Just a quick one during my lunch hour …. ran into an issue yesterday at my current client that shows once more that when you do not work with a specific OS for a while, you really loose your touch for the small details.
The Saga of WAS, AIX and the damn Java Cache
We installed iFixes yesterday and that all went well. However the syncing of the nodes (kicked off from the Dmgr console) took forever, and then one of the app clusters on one of the nodes would not restart (it eventually did after 4 hours).
To clean the system and get rid of any old temp files we:
- Stopped all WebSphere servers: /WAS_Profile/bin/stopServer.sh xxx -user xxx -password ****
- Stopped the NodeAgent: /WAS_Profile/bin/stopNode.sh -username xxxx -password ****
- Cleaned all temp files /WAS_Profile/temp and /wstemp (everything inside of both folders)
- Ran /WAS_Profile/bin/osgiCfgInit.sh
- Ran /WAS_Profile/bin/clearClassCache.sh
Note: you can also use the command “./stopNode.sh –stopservers -username xxxx -password ****” to shut down the node agent AND the servers at the same time. We wanted to see the individual servers come down as we had issues with one of them.
We then tried to restart the node agent ….. and it failed. We found this in the startserver.log for the node agent:
ADMU3011E: Server launched but failed initialization
Damn, nothing worked … re-cleaned, checked, cursed, cried ….. and then opened a Sev 1 ticket with IBM support online. (had a REALLY fast response – thanks guys!)
The Cavalry to the Rescue …
The Connections support guy had a look at the logs and brought in a WAS support specialist who had me repeat the clean-ups steps above AND clean this location as well (everything in this folder, but not delete the folder itself):
The IBM tech thinks we had a corrupted system level java cache that was causing the issue. After that a ./startNode.sh worked like a charm and the servers started fine as well.
Incidentally, we ended up shutting each AIX WAS server (including the Dmgr) down one by one so we do not have a service outage and ran the above maintenance once more. On the nodes we also ran a “./synchNode.sh” with the node agent turned off – just to eliminate any possibility of the nodes maybe being out of synch (thanks for the idea Stuart).
We will also be going through our automated scripts to test adding some more items to them (email notifications when individual steps are done, add the “/temp/javasharedresources” to the list of folders to be cleaned, etc.).
Lessons to be learned:
- When you don’t work with an OS for a while you forget the important SMALL stuff (/tmp/javasharedresources) – I had run into this very issue a few years ago and totally forgot about it. I actually did not remember it until this morning, the day after.
- When in doubt – call support RIGHT AWAY, if for no other reason than to validate your thought process is correct and you are not barking up the wrong tree. We did not wait very long to call, but sometimes even 5 minutes can mean the difference between failure and success.