Hi and Welcome to my blog!
For my first post I’d like to share an issue I recently had to look into with the SCOM agent not always picking up an unexpected restart.
SCOM Version: SCOM 2012 SP1 UR5 – 7.0.9538.1106
ServerA: Physical Server running Windows Server 2008
ServerA: SCOM agent version 7.0.9538.0
ServerA has a know issue where it can unexpectedly restart. Could run for a week or could run a couple of hours between restarts.
On once night it had suffered 3 unexpected restarts but we had only got one alert notification. What happened to the other 3?
So I checked the event logs to confirm yes the 6008 events were in the system log and yes the agent was running. No issues with time drift, the alert that came through was closed 15 minutes after being raised so it didn’t repeat, no issues with monitoring configuration and no issue with the agent itself. I had also previously configured another rule “Unexpected Server Reboot” to pickup unexpected restarts (just in case).
ServerA Unexpected Restart Events
ServerA Unexpected Restart Alerts
So I created a new monitor to monitor dummy events and attempted to reproduce the issue by killing the agent process. But unfortunately it kept picking up the events as expected.
So I logged a call with Microsoft to investigate this.
After Microsoft support engineer reviewed the logs discussed the case with the escalation engineer he advised that if the server suffers from an unexpected restart, there is a possibility that the SCOM agent won’t pick up the 6008 event. Microsoft support was able to go through the source code/agent logic and advise on the circumstance which may result in the unexpected restarts not being picked up.
Circumstances are when:
- Server suffers unexpected restart
- A possibility exists where the bookmark is not written to the EDB
- Server starts up and writes 6008 event into log
- Agent starts up 1-2 minutes after server starts goes through its own checks and checks the EDB for where the agent last read from. (if the bookmark isn’t written then it uses current date and time)
- Agent ignores the recent 6008 event as it is considered and old event, thus not alerting on the unexpected restart.
Another cause I suspected would be that the EDB became corrupt (unlikely as I didn’t see the agent logs report downloading MP’s) and was needed to be rebuilt. The above results in the same. But this wasn’t seen in the agent logs when the restarts occurred.
A solution I believe will work is to create a Monitor that executes a script every 15 minutes which will check the System log for 6008 event for the past 15 minutes. If found generate an alert. Thus when the agent starts up it wont rely on the EDB. I know this will generate 2 alerts for the same issue sometimes but I’d rather know than miss it all together. Will post solution once done.
Thanks for reading!