Stale State Change Events detected in OpsMgr database – My take on it.

I’m not a DBA but if you work with SCOM you need to know a bit about its databases and whats inside them. For the past few weeks I have been receiving this alert “Stale State Change Events detected in OpsMgr database” and have been executing the SQL query to clean up this manually. (Running an Alert report found that this was happening for some time). I started to do some investigation on how this alert is generated and why I was getting this.

So this rule comes from Tao Young’s OpsMgr Self Maintenance Pack: opsmgr-self-maintenance-management-pack
I then checked the MP XML for how this was detected and found the following two SQL queries:

SELECT DaysToKeep from PartitionAndGroomingSettings Where ObjectName = ‘StateChangeEvent’

 

SELECT DATEDIFF(d, MIN(TimeAdded), GETDATE()) AS [Current] FROM statechangeevent

The first query gets the grooming setting (in my case 14 days) and the second returns the number of days since the oldest entry (when I ran it last I got 22 days). A comparison is then done and alert is generated if there are N+1 more days of data which should have been groomed. So the alerts are legitimate.

The issue I have is that the Stale State change events are not being removed automatically until I manually removed them (using the query supplied in the alert knowledge base).

I manually run the stored procedure “Exec p_PartitioningAndGrooming” but this does not clean up the events. Its still reporting 22 days.
My question is, what is responsible for cleaning up the state change event? Is it “Exec p_PartitioningAndGrooming” If so why wouldn’t it be working?

When I check the table it runs and i can see that events get groomed, no errors when the query runs.

I did some more research and found a comment Kevin Holman made about the query that manually removed these events (see: useful-operations-manager-2007-sql-queries)

“To clean up old StateChangeEvent data for state changes that are older than the defined grooming period, such as monitors currently in a disabled, warning, or critical state. By default we only groom monitor statechangeevents where the monitor is enabled and healthy at the time of grooming.”

So from what I understand is the SP which runs daily works as intended. It does not remove the statechangeenents of monitors that are in a warning or error state thus this data will remain unless it is manually removed or the monitor goes green then will be removed the next time grooming is run automatically.

So this may or may not be an issue. To be on the safe side I have left this rule enabled and will look at creating some custom PRTG sensors to monitor this and also create a task to execute grooming automatically. More on that later.

SCOM 2012 R2 – Issue: Console errors “Verification failed with 1 errors” when deleting overrides or importing a Management Pack

So a couple of weeks ago one of the Unix engineers raised an issue with me that when he tried to delete an override from the SCOM Console he got the following error:

Note:  The following information was gathered when the operation was attempted.  The information may appear cryptic but provides context for the error.  The application will continue to run.: Verification failed with 3 errors:
——————————————————-
Error 1:
Found error in 1|XXXXX.XX.Unix.Management.Pack|1.0.0.0|DisplayString,ElementReference=a27e1d14-8ad4-56fb-da46-0c0994054e92,LanguageID=ENA|| with message:
Element Info with Identity DisplayString,ElementReference=a27e1d14-8ad4-56fb-da46-0c0994054e92,LanguageID=ENA refers to an invalid sub element Filter.
——————————————————-
Error 2:
Found error in 1|XXXXX.XX.Unix.Management.Pack|1.0.0.0|DisplayString,ElementReference=ef78d342-5a82-d629-224d-2e476003f6e1,LanguageID=ENA|| with message:
Element Info with Identity DisplayString,ElementReference=ef78d342-5a82-d629-224d-2e476003f6e1,LanguageID=ENA refers to an invalid sub element Filter.
——————————————————-
Error 3:
Found error in 1|XXXXX.XX.Unix.Management.Pack|1.0.0.0|DisplayString,ElementReference=06522f67-c195-2dfa-c310-a0134b961fc4,LanguageID=ENA|| with message:
Element Info with Identity DisplayString,ElementReference=06522f67-c195-2dfa-c310-a0134b961fc4,LanguageID=ENA refers to an invalid sub element Filter.
——————————————————-
 So after spending a week troubleshooting and researching, there were a couple of other posts I found with a similar issue. But with those the error had a reference or “ElementReference” that exists. The problem I has was that none of the “ElementReference”‘s existed in the management pack. So I wasn’t able to workout where these were. Then I had a moment of clarity.
//

The first issue was that i was missing the key details of what the error was telling me.

Error 1:
Found error in 1|XXXXX.XX.Unix.Management.Pack|1.0.0.0|DisplayString,ElementReference=ef78d342-5a82-d629-224d-2e476003f6e1,LanguageID=ENA|| with message:
Element Info with Identity DisplayString,ElementReference=ef78d342-5a82-d629-224d-2e476003f6e1,LanguageID=ENA refers to an invalid sub element Filter.

ENA refers to an invalid sub element “Filter“.

Understand the Error: “Filter” is the element it is referring to. Realizing that I then recalled I have added custom XML in the past where I wanted to “filter” the result of some Unix log monitoring. So some of the Unix log monitoring rules have the below “Filter” applied.

Example of the XML:

        <ConditionDetection ID=”Filter” TypeID=”System!System.ExpressionFilter”>
<Expression>
<RegExExpression>
<ValueExpression>
<XPathQuery Type=”String”>//row</XPathQuery>
</ValueExpression>
<Operator>DoesNotMatchRegularExpression</Operator>
<Pattern>27037|00245|00227|227|245|00202</Pattern>
</RegExExpression>
</Expression>
</ConditionDetection>

I then searched the MP for all rules that have “ID=”Filter”” and recorded the ID’s (my example below)

ID=”LogFileTemplate_8d97498787e44bb08d640e79a58c4919.Alert”
ID=”LogFileTemplate_4b8e8427aa7342eab7e57b8f28b68240.Alert”
ID=”LogFileTemplate_0cb6bd324de345e9b5f66ea17338c8ca.Alert”
ID=”LogFileTemplate_8e0db25739e74ab1bc37afd5b48f53ee.Alert”
ID=”LogFileTemplate_985eafabee8144789b026a3292405849.Alert”
ID=”LogFileTemplate_f05331c6df034b63ab364900655235ba.Alert”
ID=”LogFileTemplate_aba7d5aaf42b4c5589d124b24f33aa71.Alert”
ID=”LogFileTemplate_d7b747d1708943b1a19cfa57960c692c.Alert”
ID=”LogFileTemplate_77cc9a10c11e4ec2be13d755c4ce1f4d.Alert”
ID=”LogFileTemplate_b8ec93656e75432eb6f9959311513f93.Alert”
ID=”LogFileTemplate_e36f7c7d961e43f295c0c62e5668bf94.Alert”
ID=”LogFileTemplate_42923834c6da47229a860b6b0bf51838.Alert”
ID=”LogFileTemplate_a13fa86a600a4f079804953d87fbd686.Alert”
ID=”LogFileTemplate_0cf825ac45214a6da85097a11f529a79.Alert”
ID=”LogFileTemplate_3c292e133a08417fa96f00cc63fa7050.Alert”
ID=”LogFileTemplate_1d0eefcfac314d1e9868459d44b6bd83.Alert”

I then search through the Language pack XML for “SubElementID=”Filter”” and record the Elements that dont match in the list (above).

There were the 3 I found (and i have 3 errors for this MP… Interesting)
1. <DisplayString ElementID=”LogFileTemplate_df92799e9f064db282f6abc981f1e5d3.Alert” SubElementID=”Filter”> not found – checked and no filter condition exists in configuration

2. <DisplayString ElementID=”LogFileTemplate_4226bd92be9e4a10871cbc2fcac6da5d.Alert” SubElementID=”Filter”> not found – checked and no filter condition exists in configuration

3. <DisplayString ElementID=”LogFileTemplate_d3d9588aa82a4a53a920d6d99a400940.Alert” SubElementID=”Filter”> not found – checked and no filter condition exists in configuration

I then deleted those references in the Language XML part and import the MP. It import with no error. Oh Yeh!

Once I removed the these references the Management pack loaded and I was able to delete overrides and everything was gravy.

So the cause of the error was that the Unix Log XML configuration must have previously contained a Filter condition that was later removed but was not removed from the Language component of the MP.

If you are also interested to find out the name of the log monitring rule (unter Unix logfile template search for the ElementID and when you get to the “KnowledgeArticle” XML you will see the name/details of the rule). I suspect that this may have been caused by the recent upgrade of SCOM 2012 SP1 to R2 or the filter configuration being cleared from the console and cleanup was not 100%.

 

SCOM Reporting Forecasting/trending Powershell Report

SCOM Reporting Forecasting/trending powershell report solution

 

The “issue”:

The previous monitoring tool sets reporting tool had the ability to generate forecasting reports but unfortunately SCOM does not have the capability out of the box to produce these reports or perform this analysis.

 

I investigated other solutions and did find a couple (one was at a significant cost while another only provided graphics and not statistics/report). These didn’t seem to fit the bill so I then decided to develop my own.

 

So i created a Powershell script that gets the data, analyzes it and produces HTML reports. The breakdown you can see below is on how my script is structured and if you would like to create the code for yourself (best way to learn).

  • get list of unix servers from resource pool specified  (we use different resource pools for different gateways but you can modify this into list of servers from a group)
  • gets list of windows servers from gateway server (we use gateway servers but you can modify this into list of servers from a group)
  • run SQL query to extract performance data form the data warehouse (for the list of servers specified above and store the results)
  • analyze the each performance counter/instance and workout the average and projections (The store these results for reporting)
  • clean the data table (remove negative numbers and replace with 0)
  • generate html reports based on the processed data
    • 0-45 days to upgrade (report lists CPU/Memory/Disks that will run out of space in the next 45 days or have already run out of space)
    • 46-200 days to upgrade (report lists CPU/Memory/Disks that will run out of space in the next 46-200 days or have already run out of space)
    • Full table of results for reference

The report is then scheduled on a management server and is only run once per month.

Note: It takes 6 hours to process as this is a synchronous script and depends on the number of servers or objects that need to be analysed. (some unix servers have upwards of 30 file systems which causes the script to take some time.

I originally had this all in one script but to speed up processing I split it into 2 (Windows and Unix).

I also investigated power shell workflows but it has it limitations as it wont use some of the power shell commands I had in my script.

 

Download Capacity Report Script Here

Edit: I have moved this to github: https://github.com/Buzzcola81/scom2012forecasting

Please contact me if you have any questions.
SCOM