SCOM – Calculation of “Memory Utilization” under Linux

I was reading the SCOM Technet Forums and found an interesting post about “Calculation of “Memory Utilization” under Linux”

Long story short SCOM’s calculation for “Available Memory” (values are gathered from /proc/meminfo) is:

 

m_availablememory = freemem + buffers + cached

 

But according to the Linux Admin it would make more sense if “Available Memory” would be calculated like this:

 

m_availablememory = freemem + Inactive

 

The response from a Microsoft employee was:

“Again, let me stress: There is no “right way” or “wrong way” to derive available memory. There are different ways, and many of them make sense. This is why different system utilities on the same systemoften report different values for this figure.”

 

If you would like to see the full post on TechNet it’s located here.

Advertisements

SCOM Discovery Issue: Monitoring Cluster Shared Volumes with the Cluster NetBIOS name longer then 15 Characters doesn’t work

Recently I was asked to investigate why no alerts for a Cluster Shared Volume were received due to it filing up. The short version was that SCOM didn’t discover the CSV’s of the cluster. But other clusters configured in the exact same way on the same hardware ect had the CSV’s discovered and was monitored. Strange.

I stared of in the usual method but manually running discovery, restarting the SCOM agents, flushing the cache and checking the event logs… nothing. So I started digging into the SCOM configuration and see how does discovery work and why would it be failing.

As a cluster is Agentless Monitored I found that the clusters Virtual Server Name was incorrectly discovered. It was missing the last character.

Running: “Get-SCOMAgentlessManagedComputer” and finding the clusters details:

ManagementGroup : XXXXXXX
Computer : XXXXXXXXXSCLUSTE.connecteast.local
LastModified : 14/01/2015 1:20:52 AM
Path :
Name : XXXXXXXXXSCLUSTE.testdomain.local
DisplayName : XXXXXXXXXSCLUSTE.testdomain.local
HealthState : Uninitialized
PrincipalName : XXXXXXXXXSCLUSTE.testdomain.local
ComputerName : XXXXXXXXXSCLUSTE
Domain : testdomain
IPAddress : 10.10.1.100
ProxyAgentPrincipalName : XXXXXPRDHYP26.testdomain.local
Id : f345538e-03b8-f673-da83-2bd2f49a53a8
ManagementGroupId : f71ed7ba-0ae9-f130-8b74-11fda2c11ba1

 

Doing some more checking on the cluster I found that the Name (NetBIOS name) and DNS name didn’t match. By running this command on one of the Hyper-V servers I was able to verify this dependency:

 

Get-ClusterResource -Name “Cluster Name” | Get-ClusterParameter

Object              Name                Value               Type

——              —-                —–               —-

Cluster Name        Name                XXXXXXXXXCLUSTE     String

Cluster Name        DnsName             XXXXXXXXXCluster      String

Cluster Name        Aliases                                 String

 

Then bringing my attention to the discovery rule/scripts I found in the management pack “Microsoft.Windows.Server.ClusterSharedVolumeMonitoing” the XML’s discovery script for CSV’s . The function described in the MP where I think the relationship is breaking is:

‘****************************************************************************************************************
‘   FUNCTION:       DiscoverClusterName
‘   DESCRIPTION:    Discover instances of the relationship class
‘                   ‘Microsoft.Windows.Server.ClusterSharedVolumeMonitoring.Microsoft.Windows.Cluster.Contains.Microsoft.Windows.Cluster.VirtualServer’.
‘   PARAMETERS:     IN String strTargetComputer: principal name of the targeted ‘Microsoft.Windows.Cluster.VirtualServer’ instance.
‘                   OUT Object objDiscoveryData: initialised DiscoveryData instance
‘   RETURNS:        Boolean: True if successful
‘****************************************************************************************************************

So if the discovery of the Virtual Computer Class for the cluster is incorrect (IE “XXXXXXXXXCLUSTE”) and the cluster name used is “XXXXXXXXXCluster” it wont discover any relationship to the CSV’s as its using the incorrect name of the cluster.

The only solution I can think of is to rename the cluster so that the NetBIOS name and DNS name match or if Microsoft Update the discovery to allow for this condition when discovering.

On that Note Microsoft don’t recommend naming clusters or servers longer than 15 characters.

What we have a secondary monitoring (PRTG) system which I was able to configure the CSV’s to be monitored by (only limitation of this is that it isn’t cluster aware and monitoring from only 1 node.)

Stale State Change Events detected in OpsMgr database – My take on it.

I’m not a DBA but if you work with SCOM you need to know a bit about its databases and whats inside them. For the past few weeks I have been receiving this alert “Stale State Change Events detected in OpsMgr database” and have been executing the SQL query to clean up this manually. (Running an Alert report found that this was happening for some time). I started to do some investigation on how this alert is generated and why I was getting this.

So this rule comes from Tao Young’s OpsMgr Self Maintenance Pack: opsmgr-self-maintenance-management-pack
I then checked the MP XML for how this was detected and found the following two SQL queries:

SELECT DaysToKeep from PartitionAndGroomingSettings Where ObjectName = ‘StateChangeEvent’

 

SELECT DATEDIFF(d, MIN(TimeAdded), GETDATE()) AS [Current] FROM statechangeevent

The first query gets the grooming setting (in my case 14 days) and the second returns the number of days since the oldest entry (when I ran it last I got 22 days). A comparison is then done and alert is generated if there are N+1 more days of data which should have been groomed. So the alerts are legitimate.

The issue I have is that the Stale State change events are not being removed automatically until I manually removed them (using the query supplied in the alert knowledge base).

I manually run the stored procedure “Exec p_PartitioningAndGrooming” but this does not clean up the events. Its still reporting 22 days.
My question is, what is responsible for cleaning up the state change event? Is it “Exec p_PartitioningAndGrooming” If so why wouldn’t it be working?

When I check the table it runs and i can see that events get groomed, no errors when the query runs.

I did some more research and found a comment Kevin Holman made about the query that manually removed these events (see: useful-operations-manager-2007-sql-queries)

“To clean up old StateChangeEvent data for state changes that are older than the defined grooming period, such as monitors currently in a disabled, warning, or critical state. By default we only groom monitor statechangeevents where the monitor is enabled and healthy at the time of grooming.”

So from what I understand is the SP which runs daily works as intended. It does not remove the statechangeenents of monitors that are in a warning or error state thus this data will remain unless it is manually removed or the monitor goes green then will be removed the next time grooming is run automatically.

So this may or may not be an issue. To be on the safe side I have left this rule enabled and will look at creating some custom PRTG sensors to monitor this and also create a task to execute grooming automatically. More on that later.

SCOM 2012 R2 – Issue: Console errors “Verification failed with 1 errors” when deleting overrides or importing a Management Pack

So a couple of weeks ago one of the Unix engineers raised an issue with me that when he tried to delete an override from the SCOM Console he got the following error:

Note:  The following information was gathered when the operation was attempted.  The information may appear cryptic but provides context for the error.  The application will continue to run.: Verification failed with 3 errors:
——————————————————-
Error 1:
Found error in 1|XXXXX.XX.Unix.Management.Pack|1.0.0.0|DisplayString,ElementReference=a27e1d14-8ad4-56fb-da46-0c0994054e92,LanguageID=ENA|| with message:
Element Info with Identity DisplayString,ElementReference=a27e1d14-8ad4-56fb-da46-0c0994054e92,LanguageID=ENA refers to an invalid sub element Filter.
——————————————————-
Error 2:
Found error in 1|XXXXX.XX.Unix.Management.Pack|1.0.0.0|DisplayString,ElementReference=ef78d342-5a82-d629-224d-2e476003f6e1,LanguageID=ENA|| with message:
Element Info with Identity DisplayString,ElementReference=ef78d342-5a82-d629-224d-2e476003f6e1,LanguageID=ENA refers to an invalid sub element Filter.
——————————————————-
Error 3:
Found error in 1|XXXXX.XX.Unix.Management.Pack|1.0.0.0|DisplayString,ElementReference=06522f67-c195-2dfa-c310-a0134b961fc4,LanguageID=ENA|| with message:
Element Info with Identity DisplayString,ElementReference=06522f67-c195-2dfa-c310-a0134b961fc4,LanguageID=ENA refers to an invalid sub element Filter.
——————————————————-
 So after spending a week troubleshooting and researching, there were a couple of other posts I found with a similar issue. But with those the error had a reference or “ElementReference” that exists. The problem I has was that none of the “ElementReference”‘s existed in the management pack. So I wasn’t able to workout where these were. Then I had a moment of clarity.
//

The first issue was that i was missing the key details of what the error was telling me.

Error 1:
Found error in 1|XXXXX.XX.Unix.Management.Pack|1.0.0.0|DisplayString,ElementReference=ef78d342-5a82-d629-224d-2e476003f6e1,LanguageID=ENA|| with message:
Element Info with Identity DisplayString,ElementReference=ef78d342-5a82-d629-224d-2e476003f6e1,LanguageID=ENA refers to an invalid sub element Filter.

ENA refers to an invalid sub element “Filter“.

Understand the Error: “Filter” is the element it is referring to. Realizing that I then recalled I have added custom XML in the past where I wanted to “filter” the result of some Unix log monitoring. So some of the Unix log monitoring rules have the below “Filter” applied.

Example of the XML:

        <ConditionDetection ID=”Filter” TypeID=”System!System.ExpressionFilter”>
<Expression>
<RegExExpression>
<ValueExpression>
<XPathQuery Type=”String”>//row</XPathQuery>
</ValueExpression>
<Operator>DoesNotMatchRegularExpression</Operator>
<Pattern>27037|00245|00227|227|245|00202</Pattern>
</RegExExpression>
</Expression>
</ConditionDetection>

I then searched the MP for all rules that have “ID=”Filter”” and recorded the ID’s (my example below)

ID=”LogFileTemplate_8d97498787e44bb08d640e79a58c4919.Alert”
ID=”LogFileTemplate_4b8e8427aa7342eab7e57b8f28b68240.Alert”
ID=”LogFileTemplate_0cb6bd324de345e9b5f66ea17338c8ca.Alert”
ID=”LogFileTemplate_8e0db25739e74ab1bc37afd5b48f53ee.Alert”
ID=”LogFileTemplate_985eafabee8144789b026a3292405849.Alert”
ID=”LogFileTemplate_f05331c6df034b63ab364900655235ba.Alert”
ID=”LogFileTemplate_aba7d5aaf42b4c5589d124b24f33aa71.Alert”
ID=”LogFileTemplate_d7b747d1708943b1a19cfa57960c692c.Alert”
ID=”LogFileTemplate_77cc9a10c11e4ec2be13d755c4ce1f4d.Alert”
ID=”LogFileTemplate_b8ec93656e75432eb6f9959311513f93.Alert”
ID=”LogFileTemplate_e36f7c7d961e43f295c0c62e5668bf94.Alert”
ID=”LogFileTemplate_42923834c6da47229a860b6b0bf51838.Alert”
ID=”LogFileTemplate_a13fa86a600a4f079804953d87fbd686.Alert”
ID=”LogFileTemplate_0cf825ac45214a6da85097a11f529a79.Alert”
ID=”LogFileTemplate_3c292e133a08417fa96f00cc63fa7050.Alert”
ID=”LogFileTemplate_1d0eefcfac314d1e9868459d44b6bd83.Alert”

I then search through the Language pack XML for “SubElementID=”Filter”” and record the Elements that dont match in the list (above).

There were the 3 I found (and i have 3 errors for this MP… Interesting)
1. <DisplayString ElementID=”LogFileTemplate_df92799e9f064db282f6abc981f1e5d3.Alert” SubElementID=”Filter”> not found – checked and no filter condition exists in configuration

2. <DisplayString ElementID=”LogFileTemplate_4226bd92be9e4a10871cbc2fcac6da5d.Alert” SubElementID=”Filter”> not found – checked and no filter condition exists in configuration

3. <DisplayString ElementID=”LogFileTemplate_d3d9588aa82a4a53a920d6d99a400940.Alert” SubElementID=”Filter”> not found – checked and no filter condition exists in configuration

I then deleted those references in the Language XML part and import the MP. It import with no error. Oh Yeh!

Once I removed the these references the Management pack loaded and I was able to delete overrides and everything was gravy.

So the cause of the error was that the Unix Log XML configuration must have previously contained a Filter condition that was later removed but was not removed from the Language component of the MP.

If you are also interested to find out the name of the log monitring rule (unter Unix logfile template search for the ElementID and when you get to the “KnowledgeArticle” XML you will see the name/details of the rule). I suspect that this may have been caused by the recent upgrade of SCOM 2012 SP1 to R2 or the filter configuration being cleared from the console and cleanup was not 100%.

 

SCOM Reporting Forecasting/trending Powershell Report

SCOM Reporting Forecasting/trending powershell report solution

 

The “issue”:

The previous monitoring tool sets reporting tool had the ability to generate forecasting reports but unfortunately SCOM does not have the capability out of the box to produce these reports or perform this analysis.

 

I investigated other solutions and did find a couple (one was at a significant cost while another only provided graphics and not statistics/report). These didn’t seem to fit the bill so I then decided to develop my own.

 

So i created a Powershell script that gets the data, analyzes it and produces HTML reports. The breakdown you can see below is on how my script is structured and if you would like to create the code for yourself (best way to learn).

  • get list of unix servers from resource pool specified  (we use different resource pools for different gateways but you can modify this into list of servers from a group)
  • gets list of windows servers from gateway server (we use gateway servers but you can modify this into list of servers from a group)
  • run SQL query to extract performance data form the data warehouse (for the list of servers specified above and store the results)
  • analyze the each performance counter/instance and workout the average and projections (The store these results for reporting)
  • clean the data table (remove negative numbers and replace with 0)
  • generate html reports based on the processed data
    • 0-45 days to upgrade (report lists CPU/Memory/Disks that will run out of space in the next 45 days or have already run out of space)
    • 46-200 days to upgrade (report lists CPU/Memory/Disks that will run out of space in the next 46-200 days or have already run out of space)
    • Full table of results for reference

The report is then scheduled on a management server and is only run once per month.

Note: It takes 6 hours to process as this is a synchronous script and depends on the number of servers or objects that need to be analysed. (some unix servers have upwards of 30 file systems which causes the script to take some time.

I originally had this all in one script but to speed up processing I split it into 2 (Windows and Unix).

I also investigated power shell workflows but it has it limitations as it wont use some of the power shell commands I had in my script.

 

Download Capacity Report Script Here

Edit: I have moved this to github: https://github.com/Buzzcola81/scom2012forecasting

Please contact me if you have any questions.
SCOM

My SCOM 2012/PRTG Maintenance Mode Scheduler application

Problem: Engineers that are working on systems don’t have the ability in SCOM to schedule maintenance modes.

 

Business case: As we use SCOM and PRTG and we needed to schedule MM not just for windows servers but for unix and for Business Critical Unix applications  known as clustered packages.

 

Research: Found the MM scheduler tool but as this was built for just windows and unix server and was built by a someone else, I need to be able to add additional features and integrate with PRTG. So either spend the time to backward engineer someone else’s solution or build it from scratch.

 

Solution:

Scripted Solution with 2 parts.

1. Back end service that stores and executes maintenance modes when needed.

2. Script shared with the business to schedule.

 

How it all works:

I outlined the workflow for how I wanted it all to work.

For V1 of my solution I’m sticking with command line. Its the fastest way for me to deliver the solution and easiest to fix any bugs.

I used Power Shell to script a service and installing it on the SCOM Management server. I packaged the Power Shell script as an executable service using a power shell command that was developed by Daniel Sorlov. You can find his module here.  Also he posted a youtube video on how to package it. Check it out here.

The service checks a file share on the SCOM management server and if any xml files exist processes them.

The XML contained details like customer, server, start date time, duration, frequency, frequency data, end date time, reason and comment.

The service checks if there are any errors in the xml, if the server is already in maintenance mode and if its a unix package. This last check is what make this different than the other SCOM maintenance mode schedulers, it intergrates also with PRTG and pauses the PRTG sensors for the specif packages. This is done by extracting the sensor ID’s for each package and placing them into a CSV where the service checks for what sensors it needs to place into maintenance mode. PRTG’s API make this easy to achieve.

The second component is the interface. So as I said earlier V1 was just a command line tool that generates the xml.

 

I found that the most difficult part of this process was getting my head around how to get the weekly and monthly frequency working. I did this by representing the frequency data for weekly as: (snip-it from my help file)

If Frequency is “Weekly” you must set the position flag to 1 for the corresponding day of the week and the 24 time

in “HH:mm” format. Format is “0000000 HH:mm”

Monday Flag    1000000 HH:mm

Tuesday Flag   0100000 HH:mm

Wednesday Flag 0010000 HH:mm

Thursday Flag  0001000 HH:mm

Friday Flag    0000100 HH:mm

Saturday Flag  0000010 HH:mm

Sunday Flag    0010001 HH:mm

Note: Combinations can also be used EG Monday and Fridays you can use 1000100

Example “1010000 HH:mm” eg Monday and Wednesday at 2pm is “1010000 14:00”

 

and for Monthly:

If Frequency is “Monthly” you must specify which week of the month (1-5), which day of the week (0-6) and what

time “HH:mm” in 24 hour time format.

Format is “Week Number” “Weekday” “Time” looks like “2 3 21:00” eg 2nd Wednesday of the Month at 21:00

 

The Week Number flag represents the week of the month:

1 – First week of Month

2 – Second week of Month

3 – Third week of Month

4 – Fourth week of Month

5 – Fifth week of Month

 

The Weekday flag represents the week day you are after:

0    Sunday

1    Monday

2    Tuesday

3    Wednesday

4    Thursday

5    Friday

6    Saturday

 

So it been running for a few weeks now and after finding a couple of small bugs, i haven’t had any issues with crashing or maintenance modes not being applied. This was an enjoyable learning curve for me as this was weeks worth of effort and I was able to deliver a solution that I was confident about.

 

Wouldn’t call myself a programmer just yet, but im liking being a script kiddy!

 

 

As this was made specifically for my employer and is customized sharing all the source code wont benefit you but on request I can help you develop or share the necessary functions to help you build your own solution.

Script Monitor to check for unexpected shutdown events

A few days ago I posted an issue with where the SCOM agent might miss unexpected restarts events.

So I developed a solution that does not rely on the way SCOM normally does log monitoring and does not rely on a time stamp to read the event log.

How it works:

  • I have created a new monitor called “Monitor Unexpected Shutdown” It can be see under Entity Health>Availability>Operating System Availability.
  • This monitor executes a script every 10 minutes that checks the System Event Log for the past 30 minutes for any 6008 events and counts the number of matches.
  • If the number of events is greater than 0 the monitor will turn critical and generate a critical alert.
  • After 30 minutes (or the 3rd check) the script will then report 0 and the monitor will go back to green and the alert will be auto closed.

Image

I have designed it this way as I have integration with our ticketing system for alerts.

You can override the “Minutes” parameter to check for evens going back further so that the alerts are kept open for longer and increase the frequency of the script execution if you don’t need to run it that frequently.

Its my first self authored SCOM management pack from scratch so I welcome any comments and feedback.

Please be sure to test it as the monitor is enabled by default (assuming you have experienced this issue)

You can find the Management Pack here: WindowsUnexpectedRestart.xml