SCOM – Calculation of “Memory Utilization” under Linux

I was reading the SCOM Technet Forums and found an interesting post about “Calculation of “Memory Utilization” under Linux”

Long story short SCOM’s calculation for “Available Memory” (values are gathered from /proc/meminfo) is:

 

m_availablememory = freemem + buffers + cached

 

But according to the Linux Admin it would make more sense if “Available Memory” would be calculated like this:

 

m_availablememory = freemem + Inactive

 

The response from a Microsoft employee was:

“Again, let me stress: There is no “right way” or “wrong way” to derive available memory. There are different ways, and many of them make sense. This is why different system utilities on the same systemoften report different values for this figure.”

 

If you would like to see the full post on TechNet it’s located here.

Advertisements

SCOM Discovery Issue: Monitoring Cluster Shared Volumes with the Cluster NetBIOS name longer then 15 Characters doesn’t work

Recently I was asked to investigate why no alerts for a Cluster Shared Volume were received due to it filing up. The short version was that SCOM didn’t discover the CSV’s of the cluster. But other clusters configured in the exact same way on the same hardware ect had the CSV’s discovered and was monitored. Strange.

I stared of in the usual method but manually running discovery, restarting the SCOM agents, flushing the cache and checking the event logs… nothing. So I started digging into the SCOM configuration and see how does discovery work and why would it be failing.

As a cluster is Agentless Monitored I found that the clusters Virtual Server Name was incorrectly discovered. It was missing the last character.

Running: “Get-SCOMAgentlessManagedComputer” and finding the clusters details:

ManagementGroup : XXXXXXX
Computer : XXXXXXXXXSCLUSTE.connecteast.local
LastModified : 14/01/2015 1:20:52 AM
Path :
Name : XXXXXXXXXSCLUSTE.testdomain.local
DisplayName : XXXXXXXXXSCLUSTE.testdomain.local
HealthState : Uninitialized
PrincipalName : XXXXXXXXXSCLUSTE.testdomain.local
ComputerName : XXXXXXXXXSCLUSTE
Domain : testdomain
IPAddress : 10.10.1.100
ProxyAgentPrincipalName : XXXXXPRDHYP26.testdomain.local
Id : f345538e-03b8-f673-da83-2bd2f49a53a8
ManagementGroupId : f71ed7ba-0ae9-f130-8b74-11fda2c11ba1

 

Doing some more checking on the cluster I found that the Name (NetBIOS name) and DNS name didn’t match. By running this command on one of the Hyper-V servers I was able to verify this dependency:

 

Get-ClusterResource -Name “Cluster Name” | Get-ClusterParameter

Object              Name                Value               Type

——              —-                —–               —-

Cluster Name        Name                XXXXXXXXXCLUSTE     String

Cluster Name        DnsName             XXXXXXXXXCluster      String

Cluster Name        Aliases                                 String

 

Then bringing my attention to the discovery rule/scripts I found in the management pack “Microsoft.Windows.Server.ClusterSharedVolumeMonitoing” the XML’s discovery script for CSV’s . The function described in the MP where I think the relationship is breaking is:

‘****************************************************************************************************************
‘   FUNCTION:       DiscoverClusterName
‘   DESCRIPTION:    Discover instances of the relationship class
‘                   ‘Microsoft.Windows.Server.ClusterSharedVolumeMonitoring.Microsoft.Windows.Cluster.Contains.Microsoft.Windows.Cluster.VirtualServer’.
‘   PARAMETERS:     IN String strTargetComputer: principal name of the targeted ‘Microsoft.Windows.Cluster.VirtualServer’ instance.
‘                   OUT Object objDiscoveryData: initialised DiscoveryData instance
‘   RETURNS:        Boolean: True if successful
‘****************************************************************************************************************

So if the discovery of the Virtual Computer Class for the cluster is incorrect (IE “XXXXXXXXXCLUSTE”) and the cluster name used is “XXXXXXXXXCluster” it wont discover any relationship to the CSV’s as its using the incorrect name of the cluster.

The only solution I can think of is to rename the cluster so that the NetBIOS name and DNS name match or if Microsoft Update the discovery to allow for this condition when discovering.

On that Note Microsoft don’t recommend naming clusters or servers longer than 15 characters.

What we have a secondary monitoring (PRTG) system which I was able to configure the CSV’s to be monitored by (only limitation of this is that it isn’t cluster aware and monitoring from only 1 node.)

SCOM Reporting Forecasting/trending Powershell Report

SCOM Reporting Forecasting/trending powershell report solution

 

The “issue”:

The previous monitoring tool sets reporting tool had the ability to generate forecasting reports but unfortunately SCOM does not have the capability out of the box to produce these reports or perform this analysis.

 

I investigated other solutions and did find a couple (one was at a significant cost while another only provided graphics and not statistics/report). These didn’t seem to fit the bill so I then decided to develop my own.

 

So i created a Powershell script that gets the data, analyzes it and produces HTML reports. The breakdown you can see below is on how my script is structured and if you would like to create the code for yourself (best way to learn).

  • get list of unix servers from resource pool specified  (we use different resource pools for different gateways but you can modify this into list of servers from a group)
  • gets list of windows servers from gateway server (we use gateway servers but you can modify this into list of servers from a group)
  • run SQL query to extract performance data form the data warehouse (for the list of servers specified above and store the results)
  • analyze the each performance counter/instance and workout the average and projections (The store these results for reporting)
  • clean the data table (remove negative numbers and replace with 0)
  • generate html reports based on the processed data
    • 0-45 days to upgrade (report lists CPU/Memory/Disks that will run out of space in the next 45 days or have already run out of space)
    • 46-200 days to upgrade (report lists CPU/Memory/Disks that will run out of space in the next 46-200 days or have already run out of space)
    • Full table of results for reference

The report is then scheduled on a management server and is only run once per month.

Note: It takes 6 hours to process as this is a synchronous script and depends on the number of servers or objects that need to be analysed. (some unix servers have upwards of 30 file systems which causes the script to take some time.

I originally had this all in one script but to speed up processing I split it into 2 (Windows and Unix).

I also investigated power shell workflows but it has it limitations as it wont use some of the power shell commands I had in my script.

 

Download Capacity Report Script Here

Edit: I have moved this to github: https://github.com/Buzzcola81/scom2012forecasting

Please contact me if you have any questions.
SCOM

My SCOM 2012/PRTG Maintenance Mode Scheduler application

Problem: Engineers that are working on systems don’t have the ability in SCOM to schedule maintenance modes.

 

Business case: As we use SCOM and PRTG and we needed to schedule MM not just for windows servers but for unix and for Business Critical Unix applications  known as clustered packages.

 

Research: Found the MM scheduler tool but as this was built for just windows and unix server and was built by a someone else, I need to be able to add additional features and integrate with PRTG. So either spend the time to backward engineer someone else’s solution or build it from scratch.

 

Solution:

Scripted Solution with 2 parts.

1. Back end service that stores and executes maintenance modes when needed.

2. Script shared with the business to schedule.

 

How it all works:

I outlined the workflow for how I wanted it all to work.

For V1 of my solution I’m sticking with command line. Its the fastest way for me to deliver the solution and easiest to fix any bugs.

I used Power Shell to script a service and installing it on the SCOM Management server. I packaged the Power Shell script as an executable service using a power shell command that was developed by Daniel Sorlov. You can find his module here.  Also he posted a youtube video on how to package it. Check it out here.

The service checks a file share on the SCOM management server and if any xml files exist processes them.

The XML contained details like customer, server, start date time, duration, frequency, frequency data, end date time, reason and comment.

The service checks if there are any errors in the xml, if the server is already in maintenance mode and if its a unix package. This last check is what make this different than the other SCOM maintenance mode schedulers, it intergrates also with PRTG and pauses the PRTG sensors for the specif packages. This is done by extracting the sensor ID’s for each package and placing them into a CSV where the service checks for what sensors it needs to place into maintenance mode. PRTG’s API make this easy to achieve.

The second component is the interface. So as I said earlier V1 was just a command line tool that generates the xml.

 

I found that the most difficult part of this process was getting my head around how to get the weekly and monthly frequency working. I did this by representing the frequency data for weekly as: (snip-it from my help file)

If Frequency is “Weekly” you must set the position flag to 1 for the corresponding day of the week and the 24 time

in “HH:mm” format. Format is “0000000 HH:mm”

Monday Flag    1000000 HH:mm

Tuesday Flag   0100000 HH:mm

Wednesday Flag 0010000 HH:mm

Thursday Flag  0001000 HH:mm

Friday Flag    0000100 HH:mm

Saturday Flag  0000010 HH:mm

Sunday Flag    0010001 HH:mm

Note: Combinations can also be used EG Monday and Fridays you can use 1000100

Example “1010000 HH:mm” eg Monday and Wednesday at 2pm is “1010000 14:00”

 

and for Monthly:

If Frequency is “Monthly” you must specify which week of the month (1-5), which day of the week (0-6) and what

time “HH:mm” in 24 hour time format.

Format is “Week Number” “Weekday” “Time” looks like “2 3 21:00” eg 2nd Wednesday of the Month at 21:00

 

The Week Number flag represents the week of the month:

1 – First week of Month

2 – Second week of Month

3 – Third week of Month

4 – Fourth week of Month

5 – Fifth week of Month

 

The Weekday flag represents the week day you are after:

0    Sunday

1    Monday

2    Tuesday

3    Wednesday

4    Thursday

5    Friday

6    Saturday

 

So it been running for a few weeks now and after finding a couple of small bugs, i haven’t had any issues with crashing or maintenance modes not being applied. This was an enjoyable learning curve for me as this was weeks worth of effort and I was able to deliver a solution that I was confident about.

 

Wouldn’t call myself a programmer just yet, but im liking being a script kiddy!

 

 

As this was made specifically for my employer and is customized sharing all the source code wont benefit you but on request I can help you develop or share the necessary functions to help you build your own solution.

Script Monitor to check for unexpected shutdown events

A few days ago I posted an issue with where the SCOM agent might miss unexpected restarts events.

So I developed a solution that does not rely on the way SCOM normally does log monitoring and does not rely on a time stamp to read the event log.

How it works:

  • I have created a new monitor called “Monitor Unexpected Shutdown” It can be see under Entity Health>Availability>Operating System Availability.
  • This monitor executes a script every 10 minutes that checks the System Event Log for the past 30 minutes for any 6008 events and counts the number of matches.
  • If the number of events is greater than 0 the monitor will turn critical and generate a critical alert.
  • After 30 minutes (or the 3rd check) the script will then report 0 and the monitor will go back to green and the alert will be auto closed.

Image

I have designed it this way as I have integration with our ticketing system for alerts.

You can override the “Minutes” parameter to check for evens going back further so that the alerts are kept open for longer and increase the frequency of the script execution if you don’t need to run it that frequently.

Its my first self authored SCOM management pack from scratch so I welcome any comments and feedback.

Please be sure to test it as the monitor is enabled by default (assuming you have experienced this issue)

You can find the Management Pack here: WindowsUnexpectedRestart.xml

 

 

SCOM agent not guaranteed to pickup unexpected shutdown event

Hi and Welcome to my blog!

For my first post I’d like to share an issue I recently had to look into with the SCOM agent not always picking up an unexpected restart.

Some background:
SCOM Version: SCOM 2012 SP1 UR5 – 7.0.9538.1106
ServerA: Physical Server running Windows Server 2008
ServerA: SCOM agent version 7.0.9538.0

ServerA has a know issue where it can unexpectedly restart. Could run for a week or could run a couple of hours between restarts.
On once night it had suffered 3 unexpected restarts but we had only got one alert notification. What happened to the other 3?

So I checked the event logs to confirm yes the 6008 events were in the system log and yes the agent was running. No issues with time drift, the alert that came through was closed 15 minutes after being raised so it didn’t repeat, no issues with monitoring configuration and no issue with the agent itself. I had also previously configured another rule “Unexpected Server Reboot” to pickup unexpected restarts (just in case).

ServerA Unexpected Restart Events

ServerA Unexpected Restart Events

ServerA Unexpected Restart Alerts

ServerA Unexpected Restart Alerts

Curious….

So I created a new monitor to monitor dummy events and attempted to reproduce the issue by killing the agent process. But unfortunately it kept picking up the events as expected.

So I logged a call with Microsoft to investigate this.

After Microsoft support engineer reviewed the logs discussed the case with the escalation engineer he advised that if the server suffers from an unexpected restart, there is a possibility that the SCOM agent won’t pick up the 6008 event. Microsoft support was able to go through the source code/agent logic and advise on the circumstance which may result in the unexpected restarts not being picked up.

Circumstances are when:

  • Server suffers unexpected restart
  • A possibility exists where the bookmark is not written to the EDB
  • Server starts up and writes 6008 event into log
  • Agent starts up 1-2 minutes after server starts goes through its own checks and checks the EDB for where the agent last read from. (if the bookmark isn’t written then it uses current date and time)
  • Agent ignores the recent 6008 event as it is considered and old event, thus not alerting on the unexpected restart.

Another cause I suspected would be that the EDB became corrupt (unlikely as I didn’t see the agent logs report downloading MP’s) and was needed to be rebuilt. The above results in the same. But this wasn’t seen in the agent logs when the restarts occurred.

A solution I believe will work is to create a Monitor that executes a script every 15 minutes which will check the System log for 6008 event for the past 15 minutes. If found generate an alert. Thus when the agent starts up it wont rely on the EDB. I know this will generate 2 alerts for the same issue sometimes but I’d rather know than miss it all together. Will post solution once done.

Thanks for reading!

Martin.