SCOM Discovery Issue: Monitoring Cluster Shared Volumes with the Cluster NetBIOS name longer then 15 Characters doesn’t work

Recently I was asked to investigate why no alerts for a Cluster Shared Volume were received due to it filing up. The short version was that SCOM didn’t discover the CSV’s of the cluster. But other clusters configured in the exact same way on the same hardware ect had the CSV’s discovered and was monitored. Strange.

I stared of in the usual method but manually running discovery, restarting the SCOM agents, flushing the cache and checking the event logs… nothing. So I started digging into the SCOM configuration and see how does discovery work and why would it be failing.

As a cluster is Agentless Monitored I found that the clusters Virtual Server Name was incorrectly discovered. It was missing the last character.

Running: “Get-SCOMAgentlessManagedComputer” and finding the clusters details:

ManagementGroup : XXXXXXX
Computer : XXXXXXXXXSCLUSTE.connecteast.local
LastModified : 14/01/2015 1:20:52 AM
Path :
Name : XXXXXXXXXSCLUSTE.testdomain.local
DisplayName : XXXXXXXXXSCLUSTE.testdomain.local
HealthState : Uninitialized
PrincipalName : XXXXXXXXXSCLUSTE.testdomain.local
ComputerName : XXXXXXXXXSCLUSTE
Domain : testdomain
IPAddress : 10.10.1.100
ProxyAgentPrincipalName : XXXXXPRDHYP26.testdomain.local
Id : f345538e-03b8-f673-da83-2bd2f49a53a8
ManagementGroupId : f71ed7ba-0ae9-f130-8b74-11fda2c11ba1

 

Doing some more checking on the cluster I found that the Name (NetBIOS name) and DNS name didn’t match. By running this command on one of the Hyper-V servers I was able to verify this dependency:

 

Get-ClusterResource -Name “Cluster Name” | Get-ClusterParameter

Object              Name                Value               Type

——              —-                —–               —-

Cluster Name        Name                XXXXXXXXXCLUSTE     String

Cluster Name        DnsName             XXXXXXXXXCluster      String

Cluster Name        Aliases                                 String

 

Then bringing my attention to the discovery rule/scripts I found in the management pack “Microsoft.Windows.Server.ClusterSharedVolumeMonitoing” the XML’s discovery script for CSV’s . The function described in the MP where I think the relationship is breaking is:

‘****************************************************************************************************************
‘   FUNCTION:       DiscoverClusterName
‘   DESCRIPTION:    Discover instances of the relationship class
‘                   ‘Microsoft.Windows.Server.ClusterSharedVolumeMonitoring.Microsoft.Windows.Cluster.Contains.Microsoft.Windows.Cluster.VirtualServer’.
‘   PARAMETERS:     IN String strTargetComputer: principal name of the targeted ‘Microsoft.Windows.Cluster.VirtualServer’ instance.
‘                   OUT Object objDiscoveryData: initialised DiscoveryData instance
‘   RETURNS:        Boolean: True if successful
‘****************************************************************************************************************

So if the discovery of the Virtual Computer Class for the cluster is incorrect (IE “XXXXXXXXXCLUSTE”) and the cluster name used is “XXXXXXXXXCluster” it wont discover any relationship to the CSV’s as its using the incorrect name of the cluster.

The only solution I can think of is to rename the cluster so that the NetBIOS name and DNS name match or if Microsoft Update the discovery to allow for this condition when discovering.

On that Note Microsoft don’t recommend naming clusters or servers longer than 15 characters.

What we have a secondary monitoring (PRTG) system which I was able to configure the CSV’s to be monitored by (only limitation of this is that it isn’t cluster aware and monitoring from only 1 node.)

Stale State Change Events detected in OpsMgr database – My take on it.

I’m not a DBA but if you work with SCOM you need to know a bit about its databases and whats inside them. For the past few weeks I have been receiving this alert “Stale State Change Events detected in OpsMgr database” and have been executing the SQL query to clean up this manually. (Running an Alert report found that this was happening for some time). I started to do some investigation on how this alert is generated and why I was getting this.

So this rule comes from Tao Young’s OpsMgr Self Maintenance Pack: opsmgr-self-maintenance-management-pack
I then checked the MP XML for how this was detected and found the following two SQL queries:

SELECT DaysToKeep from PartitionAndGroomingSettings Where ObjectName = ‘StateChangeEvent’

 

SELECT DATEDIFF(d, MIN(TimeAdded), GETDATE()) AS [Current] FROM statechangeevent

The first query gets the grooming setting (in my case 14 days) and the second returns the number of days since the oldest entry (when I ran it last I got 22 days). A comparison is then done and alert is generated if there are N+1 more days of data which should have been groomed. So the alerts are legitimate.

The issue I have is that the Stale State change events are not being removed automatically until I manually removed them (using the query supplied in the alert knowledge base).

I manually run the stored procedure “Exec p_PartitioningAndGrooming” but this does not clean up the events. Its still reporting 22 days.
My question is, what is responsible for cleaning up the state change event? Is it “Exec p_PartitioningAndGrooming” If so why wouldn’t it be working?

When I check the table it runs and i can see that events get groomed, no errors when the query runs.

I did some more research and found a comment Kevin Holman made about the query that manually removed these events (see: useful-operations-manager-2007-sql-queries)

“To clean up old StateChangeEvent data for state changes that are older than the defined grooming period, such as monitors currently in a disabled, warning, or critical state. By default we only groom monitor statechangeevents where the monitor is enabled and healthy at the time of grooming.”

So from what I understand is the SP which runs daily works as intended. It does not remove the statechangeenents of monitors that are in a warning or error state thus this data will remain unless it is manually removed or the monitor goes green then will be removed the next time grooming is run automatically.

So this may or may not be an issue. To be on the safe side I have left this rule enabled and will look at creating some custom PRTG sensors to monitor this and also create a task to execute grooming automatically. More on that later.

SCOM 2012 R2 – Issue: Console errors “Verification failed with 1 errors” when deleting overrides or importing a Management Pack

So a couple of weeks ago one of the Unix engineers raised an issue with me that when he tried to delete an override from the SCOM Console he got the following error:

Note:  The following information was gathered when the operation was attempted.  The information may appear cryptic but provides context for the error.  The application will continue to run.: Verification failed with 3 errors:
——————————————————-
Error 1:
Found error in 1|XXXXX.XX.Unix.Management.Pack|1.0.0.0|DisplayString,ElementReference=a27e1d14-8ad4-56fb-da46-0c0994054e92,LanguageID=ENA|| with message:
Element Info with Identity DisplayString,ElementReference=a27e1d14-8ad4-56fb-da46-0c0994054e92,LanguageID=ENA refers to an invalid sub element Filter.
——————————————————-
Error 2:
Found error in 1|XXXXX.XX.Unix.Management.Pack|1.0.0.0|DisplayString,ElementReference=ef78d342-5a82-d629-224d-2e476003f6e1,LanguageID=ENA|| with message:
Element Info with Identity DisplayString,ElementReference=ef78d342-5a82-d629-224d-2e476003f6e1,LanguageID=ENA refers to an invalid sub element Filter.
——————————————————-
Error 3:
Found error in 1|XXXXX.XX.Unix.Management.Pack|1.0.0.0|DisplayString,ElementReference=06522f67-c195-2dfa-c310-a0134b961fc4,LanguageID=ENA|| with message:
Element Info with Identity DisplayString,ElementReference=06522f67-c195-2dfa-c310-a0134b961fc4,LanguageID=ENA refers to an invalid sub element Filter.
——————————————————-
 So after spending a week troubleshooting and researching, there were a couple of other posts I found with a similar issue. But with those the error had a reference or “ElementReference” that exists. The problem I has was that none of the “ElementReference”‘s existed in the management pack. So I wasn’t able to workout where these were. Then I had a moment of clarity.
//

The first issue was that i was missing the key details of what the error was telling me.

Error 1:
Found error in 1|XXXXX.XX.Unix.Management.Pack|1.0.0.0|DisplayString,ElementReference=ef78d342-5a82-d629-224d-2e476003f6e1,LanguageID=ENA|| with message:
Element Info with Identity DisplayString,ElementReference=ef78d342-5a82-d629-224d-2e476003f6e1,LanguageID=ENA refers to an invalid sub element Filter.

ENA refers to an invalid sub element “Filter“.

Understand the Error: “Filter” is the element it is referring to. Realizing that I then recalled I have added custom XML in the past where I wanted to “filter” the result of some Unix log monitoring. So some of the Unix log monitoring rules have the below “Filter” applied.

Example of the XML:

        <ConditionDetection ID=”Filter” TypeID=”System!System.ExpressionFilter”>
<Expression>
<RegExExpression>
<ValueExpression>
<XPathQuery Type=”String”>//row</XPathQuery>
</ValueExpression>
<Operator>DoesNotMatchRegularExpression</Operator>
<Pattern>27037|00245|00227|227|245|00202</Pattern>
</RegExExpression>
</Expression>
</ConditionDetection>

I then searched the MP for all rules that have “ID=”Filter”” and recorded the ID’s (my example below)

ID=”LogFileTemplate_8d97498787e44bb08d640e79a58c4919.Alert”
ID=”LogFileTemplate_4b8e8427aa7342eab7e57b8f28b68240.Alert”
ID=”LogFileTemplate_0cb6bd324de345e9b5f66ea17338c8ca.Alert”
ID=”LogFileTemplate_8e0db25739e74ab1bc37afd5b48f53ee.Alert”
ID=”LogFileTemplate_985eafabee8144789b026a3292405849.Alert”
ID=”LogFileTemplate_f05331c6df034b63ab364900655235ba.Alert”
ID=”LogFileTemplate_aba7d5aaf42b4c5589d124b24f33aa71.Alert”
ID=”LogFileTemplate_d7b747d1708943b1a19cfa57960c692c.Alert”
ID=”LogFileTemplate_77cc9a10c11e4ec2be13d755c4ce1f4d.Alert”
ID=”LogFileTemplate_b8ec93656e75432eb6f9959311513f93.Alert”
ID=”LogFileTemplate_e36f7c7d961e43f295c0c62e5668bf94.Alert”
ID=”LogFileTemplate_42923834c6da47229a860b6b0bf51838.Alert”
ID=”LogFileTemplate_a13fa86a600a4f079804953d87fbd686.Alert”
ID=”LogFileTemplate_0cf825ac45214a6da85097a11f529a79.Alert”
ID=”LogFileTemplate_3c292e133a08417fa96f00cc63fa7050.Alert”
ID=”LogFileTemplate_1d0eefcfac314d1e9868459d44b6bd83.Alert”

I then search through the Language pack XML for “SubElementID=”Filter”” and record the Elements that dont match in the list (above).

There were the 3 I found (and i have 3 errors for this MP… Interesting)
1. <DisplayString ElementID=”LogFileTemplate_df92799e9f064db282f6abc981f1e5d3.Alert” SubElementID=”Filter”> not found – checked and no filter condition exists in configuration

2. <DisplayString ElementID=”LogFileTemplate_4226bd92be9e4a10871cbc2fcac6da5d.Alert” SubElementID=”Filter”> not found – checked and no filter condition exists in configuration

3. <DisplayString ElementID=”LogFileTemplate_d3d9588aa82a4a53a920d6d99a400940.Alert” SubElementID=”Filter”> not found – checked and no filter condition exists in configuration

I then deleted those references in the Language XML part and import the MP. It import with no error. Oh Yeh!

Once I removed the these references the Management pack loaded and I was able to delete overrides and everything was gravy.

So the cause of the error was that the Unix Log XML configuration must have previously contained a Filter condition that was later removed but was not removed from the Language component of the MP.

If you are also interested to find out the name of the log monitring rule (unter Unix logfile template search for the ElementID and when you get to the “KnowledgeArticle” XML you will see the name/details of the rule). I suspect that this may have been caused by the recent upgrade of SCOM 2012 SP1 to R2 or the filter configuration being cleared from the console and cleanup was not 100%.

 

SCOM Reporting Forecasting/trending Powershell Report

SCOM Reporting Forecasting/trending powershell report solution

 

The “issue”:

The previous monitoring tool sets reporting tool had the ability to generate forecasting reports but unfortunately SCOM does not have the capability out of the box to produce these reports or perform this analysis.

 

I investigated other solutions and did find a couple (one was at a significant cost while another only provided graphics and not statistics/report). These didn’t seem to fit the bill so I then decided to develop my own.

 

So i created a Powershell script that gets the data, analyzes it and produces HTML reports. The breakdown you can see below is on how my script is structured and if you would like to create the code for yourself (best way to learn).

  • get list of unix servers from resource pool specified  (we use different resource pools for different gateways but you can modify this into list of servers from a group)
  • gets list of windows servers from gateway server (we use gateway servers but you can modify this into list of servers from a group)
  • run SQL query to extract performance data form the data warehouse (for the list of servers specified above and store the results)
  • analyze the each performance counter/instance and workout the average and projections (The store these results for reporting)
  • clean the data table (remove negative numbers and replace with 0)
  • generate html reports based on the processed data
    • 0-45 days to upgrade (report lists CPU/Memory/Disks that will run out of space in the next 45 days or have already run out of space)
    • 46-200 days to upgrade (report lists CPU/Memory/Disks that will run out of space in the next 46-200 days or have already run out of space)
    • Full table of results for reference

The report is then scheduled on a management server and is only run once per month.

Note: It takes 6 hours to process as this is a synchronous script and depends on the number of servers or objects that need to be analysed. (some unix servers have upwards of 30 file systems which causes the script to take some time.

I originally had this all in one script but to speed up processing I split it into 2 (Windows and Unix).

I also investigated power shell workflows but it has it limitations as it wont use some of the power shell commands I had in my script.

 

Download Capacity Report Script Here

Edit: I have moved this to github: https://github.com/Buzzcola81/scom2012forecasting

Please contact me if you have any questions.
SCOM

Activate – Administration – Hands on training

Every business wants automation, especially when it comes to user provisioning and services.  There are a number of tasks which IT/Service Desk do day in and day out. We all know that with almost all modern enterprise tools and services you can drive them with scripts or API’s. So Activate comes in as a self service portal for end users to be able to service their own IT needs.

After a couple of jammed packed days or training I must say that the guys at Activate really put a lot of work into the labs and test VM’s to create an environment that looked to be a typical customer setup. The training delivered was 100% spot on with what we need to know about to to drive the product for success without the sales pitch. This training was one of the best Instructor lead courses I have been to. My hat off to Fuad Baloch for the great 2 days of training!

 

Activate

My PRTG Dashboards (aka Maps)

So a few weeks ago our operations team got some new 42’ LCD’s to be used for monitoring.

As they were just displaying the same old alert screens they sit in front of. I thought I’d give PRTG dashboards a go.

We use SCOM to monitoring PRG core server and PRTG to monitor the SCOM core infrastructure I have configured additional SQL sensors and auto ticketing notifications.

Great screen for me as I can see if there’s any functions broken quickly in our monitoring tools.

 

After I had completed this I was thinking what else could I do that’s could really pop out. So one of our clients VICSES has some interesting feeds that I can leverage from.

After a little research I found a number of sources that I could use to get some info from:

  • VICSES RSS feed – For Current Emergency Information and Warnings
  • BOM (Australian Bureau of Meteorology) – Live rain radar image feeds for all around the sate
  • Weather Widget – Provides live weather stats for Melbourne and 7 day outlook
  • Declared Operations – Where we have integration into our ticketing system and colour changes to yellow for when were in declared Operations.
  • Animated Background – Using GIF’s

 

 

Took a couple of days to get all this working and found a number of limitations to be aware of:

  • Pages are very static and rely on the PRTG refresh to be able to update
  • If you have enabled SSL on your PRTG instance you can link to unsecured sources (limitation of modern browsers rather than PRTG itself)
  • Javascript/scripts and a number of other HTML coding doesn’t load.
  • 2MB limit on a background image by default (Work around found to use custom HTML element with link to image and set it to layer 1)

 

After understanding the limitations and developing some work around you can still build some decent interactive dashboards that can impress!

 

This was the source of a bit of buzz when I loaded it onto the big screens. Overall I did learn more about HTML coding and the limitations of the PRTG dashboards and some more about PRTG in general.

 

If you would like to know more about the HTML coding or sources I have used please feel free to contact me.

 

AWS Puppet Labs workshop Melbourne

I was able to attend a Puppet Labs workshop at AWS offices in Melbourne on Monday. New to the Puppet tool I had found quickly that Puppet uses a declarative method of describing a state rather then using procedural code. Something new to me (as well as a lot of other AWS things I have recently been discovering)

There was a real mix of experienced people attending the puppet Lab workshop from those that had used it for some time to the noobs (like myself) who were windows based IT guys.

We were run through the basic setup of a master node and then some of the post configuration to start looking into how puppet starts to take control of files, packages, instillation, services and other configurations. Again the idea of letting go of control and letting the tools do the work for you is a repeated theme that a lot of  engineers (and users) need to adopt if IT want to become efficient. (Cattle not Pets).

Next step for me is to think of things Id like to have under puppet control in a windows lab environment…. Might need to do some more research and come back to that one now that I think of it.

I think most attendees got something out of that day. At the very least working with AWS sniping up servers, becoming more familiar with the interface is always good to practice.

puppetlabs