Troubleshooting the SLA in SCSM 2012 (SP1)

imageIf you already read my blog posts about SLA in SCSM 2012 (Object model, How it’s works and “Hidden” features and “pausing” SLA) then you know what SLA process in SCSM 2012 is really complex and depend from many internal features of the SCSM. We are working in IT and know what sometime a things can work wrong even if no one touch it (or touch but don’t want to talk about this). This article is attempt to summaries information about issues with SLA in SCSM 2012 and how to troubleshoot them and solve.

Note: please re-read my blog posts about SLA, especially SLA in SCSM 2012. Part 2. How it’s works. This is absolutely required for troubleshooting SLA.

OK, let’s imagine the root problem: the Service Level tab is empty. Usually this situation people called “SLA not working!!!”. First of all, list the possible reasons:

  1. Wrong SLO configuration. Queue can have wrong filter, SLO can have wrong queue selected and so on.
  2. Queue calculation process is stuck. This is most popular issue (from my point of view)
  3. SLO workflow is broken. This issue isn’t discussed in this article because it can be only one reason of this: manual changes of the management pack with SLO workflows.

Troubleshooting check list

If you have a empty Service Level tab then at first step you must check the queue. To do this:

  1. Open your SLO configuration and check the name of the queue(s)
  2. Open properties of each queue and re-check the criteria. Tip: Queue’s property can be opened directly from SLO configuration:
    image
  3. If queue’s criteria is OK then find the work item that meat the queue’s criteria.
  4. Open this work item and navigate to History tab.
  5. Find the history records with empty “changed by” field:
    image
  6. If there is no such records then your queue calculation process is broken. You can stop here and go to Troubleshooting queue calculation process.
  7. If you find such records then you must expand each records and check the “Item” column. One of records must contain name of your queue in this column:
    image
  8. If you can’t find your queue in records with empty “changed by” field but other records with empty “changed by” field  are exists then this is worst scenario. This means what queue calculation process works fine but something wrong with given queue. You must re-check queue criteria twice before make additional troubleshooting. Stop here and go to Troubleshooting the given queue\group.

If you find queue in history records then next step is to check the SLO group. To do this:

  1. Using same work item, open it and navigate to History tab. Find the history records with empty “changed by” field and check the “Item” column. One of records must contain name of  SLO object with “SLO Group:” prefix:
    image
  2. If you can’t find your SLO object with empty “changed by” field but other records with  empty “changed by” are exists then stop here and go to Troubleshooting the given queue\group.

If you find your SLO in history records but Service Level tab is still empty then something totally wrong here. I’ve never seen such cases so I can’t give you any recommendations how to solve this. But you must check the workflow engine and SLO’s workflows.

Troubleshooting queue calculation process

The most common reason of the broken queue\group calculation process is unavailable for long time but restored connection from SCSM management server to SQL database. The SCSM 2012 is able to restore connection to SQL database. For instance, if you rebooted the SQL server then (generally) you no need to restart the SCSM services: the SCSM will try to restore connection. But if your SQL database was unavailable for long time then SCSM can’t restore fully functional connection to SQL database. The worst things here what  I don’t know exact time after it’s totally broken, but I seen the issues then database was unavailable about one hour. This is usual situation for night maintenance, for instance. The second bad thing is what from first look the SCSM is working fine: workflows are running, you can use console without problem, the portal is working and so on.

The normal scenario in case of lost connectivity to SQL database is next:

1. The SCSM lost connection to SQL database. It can log different events, the most common is:

Log Name:      Operations Manager
Source:        OpsMgr SDK Service
Date:          6/12/2013 2:56:40 AM
Event ID:      26330
Task Category: None
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      SCSM12.scsmsolutions.local
Description:
The System Center Data Access service lost database connectivity.
Database name: ServiceManager
Server instance name: SQL01
Exception message: %ERROR MESSAGE FROM SQL SERVER%

Log Name:      Operations Manager
Source:        OpsMgr Config Service
Date:          6/12/2013 2:57:33 AM
Event ID:      29200
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      SCSM12.scsmsolutions.local
Description:
OpsMgr Config Service has lost connectivity to the OpsMgr database, therefore it can not get any updates from the database. This may be a temporary issue that may be recovered from automatically. If the problem persists, it usually indicates a problem with the database. Reason:

%ERROR MESSAGE FROM SQL SERVER%

2. The SCSM restore connection to SQL database. The event log must contain event:

Log Name:      Operations Manager
Source:        Health Service Modules
Date:          6/12/2013 2:59:43 AM
Event ID:      31404
Task Category: None
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      SCSM12.scsmsolutions.local
Description:
Group membership calculation recovered after retrying on a database error.

But sometimes, if SQL database was unavailable for long time, the SCSM can’t recover all processes. The most common “alarm light” in this case is error event with ID 26340 in Operations Manager log:

Log Name:      Operations Manager
Source:        OpsMgr SDK Service
Date:          7/17/2013 3:43:39 PM
Event ID:      26340
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      SCSM12.scsmsolutions.local
Description:
System Center Data Access Service and/or System Center Management is unresponsive because Authorization Manager is unable to recover from database errors. Please restart services System Center Data Access Service and System Center Management.

and absent event 31404 (see above).

Now let’s take closer to troubleshooting queue\group. The queue\group calculation process handler by Health Service. The first thing what you should do is to find the earliest events with id 31410:

Log Name:      Operations Manager
Source:        Health Service Modules
Date:          7/19/2013 4:00:53 AM
Event ID:      31410
Task Category: None
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      SCSM12.scsmsolutions.local
Description:
Starting new group membership calculation rule:
Subscription ID: f13e48ec-d255-4c80-9d18-b945c00406f5
Rule ID: fd0ee6c4-d6b4-d45a-94b9-e5b0e0bc97f4
Group ID: 2cdaf852-7b1d-fe64-2b84-40b439f3ee30
Group type name: WorkItemGroup.dd8723afe991483fb6cccbc284bd4b6a

Normally, this events must exist after each Health Service restart or after group membership calculation recovered (see event above). If one of the next statements:

  1. There is no events with event id 31410 at all
  2. There is no events with event id 31404 between last connectivity lost and current time
  3. There is events with event id 26340

are true then you must restart all SCSM’s service:

  • System Center Management Configuration (OMCFG)
  • System Center Management (HealthService)
  • System Center Data Access Service (OMSDK)

After the HealthService is restarted you should check the Operations Manager log for events with id 31410 (“Starting new group membership calculation rule”). In most cases restarting of HealthService can help to solve or troubleshoot the queue calculation process: the queue calculation process will be started fine or the HealthService will log the error event if it can’t start the group calculation. On last case you just need to read exception and solve the reason of it (or open support case if you can’t solve it).

Troubleshooting the given queue\group

If you have issues with only given queue\group (and absolutely sure what configuration of the group\queue are fine) then only thing what I can recommend is:

  1. Get the internal name of the queue\group using the PowerShell. For instance, to get internal name of the “Incidents – High urgent” queue:
    Get-SCSMClass | ?  {$_.DisplayName -eq "Incidents - High urgent"} | select Name

    image

  2. Restart the HealthService
  3. Check the Operations Manager log for event with id 31410 and name of the queue\group inside of description:
    Log Name:      Operations Manager
    Source:        Health Service Modules
    Date:          7/19/2013 5:44:59 AM
    Event ID:      31410
    Task Category: None
    Level:         Information
    Keywords:      Classic
    User:          N/A
    Computer:      SCSM12.scsmsolutions.local
    Description:
    Starting new group membership calculation rule:
     Subscription ID: 70755663-5337-43dd-ac5c-985e2bc6f036
     Rule ID: 5dafa47d-f6ad-f747-1dec-6a61129f2b02
     Group ID: f8f4525d-4b32-655a-21d6-d6c6363d7c51
     Group type name: WorkItemGroup.47864e052af944dcb981c9bf18ffda9d

If you can’t find it then try to search the Operations manger log for name of the queue\group (see step 1). Almost always you will find error\warning event with description what is wrong with your queue\group

Share

This entry was posted in Troubleshooting and tagged , . Bookmark the permalink.
%d bloggers like this: