Friday, October 19, 2012

[OpsMgr 2007R2][OpsMgr 2012] Troubleshooting SQL Server Performance on Operational Database and Data Warehouse

Troubleshooting SQL Server Performance on Operational Database and Data Warehouse

 

Operational Database (OperationsManager)

For the OperationsManager database, the most likely bottleneck is the disk array. If the disk array is not at maximum I/O capacity, the next most likely bottleneck is the CPU. The database will experience occasional slowdowns and operational "data storms” (very high incidences of events, alerts, and performance data or state changes that persist for a relatively long time). A short burst typically does not cause any significant delay for an extended period of time.

During operational data insertion, the database disks are primarily used for writes. CPU use is usually caused by SQL Server churn. This may occur when you have large and complex queries, heavy data insertion, and the grooming of large tables (which, by default, occurs at midnight). Typically, the grooming of even large Events and Performance Data tables does not consume excessive CPU or disk resources. However, the grooming pf the Alert and State Change tables can be CPU-intensive for large tables.

The database is also CPU-bound when it handles configuration redistribution bursts, which are caused by MP imports or by a large instance space change. In these cases, the Config service queries the database for new agent configuration. This ususally causes CPU spikes to occur on the database before the service sends the configuration updates to the agents.

Data Warehouse (OperationsManagerDW)

For the OperationsManagerDW database, the most likely bottleneck is the disk array. This usually occurs because of very large operational data insertions. In these cases, the disks are mostly busy performing writes. Usually, the disks are performing few reads, except to handle manually-generated Reporting views because these run queries on the data warehouse.

CPU usage is usually caused by SQL Server churn. CPU spikes may occur during heavy partitioning activity (when tables become very large and then get partitioned), the generation of complex reports, and large amounts of alerts in the database, with which the data warehouse must constantly sync up.

General troubleshooting

To troubleshoot the issue in this situation, collect the following information for each affected management server or gateway:
  • Exact Windows version, edition, and build number (for example, Windows Server 2003 Enterprise x64 SP2)
  • Number of processors
  • Amount of RAM
  • Amount of memory that is allocated to SQL Server
  • Whether SQL Server is 32-bit and is AWE enabled

    Note You can find most of this information in SQL Server Management Studio or in SQL Server Enterprise Manager. To do this, open the Properties window of the server, and then click the General and Memory tabs. The General tab includes the SQL Server version, the Windows version, the platform, the amount of RAM, and the number of processors. The Memory tab includes the memory that is allocated to SQL Server. In Microsoft SQL Server 2008 and in Microsoft SQL Server 2005, the Memory tab also includes the AWE option. To determine whether AWE is enabled in Microsoft SQL Server 2000, run the following command in the Microsoft SQL Query Analyzer:
    sp_configure 'show advanced options', 1
    RECONFIGURE
    GO
    sp_configure 'awe enabled'
    The returned values for config_value and for run_value will be 1 if AWE is enabled.

    If OS is 32-bit and RAM is 4 GB or greater, check whether the /pae or /3gb switches exist in the Boot.ini. file. These options could be configured incorrectly if the server was originally installed by having 4 GB or less of RAM, and if the RAM was later upgraded.

    For 32-bit servers that have 4 GB of RAM, the /3gb switch in Boot.ini increases the amount of memory that SQL Server can address (from 2 to 3 GB). For 32-bit servers that have more than 4 GB of RAM, the /3gb switch in Boot.ini could actually limit the amount of memory that SQL Server can address. For these systems, add the /pae switch to Boot.ini, and then enable AWE in SQL Server.

    On a multi-processor system, check the Max Degree of Parallelism (MAXDOP) setting. In SQL Server 2008 and in SQL Server 2005, this option is on the Advanced tab in the Properties dialog box for the server. To determine this setting in SQL Server 2000, run the following command in SQL Query Analyzer:

    sp_configure 'show advanced options', 1
    RECONFIGURE
    GO
    sp_configure 'max degree of parallelism'


    The default value is 0, which means that all available processors will be used. A setting of 0 is fine for servers that have eight or fewer processors. For servers that have more than eight processors, the time that it takes SQL Server to coordinate the use of all processors may be counterproductive. Therefore, for servers that have more than eight processors, you generally should set Max Degree of Parallelism to a value of 8. To do this, run the following command in SQL Query Analyzer:

    sp_configure 'show advanced options', 1
    GO
    RECONFIGURE WITH OVERRIDE
    GO
    sp_configure 'max degree of parallelism', 8
    GO
    RECONFIGURE WITH OVERRIDE
    GO
  • Drive letters that contain data warehouse or Ops and Tempdb files
  • Whether the antivirus software is configured to exclude SQL data and log files (Antivirus software cannot scan SQL database files. Trying to do this can degrade performance.)
  • Amount of free space on drives that contain data warehouse or Ops and Tempdb files
  • Storage type (SAN or local)
  • RAID level (0, 1, 5, 0+1 or 1+0) for drives that are used by SQL Server
  • If SAN storage us used: amount of spindles on each LUN that is used by SQL Server
  • In OpsMgr 2007 SP1: whether hotfix 969130 (data warehouse event grooming) or SP1 hotfix rollup 971541 is applied
  • If the converted Exchange 2007 managment pack is being used or has ever been used: amount of rows in the LocalizedText table in the Ops DB and in the EventPublisher table in the data warehouse database

    Note To determine the row amounts, run the following commands: 
    USE OperationsManager SELECT COUNT(*) FROM LocalizedText
    USE OperationsManagerDW SELECT COUNT(*) FROM EventPublisher

Counters to identify memory pressure

  • MSSQL$<instance>: Buffer Manager: Page Life expectancy – How long pages persist in the buffer pool. If this value is below 300 seconds, it may indicate that the server could use more memory. It could also result from index fragmentation.
  • MSSQL$<instance>: Buffer Manager: Lazy Writes/sec – Lazy writer frees space in the buffer by moving pages to disk. Generally, the value should not consistently exceed 20 writes per second. Ideally, it would be close to zero.
  • Memory: Available Mbytes - Values below 100 MB may indicate memory pressure. Memory pressure is clearly present when this amount is less than 10 MB.
  • Process: Private Bytes: _Total: This is the amount of memory (physical and page) being used by all processes combined.
  • Process: Working Set: _Total: This is the amount of physical memory being used by all processes combined. If the value for this counter is significantly below the value for Process: Private Bytes: _Total, it indicates that processes are paging too heavily. A difference of more than 10% is probably significant.

Counters to identify disk pressure

Capture these Physical Disk counters for all drives that contain SQL data or log files:
  • % Idle Time: How much disk idle time is being reported. Anything below 50 percent could indicate a disk bottleneck.
  • Avg. Disk Queue Length: This value should not exceed 2 times the number of spindles on a LUN. For example, if a LUN has 25 spindles, a value of 50 is acceptable. However, if a LUN has 10 spindles, a value of 25 is too high. You could use the following formulas based on the RAID level and number of disks in the RAID configuration:
    • RAID 0: All of the disks are doing work in a RAID 0 set
    • Average Disk Queue Length <= # (Disks in the array) *2
    • RAID 1: half the disks are “doing work”; therefore, only half of them can be counted toward Disks Queue
    • Average Disk Queue Length <= # (Disks in the array/2) *2
    • RAID 10: half the disks are “doing work”; therefore, only half of them can be counted toward Disks Queue
    • Average Disk Queue Length <= # (Disks in the array/2) *2
    • RAID 5: All of the disks are doing work in a RAID 5 set
    • Average Disk Queue Length <= # Disks in the array *2
    • Avg. Disk sec/Transfer: The number of seconds it takes to complete one disk I/O
    • Avg. Disk sec/Read: The average time, in seconds, of a read of data from the disk
    • Avg. Disk sec/Write: The average time, in seconds, of a write of data to the disk

      Note The last three counters in this list should consistently have values of approximately .020 (20 ms) or lower and should never exceed.050 (50 ms). The following are the thresholds that are documented in the SQL Server performance troubleshooting guide:
      • Less than 10 ms: very good
      • Between 10 - 20 ms: okay
      • Between 20 - 50 ms: slow, needs attention
      • Greater than 50 ms: serious I/O bottleneck
    • Disk Bytes/sec: The number of bytes being transferred to or from the disk per second
    • Disk Transfers/sec: The number of input and output operations per second (IOPS)
    When % Idle Time is low (10 percent or less), this means that the disk is fully utilized. In this case, the last two counters in this list (“Disk Bytes/sec” and “Disk Transfers/sec”) provide a good indication of the maximum throughput of the drive in bytes and in IOPS, respectively. The throughput of a SAN drive is highly variable, depending on the number of spindles, the speed of the drives, and the speed of the channel. The best bet is to check with the SAN vendor to find out how many bytes and IOPS the drive should support. If % Idle Time is low, and the values for these two counters do not meet the expected throughput of the drive, engage the SAN vendor to troubleshoot.
The following links provide deeper insight into troubleshooting SQL Server performance:

OpsMgr Performance counters

The following sections describe the performance counters that you can use to monitor and troubleshoot OpsMgr performance.
Gateway server role
  • Overall performance counters: These counters indicate the overall performance of the gateway:
    • Processor(_Total)\% Processor Time
    • Memory\% Committed Bytes In Use
    • Network Interface(*)\Bytes Total/sec
    • LogicalDisk(*)\% Idle Time
  • LogicalDisk(*)\Avg. Disk Queue LengthOpsMgr process generic performance counters: These counters indicate the overall performance of OpsMgr processes on the gateway:
    • Process(HealthService)\%Processor Time
    • Process(HealthService)\Private Bytes (depending on how many agents this gateway is managing, this number may vary and could be several hundred megabytes)
    • Process(HealthService)\Thread Count
    • Process(HealthService)\Virtual Bytes
    • Process(HealthService)\Working Set
    • Process(MonitoringHost*)\% Processor Time
    • Process(MonitoringHost*)\Private Bytes
    • Process(MonitoringHost*)\Thread Count
    • Process(MonitoringHost*)\Virtual Bytes
  • Process(MonitoringHost*)\Working SetOpsMgr specific performance counters: These counters are OpsMgr specific counters that indicate the performance of specific aspects of OpsMgr on the gateway:
    • Health Service\Workflow Count
    • Health Service Management Groups(*)\Active File Uploads: The number of file transfers that this gateway is handling. This represents the number of management pack files that are being uploaded to agents. If this value remains at a high level for a long time, and there is not much management pack importing at a given moment, these conditions may generate a problem that affects file transfer.
    • Health Service Management Groups(*)\Send Queue % Used: The size of persistent queue. If this value remains higher than 10 for a long time, and it does not drop, this indicates that the queue is backed up. This condition is cause by an overloaded OpsMgr system because the management server or database is too busy or is offline.
    • OpsMgr Connector\Bytes Received: The number of network bytes received by the gateway – i.e., the amount of incoming bytes before decompression.
    • OpsMgr Connector\Bytes Transmitted: The number network bytes sent by the gateway – i.e., the amount of outgoing bytes after compression.
    • OpsMgr Connector\Data Bytes Received: The number of data bytes received by the gateway – i.e., the amount of incoming data after decompression.
    • OpsMgr Connector\Data Bytes Transmitted: The number of data bytes sent by the gateway – i.e. the amount of outgoing data before compression.
    • OpsMgr Connector\Open Connections: The number of connections that are open on gateway. This number should be same as the number of agents or management servers that are directly connected to the gateway.

Management server role

Overall performance counters: These counters indicate the overall performance of the management server:
  • Processor(_Total)\% Processor Time
  • Memory\% Committed Bytes In Use
  • Network Interface(*)\Bytes Total/sec
  • LogicalDisk(*)\% Idle Time
LogicalDisk(*)\Avg. Disk Queue LengthOpsMgr process generic performance counters: These counters indicate the overall performance of OpsMgr processes on the management server:
  • Process(HealthService)\% Processor Time
  • Process(HealthService)\Private Bytes – Depending on how many agents this management server is managing, this number may vary, and it could be several hundred megabytes.
  • Process(HealthService)\Thread Count
  • Process(HealthService)\Virtual Bytes
  • Process(HealthService)\Working Set
  • Process(MonitoringHost*)\% Processor Time
  • Process(MonitoringHost*)\Private Bytes
  • Process(MonitoringHost*)\Thread Count
  • Process(MonitoringHost*)\Virtual Bytes
Process(MonitoringHost*)\Working SetOpsMgr specific performance counters: These counters are OpsMgr specific counters that indicate the performance of specifric aspects of OpsMgr on the management server:
  • Health Service\Workflow Count: The number of workflows that are running on this management server.
  • Health Service Management Groups(*)\Active File Uploads: The number of file transfers that this management server is handling. This represents the number of management pack files that are being uploaded to agents. If this value remains at a high level for a long time, and there is not much management pack importing at a given moment, these conditions may generate a problem that affects file transfer.
  • Health Service Management Groups(*)\Send Queue % Used: The size of the persistent queue. If this value remains higher than 10 for a long time, and it does not drop, this indicates that the queue is backed up. This condition is cause by an overloaded OpsMgr system because the OpsMgr system (for example, the root management server) is too busy or is offline.
  • Health Service Management Groups(*)\Bind Data Source Item Drop Rate: The number of data items that are dropped by the management server for database or data warehouse data collection write actions. When this counter value is not 0, the management server or database is overloaded because it can’t handle the incoming data item fast enough or because a data item burst is occurring. The dropped data items will be resent by agents. After the overload or burst situation is finished, these data items will be inserted into the database or into the data warehouse.
  • Health Service Management Groups(*)\Bind Data Source Item Incoming Rate: The number of data items received by the management server for database or data warehouse data collection write actions.
  • Health Service Management Groups(*)\Bind Data Source Item Post Rate: The number of data items that the management server wrote to the database or data warehouse for data collection write actions.
  • OpsMgr Connector\Bytes Received: The number of network bytes received by the management server – i.e., the size of incoming bytes before decompression.
  • OpsMgr Connector\Bytes Transmitted: The number of network bytes sent by the management server – i.e., the size of outgoing bytes after compression.
  • OpsMgr Connector\Data Bytes Received: The number of data bytes received by the management server – i.e., the size of incoming data after decompress)
  • OpsMgr Connector\Data Bytes Transmitted: The number of data bytes sent by the management server – i.e., the size of outgoing data before compression)
  • OpsMgr Connector\Open Connections: The number of connections open on management server. It should be same as the number of agents or root management server that are directly connected to it.
  • OpsMgr database Write Action Modules(*)\Avg. Batch Size: The number of a data items or batches that are eceived by database write action modules. If this number is 5,000, a data item burst is occurring.
  • OpsMgr DB Write Action Modules(*)\Avg. Processing Time: The number of seconds a database write action modules takes to insert a batch into database. If this number is often greater than 60, a database insertion performance issue is occurring.
  • OpsMgr DW Writer Module(*)\Avg. Batch Processing Time, ms: The number of milliseconds for data warehouse write action to insert a batch of data items into a data warehouse.
  • OpsMgr DW Writer Module(*)\Avg. Batch Size: The average number of data items or batches received by data warehouse write action modules.
  • OpsMgr DW Writer Module(*)\Batches/sec: The number of batches received by data warehouse write action modules per second.
  • OpsMgr DW Writer Module(*)\Data Items/sec: The number of data items received by data warehouse write action modules per second.
  • OpsMgr DW Writer Module(*)\Dropped Data Item Count: The number of data items dropped by data warehouse write action modules.
  • OpsMgr DW Writer Module(*)\Total Error Count: The number of errors that occurred in a data warehouse write action module.

Root management server role

Overall performance counters: These counters indicate the overall performance of the root management server:
  • Processor(_Total)\% Processor Time
  • Memory\% Committed Bytes In Use
  • Network Interface(*)\Bytes Total/sec
  • LogicalDisk(*)\% Idle Time
LogicalDisk(*)\Avg. Disk Queue LengthOpsMgr process generic performance counters: These counters indicate the overall performance of OpsMgr processes on the root management server:
  • Process(HealthService)\% Processor Time
  • Process(HealthService)\Private Bytes (Depending on how many agents this root management server is managing, this number may vary and could be several hundred Megabytes.)
  • Process(HealthService)\Thread Count
  • Process(HealthService)\Virtual Bytes
  • Process(HealthService)\Working Set
  • Process(MonitoringHost*)\% Processor Time
  • Process(MonitoringHost*)\Private Bytes
  • Process(MonitoringHost*)\Thread Count
  • Process(MonitoringHost*)\Virtual Bytes
  • Process(MonitoringHost*)\Working Set
  • Process(Microsoft.Mom.ConfigServiceHost)\% Processor Time
  • Process(Microsoft.Mom.ConfigServiceHost)\Private Bytes
  • Process(Microsoft.Mom.ConfigServiceHost)\Thread Count
  • Process(Microsoft.Mom.ConfigServiceHost)\Virtual Bytes
  • Process(Microsoft.Mom.ConfigServiceHost)\Working Set
  • Process(Microsoft.Mom.Sdk.ServiceHost)\% Processor Time
  • Process(Microsoft.Mom.Sdk.ServiceHost)\Private Bytes
  • Process(Microsoft.Mom.Sdk.ServiceHost)\Thread Count
  • Process(Microsoft.Mom.Sdk.ServiceHost)\Virtual Bytes
Process(Microsoft.Mom.Sdk.ServiceHost)\Working SetOpsMgr specific performance counters: These counters are OpsMgr specific counters that indicate the performance of specific aspects of OpsMgr on the root management server:
  • Health Service\Workflow Count: The number of workflows that are running on this root management server.
  • Health Service Management Groups(*)\Active File Uploads: The number of file transfers that this root management server is handling – i.e., configuration and management pack uploads to agents. If this value remains higher for a long time, and it does not drop, this indicates that not much discovery or management pack is being imported at the moment, and that there could be a problem in file transfer.
  • Health Service Management Groups(*)\Send Queue % Used: The size of the persistent queue.
  • Health Service Management Groups(*)\Bind Data Source Item Drop Rate: The number of data items dropped by the root management server for database or data warehouse data collection write actions. When this counter value is not 0, the root management server or database is overloaded because it can’t handle the incoming data item fast enough or because a data item burst is occurring. The dropped data items will be resent by agents. After the overloaded or burst situation is finished, these data items will be inserted into the database or into the data warehouse.
  • Health Service Management Groups(*)\Bind Data Source Item Incoming Rate: The number of data items received by the root management server for database or data warehouse data collection write actions.
  • Health Service Management Groups(*)\Bind Data Source Item Post Rate: The number of data items that the root management server wrote to the database or to the data warehouse for database or data warehouse data collection write actions.
  • OpsMgr Connector\Bytes Received: The number of network bytes received by the root management server – i.e., the size of incoming bytes before decompress.
  • OpsMgr Connector\Bytes Transmitted: The number of network bytes sent by the root management server – i.e., the size of outgoing bytes after compression.
  • OpsMgr Connector\Data Bytes Received: The number of data bytes received by the root management server – i.e., the size of incoming data after decompression.
  • OpsMgr Connector\Data Bytes Transmitted: The number of data bytes sent by the root management server – i.e., the size of outgoing data before compression.
  • OpsMgr Connector\Open Connections: The number of connections open on the root management server. It should be same as the number of agents or management servers that are directly connected to it.
  • OpsMgr Config Service\Number Of Active Requests: The number of configuration or management pack requests that are being processing by the Config service.
  • OpsMgr Config Service\Number Of Queued Requests: The number of queued config or management pack requests sent to the Config service. If it is high for a long time, the instance space or management pack space is changing too frequently.
  • OpsMgr SDK Service\Client Connections: The number of SDK connections.
  • OpsMgr DB Write Action Modules(*)\Avg. Batch Size: The number of a data items or batches that are received by database write action modules. If this number is 5,000, a data item burst is occurring.
  • OpsMgr DB Write Action Modules(*)\Avg. Processing Time: The number of seconds that a database write action modules takes to insert a batch into a database. If this number is often larger than 60, a database insertion performance issue is occurring.
  • OpsMgr DW Writer Module(*)\Avg. Batch Processing Time, ms: The number of milliseconds that it takes for a data warehouse write action to insert a batch of data items into a data warehouse.
  • OpsMgr DW Writer Module(*)\Avg. Batch Size: The average number of data items or batches that are received by data warehouse write action modules.
  • OpsMgr DW Writer Module(*)\Batches/sec: The number of batches received by data warehouse write action modules per second.
  • OpsMgr DW Writer Module(*)\Data Items/sec: The number of data items received by data warehouse write action modules per second.
  • OpsMgr DW Writer Module(*)\Dropped Data Item Count: The number of data items that are dropped by data warehouse write action modules)
  • OpsMgr DW Writer Module(*)\Total Error Count (This is number of errors happened in data warehouse write action modules.

This posting is provided "AS IS" with no warranties.

No comments:

Post a Comment