Friday, October 19, 2012

[OpsMgr 2007R2][OpsMgr 2012] Troubleshooting management server and gateway performance

 

For OpsMgr 2007 and R2 - Root management server (RMS)

Configuration update bursts are caused by management pack imports and by discovery data. When system performance is slow, the most likely bottlenecks are, first, the CPU and, second, the OpsMgr installation disk I/O.

The RMS is responsible for generating and sending configuration files to all affected Health Services.

For Workflow reloading (which is caused by new configuration on RMS), the most likely bottlenecks are the same: the CPU first, and OpsMgr installation disk I/O second. The RMS is responsible for reading the configuration file, for loading and initializing all workflows that run on it, and for updating the RMS HealthService store when the configuration file is updated on the RMS.

For local workflow activity bursts (which is when agents change their availability), the most likely bottleneck is the CPU. If you find that the CPU is not working at maximum capacity, the next most likely bottleneck is the hard disk. The RMS is responsible for monitoring the availability of all agents that are using RMS local workflows. The RMS also hosts distributed dependency monitors that use the disk.

Management server

During a configuration update burst (that is caused by MP import and discovery), the typical bottlenecks are, first, the CPU and, second, the OpsMgr installation disk I/O. The management server is responsible of forwarding configuration files from the RMS to the target agents.

For Operational data collection, bottlenecks are typically caused by the CPU. The disk I/O may also be at maximum capacity, but that is not as likely. The management server is responsible for decompressing and decrypting incoming operational data, and inserting it into the Operational Database. It also sends acknowledgements (ACKs) back to the agents or gateways after it receives operational data, and uses disk queuing to temporarily store these outgoing ACKs. Lastly, the management server will also forward monitor state changes (by using a disk queue) to the RMS for distributed dependency monitors.

Gateway

The gateway is both CPU-bound and I/O-bound. When the gateway is relaying a large amount of data, both the CPU and I/O operations may show high usage. Most of the CPU usage is caused by the decompression, compression, encryption, and decryption of the incoming data, and also by the transfer of that data. All data that is received by the gateway and from the agents is stored in a persistent queue on disk, to be read and forwarded to the management server by the gateway Health service. This can cause heavy disk usage. This usage can be significant when the gateway is taken temporarily offline and must then handle accumulated agent data that the agents generated and tried to send when the GW was still offline.

To troubleshoot the issue in this situation, collect the following information for each affected management server or gateway:
  • Exact Windows version, edition, and build number (for example, Windows Server 2003 Enterprise x64 SP2)
  • Number of processors
  • Amount of RAM
  • Drive that contains the Health Service State folder
  • Whether the antivirus software is configured to exclude the Health Service store

    Note For more information, click the following article number to view the article in the Microsoft Knowledge Base: 975931 (http://support.microsoft.com/kb/975931/ )

  • Recommendations for antivirus exclusions that relate to Operations Manager
  • RAID level (0, 1, 5, 0+1 or 1+0) for the drive that is used by the Health Service State
  • Number of disks used for the RAID
  • Whether battery-backed write cache is enabled on the array controller

This posting is provided "AS IS" with no warranties.

No comments:

Post a Comment