DCE performance troubleshooting guide

EcoStruxureIT · ‎2019-11-20

This guide covers some of the symptoms you might see when a DCE server has a performance bottleneck, tools you can use to better understand the resource that is being strained, and steps to try to mitigate the issue.

Symptoms of performance problems

There are several symptoms you might see while you interact with DCE that could suggest that there is a potential performance problem in your environment. This list is by no means exhaustive. Some common symptoms are:

Missed sensor update values

The nbc.xml log from the DCE server contains ERROR level messages about dropped sensor updates coming from com.apc.isxc.vb.listeners.sensor.impl.SensorQProcessorRunnable or com.netbotz.server.services.repository.impl.RepositoryEventServiceImpl

Log in to the DCE web client and click Logs in the upper right corner to view the nbc.xml log.

Delay in receiving alarm data

Alarms come into the system significantly after they were triggered on the monitored device.

Server hang, crash, or timeouts

This error message is displayed on the DCE server: Hung_task_timeout error

Contact technical support to gather capture server logs.

See https://www.apc.com/us/en/faqs/index?page=content&id=FA303596

Performance analysis from DCE

Top

Top is a standard Linux diagnostic tool used to monitor system performance. Direct access to the DCE server is not allowed. Contact technical support to capture server logs that include a top_output.

Note: Prior to DCE 7.7, the top output from captured server logs is averaged across all CPU cores. Starting with 7.7, the output per core is available, which is more insightful.

Within the top output, support looks at a few different values:

CPU load average

This lists a load average for the last one, five, and fifteen-minute period. If this number is abnormally high relative to the number of cores you have defined for the system, it is a good indicator that the system is running with a lot of CPU load. The exact cause of the CPU load won’t be clear from this data. The system could be CPU starved if this value remains high for an extended period of time.

It is expected that this value is elevated for some period of time after a system reboot or during a large discovery. You divide this value by the number of cores, and then multiply by 100 to get a percent utilization. Each physical core counts as 1; a hyperthreaded core counts as ½.

For example, an 8 core / 8 thread virtual machine should be able to sustain a load average of 8.0 without being considered oversubscribed. If you are using an 8 core, 16 thread configuration, your acceptable load average is more like 12.0 because not all 16 threads are backed by physical cores.

Mitigation in this case consists of either reducing the load on your DCE (fewer devices, longer poll period) or allocating more CPU resources to your DCE to get your load average to a more acceptable level. Make sure to review the DCE sizing guide for insight on the best starting values for CPU configuration to use based on the system workload.

Wait average

This represents the amount of time your system is stopped waiting for the underlying storage device to service requests. DCE is extremely sensitive to IO path delays, so even a slightly elevated value for wait average that persists for any extended period of time can be an issue for the system.

Ideally you want to see this value listed by CPU core. If you see any one individual core with a %wa continually over 20, your storage is likely not keeping up with DCE. If the system is allowed to stay in this state for an extended period of time, you usually start to see the missed sensor update processing symptom listed above. Bear in mind that if you are reviewing the average top output instead of by core, this value can deceptively appear to be much lower due to averaging and the number of cores in the system.

Mitigation requires a deeper dive into your storage path. If you are using network storage, you want to review the latency and utilization of the storage array. If using local ESXi storage, you can review the Host performance data in VMWare. Usually, either reducing the load on the DCE by decreasing device count or increasing poll period will help. If the storage is truly subpar, upgrading to SSDs, removing other load from the storage system, or improving the network path between the DCE and storage may be required. Reference the DCE sizing guide for more details on appropriate storage sizing.

SensorQstats

This is a statistic that the DCE server keeps track of. It represents the amount of sensor processing the server is doing every hour. This value can be monitored at:

http://<dce server ip>/nbc/compress/support/sensorqstats

The dataset can also be retrieved by technical support with a capture server logs gather request.

Regardless of where you view the data, this statistic will publish once an hour every hour. This metric is good to monitor because it shows whether the DCE is keeping up with the current work load or if it is falling behind. These values are of particular interest:

Processed

This is the number of sensor updates that the server has processed in the last hour. This value is directly impacted by the number of devices in your system, your poll period, and the number of sensor changes that are occurring.

This value represents the total number of unique events that the system completed within that 1-hour period of time. It is best observed during steady system processing. Events like discovering a large quantity of new devices can skew this number for a period or two. Use this value when you review the DCE sizing guide to determine CPU / RAM / Storage sizing.

Dropped

This value should always be zero on a healthy system. Any non-zero value indicates a sensor data point that was dropped because a component of the system cannot keep up. When this value is not zero, we often see %wa elevated in top output.

Remember, DCE is very intolerant of storage latency. If the value for dropped is a recurring non-zero value, some amount of data is constantly lost. If there is a non-zero value occasionally, look into the system during those times; it is likely running near the edge of its capabilities and is pushed beyond its limits. Events such as a large alarm storm, a discovery pulling in a large number of devices, or similar high load events can all push the system temporarily into this state.

A properly configured system should always have zero drops. Anything dropped will be lost forever, so it’s important to monitor and adjust resources accordingly to prevent this.

Remaining

This value represents the amount of sensor data still in the queue to be processed when the qstats report was run. This is not dropped data; it is data that had not yet finished being processed. On smaller systems, this will likely always be zero. As the workload increases, this value could start to become non-zero.

By itself, having some non-zero values here is not cause for alarm. If you are regularly seeing non-zero values, or the value is growing in size every hour, it’s a sign that the system is starting to have trouble keeping up.

Performance analysis from DCE VM

The primary focus of this section is specific to DCE run as a virtual machine. Information that is not VMware-centric technically applies to DCE physical servers also.

Sometimes, there are delays for reasons not readily seen from the DCE or DCE OS point of view. In these cases, it helps to review the performance data from the VMware side of things to see if there are any performance issues.

Resource Limits

It is good to verify whether any resource limits are defined for the DCE virtual machine. Resource limits are a throttling mechanism that allow a VM administrator to restrict the amount of resources a virtual machine can consume. These limits can be imposed on CPU, RAM, and storage resources, effectively restricting the virtual machines use of these resources.

If there are resource limits in place, try removing them or raising them to a higher value. Monitor the system utilization values as a result of the change to monitor for improvements.

Disk Latency

From within VMWare you can monitor the real-time disk performance results of the storage that is backing the DCE VM. The specifics of finding this data differs a bit between versions of VMWare, and whether you investigate from the ESXi locally or from within vCenter. All the versions have support for monitoring the disk latency.

To start, identify the DCE VM from within the hypervisor and review the details of the virtual machine. Specifically, look for the disk drive(s) of the DCE, and the storage backing that drive. If your DCE has more than one disk drive, they should ALL be located on the same storage destination. Splitting DCE drives among multiple storage backings almost always results in decreased VM performance and should be avoided as a general rule.

Look for the Advanced Performance Monitoring section of the ESXi host running the DCE VM. In that section, you can view the real time latency of all IO operations that host is sending to disk. The DCE is very sensitive to disk latency. You should ensure that the latency value, in ms, is less than 1 for the datastore backing the DCE VM. While some short-lived spikes can be tolerated, it is best to ensure the steady state and average response time remains below 1ms.

If response times exceed of 1ms, look for ways to lower that value. You can reduce the amount of systems that also use that shared volume, isolate the DCE VM to be the only system using that volume, or upgrade the target volume to have more disks, faster disks, or preferably SSDs.

Esxtop

Drilling into another level of the hypervisor, you can run esxtop, a real-time performance analysis tool provided by VMWare. This utility is very similar to Linux top, and its usage is the same.

To start, SSH must be enabled on the ESXi running your DCE VM, and you must have the proper credentials to SSH into the ESXi. This is a real time analysis, so the information gathered will only be applicable if your DCE is in the performance degraded state while you run this tool. For intermittent issues, you should run this tool and then cause the event that triggers the degraded system state.

As an example, the following steps cover how to perform a 30-minute esxtop capture from the ESXi. There is additional documentation about running esxtop interactively in Additional resources below.

To capture a 30-minute data set from esxtop:

Enable SSH
SSH to the ESXi server hosting the DCE VM.
Run the command:
esxtop -b -d 5 -n 360 -a |gzip >esxtopOutput.csv.gz

The esxtop command should monitor the ESXi for 30 minutes and create a report of all the performance counters.
After the collection completes, SCP the file from the ESXi host and put the output on a Windows machine.
From the Windows machine, use Performance Monitor to analyze the data set collected.
To do this:

1. Launch Performance Monitor.
2. From the left navigation, under Monitoring Tools, right click Performance Monitor and choose Properties.
3. Under the Source tab, change Data Source to Log Files and point it to the extracted contents you gathered from running esxtop.
4. Click OK.
5. Right click the graph and choose Add Counters.

You can now choose which data from the log collection you want to graph to determine signs of stress from typical system resources: CPU, RAM, drives. Some values of interest are:

CPU %Used of the DCE VM
- Returns data similar to the Linux top data collected before, just another way to reference it
- CPU load average of the host
  - Like with top, gives you CPU insight into how much load the ESXi CPU is under
- %VMwait
  - Percentage of time the VM is waiting for kernel activity, usually disk IO
- DAVG / KAVG / GAVG
  - Stats for latency of disk commands.
  - DAVG : Latency at the device driver level
  - KAVG: Latency at the VMKernel level
  - GAVG: GAVG= DAVG + KAVG

For additional analysis of the esxtop data, see Additional resources.

Additional resources

Additional resources to help you better understand some of the performance tools, what they mean, and how to use them

ESXTOP

ESXTOP quick overview

http://www.running-system.com/wp-content/uploads/2015/04/ESXTOP_vSphere6.pdf

ESXTOP metrics

https://www.virten.net/vmware/esxtop/

ESXTOP interpretation

https://communities.vmware.com/docs/DOC-9279

VMWare

VMWare KB: Troubleshooting ESXi virtual machine performance issues

https://kb.vmware.com/s/article/2001003

VMWare KB: Troubleshooting ESXi storage performance issues

https://kb.vmware.com/s/article/1008205

Questionnaire for performance escalation

Use these questions as a starting point for data that you should gather from the site if you suspect a DCE VM performance issue. If you open a case to diagnose this problem, technical support and engineering will request this data. Proactively gathering the data will help expedite issue resolution.

The questions are written to gain a better understanding of the environment hosting the DCE virtual machine. The goal is to understand the capabilities of the ESXi, the storage supporting DCE, and resource utilization.

VMWare Configuration

What version of VMware are you running on vCenter?
What version VMware are you running on the ESXi?
What VM Hardware version do you have applied to the DCE VM?
Are your ESXi hosts in a cluster supporting vMotion of your VMs for load balancing?
- How many hosts are in the cluster?
- Is DRS enabled such that VMs can migrate between ESXi?
- How often is your DCE VM migrating?
Which ESXis are hosting the DCE VMs in question? (if multiple ESXis, please list)
What is the make, model, and hardware specs of the ESXi server?
- Specifically interested in CPU type and quantity, RAM quantity
Are your DCE VMs configured with multiple drives?
- If yes, are all the drives located on the same storage location?
Are there any resource limit restrictions being set on your DCE VM?
- CPU Limits? CPU Shares?
- Memory Limits? Memory Shares?
- For all DCE VM disk drives: Disk Shares? Disk IOPs Limit?

ESXi Local Storage

Are the ESXis using local storage to run any of the VMs? If no, skip this section.
If VMs are using local ESXi storage, what are the ESXi disk types (HDD / SSD)?
- If HDD what are the RPM speed of disk?
- If multiple disks are being used, what is the RAID scheme?
- What is the size of your RAID Controller Cache?

Network Storage

If your DCE VM is leveraging network storage for their disk backing:

What are the make and model of the shared storage disk array?
What protocol is your network storage running (NFS / VMFS / SCSI )?
How many disks are there in the storage solution?
What are the drive types? SSD? HDD?
- If HDD, what are the disk speeds?
- How is the array provisioned (Single disk pool, multiple disk pools)?
  - If multiple pools, how many disks per pool?
- What is the RAID configuration on the volume?
- Is the DCE VM using an isolated volume or is it shared with other VMs?

Network Topology

Please describe the network topology where this DCE is deployed. Link speeds between nodes of the system are of specific interest.

Running System Data Collection

While running your typical DCE workload, use the esxtop tool to collect a snapshot of your system. Ideally, the collection should cover the period of time where you are experiencing the performance issue.

Esxtop collection

Enable SSH and SSH to the ESXi server hosting DCE.
Run the command:
esxtop -b -d 5 -n 360 -a |gzip >esxtopOutput.csv.gz
Monitor the ESXi for 30 minutes and create a report of all the performance counters.
SCP the output from the ESXi and send to support.

DCE performance troubleshooting guide

Related Forums

EcoStruxure IT forum

APC UPS Data Center & Enterprise Solutions Forum