Thursday, April 16, 2015

NUMA and VMware


ESXi uses a sophisticated NUMA scheduler to dynamically balance processor load and memory locality or processor load balance.


Each virtual machine managed by the NUMA scheduler is assigned a home node. A home node is one of the system’s NUMA nodes containing processors and local memory, as indicated by the System Resource Allocation Table (SRAT).

When memory is allocated to a virtual machine, the ESXi host preferentially allocates it from the home node. The virtual CPUs of the virtual machine are constrained to run on the home node to maximize memory locality.

The NUMA scheduler can dynamically change a virtual machine's home node to respond to changes in system load. The scheduler might migrate a virtual machine to a new home node to reduce processor load imbalance. Because this might cause more of its memory to be remote, the scheduler might migrate the virtual machine’s memory dynamically to its new home node to improve memory locality. The NUMA scheduler might also swap virtual machines between nodes when this improves overall memory locality.
Some virtual machines are not managed by the ESXi NUMA scheduler. For example, if you manually set the processor or memory affinity for a virtual machine, the NUMA scheduler might not be able to manage this virtual machine. Virtual machines that are not managed by the NUMA scheduler still run correctly. However, they don't benefit from ESXi NUMA optimizations.
The NUMA scheduling and memory placement policies in ESXi can manage all virtual machines transparently, so that administrators do not need to address the complexity of balancing virtual machines between nodes explicitly.
The optimizations work seamlessly regardless of the type of guest operating system. ESXi provides NUMA support even to virtual machines that do not support NUMA hardware, such as Windows NT 4.0. As a result, you can take advantage of new hardware even with legacy operating systems.
A virtual machine that has more virtual processors than the number of physical processor cores available on a single hardware node can be managed automatically. The NUMA scheduler accommodates such a virtual machine by having it span NUMA nodes. That is, it is split up as multiple NUMA clients, each of which is assigned to a node and then managed by the scheduler as a normal, non-spanning client. This can improve the performance of certain memory-intensive workloads with high locality. For information on configuring the behavior of this feature, see Advanced Virtual Machine Attributes.
ESXi 5.0 and later includes support for exposing virtual NUMA topology to guest operating systems. For more information about virtual NUMA control, see Using Virtual NUMA.

More on this in later posts.

Wednesday, April 15, 2015

Calculating Average Guest Latency in VMware

If you are using VMware vSphere, VMware ESXi cannot see application latency because it is above the ESXi stack. What ESXi can do is detect three types of latency that are also reported back into esxtop and VMware vCenter.
Average guest latency (GAVG) has two major components: average disk latency (DAVG) and average kernel latency (KAVG).
DAVG is the measure of time that I/O commands spend in the device, from the driver host bus adapter (HBA) to the back-end storage array.
KAVG is how much time I/O spends in the ESXi kernel. Time is measured in milliseconds. KAVG is a derived metric, which means that there is no specific calculation for it. To derive KAVG, subtract DAVG from GAVG.
 
In addition, the VMkernel processes I/O very efficiently, so there should be no significant wait in the kernel, or KAVG. In a well-configured, well-running VDI environment, KAVG should be equal to zero. If KAVG is not equal to zero, then the I/O might be stuck in a kernel queue inside the VMkernel. When that is the case, time in the kernel queue is measured as the QAVG.
 
To get a sense of the latency that the application can see in the guest OS, use a tool such as Perfmon to compare the GAVG and the actual latency the application is seeing. 
 
This comparison reveals how much latency the guest OS is adding to the storage stack. For instance, if ESXi is reporting GAVG of 10 ms, but the application or Perfmon in the guest OS is reporting storage latency of 30 ms, then 20 ms of latency is somehow building up in the guest OS layer, and you should focus your debugging on the guest OS storage configuration