VMworld 2013

From Omnia
Jump to navigation Jump to search

Hands On Labs

Thanks for registering for Test Drive on VMware® Hands-on Labs Online.

http://labs.hol.vmware.com/testdrive

You can now access Test Drive on Hands-on Labs Online, a select set of labs that highlight key VMware solutions for your business needs. Please use your existing Hands-on Labs log-in to access: labs.hol.vmware.com/testdrive.

Before you begin using the labs, follow these recommendations for best performance:

  • Internet: Access labs on latest version of Firefox, Chrome, or Safari (Internet Explorer not supported)
  • Viewing: Screen 13" or larger (no tablet support, except iPad for Lab Manual reading)
  • Networking: 300 – 500 kb of bandwidth and network latency better than 240 ms.

Once you are ready to start a lab, keep these tips in mind:

  • Review the Welcome message: For a quick overview of how to best navigate the lab environment, review the Welcome message that appears when you first log in
  • Create a virtual machine: In minutes, spin up a virtual machine in the VMware cloud to start your lab, no installation on your hardware required
  • Use the manual: Integrated lab manuals can be accessed within each lab by clicking the "Manual" tab on the right side of your console
  • End a lab: You can pause a lab at any point to return at a later time with your saved work by clicking "Exit" or end the lab completely if you will not return by clicking "End"

For issues or questions, visit the Hands-on Labs support community.

Enjoy your test drive of VMware products.

–VMware Hands-on Labs Team

---

Thank you for participating in the VMware 2013 Hands-on Labs. Be sure to visit http://hol.vmware.com/ to continue your lab experience online.

Community: VMware Hands-on Labs | VMware Communities

https://communities.vmware.com/community/vmtn/resources/how

Hands-on Labs Online The VMware Hands-on Labs are now available online! The public beta has over 30,000 worldwide users taking 32K labs and growing every day. Self Registration is now active- visit https://www.projectnee.com/HOL

The HOL Blog - Keep up-to-date on the latest HOL news at the Hands-on Labs Blog. Here you can read about VMware innovations for both event-based and online labs.


HOL-SDC-1304 vSphere Performance Optimization

For the complete Performance Troubleshooting Methodology and a list of VMware Best Practices, please visit the VMware.com website:

-

  • Module 1 - Basic vSphere Performance Concepts and Troubleshooting (60 minutes)
  • Module 2 - Performance Features in vSphere (vSphere Flash Read Cache) (45 minutes)
  • Module 3 - Understanding the New Latency Sensitivity Feature in vSphere (30 minutes)
  • Module 4 - vBenchmark: Free Tool for Measurement and Peer Benchmarking of Datacenter's Operational Statistics (15 minutes)
  • Module 5 - StatsFeeder: Scalable Statistics Collection for vSphere (20 minutes)
  • Module 6 - Using esxtop (60 minutes plus 20 minute bonus section)

Thank you for participating in the VMware 2013 Hands-on Labs. Be sure to visit http://hol.vmware.com/ to continue your lab experience online.

Lab SKU: HOL-SDC-1304

Version: 20130826-105549

Module 1 - Basic vSphere Performance Concepts and Troubleshooting (60 minutes)

"This experiment compares the difference between 2-way vCPU virtual machines. Your manager has asked you if there is any difference between multi-socket virtual machines compared to multi-core virtual machines. He also wanted to know if Hot-Add CPUs perform well."

---

Overview of CPU Test

Below is a list of the most common CPU performance issues…

High Ready Time: Ready Time above 10% could indicate CPU contention and might impact the Performance of CPU intensive application. However, some less CPU sensitive application and virtual machines can have much higher values of ready time and still perform satisfactorily.

High Costop time: Costop time indicates that CPU contention is occurring among vCPUs of a multi-way virtual machine. Costop time above 10% could be an indicator that vSphere is having contention issues when trying to schedule all the vCPUS of a multi-way virtual machine.

CPU Limits: CPU Limits directly prevent a virtual machine from using more than a set amount of CPU resources. Any CPU limit might cause a CPU performance problem if the virtual machine needs resources beyond the limit.

Host CPU Saturation: When the Physical CPUs of a vSphere host are being consistently utilized at 85% or more then the vSphere host may be saturated. When a vSphere host is saturated, it is more difficult for the scheduler to find free physical CPU resources in order to run virtual machines.

Guest CPU Saturation: Guest CPU (vCPU) Saturation is when the application inside the virtual machine is using 90% or more of the CPU resources assigned to the virtual machine. This may be an indicator that the application is being bottlenecked on vCPU resource. In these situations, adding additional vCPU resources to the virtual machine might improve performance.

Incorrect SMP Usage: Using large SMP virtual machines can cause extra overhead. Virtual machines should be correctly sized for the application that is intended to run in the virtual machine. Some applications may only support multithreading up to a certain number of threads. Assignment of additional vCPU to the virtual machine may cause additional overhead. If vCPU usage shows that a machine that is configured with multiple vCPUs is only using one of them, then it might be an indicator that the application inside the virtual machine is unable to take advantage of the additional vCPU capacity, or that the guest OS is not configured correctly.

Low Guest Usage: Low in-guest CPU utilization might be an indicator that the application is not configured correctly or that the application is starved on some other resource such as I/O or Memory and therefore cannot fully utilize the assigned vCPU resources.

---

When investigating a potential CPU issue are

  • Demand: Amount of CPU the virtual machine is demanding / trying to use.
  • Usage: Amount of CPU the virtual machine is actually currently being allowed to use.
  • Ready: Amount of time the virtual machine is ready to run but unable to because vSphere could not find physical resources to run the virtual machine on.
  1. Select the VM
  2. Select the Monitor tab
  3. Select the Performance screen
  4. Select the Advanced view
  5. Click on Chart Options
  6. Select CPU from the Chart metrics
  7. Select just Demand, Ready, and Usage in MHz

Virtual machines can be in any one of four high-level CPU States:

  • Wait: This can occur when the virtual machine's guest OS is idle (Waiting for Work), or the virtual machine could be waiting on vSphere tasks. Some examples of vSphere tasks that a vCPU may be waiting on are either waiting for I/O to complete or waiting for ESX level swapping to complete. These non-idle vSphere system waits are called VMWAIT.
  • Ready (RDY): A CPU is in the Ready state when the virtual machine is ready to run but unable to run because the vSphere scheduler is unable to find physical host CPU resources to run the virtual machine on. One potential reason for elevated Ready time is that the virtual machine is constrained by a user-set CPU limit or resource pool limit, reported as max limited (MLMTD).
  • CoStop(CSTP): Time the vCPUs of a multi-way virtual machine spent waiting to be co-started. This gives an indication of the co-scheduling overhead incurred by the virtual machine.
  • Run: Time the virtual machine was running on a physical processor

x8qe.png

TIP: You can shrink the left hand navigation pane by clicking on the thumbtack icon. You can also shrink the advance/overview selector by pressing the arrows at the top of that pane.

TIP: You can right click on the Performance Chart Legend column header bar and select or deselect the columns you want to see.

Notice the amount of CPU this virtual machine is demanding and compare that to the amount of CPU usage the virtual machine is actually allocated (Usage in MHz). The virtual machine is demanding more than it is currently being allowed to use. Notice that the virtual machine is also seeing a large amount of ready time.

Guidance: Ready time greater than 10% could be a performance concern.

Value Conversion:

  • Metric Value (in percent) = Total Time of Sample Period (by default 20,000ms in vCenter) / Metric Value (in milliseconds)

NOTE: vCenter reports some metrics such as "Ready Time" in milliseconds (ms). Use the formula above to convert the milliseconds (ms) value to a percentage.

For multi vCPU virtual machines you need to multiply the Sample Period by the number of vCPUs in the virtual machine to determine the total time of the sample period. It is also beneficial to monitor Co-Stop time on multi vCPU virtual machines. Like Ready time, Co-Stop time greater than 10% could indicate a performance problem

Note: VMware does not recommend setting affinity in most cases. vSphere already has affinity by default. Enabling affinity prevents some features like vMotion, it can become a management headache and lead to performance issue like the one we just diagnosed.

---

SMP

If you hot add a CPU you must also online that CPU in the Guest OS, so that the Guest OS knows that it can now use the newly-added CPU.

NOTE: It makes no difference in performance if you add vCPUs as cores or sockets.

NOTE: Multi-way virtual machines can scale well from 1 to 64 way in vSphere, but you need to be mindful of the potential waste that unused vCPUS can have on the environment

vSphere 5.1+ allows you to create very large virtual machines that have up to 64 vCPUs. It is highly recommended to size your virtual machine for the application workload that will be running in them. Sizing your virtual machine with resources that are unnecessarily larger than the workload can actually use may result in hypervisor overhead and can also lead to performance issues.

Avoid a large VM on too small a platform

  • Ruleof thumb: 1-4 vCPU on dual socket hosts, 8+ vCPU on quad socket hosts. This rule of thumb changes as core counts increase.
  • With 8 vCPU, ensure at least vSphere 4.1
  • Sizing a VM too large is wasteful. The OS will spend more time wasting cycles trying to keep workloads in sync.

Don't expect as high of consolidation ratios with busy workloads as you did with the low-hanging-fruit

  • Virtualizing larger workloads require revisiting consolidation ratios.
  • Tier 1 applications more performant workloads demand more resources

---

Memory Performance

Host memory is a limited resource. VMware vSphere incorporates sophisticated mechanisms that maximize the use of available memory through page sharing, resource-allocation controls, and other memory management techniques. However, several of vSphere Memory Over-commitment Techniques only kick-in when the host is under memory pressure.

Note: vSphere uses a 100 memory page per virtual machine random sampling to calculate active virtual machine memory. It is by no means 100% accurate but for the statistics majors out there, it does have a very high confidence level and is generally quite accurate.

-

Transparent page sharing is a method by which redundant copies of pages are eliminated. TPS is always running by default, however on modern hardware-assisted memory virtualization systems; vSphere will preferentially back guest physical pages with large host physical pages (2MB contiguous memory region instead of 4KB for regular pages) for better performance. vSphere will not attempt to share large physical pages because the probability of finding two large pages that are identical is very low. If memory pressure occurs on the host, vSphere may break the large memory pages into regular 4KB pages, which TPS will then be able to use to consolidate memory in the host.

For that reason, it is no longer recommended to solely look at the "host memory consumed" metric for capacity planning. "Host consumed memory" may be constantly high in most environments. Instead, Active Memory (Memory Demand) should be used to determine memory capacity planning.

-

There is a nifty “VMware vSphere 5 Memory Management and Monitoring diagram” that shows in a picture/diagram fashion the various memory overcommit techniques that vSphere uses. http://kb.vmware.com/kb/2017642

-

There are 4 main memory overcommit techniques and 4 main memory states/levels when these techniques are enabled.

  • High (no memory pressure): Transparent Page Sharing
  • Soft (Less than MinFree memory available): TPS, Ballooning
  • Hard(Less than 2/3 MinFree memory available): TPS, Ballooning, Compression, and Host Swapping. It is at the Hard state that large memory pages will be broken down to small pages and TPS will be able to consolidate identical pages.
  • Low (Less than 1/3 MinFree memory available): Swapping. VM activity is halted until Memory pressure is relieved.

MinFree Memory for a vSphere 5.x host is calculated by default on a sliding scale from 6% to 1% of physical host memory.

-

Tip: Memory over-allocation tends to be fine for most applications and environments. It is generally safe to have a 20% memory over-allocation and it is therefore recommended if starting to do memory over-allocation to start with 20% or less memory over-allocation and increase or decrease after monitoring application performance and monitoring that the memory over-allocation does not cause a constant Swap in Rate to occur.

---

Storage Performance

Approximately 90% of performance problems in a vSphere deployment are typically related to storage in some way. There have been significant advances in storage technologies over the past 6-12 months to help improve storage performance. There are a few things that you should be aware of:

In a well-architected environment, there is no difference in performance between storage fabric technologies. A well-designed NFS, iSCSI or FC implementation will work just about the same as the others.

Despite advances in the interconnects, performance limit is still hit at the media itself, in fact 90% of storage performance cases seen by GSS (Global Support Services - VMware support) that are not configuration related, are media related. Some things to remember:

  • Payload (throughput) is fundamentally different from IOP (cmd/s)
  • IOP performance is always lower than throughput

A good rule of thumb on the total number of IOPS any given disk will provide:

  • 7.2k rpm – 80 IOPS
  • 10k rpm – 120 IOPS
  • 15k rpm – 150 IOPS
  • EFD/SSD – 5k-10k IOPS (max ≠ real world)

So, if you want to know how many IOPs you can achieve with a given number of disks:

  • Total Raw IOPS = Disk IOPS * Number of disks
  • Functional IOPS = (Raw IOPS * Write%)/(Raid Penalty) + (Raw IOPS * Read %)

-

Iometer is a commonly used tool for testing storage. For more details on Iometer or the free I/O Analyzer tool (offered as a VMware fling) that improves on Iometer, be sure to take the extended Iometer and I/O Analyzer modules offered in this lab.

The poor performance can be seen in the Iometer GUI as...

  • Long Latencies (Average I/O Response Time), latencies greater than 20ms.
  • Low IOPs (Total I/O per Second)
  • Low Throughput (Total MBs per Second)

-

vSphere provides several storage features to help manage and control storage performance:

  • Storage I/O control
  • Storage IOP Limits
  • Storage DRS
  • Disk Shares

-

u53e.png


When we think about storage performance problems, the top issue is generally latency, so we need to look at the storage stack and understand what layers there are in the storage stack and where latency can build up.

At the top most layer, is the Application running in the guest operating system. That is ultimately the place where we most care about latency. This is the total amount of latency that application sees and it include the latencies off the total storage stack including the guest OS, the VMKernel virtualization layers, and the physical hardware.

ESX can’t see application latency because that is a layer above the ESX virtualization layer.

From ESXi we see 3 main latencies that are reported in esxtop and vCenter.

The top most is GAVG, or Guest Average latency, that is the total amount of latency that ESXi can detect.

That is not saying that is the total amount of latency that Application will see, in fact if you compare the GAVG (the Total Amount of Latency ESX is seeing) and the Actual latency the Application is seeing, you can tell how much latency the Guest OS is adding to the storage stack and that could tell you if the guest OS is configured incorrectly or is causing a performance problem. For example, if ESX is reporting GAVG of 10ms, but the application or perfmon in the guest OS is reporting Storage Latency of 30ms, that means that 20ms of latency is somehow building up in the Guest OS Layer, and you should focus your debugging on the Guest OS’s storage configuration.

Ok, now GAVG is made up of 2 major components KAVG and DAVG, DAVG = basically how much time is spent in the Device from the driver HBA and storage array, and KAVG = how much time is spent in the ESXi Kernel (so how much over is the kernel adding).

KAVG is actually a derived metric - ESX does not specifically calculate KAVG. ESX calculates KAVG with the following formula:

Total Latency – DAVG = KAVG.

The VMKernel is very efficient in processing IO, so there really should not be any significant time that an IO should wait in the kernel or KAVG, so KAVG should be equal to 0 in well configured / running environments. When KAVG is not equal to 0, then that most likely means that the IO is stuck in a Kernel Queue inside the VMKernel. So the vast majority of the time KAVG will equal QAVG or Queue Average latency (The amount of time an IO is stuck in a queue waiting for a slot in a lower queue to free up so it can move down the stack).

-

Storage DRS

SDRS makes storage moves based on performance only after it has collect performance data for more than 8 hours. Since the workloads just recently started SDRS would not make a recommendation to balance the workloads based on performance until it has collected more data.

-

Guidance: This shows the importance of sizing your storage correctly. It also shows that sometimes when you have two storage intensive sequential workloads sharing the same spindles, the performance can be greatly impacted. If possible try to keep workloads separated; sequential workloads separate (back by different spindles/LUNs) from random workloads.

Guidance: From a vSphere perspective, for most applications, the use of one large Datastore vs. several small Datastores tends not to have a performance impact. However, the use of one large LUN vs. several LUNs is storage array dependent and most storage arrays perform better in a multi LUN configuration than a single large LUN configuration.

Guidance: Follow your storage vendor’s best practices and sizing guidelines to properly size and tune your storage for your virtualized environment.

-

Wrap Up - Storage

In this test, we learned that storage latency greater than 20 to 30ms may cause performance problems for applications. Not having enough spindles or sharing the same storage resources/disk spindles with competing workloads can cause poor storage performance. vSphere 5.1+ provides several storage features, such as Storage DRS that can perform Storage vMotions to balance storage workloads based on capacity and performance.

Other Things to keep in mind with storage are....

  • Kernel latency greater than 2ms may indicate a storage performance issue.
  • Use the Parvirtualized (PVSCSI) device Drivers for best storage performance and lower CPU utilization
  • VMFS performs equally well compared to RDMs. In general, there are no performance reasons to use RDMs instead of VMFS.)
  • vSphere has several Storage Queues, Queues may cause bottlenecks for storage intensive applications. Check VM, Adapter, and Device/LUN queues for bottlenecks.

For more details on these topics, see the Performance Best Practices and Troubleshooting Guides on the VMware.com website.

http://pubs.vmware.com/vsphere-51/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-51-storage-guide.pdf

http://communities.vmware.com/docs/DOC-19166

Module 2 - Performance Features in vSphere (vSphere Flash Read Cache) (45 minutes)

Introducing vSphere Flash Read Cache

esxcli storage vflash cache list -m vfc
esxcli storage vflash cache stats get -m vfc -c <descriptor>

Module 3 - Understanding the New Latency Sensitivity Feature in vSphere (30 minutes)

What is Latency Sensitivity?

The Latency Sensitivity feature in ESXi 5.5 is a VM based setting that lets you signal to ESXi that the VM contains a workload that is sensitive to I/O latency. The sensitivity is classified as "Low", "Normal", "Medium" or "High".

VMs with Latency Sensitivity set to High will experience reduced I/O latency and jitter.

Jitter: variability in inter-packet arrival time

I/O Network Latency: Can be measured either as

  • One-way, which is elapsed time between sending a packet and the destination receiving the packet, or as
  • Round-trip, which is elapsed time between sending a packet to its destination and receiving the corresponding response at the source.

-

Who should use this feature?

The Latency Sensitivity feature is intended for specialized use cases that require extremely low latency. It's highly important to determine whether or not your workload could benefit from this feature before enabling it. In a nutshell, Latency Sensitivity provides extremely low network latency with a tradeoff of increased CPU and memory cost as a result of less resource sharing, and increased power consumption.

We define a 'Highly' latency sensitive application as one that requires network latencies in the order of tens of microseconds and very small jitter. Stock market trading applications are an example of highly latency sensitive applications.

Before deciding if this setting is right for you, you should be aware of the network latency needs of your application. If you set latency sensitivity to High, it could lead to increased host CPU utilization, power consumption, and even negatively impact performance in some cases.

-

Who shouldn't use this feature?

Enabling the Latency Sensitivity feature reduces network latency. Latency Sensitivity will not decrease application latency if latency is influenced by storage latency or other sources of latency besides the network.

The Latency Sensitivity feature should be enabled in environments in which the CPU is undercommitted. VMs which have Latency Sensitivity set to High will be given exclusive access to the physical CPU they need to run. This means the latency sensitive VM can no longer share the CPU with neighboring VMs.

Generally, VMs that use the latency sensitivity feature should have a number of vCPUs which is less than the number of cores per socket in your host to ensure that the latency sensitive VM occupies only one NUMA node.

-

Latency Sensitivity Under the Covers

What changes are made on behalf of the VM, which is 'latency sensitive'?

With just one setting, many optimizations are made by ESXi on behalf of the 'High' Latency Sensitive VM. The most important changes are to the VM's CPU access, its virtual NIC coalescing, and additional optimizations not discussed here.

-

Changes to CPU access

When a VM has 'High' Latency Sensitivity set in vCenter, the VM is given exclusive access to the physical cores it needs to run. This is termed exclusive affinity. These cores will be reserved for the latency sensitive VM only, which results in greater CPU accessibility to the VM and less L1 and L2 cache pollution from multiplexing other VMs onto the same cores. When the VM is powered on, each vCPU is assigned to a particular physical CPU and remains on that CPU.

When the Latency Sensitive VM's vCPU is idle, ESXi also alters its halting behavior so that the physical CPU remains active. This reduces wakeup latency when the VM becomes active again.

-

Changes to virtual NIC coalescing

A virtual NIC (vNIC) is a virtual device which exchanges data packets between the VMkernel and the Guest operating system. Exchanges are typically triggered by interrupts to the Guest OS or by the Guest OS calling into VMKernel, both of which are expensive operations. Virtual NIC coalescing, which is default behavior in ESXi, attempts to reduce CPU overhead by holding onto packets for some time before posting interrupts or calling into VMKernel. In doing so, coalescing introduces additional network latency and jitter, but these effects are negligible for most non-latency sensitive workloads.

Enabling 'High' Latency Sensitivity disables virtual NIC coalescing, so that there is less latency between when a packet is sent or received and when the CPU is interrupted to process the packet.

This also results in greater power and CPU consumption as a tradeoff for reduced network latency. This reduces network latency when the number of packets being processed is small. But, if the number of packets becomes large, disabling virtual NIC coalescing can actually be counterproductive due to the increased CPU overhead.

-

Demonstration of the 'High' Latency Sensitivity Setting

Let's walk through how to configure the Latency Sensitivity setting in the vSphere Web Client and compare network latency between a VM with 'high' latency sensitivity set and one with 'normal' latency sensitivity. First, we will test a VM with 'normal' latency sensitivity, then we will change the setting to 'high' latency sensitivity and test again. Comparing network latency between the two will show us the Latency Sensitivity setting at work.

Module 4 - vBenchmark: Free Tool for Measurement and Peer Benchmarking of Datacenter's Operational Statistics (15 minutes)

VMware vBenchmark

Traditionally, datacenter operators have been focused on benchmarking application performance. In recent years, there has been an emphasis on the measurement of infrastructure performance (e.g. consolidation ratio) in addition to application performance. However, there have been no tools available to measure the operational statistics calculated from actual tasks and events within a datacenter such as how long it takes to provision a new service, how often SLAs (Service Level Agreements) are being met, or even how efficiently datacenter resources are being utilized. VMware vBenchmark is the industry’s first tool designed to measure such metrics for virtualized datacenters. With this tool, you can

Measure key operational statistics/metrics of your VMware infrastructure across three categories:

   Efficiency: for example, physical RAM savings by using virtualization.
   Agility: for example, average time to provision a VM.
   Quality of Service: for example, downtime avoided by using availability features.

Compare your operational metrics against your peer group by choosing to contribute your stats to the community repository. The stats that you submit are anonymized and encrypted for secure transmission.

Track your operational metrics over time. For instance, you can query metrics for the last 1 month, 3 month or 6 month time period and compare how your operational metrics have changed with time.

VMware vBenchmark is available for free download at http://labs.vmware.com/flings/vbenchmark.

-

vBenchmark also allows you to specify the number of months (1 to 6) of historical data that you want to analyze. For this lab, you can use the default value of 1 month.

Note: To use vBenchmark, you don’t need vCenter administrator/root privileges, any user account with read-only privileges to the vCenter inventory will work. However, vBenchmark won’t be able to query data from objects for which it does not have read access privileges. For this reason, it is recommended that you use an account which has read access permission to all the objects in the inventory.

-

How Does Your Cloud Stack Up Against Your Peer Groups?

With vBenchmark you can upload your data to a community repository to view the peer group metrics. To do so, please click on the “Share” tab at the top.

-

vBenchmark allows you to save your current session and reload it later. This will be useful when you want to snapshot your current state and use it for comparison at a later date.

-

The Dashboard page summarizes the key metrics across all vCenters and hosts. Individual tabs (such as Infrastructure configuration, Efficiency, Agility, Quality of Service) offer a more granular view of these metrics.

-

Infrastructure tab - This tab provides a summary of statistics about your infrastructure configuration such as consolidation ratio, average number of physical CPUs (sockets) in your servers etc. You can sort this data by vCenter instance or by vSphere License Edition. You can use this tab to identify which cluster has the VM with the largest number of vCPUs.

Efficiency tab - This tab shows how efficiently your datacenter resources are being utilized, specifically your memory and CPU resources. This page also allows you to specify the number of IT administrators that manage your virtualized environment, so that vBenchmark can calculate the number of VMs managed per administrator. This metric is useful if you want to determine the efficiency with which you are managing your infrastructure. One way to improve memory efficiency is to utilize VMware’s advanced memory over-commitment technologies, which allow you to provision more vRAM than the available physical RAM. You can use this tab to identify the clusters which have had virtual machines with more vRAM provisioned than the physical memory available.

Agility tab - With virtualization, it is much easier and quicker to provision your workloads, improving the agility of your datacenter operations. The agility tab allows you to track how quickly you can provision and configure a VM. You can use this tab to Identify the clusters which have had the most VMs provisioned in the recent past.

Quality of Service tab - Maintaining quality of service involves meeting performance and availability SLAs of your applications. VMware HA, VMware Fault Tolerance and VMware Site Recovery Manager technologies mitigate unplanned downtime while VMware vMotion technology eliminates planned downtime. For example, when you put a host in maintenance mode, VMware vMotion automatically kicks in and prevents the VM from having downtime. This page tracks the downtime avoided using all the aforementioned technologies. You can use this tab to Identify the cluster which benefited the most from VMware vMotion by having the most hours of application downtime avoided. Note: This is a function of 3 factors: a) the number of VMs per host, b) the hours the host spent in maintenance mode c) the number of the hosts in the cluster.

-

vBenchmark, which you can use to measure and track your datacenter operational statistics and compare them with your peer group. vBenchmark is free and is available for download at http://labs.vmware.com/flings/vbenchmark.

- That wraps up our tour of vBenchmark, which you can use to measure and track your datacenter operational statistics and compare them with your peer group. vBenchmark is free and is available for download at http://labs.vmware.com/flings/vbenchmark. Thanks for taking the vBenchmark lab!

Module 5 - StatsFeeder: Scalable Statistics Collection for vSphere (20 minutes)

StatsFeeder is a tool for collecting performance metrics from VMware vCenter and sending them to multiple destinations. You specify which statistics to collect from vCenter, and StatsFeeder can retrieve them in CSV format, write them into a time series database, or you can write your own java-based receiver to send statistics to any arbitrary endpoint. StatsFeeder is a lightweight tool that makes it easy to collect just the statistics you need from vSphere in a scalable fashion, and is run at the command line as a script.

StatsFeeder is a VMware Fling, which are apps and tools built by our engineers that are intended to be played with and explored. They are free, unsupported software tools. You can obtain StatsFeeder from the VMware Labs' Fling site at http://labs.vmware.com/flings/StatsFeeder.

-

Using StatsFeeder

StatsFeeder allows you to collect statistics from all entities in vSphere, including hosts, VMs, and cluster-level statistics. StatsFeeder makes it easy to specify both the target entities as well as the type of statistics using a single XML configuration file and a simple command line interface. It not only streamlines statistics collection across a large number of entities, but also allows a deeper analysis of these statistics that goes beyond current monitoring tools by putting the data in your hands. Statistics gathered using StatsFeeder can be used to answer questions such as "Is DRS working?", and "What is the difference in Quality of Service I am providing between two resource pools?".

StatsFeeder can serve as an alternative to using esxtop to monitor your environment. To monitor a cluster, esxtop must be started on each host individually, requires access to the hosts' shell, and does not capture statistics for VMs which are migrated off the host during the statistics collection period. StatsFeeder scalably collects statistics for an entire cluster and all of its hosts and VMs, and retrieves statistics for all entities that are created or destroyed during the collection period.

For examples of prototyped use cases for StatsFeeder and much more information, see the StatsFeeder academic paper, StatsFeeder: An Extensible Statistics Collection Framework for Virtualized Environments here: http://labs.vmware.com/academic/publications/ravi-vmtj-spring2012

-

System Requirements

StatsFeeder can be run from any Windows or Linux system that has network connectivity with the target vCenter Server and has a JRE 1.6 or higher installed.

-

StatsFeeder command

run it from a Windows command prompt, Cygwin shell or Linux command line.

cd C:\StatsFeeder
C:\StatsFeeder\Statsfeeder.bat -h vc-l.corp.local -u root -p VMware1! -c C:\StatsFeeder\config\config1.xml

Output:

 C:\StatsFeeder\output\Output-PointSample.csv

-

XML config file. This is the list of statistics that we will collect in the next demonstration. It's a grab bag of statistics that are useful for troubleshooting performance problems, and, for your convenience, are included in the default config file with StatsFeeder.

To add or remove statistics, simply add or remove lines from the list. We specify a statistic by using its vCenter Performance Counter name, which is of the form <group>.<name>.<roll-up>.

  • Most statistics fall into the four groups cpu, mem (memory), disk or net (network).
  • Name is the statistic name such as packetsRX.
  • Roll-up describes how the statistic is aggregated over the sample window. Roll-up type can be summation, latest, average, none, maximum or minimum.

As denoted in the XML file, some statistics are collected only for hosts, some only for VMs, and others for both.

A full list and descriptions of the vCenter Performance Counters can be found here and more information about counters can be found here at the time of writing.

-

Which Statistics are Important?

In a virtualized environment, certain metrics have proven to be especially useful in identifying virtualization-specific problems. These metrics can indicate that the shortage of some resource is causing performance problems, whether that is CPU, memory, or access to storage. This is not a comprehensive list, but is meant to be a useful starting point.

Keep your eye on:

  • CPU ready time (cpu.ready.summation): The amount of time that a virtual CPU was ready to be scheduled, but could not be scheduled because there was no underlying physical CPU available.
  • Memory swapIn rate (mem.swapIn.average): The rate at which memory is being swapped in from disk after it had been swapped out by ESX and indicates a lack of memory.
  • Memory swapOut rate (mem.swapOut.average): The rate at which memory is being swapped out to disk and also indicates a lack of memory.
  • CPU swapWait (cpu.swapWait.summation): Time spent waiting for memory to swap in before the virtual machine can run.
  • Disk maxTotalLatency (disk.maxTotalLatency.latest): The highest reported total latency in the sample window.

-

Persisting Data

StatsFeeder handles the collection of data and allows users to write a custom receiver to persist the data. Two default receivers that we will investigate in this lab include:

-

CSV receiver - Outputs each statistic on its own line, with characteristics of the data such as host name, entity type, timestamp, and statistic name as comma separated values on the line. Above is how to target the CSV receiver in the StatsFeeder config file.

Perfmon receiver - Outputs all statistics on one comma separated line. Above is how to target the Perfmon receiver in the StatsFeeder config file.

Other Receivers

StatsFeeder also supports writing data into the time series databases KairosDB or OpenTSDB. See the StatsFeeder download for details.

You can create a custom receiver by implementing a single receiveStats() method and specify the class name in the config file.

For more information about custom receivers, see the VMware academic publication StatsFeeder: An Extensible Statistics Collection Framework for Virtualized Environments at http://labs.vmware.com/academic/publications/ravi-vmtj-spring2012

-

Because we specified the Perfmon receiver in the config file, we can now view the StatsFeeder statistics in Windows Performance Monitor, also known as Perfmon. Perfmon is a convenient way to examine vCenter statistics because it is already a part of all windows systems. You can also view statistics in a similar tool called esxplot. Esxplot is a VMware fling designed to make it easy to filter and view statistics from a large dataset.

-

Perfmon: C:\Windows\System32\perfmon.exe

Load statistics in Perfmon

1. Maximize the window.

2. Click on Performance Monitor.

3. Click the red X to stop showing the default metric.

4. Click on the View Log Data icon. (Ctrl+L)

Load statistics in Perfmon continued

1. Click the radio button Log Files: to view historical data in Perfmon.

2. Click Add...

3. Navigate to C:\StatsFeeder\output

4. Click on Output-ContinuousSample.csv

5. Click Open.

Add VM-level metrics

Back at the Performance Monitor Properties window,

1. Click on the Data tab.

2. Click Add..

3. Under Select counters from computer: open the drop down menu.

You'll see all hosts and VMs for which statistics were collected. If you have recently taken other lab modules, your results may differ from those shown here.

Right now, we are going to look at VM-level statistics.

4. Select perf_cpu_worker-l-01a (vm-64)

In one short run, we've collected all of the metrics that are most relevant to performance troubleshooting. You can scroll in the main pane to see the names of each statistic we collected.

5. Click on cpu.ready.

6. This VM has only 1 vCPU, vCPU 0. Click 0 to graph statistics for vCPU 0.

7. Click Add > >.

8. Click on cpu.used.

9. Click 0.

10. Click Add > >.

11. Click OK.

At the Performance Monitor Properties window, click Add... again to go back to the Add Counters screen.

1. Under Select counters from computer:, select perf_cpu_worker-l-02a (vm-63).

2. Click cpu.ready.

3. Click 0.

4. Click Add > >.

5. Click cpu.used.

6. Click 0.

7. Click Add > >.

8. Click OK.

Set Graph

We need to set the graph's Y-axis, or 'Vertical scale' to an appropriate value. But what is an appropriate value for the statistics we are graphing?

Converting Metrics reported in Milliseconds into Percentages

Metric Value (in percent) = Metric Value (in millisecond) / Total time of Sample Period (by default 20,000ms in vCenter)

The statistics we are currently graphing are:

cpu.used.summation : milliseconds when the virtual machine was running on a physical processor.

cpu.ready.summation : when the virtual machine is ready to run but unable to run because the vSphere scheduler is unable to find physical host CPU resources to run the virtual machine on.

Both of these statistics are reported as the sum of the milliseconds for each vCPU across the sample period. Because the sample period was 20 seconds, this converts to 20,000 ms total.

Ready time greater than 10% could indicate a performance problem.

As a guideline, for 1 vCPU, remember that

200 ms = 1%

2000 ms = 10%

20,000 ms = 100%

So we should set the graph maximum to 20,000.

Set graph range

1. Change the Maximum: from 100 to 20000.

The collected statistics start partway through the collection period. This is because the VMs were powered off, and no statistics were collected, and then they were powered on. Unlike esxtop, StatsFeeder dynamically collects statistics for all powered-on VMs, even those that were not powered on at the beginning of the collection period.

Remove metrics

1. Right click on the counters.

2. Select Remove All Counters.

3. At the Performance Monitor Control window, click OK.

-

This wraps up our discussion of StatsFeeder. If you're interested in trying it out in your environment, download StatsFeeder here, and more information is published in the VMware academic article StatsFeeder: An Extensible Statistics Collection Framework for Virtualized Environments, found here.

Module 6 - Using esxtop (60 minutes plus 20 minute bonus section)

When To Use esxtop

There are several tools to monitor and diagnose performance for vSphere environments. It is best to use esxtop to diagnose and further investigate performance that has already been identified through another tool or method. Esxtop is not a tool designed for monitoring performance over the long term, but is great for deep investigation or monitoring a specific issue or VM over a defined period of time.

In this lab, which should take about 60 minutes, we will use esxtop to enable a deep dive into performance troubleshooting. First, we will take a look at other tools that are used before esxtop. We will then spend some time with esxtop to see how to use it both interactively and in batch data capture mode. Finally, we will use the data captured from esxtop in a few different tools to be able to better analyze and examine our data.

An additional bonus section was included with the lab that uses everything covered in the first sections about esxtop to investigate NUMA and vNUMA behavior with a few different scenarios. This additinal bonus section should take about 20 minutes. If you are already very familiar with esxtop you can skip directly to this bonus NUMA section via the table of contents.

-

vCenter Operations Manager (vCOPs)

Day To Day Performance Monitoring

There are a variety of tools that can be used to monitor your vSphere environment on a day to day basis. vCenter Operations Manager (vCOPs) is powerful tool that can be used to monitor your entire virtual infrastructure. It incorporates high-level dashboard views and built in intelligence to analyze the data and identify possible problems.

http://www.vmware.com/products/vcenter-operations-management/

VMware® vCenter™ Operations Management Suite™ provides automated operations management using patented analytics and an integrated approach to performance, capacity and configuration management.

Cost: ~$3,800

-

vCenter Performance Monitoring

vCenter Advanced Performance Graphs Limitations - vCenter performance graphs are limited to two at a time. If you followed the directions and only had two counters selected you won't see this error.

Invalid Selection - You can only select two distinct units at a time.

Using vCenter Operations Manager most performance issues can be identified and resolved. In addition to the high-level stats and analysis available with vCenter Operations Manager, vCenter (via the vSphere Client) provides detailed level real-time and historical performance information. These tools are designed to be used in day-to-day operations and management of vSphere. esxtop is designed to provide deep-dive real time performance information for an ESXi host. This is usually only needed when the performance issue is either not identified by one of these other tools, or the necessary deep level details are not available in these tools. The rest of this lab takes at look a how to use esxtop and how to sort through and analyze the data that it produces.

-

Diagnosing Performance In Interactive Mode

Esxtop can be used to diagnose performance issues involving almost any aspect of performance and at both the host and virtual machine perspectives. This section will step through how to use esxtop to view CPU, memory, disk, network, and memory using esxtop in interactive mode.

-

Initial esxtop screen - CPU

The initial esxtop screen you will see is the CPU screen which shows CPU utilization in terms of the ESX host and all the VMs that are running. Initially we see any VMs that are running in the list of processes with their %USED, %RUN, %SYS, %WAIT, and %VMWAIT shown. There are also lots of other processes shown in the list as well. These are esxi host processes and functions that for the most part can be identified by name.

Expand the putty session window to fullscreen. This will allow esxtop to display many more fields. These additional fields become visible after expanding the session to fullscreen. You will see the %RDY, %IDLE, %OVRLP, and others.

-

Customizing esxtop Screens

Press the letter o and you will be presented with the Current Field Order screen. Here you can customize the order from left to right that the different groups of data columns appear on the esxtop screen.

Press and hold the shift key while pressing the letter d three times.Then press the space bar. Watch the string of letters after Current Field order at the top of the screen. You will see the letter d move with press of SHIFTd. This is changing the screen so that the NAME field is displayed first on the far left. Press space bar to return to the CPU screen.

Now press the letter f key to open the Current Field Order Screen, but in toggle mode. When in this screen you can toggle on or off groups of fields. Press the E and F keys to remove the NWLD and %STATE TIMES fields from the screen. Press space to return. You will see only the NAME, ID, and GID displayed. Press space bar to return to the CPU screen.

TIP: Press SHIFT-V (or capital V) to tell esxtop to just display the VMs. You will now only see the three VMs we just started and not all the other processes.

Expand to see the details for a VM by pressing e and then entering the ID. In the example screen shot, we expanded the perf_cpu_worker VM with ID 13455. You can see all of the associated processes that make up the VM including the single vCPU of this VM. VMs with multiple vcpus will show each as a separate process. This allows you to see the distribution of workload that is running inside the VM. (Sometimes after you expand the VM it might only show for one refresh cycle before disappearing. This is because multiple changes were made in a single refresh cycle.)

Memory:

Press m to switch to the Memory screen of esxtop. This screen shows us stats in a similar style as the initial CPU screen, but all of these stats are dedicated to memory related information.

Network:

Press n to go to the network screen of esxtop. Here you will see all of the virtual networks and the virtual nics attached. On the screen, you can quickly see how much network activity is occurring with each of the VMs and on which virtual nics.

Disk:

Press d to open the disk adapter esxtop screen. This screen displays the disk performance data from the disk adapter perspective. There is also a disk device screen and virtual disk screen that provide disk performance information from different perspectives. We will look at those in just a few minutes.

Press e to expand the details for a disk adapter. Enter the adapter that you want to expand. In the example screen shot, we have entered vmhba1 to be expanded. Once expanded it will show all the paths currently on the expanded disk adapter. In the lab environment, this is only a single path per adapter - so not so exciting to see. In most real environments this additional detail will show many paths and will allow you to see exactly where the activity is occurring on the adapter.

Disk Device:

Press u to reach the disk device screen. This shows the disk performance information from the perspective of each specific disk device. In this example, we see that the two local disks and the NFS mounted disk are all shown.

Press f to access the field selection list for the disk device screen. This includes many interesting stats including VAAI related performance information.

Virtual Disks:

Press v to display the virtual disk screen. This displays disk performance from the perspective of the virtual machines and the virtual disks that they have. The CMD/s displayed here should be very close to what is reported inside the guest.

After returning to the main virtual disk screen, press e to expand a specific VM and enter the GID for the perf_cpu_worker VM. In the example screen shot the GID is 13455. You will see each of the virtual disks that are attached to the virtual machine and the stats per virtual disk shown. This allows you to quickly see the IOPS (CMD/s) or latency (LAT/rd or LAT/wr) per virtual disk.

-

Saving Your esxtop Customizations

When using the field selection and ordering options of esxtop you will end up creating the environment that you want. In order to save these views and have esxtop default to these settings on the next start you can press <SHIFT>W at any time. This will create an esxtop configuration file that looks like the above example in the screen shot. It will default to use the file name that esxtop automatically looks for when it starts up. You can also specify a different file name and then have esxtop use that file on startup with the -c myconfigfilename option.

-

Capturing Performance Data with esxtop

Esxtop can be used to capture or record performance data. This approach is often used to capture data when a known performance problem is occurring. The data can then be analyzed in detail after the capture. It is also often used while running performance tests or benchmarks so that performance metrics can be calculated and viewed from a variety of different perspectives after the testing is completed.

-

Batch Mode esxtop

Batch mode is used to capture esxtop data. This mode is entered by including the -b option to esxtop on the command line. In addition, we will want to include a specification for the number of data samples we want to capture and how long between each sample. This is specified via the -n for number of iterations or samples and a -d for the delay or time between each sample. A combination of the interactions and delay parameter will determine how long esxtop will be capturing data. For example if iterations was set to 5 and delay was set to 60, then the capture would run for 5 minutes.

We will also be redirecting the output of the batch mode esxtop to a file using an output redirect (the > symbol).

esxtop -b -a -d 5 -n 12 > HOLesxtop1.csv

Please wait until this commands complete - read below while waiting to get more details on what is happening.

This will capture data for one minute because we have specified to wait 5 between each interval and the capture 12 times for a total of 60 seconds. I have added the -a parameter so that all possible performance counters are captured. This will cause the size of the output file to be larger but will also ensure that the most possible counters are also captured. Nothing much will happen during the minute that the esxtop capture is running.

-

Using esxplot and Windows Performance Monitor

The amount of data created and captured by esxtop can be very large. Esxplot is a tool released as a VMware fling that can import esxtop data and quickly sort and filter the results while also displaying graphs. Esxtop data can also be directly imported to Windows Performance Monitor (perfmon) which provides its own set of powerful tools for filtering and displaying data.


esxplot – VMware Labs - http://labs.vmware.com/flings/esxplot

Esxplot is a GUI-based tool that lets you explore the data collected by esxtop in batch mode. The program loads files of this data and presents it as a hierarchical tree where the values are selectable in the left panel of the tool, and graphs of the selected metrics are plotted in the right panel.

Esxplot allows you to “browse” the contents of these somewhat unwieldy files. You can plot up to 16 metrics on the same canvas and export the graphs to a gif, jpg, png or bmp file format. Subsets of the data can be worked with by using the regex query box which will produce a subtree that can be browsed or exported as a csv file which can, in turn, be loaded into esxplot, PERFMON or Excel.

The program is written in Python language and uses the platform-independent Window library, wxPython. Python programs written in wxPython can run unchanged on Linux, Windows, and OSX. In order to run esxplot you need to have Python 2.6.x or later installed (this program will not yet run under Python 3.x due to the lack of wxPython support).

-

Expand the perf_cpu_worker Group Cpu metric. Click on %Run. You will see the CPU run time for the VM graphed in the panel on the right. Hold down <CTRL> and click on %Ready to multiselect a second metric. You will now see both %Run and %Ready graphed together. (Press and hold the Command key to multiselect if using a Mac keyboard.)

You can also add the CPU usage of another VM to the same graph. Scroll in the list of metrics and expand the list for another VM - memhog-o-01a. Hold down <CTRL> and click on %Run and %Ready for the memhog VM. You will now see all four metrics graphed together.

This ability to quickly graph and see multiple metrics is very powerful in being able to easily spot issues. A visual representation of performance data is often much more understandable because it is easy to spot trends, outliers, and correlations.

-

Comparing vCPU 0 on Two Different VMs

One of the most powerful features of esxplot is the ability to search the esxtop data set using a simple string. We are going to find all performance counters directly related to our memhog VMs.

Click in the Query search box. Delete the default text so the query search box is empty.

Type memhog into the search box and click on the GO button.

The results will return as a new entry in the panel on the lower left.

Double click on the Query:memhog to see the results.

-

Exporting esxplot Search Results

A very powerful feature of esxplot is the ability to export the results of an esxplot search to a new csv file that only contains the data metrics found in the search. This allows you to reduce the data set that you are working with to a much smaller size and only focus on the metrics that you are interested in. The resulting file is much easier to consume in other tools like a spreadsheet or windows performance monitor.

To export the results of an esxplot search, right click on the top level of the results in the results panel on the lower left. In the lab example, right click on Query: memhog,

Select Export from the popup that appears.

Save the export as memhogexport.csv

Close esxplot.

-

Using Windows Performance Monitor With esxtop Data

The data file produced by esxtop in batch mode can be directly imported into windows performance monitor (commonly referred to as perfmon). Perfmon has a strong set of graphing tools. It does not require a download to use as it is already a part of all windows systems. In the next steps, we will import a couple of esxtop data files into perfmon and see how easily it works with esxtop data.

-

Data Overload in Windows Performance Monitor

A downside of having the great detail of data is that is often too much data. In the screenshot is shown only the %Used metric for Vcpu for all instances collected. The graph becomes chaotic and even selecting smaller sets of data in perfmon can be time consuming. Unlike esxplot, there isn't a function that allows you to quickly search and filter data counters.

-

Combining esxplot and Windows Performance Monitor

It is possible to get the great search and filter capability of esxplot combined with the strong graphing tools of perfmon. We are going to use the exported data query set from esxplot and import it to perfmon.

Right click on Performance Monitor and select Properties.

Click on the Data tab.

Select all Counters and then click on remove.

Click on Apply

-

VisualEsxtop

A new VMware Labs Fling, Visual Esxtop, was released just after we finalized this lab, so we weren't able to include it. However, we did want to make you aware of it.

VisualEsxtop is an enhanced version of resxtop and esxtop. VisualEsxtop can connect to VMware vCenter Server or ESX hosts, and display ESX server stats with a better user interface and more advanced features.

With VisualEsxtop, you can connect live to an ESXi host or vCenter Server, load and replay batch output and it even color codes the important counters for you. It also includes tooltips to help describe the different counters. You can see this illustrated in the above screen shot.

You can download the VisualEsxtop from the VMware Labs Fling site here:

http://labs.vmware.com/flings/visualesxtop

-

NUMA

There is a bonus section next that uses esxtop, esxplot, and Windows Performance monitor to examine NUMA and vNUMA on ESXi. If you have time remaining you can continue on to the bonus section or click on the Table Of Contents to jump to another module in this lab.

Looking at NUMA and vNUMA With esxtop Data

Utilizing some esxtop data collected from large ESXi hosts, this section will take a close look at NUMA CPU and vCPU scheduling. This will illustrate an example of the depth of information available in esxtop and how the ESXi scheduler considers and takes advantage of NUMA architecture.

-

In the Hardware section of the Summary tab, expand the CPU. You will see that this ESXi host is running on a two socket server. Which means this ESXi host has 2 NUMA nodes.

A NUMA node is comprised of processor cores and RAM. It is advantageous from a performance point of view for processes to utilize the RAM in the same socket as the processor that they are scheduled on. When a hypervisor, operating system, or application has been written to try and optimize its performance by keeping processes and the memory they are using on the same NUMA node, it is "NUMA aware".

Virtual NUMA is a feature of ESXi that allows for a virtual NUMA architecture to be exposed to the guest operating system. It is configured in the VM settings.

-

Edit VM Settings...

Set the number of CPUs to 2. You will see that the number of sockets increases to 2 with the Cores per Socket still at 1. This means that you would have a VM with two virtual NUMA nodes. This would exactly match the underlying hardware of the ESXi host (also 2 sockets with 1 core per socket).

-

In the lab environment, everything is very small. So it is not possible to simulate a very large NUMA environment. In order to provide you with the opportunity to work with more interesting data, we have included some esxtop data file from larger servers. We are going to use these data files to investigate how ESXi decides to schedule vCPUs with regard to NUMA. This will also give you a chance to try out esxplot and windows performance monitor with esxtop data as covered in the previous sections.

The first scenario we are going to look at is 65 View desktops running on an ESXi host. Each View desktop has a single vCPU. The physical server they are running on two sockets with 6 cores per socket. This means that the host has 12 cores or 24 logical threads with hyperthreading enabled. HT is enabled, so the host has 24 logical threads that will appear in esxtop as "physical CPUs". This means that there are 65 vCPUs that have been defined and are in use, but only 24 threads for the ESXi scheduler to use. The host is overcommitted in terms of CPU resources.

Open esxplot by clicking on the shortcut on the desktop.

Click on File -> Import -> Dataset

Select 65Viewdesktops_run1.csv

Query For vCPU Data in esxplot

We want to reduce the data set down to just the information about the vCPUs used by the View desktop VMs. We want to see what physical CPU they were running on to see the results of the decisions made by the ESXi scheduler.

Enter the string -vcpu- in the query search box and click on the GO button.

Double click on the Query: -vcpu- in the results panel. This will expand the results. Scroll down and you will see the results only contain the vcpu data.

Right click on the Query: -vcpu- and select Export.

Save as 65ViewDesktops_run1_vcpuexport.csv.

Close esxplot.

Many Small VMs in Windows Performance Monitor Click to enlarge

Start Perfmon from the desktop shortcut.

Click on Performance Monitor in the left panel.

Right click on the graph area and select Properties.

Click on the Source tab.

Select Log Files:

Click on Add...

Browse to C:\esxtopdata\

Select 65Viewdesktops_run1_vcpuexport.csv. Click Open.

Click Apply.

Many Small VMs Select vCPU Data in Windows Performance Monitor Click to enlarge

Click on the Data tab.

The only counter category that will be available is vcpu because we have limited the data set with the query and export from esxplot.

Expand vcpu, scroll down and select only Physical CPU. This counter reports what physical counter the vcpu has been scheduled to run on.

Leave <All instances> selected and click on Add>>.

Then Click on OK. And OK again.

Graph of vCPU Scheduling of Many Small VMs Click to enlarge

The resulting graph shows a line that represents where physical CPUs where the vCPUs of each of the view desktops was scheduled to run. It clearly shows the clean separation of NUMA nodes 1 and 2. Once a VM was scheduled to run on a NUMA node, it stayed on that NUMA node. Even though there was contention for CPU resources the scheduler puts a high priority on keeping processes on the NUMA node they have been running on.

You can use the highlight feature of perfmon (select the little highlighter marker on the menu bar of perfmon) to see specific vcpus and how they were scheduled.

When finished browsing the data, close perfmon.

One VM Per NUMA Node Click to enlarge

The next scenario we will look at is a four socket ESXi host with four VMs running. The number of VMs exactly matches the number of NUMA nodes. We will look at the same Physical CPU counter for the vCPUs, but in this case, each virtual machine will have many vCPUs.

We are going to follow the essentially the same steps as before.

Open esxplot using the shortcut on the desktop.

File -> Import -> Dataset.

Select MultiLageVM_run1.csv. Click on Open.

Query For vCPU Data in esxplot Click to enlarge

Use the search string -vcpu- and click on GO

Expand the results in lower left panel. You will see that the vCPUs are numbered for each VM from 0 to 19. You can expand any of these you like and look at individual metrics.

Export the -vcpu- data by right clicking on the Query: -vcpu- and selecting export.

Save as MultiLargeVM_run1_vcpuexport.csv.

Close esxplot

One VM Per NUMA Node in Windows Performance Monitor Click to enlarge

Open perfmon using the shortcut on the desktop.

Select Performance Monitor and then right click anywhere on the graph.

Select Properties and go to the Source tab.

Click on Add... and select MultiLargeVM_run1_vcpuexport.csv.

Then click Open and finally, Apply.

One VM Per NUMA Node Select vCPU Data Click to enlarge

Click on Data tab. Click on Add...

Expand vcpu and scroll down and select only Physical CPU.

Click Add>> to add instances.

OK. OK.

Graph of vCPU Scheduling for One VM Per NUMA Node Click to enlarge

The resulting graph will show that each VM and its 20 vCPUs is scheduled on it own NUMA node and does not move. This is another example of the esx scheduler being NUMA aware and making scheduling decisions that optimize for performance. The guest operating system and the database running inside the guest know nothing about the NUMA architecture of the server they are running on, but still benefit from NUMA performance in this scenario.

Again, you can use the highlight function of perfmon to quickly and visually see where each vCPU is running. They are grouped by VM in the list, so this makes it very easy to see where all of the vCPUs for a given VM are running.

Close perfmon.

Large Monster VM and vNUMA Click to enlarge

The final scenario we are going to look at is a single large VM running on a four socket server. The server has 10 cores per socket for a total of 40 physical cores and 80 logical threads with hyperthreading enabled. These 80 hyperthreads will appear in the esxtop data as 80 "physical cpus". The monster VM has been configured with 4 virtual sockets and 10 vCPUs per socket. This reflects the underlying physical hardware and providing this information to the guest allows it to make NUMA aware decisions and get improved performance.

We will once again follow the same basic steps to take a look at the scheduling of vCPUs on Physical CPUs using the esxtop data.

Open esxplot from the shortcut on the desktop.

File -> Import -> Dataset. Select LargeDBVM_run1.csv

Query for vCPU Data from Monster VM in esxplot Click to enlarge

Type -vcpu- in the Query search box and click on GO.

Double click on the Query: -vcpu- in the lower left panel and you will see the results. You can browse the results and see that this was a system under heavy utilization.

Export the results of the vCPU query by right clicking on the Query: -vcpu- and selecting export. Save as LargeDBVM_run1_vcpuexport.csv

Close esxplot.

Large Monster VM esxtop Data in Windows Performance Monitor Click to enlarge

Open perfmon using the shortcut on the desktop.

Click on Performance Monitor in the panel on the left and then right click anywhere on the graph and select properties.

Click on the Source tab and click Log files: and then click on Add...

Select LargeDBVM_run1_vcpuexport.csv.

Open. Apply.

Large Monster VM Select Data in Windows Performance Monitor Click to enlarge

Click on the Data tab and then click on Add...

Expand vcpu, scroll down and select Physical CPU. Then click on Add >> to add all instances.

Ok. Ok.

Graph of Large Monster VM vCPU Data in Windows Performance Monitor Click to enlarge

The resulting graph again shows the clear separation between the four NUMA nodes on the host. Further if you use the highlight feature you can see that the vCPUs map directly such that all of the vCPUs on a virtual NUMA node stay together on the same physical NUMA node. This shows how the virtual NUMA feature allows for NUMA to be used by the guest operating system and applications.

Close perfmon.

keywords