Skip to main content

virtualization - vSphere education - What are the downsides of configuring VMs with *too* much RAM?



VMware memory management seems to be a tricky balancing act. With cluster RAM, Resource Pools, VMware's management techniques (TPS, ballooning, host swapping), in-guest RAM utilization, swapping, reservations, shares and limits, there are a lot of variables.



I'm in a situation where clients are using dedicated vSphere cluster resources. However, they are configuring the virtual machines as though they were on physical hardware. In turn, this means a standard VM build may have 4 vCPUs and 16GB or more of RAM. I come from the school of starting small (1 vCPU, minimal RAM), checking real-world use and adjusting up as necessary. Unfortunately, many vendor requirements and people unfamiliar with virtualization request more resources than necessary... I'm interested in quantifying the impact of this decision.







Some examples from a "problem" cluster.



Resource pool summary - Looks almost 4:1 overcommitted. Note the high amount of ballooned RAM.
enter image description here



Resource allocation - The Worst Case Allocation column shows that these VMs would have access to less than 50% of their configured RAM under constrained conditions.
enter image description here




The real-time memory utilization graph of the top VM in the listing above. 4 vCPU and 64GB RAM allocated. It averages under 9GB use.
enter image description here



Summary of the same VM
enter image description here








  • What are the downsides of overcommitting and overconfiguring resources (specifically RAM) in vSphere environments?


  • Assuming that the VMs can run in less RAM, is it fair to say that there's overhead to configuring virtual machines with more RAM than they actually need?


  • What is the counter-argument to: "if a VM has 16GB of RAM allocated, but only uses 4GB, what's the problem??"? E.g. do customers need to be educated that VMs are not the same as physical hardware?


  • What specific metric(s) should be used to meter RAM usage. Tracking the peaks of "Active" versus time? Watching "Consumed"?







Update: I used vCenter Operations Manager to profile this environment and get some detail on the cluster stats listed above. While things are definitely overcommitted, the VMs are actually so overconfigured with unnecessary RAM that the real (tiny) memory footprint shows no memory contention at the cluster/host level...




My takeaway is that VMs should really be right-sized with a little bit of buffer for OS-level caching. Overcommitting out of ignorance or vendor "requirements" leads to the situation presented here. Memory ballooning seems to be bad in every case, as there is a performance impact, so right-sizing can help prevent this.



Update 2:
Some of these VMs are beginning to crash with:



kernel:BUG: soft lockup - CPU#1 stuck for 71s! 


VMware describes this as a symptom of heavy memory overcommitment. So I guess that answers the question.




enter image description here






vCops "Oversized Virtual Machines" report...
enter image description here



vCops "Reclaimable Waste" graph...



enter image description here



Answer



vSphere's memory management is pretty decent, though the terms used often cause a lot of confusion.



In general, memory over-commit should be avoided as it creates exactly this type of problem. However, there are times when it cannot be avoided, so forewarned is forearmed!




What are the downsides of overcommitting and over-configuring resources
(specifically RAM) in vSphere environments?





The major downside of over-committing resources is that should you have contention, your hosts would be forced to balloon, swap or intelligently schedule/de-duplicate behind the scenes in order to give each VM the RAM it needs.



For ballooning, vSphere will inflate a "balloon" of RAM within a chosen VM, then give that ballooned RAM to the guest that needs it. This isn't really "bad" - VMs are stealing each other's RAM, so there's no disk swapping going on - but it could lead to mis-fired alerting and skewed metrics if these rely on analysing the VM's RAM usage, as the RAM won't be marked as "ballooned", just that it's "in use" by the OS.



The other feature that vSphere can use is Transparent Page Sharing (TPS) - which is essentially RAM de-duplication. vSphere will periodically scan all allocated RAM, looking for duplicated pages. When found, it will de-duplicate and free up the duplicated pages.



Take a look at vSphere's Memory Management whitepaper (PDF) - specifically "Memory Reclamation in ESXi" (page 8) - if you need a more in-depth explanation.




Assuming that the VMs can run in less RAM, is it fair to say that

there's overhead to configuring virtual machines with more RAM than
they need?




There's no visible overhead - you can allocate 100GB of RAM on a host with 16 GB (however, that doesn't mean you should, for the reasons above).



Total memory in use by all of your VMs is the "Active" curve shown in your graphs. Of course, you should never rely on just that figure when calculating how much you would like to overcommit, but if you have historical metrics as you have, you can analyse and work it out based on actual usage.



The difference between "Active" and "Consumed" RAM is discussed in this VMWare Community thread.





What is the counter-argument to: "if a VM has 16GB of RAM allocated,
but only uses 4GB, what's the problem??"
? E.g. do customers need to be
educated?




The short answer to this is yes - customers should always be educated in best practices, regardless of the the tools at their disposal.



Customers should be educated to size their VMs according to what they use, rather than what they want. A lot of the time, people will over-specify their VMs just because they might need 16 GB of RAM, even if they're historically bumbling along on 2 GB day after day. As a vSphere administrator, you have the knowledge, metrics and power to challenge them and ask them if they actually need the RAM they've allocated.




That said, if you combine vSphere's memory management with carefully-controlled overcommit limits, you should rarely have an issue in practice, the likelihood of running out of RAM for an extended period of time is relatively remote.



In addition to this, automated vMotion (called Distributed Resource Scheduling by VMware) is essentially a load-balancer for your VMs - if a single VM is becoming a resource hog, DRS should migrate VMs around to make best use of the cluster's resources.




What specific metric should be used to meter RAM usage. Tracking the
peaks of "Active" versus time?




Mostly covered above - your main concern should be "Active" RAM usage, though you should carefully define your overcommit thresholds so that if you reach a certain ratio (this is a decent example, though it may be slightly outdated). Typically, I would certainly stay within 120% of total cluster RAM, but it's up to you to decide what ratio you're comfortable with.




A few good articles/discussions on memory over-commit:




Comments

Popular posts from this blog

linux - iDRAC6 Virtual Media native library cannot be loaded

When attempting to mount Virtual Media on a iDRAC6 IP KVM session I get the following error: I'm using Ubuntu 9.04 and: $ javaws -version Java(TM) Web Start 1.6.0_16 $ uname -a Linux aud22419-linux 2.6.28-15-generic #51-Ubuntu SMP Mon Aug 31 13:39:06 UTC 2009 x86_64 GNU/Linux $ firefox -version Mozilla Firefox 3.0.14, Copyright (c) 1998 - 2009 mozilla.org On Windows + IE it (unsurprisingly) works. I've just gotten off the phone with the Dell tech support and I was told it is known to work on Linux + Firefox, albeit Ubuntu is not supported (by Dell, that is). Has anyone out there managed to mount virtual media in the same scenario?

hp proliant - Smart Array P822 with HBA Mode?

We get an HP DL360 G8 with an Smart Array P822 controller. On that controller will come a HP StorageWorks D2700 . Does anybody know, that it is possible to run the Smart Array P822 in HBA mode? I found only information about the P410i, who can run HBA. If this is not supported, what you think about the LSI 9207-8e controller? Will this fit good in that setup? The Hardware we get is used but all original from HP. The StorageWorks has 25 x 900 GB SAS 10K disks. Because the disks are not new I would like to use only 22 for raid6, and the rest for spare (I need to see if the disk count is optimal or not for zfs). It would be nice if I'm not stick to SAS in future. As OS I would like to install debian stretch with zfs 0.71 as file system and software raid. I have see that hp has an page for debian to. I would like to use hba mode because it is recommend, that zfs know at most as possible about the disk, and I'm independent from the raid controller. For us zfs have many benefits,

apache 2.2 - Server Potentially Compromised -- c99madshell

So, low and behold, a legacy site we've been hosting for a client had a version of FCKEditor that allowed someone to upload the dreaded c99madshell exploit onto our web host. I'm not a big security buff -- frankly I'm just a dev currently responsible for S/A duties due to a loss of personnel. Accordingly, I'd love any help you server-faulters could provide in assessing the damage from the exploit. To give you a bit of information: The file was uploaded into a directory within the webroot, "/_img/fck_uploads/File/". The Apache user and group are restricted such that they can't log in and don't have permissions outside of the directory from which we serve sites. All the files had 770 permissions (user rwx, group rwx, other none) -- something I wanted to fix but was told to hold off on as it wasn't "high priority" (hopefully this changes that). So it seems the hackers could've easily executed the script. Now I wasn't able