A friend pointed me to a few articles discussing some of the new features in VMware View 3 and these articles really went straight to the point for me in regards to the hype surrounding the linked-clone technology and how it might not necessarily be the great solution that everyone makes it out to be.
The first article comes from a blog that I will now start tracking: vinternals: VMware View – Linked Clones Not A Panacea for VDI Storage Pain!.
The author makes two points, the first being that “snapshots can grow up to the same size as the source disk.” While not a common situation, the author points out that the Windows NTFS filesystem will always write to blocks on the disk that are zero’d (completely empty) before it will write to blocks containing deleted files. The author gives the example of having 10GB free space on the filesystem according to the Windows guest OS and then writing/deleting a 1GB file 10 times will result in the snapshot growing to 10GB.
The gist of this point is that folks make the whole linked-clone thing to be a space-saving measure as all the clones reference the master snapshot for their base image and then record any changes they need to make to it to their own snapshot delta disk. The problem becomes if you have a user population with a very wide diversity of applications and they for various reasons cannot be included in the base image. This means snapshot growth for installation, patching, regular use, etc. I can’t think of any good way to estimate what to expect a snapshot to grow to without just actually doing it. So it becomes very scary if you need to plan for a storage environment of a certain size and you really just cannot plan for the growth, other than just expecting the worst case scenario.
The second point that the author makes is in regards to a problem that I’ve battled a lot over the last couple of years, which is LUN locking on the storage array. I have known and as the author points out, “a lock is acquired on a VMFS volume whenever volume metadata is updated. Metadata updates occur everytime a snapshot file is incremented, at the moment this is hardcoded to 16MB increments.” This plays in to the recommendation to keep the number of snapshotted-VMs per LUN to a low number. So if you have a very large VDI environment, in order to keep LUN locking/SCSI reservations manageable and under control, a very large number of LUNs need to be allocated leading to a storage management nightmare. And VMFS locks aren’t only for snapshots, they also occur for VM power-on/off and VMotions. We don’t have a lot of VM power-on/off operations but VMotions are usually always happening.
The author’s main point though about LUN locking is that snapshots grow at 16MB increments so during an initial deployment when users start launching and installing applications (which again for various reasons can’t be in the master snapshot) there would be a lot of locks being acquired as snapshots expand.
These two points make me wary of using snapshots – something I have stayed away from in our server environment and will continue to use sparingly, if only for temporary uses such as upgrades.
The second article is a response to the first. Musings of Rodos: Linked Clones Not A Panacea takes the two points and responds in a way that seems positive to him but to me further confirms my fears about snapshots and LUN locking.
In regards to snapshot growth, the second author recommends automated desktop refreshes at a regular interval where the VM delta file is removed and the machine is essentially reverted to a clean slate. For a large environment like mine, this is not practical or possible. With so many users and a diverse collection of applications (over 400), it is not possible to force the users to reinstall their applications every couple of days in order to keep the snapshot growth low. And either way, once they reinstall, the snapshots again start growing out of control. Some would recommend ThinApp’ing these so there is no install in the VM and I would agree that this is a good idea in theory but it remains to be seen if it is possible and the number of hours to change all our applications from their currently delivery method to ThinApp is a hidden cost to the whole environment and basically not possible given our manpower and other commitments. And one can pretty much guarentee that many of the apps will not play nice with ThinApp packaging in one way or another requiring further hours to track down and resolve issues.
So frequently refreshing VMs would not be possible in my environment.
The second author’s response to the LUN locking issue is basically to pay close attention and rebalance datastores/LUNs if there are problems. That’s a great statement to make but again, I can imagine the few weeks of time that would take – not to mention the interruption to the users – so it becomes a pretty big deal and not just something to toss around lightly.
This author does indeed make the point that the first author did not and that is the possibility of I/O storms in the VDI environment – something of which I am extermely familiar with. All the author says is that “things could get very interesting.” Well I’ll tell you how interesting they can get, how about bringing the whole environment to its knees? And this was without snapshots! Imagine all the LUN locking that would occur for snapshot growth if we were using thin-clones. And speaking of that, imagine just the growth! We already need to be careful with how we roll out updates for applications, antivirus, OS patches, etc. Adding these new storage pieces in to the mix just makes things more complicated.
The first author (Stu) comments on the second author’s article and points out that the support cost for refreshing the environment every 9-12 months would have wiped out any storage savings. I think this is something that very blatantly needs to be brought front and center. Any savings achieved through technology is great but if the technology demands a huge increase in manpower to manage it correctly, the savings then become negative. I don’t have any numbers but I am certain our overtime for this project has negated any savings we hoped to have, as well as our relience on enterprise storage.
Basically, I feel like people are tripping over themselves to get in to VMware View 3 and I don’t want to be the first to get there and discover new challenges that need solving. I need to hear all the mundane/minute details from enterprise customers with environments similar to mine showing in fine detail how linked-clones and application virtualization has saved them time and money before I’ll give it a shot.
In defense of VMware so this post doesn’t sound too negative, I tell them all the time that I am consistently impressed with their innovation and drive forward in this and other virtualization arenas. I am a fan of their virtualization goals and I always look forward to their software releases for all the new features and improvements that are always being made. If View3 and this implementation of linked-clones isn’t the panacea tha some make it out to be, then perhaps the new few iterations will make progress towards getting there.