How to reclaim thin provisioned space on IBM FlashSystem V840

It seems that the ability to reclaim deleted VMware space on thin provisioned LUNs is a topic that has been discussed for a long time, and in fact it has since it was introduced in vSphere 5.0.  This VMware link includes some background on the efforts VMware made to address the problem of thin provisioned volumes growing but never shrinking.

The status of this functionality for VMware today stands as described in this link, I can save you the trouble and summarize.  The process requires a manual command to be executed with a VMFS volume specified and a number of blocks to process per iteration.  VMware will create a flat file on the VMFS volume which zeroes out blocks that are no longer used, and then instruct the supported storage array via a VAAI API that those blocks can be reclaimed.  For an administrator this means starting a CLI command per VMFS data store and monitoring it for completion since “If the UNMAP operation is interrupted (for instance by pressing CTRL-C), a temporary file may be left on the root of a VMFS datastore.”  The key to the VMware procedure is that the storage array must support the VAAI UNMAP primitive, which IBM XIV does however several of the IBM storage products (IBM FlashSystem, Storwize, and SVC) do not.

I was working with a customer this week that was deploying their VMware environment onto storage volumes with IBM Real-time Compression enabled.  This has become a very popular option since Real-time Compression on IBM FlashSystem V840 can reduce certain virtualized data sets by up to 80% while still providing microsecond latency.  A compressed volume is by default thin provisioned, so this customer was concerned about how to reclaim deleted space since VMware UNMAP was unavailable.  Fortunately compressed volumes CAN reclaim deleted space for VMware (or other OS/apps) without the VMware UNMAP functionality.  Here is how:

Real-time Compression runs a periodic maintenance, and I will remain vague here as I have not yet spoken to developers to understand the details for how this process runs. This maintenance will clean up deleted blocks and reclaim space automatically assuming it can tell that the blocks are clear, aka zeroed out.

There are a couple of scenarios when an administrator may want to reclaim space.  First would be the changes within the guest VM file systems. Blocks can be zeroed out from guests using a tool such as Microsoft SDelete.  That will zero out the space within guests which may have been previously consumed and then deleted.  The second scenario would be changes at the VMFS level, for example moving VMs to/from volumes or creating/deleting VMs.  I will focus in on this second scenario.

So the first thing I did was provision a 4TB FlashSystem V840 compressed volume, added it as a VMFS data store, and used a script to provision many VMs from a Windows 2008 R2 base template.  After that was all done the stats for my datastore from VMware appeared as :

RTC Reclaim - VMware space allocated

4.00 TB capacity, 6.67 TB provisioned, 557.98GB free.

And from my V840 they appeared as:

RTC Reclaim - V840 space allocated

4TB capacity, 3.44 TB consumed before compression, 1.78TB consumed after compression.

So for this base template Real-time Compression reduced the capacity from 3.44TB to 1.78TB, or a 48% reduction.

Next I deleted all of the VMs from my datastore so that it was 100% free from a VMware viewpoint, but from a V840 view-point my usage did not change from the 1.78TB from the previous screenshot, which is expected as the V840 is not aware that those blocks are now empty.  So similar to the process of VMware UNMAP or the in guest SDelete referenced earlier, the VMFS datastore needs to be zeroed out.  One way to zero out the capacity is by provisioning an Eager Zeroed Thick (EZT) VMDK.

Since in this example the VMFS datastore is empty, my VMDK could be created to consume the majority of the capacity.  In a real environment I would probably provision the VMDK to consume all but 5% of the free capacity.

My next step was to create a 4TB VMDK:

RTC Reclaim - Create VMDK

Something to consider is that the FlashSystem V840 will not actually write zero blocks to the flash capacity.  Our controllers detect zeros and discard them.  Plus since we utilize VAAI block zero, the process for creating a large VMDK is off-loaded to the FlashSystem V840 were it executes extremely quickly.

Once my VMDK creation completed I deleted the new VM from the datastore, which resulted in my 4TB VMFS datastore being 100% free and zeroed out.  Then all I had to do was wait for Real-time Compression on the FlashSystem V840 to automatically perform the cleanup operation on the volume.  I actually went to bed for this part and upon logging on the next morning my volume had been reclaimed back down to near its original size:

RTC Reclaim - V840 reclaimed

 

This process could be easily scripted to perform over a weekend or downtime period, or simply used on an as needed basis.  Since I don’t have a production environment it is hard to get a feel for how frequently an administrator may want to perform this operation.  It will be heavily dependent upon the data change rate in your environment.  But in any case, I hope I have demonstrated how easy this process is to use.

Posted in FlashSystem V840, Storwize V7000, VMware | Tagged , , , , , , | 2 Comments

Document availability – IBM FlashSystem and VMware configuration guide

The IBM FlashSystem Solution Engineering team has recently published a document which should be useful for those deploying VMware vSphere on IBM FlashSystem products.  The IBM FlashSystem and VMware vSphere configuration guide can be downloaded here.

The document was inspired by a peer who had a great idea for a quick reference document with best practices for VMware and FlashSystem.  I expect this to be a living document which is updated frequently and added too, so let me know if there are topics you’d like more information on.

This document covers the configuration for both of our FlashSystem products, FlashSystem 840 and FlashSystem V840, with VMware vSphere.

Posted in VMware | Tagged , , | Leave a comment

Flash this blog forward

This blog was always intended to be a hobby to share information related to my profession which I find interesting.  As most hobbies go when time is limited, this blog has fallen by the wayside over the past year.  Unfortunately there have been a significant number of interesting things happening which I’ve failed to blog about, fortunately I’ve been involved in many of them!

Professionally I’ve been working within a new organization at IBM over the past 9 months.  I’m still working with server, desktop and storage virtualization, but now focused on those solutions with IBM FlashSystem.  We are very busy and things are moving very quickly in this space.  However I am renewing my focus on this blog and will be attempting to share some unique information.  If there is anything you’d like to hear about please comment, otherwise I will be working on a series of posts.

Posted in VMware | Leave a comment

Warning – SVC Stretched Cluster with vSphere 5.5

A new feature was introduced in vSphere 5.5 called Permanent Device Loss (PDL) AutoRemoval.  This KB outlines the feature, what it does and why.  The basic idea behind it is to auto remove devices from an ESXi host when the device no longer returns SCSI sense codes.  It helps to keep things clean and tidy.

When first hearing of this feature I thought there may be implications for vMSC environments (SVC stretched cluster running VMware) and this morning Duncan Epping from Yellow-Bricks confirmed this.

The recommendation is to disable PDL AutoRemoval for SVC Stretched clusters.  If devices are auto removed it will cause undesirable behavior (the volumes will not be accessible from a VMware perspective without manual intervention) when access is restored.

Posted in VMware | Leave a comment

Need a storage solution for 120,000 Exchange 2013 users? Read on…

The phrase “we need storage for 120,000 Exchange 2013 users” most likely causes most storage administrators to shudder, in some cases faint, and in extreme cases run away.  To be fair, Microsoft has made significant steps in reducing the storage performance requirements of Exchange with each release.  Exchange 2013 has reduced IOPs by 50% in comparison to Exchange 2010 for example.  But regardless of the efficiency of Exchange 2013, 120,000 users is still a large number of mailboxes to manage in terms of capacity, resiliency, and performance.

In case you are not familiar with Microsoft ESRP, here’s a little back-ground.  Microsoft Exchange Solution Reviewed Program (ESRP) is a program designed by Microsoft to facilitate third parties with the storage testing and validation of solutions for Exchange.  IBM was an early participant in this program and has since been very active.  The ESRP submissions are evaluated by Microsoft and include log files so customers can review the performance themselves.

One of the things we try to do when putting together an ESRP solution is to utilize configurations which make sense to customers.  We ask what type of configuration a customer is likely to deploy (or could deploy with the storage) and use that in the solution.  The tests are not indicative or intended to be benchmarks for the storage systems, but instead demonstrate the storage configuration that can be used for a given Exchange workload.

Late last week the Exchange 2013 ESRP for IBM XIV Gen3.2 was published.  It can be accessed here.  The XIV is unique in its methods of provisioning storage, “simple” is the single best word I can use to describe it.  When you are designing a storage solution for 120,000 Exchange users “simple” is a welcome adjective.

I will leave the details as to why XIV makes sense for Exchange to the white paper but a few highlights of the solution which was tested:

–          120,000 mailboxes

–          2 GB per mailbox

–          .16 IOPs per mailbox

–          IBM XIV Gen3.2

–          360x 4TB disk drives

Posted in VMware | Leave a comment

Benefits of Atomic Test and Set (ATS) with IBM Storwize family

As the regular author of this blog I wanted to provide a short introduction for a guest contributor   Over the past few months I have been transitioning to new challenges so my day to day work on VMware and IBM storage has become limited.  A new technical expert has taken on the role of author VMware papers and best practices and he has graciously agreed to write about some of his recent work.  Thank you Jeremy Canady for the great contributions. – Rawley Burbridge

At this point, vSphere Storage APIs for Array Integration (VAAI) is old hat for most VMware administrators. In fact this blog has already written about VAAI in a previous post and white paper back in 2011. Sense that time there has been multiple releases of vSphere and support for various VAAI primitives continues to increase among storage systems for both the Block and NAS primitives. With the 7.1.0.1 code release for the Storwize family we decided to revisit the topic using current vSphere versions and the new 7.1.0.1 release.

I have no intention of diving into all the details of VAAI as there are many very good resources online as well as the full white paper. What I would like to highlight in this post is the testing and results of the Atomic Test and Test (ATS) primitive.

Atomic Test and Set

As you are likely aware, VMFS is a clustered file system that allows multiple hosts to simultaneously access the data located on it. The key to a clustered file system is handling conflicting requests from multiple hosts. VMFS utilizes file level locking to prevent conflicting access to the same file. The locking information is stored in the VMFS metadata which also must be protected from conflicting access. To provide locking for the metadata, VMFS originally only relied upon a SCSI reservation for the full VMFS volume. The SCSI reservation locks the complete LUN and prevents access from other hosts. This results in actions such as snapshot creation or virtual machine power on temporarily locking the complete VMFS datastore. This locking mechanism poses performance and scalability issues.

Atomic Test and Set (ATS) provides an alternative to the SCSI reservation method of locking. Instead of locking the complete LUN with a SCSI reservation, ATS allows the host to lock a single sector that contains the metadata it would like to update. This results in only conflicting access being prevented.

With the update to the VAAI paper we wanted to show the actual benefits of ATS with the Storwize family. To do so, two tests were devised, one with artificially generated worst case locking and one a with more of a real world configuration. The setup consisted of two IBM Flex System x240 compute nodes running ESXi 5.1 sharing a single VMFS datastore provided via FC from a Storwize V7000. Each test was run with the locking load and ATS disabled, without the locking load, and with the locking load and ATS enabled.

Simulated Workload

To generate worst case locking we needed a way to quickly generate metadata updates on the VMFS file system. To do this a simple bash script was created that continually executed the touch command upon a file located on the VMFS datastore. The touch command updates the time stamp of the file resulting in a temporary metadata lock each time it is ran.

To measure the impact, a VMDK located on the shared VMFS datastore was attached to a VM that had IOMeter loaded. A simple 4 KB sequential read workload was placed on the VMDK and measured for each run. The results can be seen below.

ArtificialLockingIOPS

As you can see without ATS there is a severe impact upon the disk performance. When enabling ATS the performance increased by over 400% and equaled the performance when no locking workload was applied. This is to be expected as the artificially generated locking should never conflict with accessing the VMDK when ATS is enabled.

To measure the severity of the locking esxtop was utilized to monitor and measure the number of conflicts per second. As you can see, ATS all but eliminated the conflicts.

ArtificialLockingCONSps

Real World Backup Snapshot Workload

Backup processes need to be quick and have the smallest impact possible. In a VMware environment most backup processes require a snapshot of the virtual machine to be created and deleted. Snapshot creation and deletion require metadata updates and when scaled out can cause issues on heavily populated datastores.

To test the effects of ATS upon large scale snapshot creation and deletion, a VMDK was located on a VMFS volume that contained ten additional virtual machines. The VMDK was attached to a VM running Iometer with a simple 4KB sequentail workload applied to the VMDK. A script was created to continually create and remove snapshots from the ten running VMs. The resulting Iometer performance was monitored while the snapshot workload was applied. The results can be seen below.

SSIOPS

As you can see there was a sizable impact on performance with ATS disabled. Additionally notice that the performance with ATS enabled is nearly identical to the performance when no snapshot workload was running.

To measure the severity of the locking esxtop was utilized to measure the conflicts per second. As you can see below, the number of conflicts was significantly reduced.

SSCONSPS

I hope these simple tests show the huge improvement that ATS provides to the scaling of a single VMFS datastore.

A complete white paper on this topic as well as the other supported VAAI primitives is available here.

Posted in VMware | Leave a comment

Using Microsoft Windows Server 2012 thin provisioning space reclamation on IBM XIV

Earlier last month the IBM XIV 11.2 microcode was announced.  One of the features for this code is support for the T10 SCSI UNMAP standard.  The benefit of SCSI UNMAP is that it solves a problem commonly associated with thin provisioned volumes.

Typical behavior of thin provisioned volumes is that as data is generated and space consumed, the thin provisioned volume in turn consumes more physical capacity.  If data is removed from the thin provisioned volume (for example a virtual machine migration) that space is freed on the file system, however the physical capacity on the storage system remains consumed.  This puts your physical storage capacity in a state of perpetual growth.  There have been solutions (usually manual), that consisted of running tools like SDelete and then mirroring the volume.

Enter SCSI UNMAP which solves this problem by issuing SCSI commands to the physical storage when blocks are freed.  In simple functional terms, if I free up space on my file system by for example migrating a virtual machine, then that space would also be reclaimed from the physical capacity on the storage system.  More details on SCSI UNMAP is available in this Microsoft document.

SCSI UNMAP is a standard so the support built into XIV should enable other applications to take advantage of it in the future.  For example VMware released SCSI UNMAP before announcing issues, eventually this functionality should be available.

Microsoft Windows Server 2012 included thin provisioning support of SCSI UNMAP upon its release and is the key application supported by XIV in the 11.2 code.  IBM has released a comprehensive white paper which discusses this feature in greater detail and also discusses implementation considerations.  The white paper is a good read for anyone but particularly those that are running or will be running Windows Server 2012 on IBM XIV.

Microsoft Windows Server 2012 thin provisioning space reclamation using the IBM XIV Storage System

Posted in VMware | Leave a comment