Archive for November, 2009

Data Storage: The Foundation & potential Achilles Heel of Cloud Computing

November 17, 2009

In almost anything that you read about Cloud Computing, the statement that it is ‘nothing new’ is usually made at some point. The statement then goes on to qualify Cloud Computing as a cumulative epiphenomenon that more so serves as a single label to a multi-faceted substrate of component technologies than it does to a single new technology paradigm. All of them used together comprise the constitution of what could be defined as a cloud. As the previous statement makes apparent the definition is somewhat nebulous. Additionally, I could provide a long list of the component technologies within the substrate that could ‘potentially’ be involved. Instead, I will filter out the majority and focus on a subset of technologies that could be considered ‘key’ components to making cloud services work.

If we were to try to identify the most important component out of this substrate, most would agree that it is something known as virtualization. In the cloud, virtualization occurs at several levels. It can range from ‘what does what’ (server & application virtualization) to ‘what goes where’ (data storage virtualization) to ‘who is where’ (mobility and virtual networking). When viewed as such, one could even come to the conclusion that virtualization is the key enabling technology upon which all other components either rely on or embody in some subset of functionality.

As an example, at the application level Web Services and Service Oriented Architecture serve to abstract & virtualize the application resources required to provide a certain set of user exposed functions. Going further whole logical application component processes can be strung together in a work flow to create an automated complex business process that can be kicked off by the simple submittal of an on line form on a web server.

If we look further, underneath this we can identify another set of technologies where the actual physical machine is host to multiple resident ‘virtual machines’(VM) which house different applications within the data center. Additionally, these VM’s can migrate from one physical machine to another or invoke clones of themselves that can in turn be load balanced for improved performance during peak demand hours. At first this was a more or less local capability that was limited to the physical machines within the Data Center, but recently advances have been made by the use of something known as ‘stretch clustering’ to enable migrations to remote Data Centers or secondary sites in response to primary site failures and outages. This capability has been a great enabling tool in prompt Disaster Recovery plans for key critical applications that absolutely need to stay running and accessible.

In order for the above remote VM migration to work however there needs to be consistent representation and access to data. In other words, the image of the working data that VM #1 has access to at the primary site needs to be available to VM #2 at the secondary site. Making this occur with traditional data storage management methods is possible but extremely complex, inefficient and costly.

Virtualization is also used within storage environments to create virtual pools of storage resources that can be used transparently by the dependant servers and applications. Storage Virtualization not only simplifies data management for virtualized services but also serves to provide the actual foundation for all of the other forms of virtualization within the cloud in that the data needs to be always available to the dependant layers within the cloud. Indeed, without the data – the cloud is nothing but useless vapor.

This is painfully evident in some of the recent press around cloud failures, most notably the T-Mobile Sidekick failure that was the result of Microsoft’s Danger subsidiaries failure to back up key data prior to a storage upgrade that was being performed by Hitachi. Many T-Mobile users woke up one morning to find that their calendars and contact lists were non-existent. After some time, T-Mobile was forced to tell many of their subscribers that the data was permanently lost and not recoverable. This particular instance has had a multi-level reverberation that impacted T-Mobile (the Mobile Service Provider), Microsoft Danger (the Data Management Provider), Hitachi (the company performing the storage upgrade) and finally the thousands of poor mobile subscribers who arguably bore the brunt of failure. To be fair, Microsoft was able to restore most of the lost data, but this was only after days had passed. Needless to say, the legal community is now a buzz over potential law suits and some are already in the process of being filed.

The reasons for the failure are not really the primary purpose of the example. The example is intended to illustrate two things; first, while many think that Cloud Computing somehow takes us beyond the traditional IT practices – it does not. In reality, Cloud Computing builds upon them and is in turn dependent upon them for proper intended functionality. The responsibility for needs to perform them can be vague however and needs to be clearly understood by all parties. Second, Cloud Computing without data is severely crippled, if not totally worthless.  After all, the poor T-Mobile subscriber did not know who to meet or call, or even how to call to cancel or reschedule (unless they took the time to copy all of that information locally to the PDA – and some did).  What good is next generation mobile technology if you have no idea of where to be or who to contact!

If we view it as such then it could be argued that proper data storage management is the key foundation and enabler for Could Computing. If this is the case then it needs to be treated as such when the services are being designed. You often hear that security should not be an afterthought. It needs to be considered in every step of a design process. This is most definitely true. The point of this article is that the same thing needs to be said for data storage and management.

The figure below illustrates this relationship. The top layer, which represents the user leverages on mobility and virtual networking to provide access to resources anywhere, anytime. Key enabling technologies such as 3G or 4G wireless and Virtual Private Networking provide for secure almost ubiquitous connectivity into the cloud where key resources reside.

Figure 1. Cloud Virtualization Layers

In the next layer the enabling services are provided for by underlying applications. Some may be atomic like simple email in that they provide a single function from a single application. More and more however, services are becoming composite in that they may depend on multiple applications acting in concert to complete whole business processes. These types of services are typically SOA enabled in that they follow process flows that are defined by an overarching policy and rule set that is maintained and driven by the SOA framework. In these types of services there is a high degree of inter-dependency which, while enabling enhanced feature service offerings, also creates areas of vulnerability that can become critical outages if one of the component applications in the process flow were to suddenly become unavailable.  To accommodate for this, many SOA environments provide for recovery work flows which can provide for graceful rollback of a particular transaction. Optimally, any failure of a component application should be totally transparent to the composite service. If a server that is providing the application were to fail, another server should be ready to take over that function and allow the services process flow to proceed uninterrupted.

The layer below the service application layer is the layer that would provide for this transparent resiliency and redundancy.  Here physical servers provide hosting for multiple virtual machines which can provide for redundant and even load balanced application service environments.

In the figure below, we see that these added features provide the resource abstraction that allows one VM to step in for another’s failure so that a higher level business process flow can proceed without a glitch. Additionally, applications can be load balanced to allow for scale and higher capacity.

Figure 2. VM’s set up in a Fault Tolerant configuration

As we pointed out earlier however, this apparent Nirvana of application resiliency can only be met if there is consistent data that is available to both systems at the time of the failover at the VM level. In the case of a transaction database the secondary VM should ideally be able to capture the latest exchange so as to allow the application to proceed without interruption. In other words, the data has to have ‘full transactional integrity’. At the very least the user may have to fill out the present form page that they are currently working on once again. Without the availability to data any and all resiliency provided by the higher layers are null and void. The figure below builds upon figure two to illustrate this.

Figure 3. Redundant Data Stores key to service resiliency

As the user interacts with the service they ideally should be totally oblivious to any failures within the cloud. As we see in the figure above however, this can only be the case if there are consistent up to the current transaction data repositories that the failover VM can mount and carry on with the user service with as little interruption as possible. Doing this with traditional Direct Attached Storage (DAS) is a monumental task that is prone to vulnerabilities. The concept of transactional integrity in this approach is difficult. The use of Storage Virtualization helps solve this complexity by creating one large virtual storage space that can be leveraged at the logical level by multiple server resources within the environment. Shown below, this virtualized storage space can be divided up and allocated by a process known as provisioning. Once these logical storage spaces (LUN’s) are created, they can not only be allocated to physical servers but to individual VM’s as well as any higher level fault tolerance. The value to this is that failure at the VM level is totally independent of failure at the data storage level.

Figure 4. Failure mode independence

As shown in the figure above most VM level failures can be addressed at the local site. As a result, the failover VM can effectively mount the primary data store. Data consistency is not an issue in this case because it is the exact same data set. In instances of total site failure the secondary site must take over completely. In this instance the secondary storage must be used. It was pointed out earlier that this secondary store must have complete transactional integrity with the primary store and the dependent application.  In a remote secondary site scenario that is designed for disaster recovery, the costs for up to the minute traditional data backups is cost prohibitive and logistically impossible. Consquently, reliable backup data is in many instances 12 hours old or greater.

Newer storage technologies come into play here that allow for drastic reduction in the amount of data that has to be copied as well as optimization in the methods for doing so.

Thin Provisioning

One of the major reasons for the difficulties noted in the previous section is the prevalence of overprovisioning in the data storage environment. This seems counterintuitive. If there is more and more data, how can data storage environments be  overprovisioned? This occurs because of the friction between two sets of demands. When installing a server environment one of the key steps is in the allocation of the data volume. This is done at install and is not an easy allocation to adjust once the environment has been provisioned. As a result, most administrators will wiegh the risk and downtime to increase volume size against the cost of storage. In the end they will typically choose to over provision the allocation so that they do not have to be concerned about any issues with storage space later on.

This logic is fine in a static example. However, if we consider this practice in light of Business Continuity and Distaster Recovery it becomes problematic and costly. The reason for this is that using traditional volume management and backup methods require the backup of the whole data volume. This is the case even if the application is only actually using 20% of the allocated disk space. Now, size translates to WAN bandwidth. Suddenly disk space is not so cheap.

Storage virtualization enables the ability to do something known as thin provisioning. Because the virtualized storage environment abstracts the actual data storage from the application environment, it can be used to actually allocate a much smaller space than the application believes it has. The concept of pooling allows for the virtualized environment to allocate additonal space as the data store requirements grow for the application environment. This is all transparent to the application however. The end result is a much more efficient data storage environment and the need to re-configure the application environment is eliminated. The figure below illustrates an application that has been provisioned for 1 TeraByte of data storage. The storage virtualization environment however has only allocated 200 GigaBytes of actual storage. This translates into an 80% increase in the efficiency of storage usage.

Figure 5. Thin Provisioning & Replication

The real impact comes when considering this practice in Business Continuity and Disaster Recovery. At the primary site, only the allocated portion of the virtualized data store needs to be replicated for business continuity at the local site. This is something that is termed as thin replication. For disaster recovery purposes the benefits translate directly into an 80% reduction in the required WAN usage to provide for full resiliency. Now it becomes possible not only to seriously entertain network based DR (as opposed to the ‘tape and truck’ method), but to perform the replications at multiple times during the day rather than once at the end of the day during off hours. What enables this are two things, first the drastic reduction in the data being moved and second the fact that the server is removed from these tasks by the storage virtualization. This means that the application server environment can be up 24/7 and provide for a more consistent Business Continuity and Disaster Recovery practice.

Continuous Data Protection (CDP)

The next of these technologies is Continuous Data Protection. CDP is based on the concept of splitting writes to disk to a separate data journal volume. This process is illustrated below. While the write primary storage occurs as normal, a secondary write occurs which is replicated into the CDP data journal.  This split can occur in the host, within the Storage Area Network, in an appliance or in the storage array itself. If the added process is handled by the host (via a write splitter agent), the host must support the additional overhead.

Figure 6. Continuous Data Protection split on writes to disk

If the split is done in the disk array the journal must be local within that array or within an array that is local, hence its use in DR is somewhat limited. If the split occurs within the SAN Fabric or in an appliance the CDP data journal can be located in a different location than the primary store.  This can be supported in multiple configurations but the main point is that on primary storage failover there is a consistent data set that has full transactional integrity available and the secondary VM can take over in as transparent a fashion as possible regardless of which site it’s located at.

Figure 7. CDP and its use in creating ongoing DR Backup

As shown above with less than the original volume size, data consistency can be provided in any minute density that the administrator requires for historical purposes and up to the minute for real time recovery with data journaling. Also consider that disk space is cheap in comparison to bandwidth and even cheaper in comparison to lost business. With only the used disk deltas being copied, far less bandwidth is used. Additionally, with a complete consistent data set always available, off line backups can occur to archive Virtual Tape Libraries (VTL) or directly to tape at any time – even during production hours – to provide for complete DR compliance in the event of total catastrophe at the primary site.

Data De-Duplication

Full traditional backups will usually store a majority of redundant data. This means that every initial image will mostly be of redundant data that was already contained in the last full image. The replication of this data seems pointless and it is.* Data De-duplication works by the assumption that most of the data that moves into backup is repetitive and redundant. While CDP works well towards reducing this for database & file based environments by its very nature of operation, most tape based backups will simply save the whole file if any change has been recorded (typically done by size or last modification date).

*There may be instances where certain types of data cannot be de-duplicated due to regulatory requirements. Be sure that the vendor can support such exceptions.

Data De-Duplication works at the sub block level to identify only the sections of the file that have changed and thereby only backup the delta sub blocks to maintain complete consistency of not only the most recent, but also of all archived versions of the file. (This is accomplished by an in depth indexing that occurs at the time of the de-duplication that preserves all versions of the data file for complete historical consistency.) As an example, when a file is first saved obviously the de-duplication ratio is 1:1 as this is the first time that data is saved. However, over time as subsequent backups occur, file based repositories can realize de-duplication ratios as high as 30:1. The chart below illustrates some of the potential reduction ratios for different types of data files.

Document type    De-dupe ratio    % of data backed up

New working documents                                             2:1                                                          50% less data

5:1                                                          80% less data

Active working documents                                        10:1                                                        90% less data

20:1                                                        95% less data

Archived inactive documents                                  30:1                                                        97% less data

As can be seen, these technologies can drastically reduce the amount of data that you need to move over the wire to provide data consistency as well as greatly reduce the storage requirements for maintaining that consistency. The result is an ROI that is unprecedented and simply cannot be found in traditional storage and networking investments.

In reality, in data de-duplication the reduction ratios occur in ranges. More active data will show less reduction ratios than data that are largely historical. As a data set matures and goes into archive status the ratio for data reduction becomes quite high because there is no change to the data pattern within the file. This leads to the point that data de-duplication is best done at various locations, not only across its end to end path but from a life cycle perspective as well.  For instance, de-duplication provides great value in WAN usage reductions for remote site backups if the function is performed at the remote site. It would also find value within the replication and archive process, particularly to VTL or tape store, knowing that what goes onto this medium typically can be viewed as static and is for archive purposes.

Some of the newer research in the industry is around the management of the flow of data through its life cycle. As new data is created its usage factor is high as well as the amount of change that it undergoes. Imagine a new document that is created at the beginning of a standards project. As the team moves through the flow of the project the document is modified. There may even be multiple versions of the same document at the same time which would be considered valid to the overall project.

Figure 8. Project Data Life Cycle

As the project matures and the standard solidifies however, more and more of these documents will become ‘historical’ and will no longer change. Even the final valid document that the project delivers as its end product will not change without due process and notification. Then at such a time the whole parade begins anew. The main point is that as these pieces of data age they should be moved to more cost effective storage. The end result is that as the de-duplication hit gets higher, that piece of data should be moved to more cost effective storage. Eventually, that piece of data would end up in a VTL where it would act as a template for de-duplication against all further input to those final archives. The end result is the reduction of data amount as well as the lowering of the overall retention cost.

While it may be true that data storage is the key foundation and consequently Achilles heel for Cloud Computing services, there are technologies available to enable data storage infrastructures to step up to the added requirements for a true Cloud service environment. This is why the term Cloud Storage makes me uneasy when I hear it used without any qualification. Consider after all, any exposed disk in a server that is attached to a cloud could be called ‘cloud storage’. Just because it is ‘storage in the cloud’ does not mean that it is resilient, robust, or cost effective. Consequently, I would prefer to differentiate ‘Cloud Storage’, (i.e. storage as a cloud service) and ‘Storage architectures for Cloud Services’ which are the technologies and practices of data storage management to support all cloud services (of which Cloud Storage is one). The technologies reviewed in this article enable storage infrastructures to provide the resiliency and scale that are required for true secure and robust data storage solutions for cloud service infrastructures.  Additionally, they help optimize the IT cost profile both in capital as well as operational expense perspectives.  These technologies also work towards vastly improving the RPO (Recover Point Objectives) and RTO (Recovery Time Objectives) of any Business Continuity and Disaster Recovery plan. As cloud computing moves into the future its fate will depend upon the integrity of the data on which it operates. Cloud Service environments and perhaps the companies that provide or use them will succeed or fail based on whether or not they are built upon truly solid data management practices and solutions. The technology exists for these practices to be implemented. As always it is up to those who deploy the service to make sure that they consider secure and dependable storage in the overall plan for Business Continuity and Disaster Recovery as well as business regulatory compliance.