Tuesday, March 25, 2008

Data Deduplication - Eases Storage Requirements

Data is flooding the enterprise. Storage administrators are struggling to handle a spiraling volume of documents, audio, video and images, along with an alarming proliferation of large email attachments. Adding storage is not always the best solution; storage costs money and the sheer number of files eventually burdens the company’s backup and disaster recovery (DR) plans. Rather than finding ways to store more data, companies are turning to data reduction technologies such as data deduplication. This article explains the basic principles of data deduplication and looks at some of the implementation issues for data deduplication technology.


Understanding data deduplication

Data deduplication is a means of reducing storage space. It works by eliminating redundant data and ensuring that only one unique instance of the data is actually retained on storage media, such as disk or tape. Redundant data is replaced with a pointer to the unique data copy. Data deduplication, sometimes called intelligent compression or single-instance storage, is often used inconjunction with other forms of data reduction. Traditional compression has been around for about threedecades, applying mathematical algorithms to data in order to simplify large or repetitious parts of a file—effectively making a file smaller. Similarly, delta differencing reduces the total volume of stored data bycomparing the new and old iteration of a file and saving only the data that had changed. Together, thesetechniques can optimize the use of storage space.


Benefits of data deduplication

When properly implemented, data deduplication lowers the amount of storage space required, which results in less disk expenditures. More efficient use of disk space also allows for longer disk retention periods, which offers better recovery time objective (RTO) for a longer time and reduces the need for tape backups. Data deduplication also reduces the data that must be sent across a WAN for remote backups, replication and disaster recovery. Data deduplication primarily operates at the file, block and even the bit levels. File deduplication is relatively easy to understand: If two files are exactly alike, one copy of the file is stored and subsequent iterations receive pointers to the saved file. However, file deduplication is not very efficient because the change of even a single bit results in a totally different copy of the entire file being stored. By comparison, block and bit deduplication looks within a file and saves unique iterations of each block. If a file is updated, only the changed data is saved. This behavior makes block and bit deduplication far more efficient. “It’s an order of magnitude difference in terms of the amount of storage that it [block deduplication] saves in a typical environment,” says W. Curtis Preston, vice president of data protection at GlassHouse Technologies Inc. Other analysts note that deduplication can achieve compression ratios ranging from 10-to-1 to 50-to-1. However, block and bit deduplication take more processing power and use much larger index to track the individual blocks. Data deduplication platforms must contend with the issue of “hash collisions.” Each chunk of data is processed using a hash algorithm, such as MD5 or SHA-1, generating a unique number for each piece. The resulting hash number is then compared with an index of the existing hash numbers. If that hash number is already in the index, the piece of data is a duplicate and does not need to be stored again. Otherwise, the new hash number is added to the index and the new data is stored. In rare cases, the hash algorithm may produce the same hash number for two different chunks of data. When such a hash collision occurs, the system fails to store the new data because it sees that hash number already. This is called a false positive and can result in data loss. Some vendors combine hash algorithms to reduce the possibility of a hash collision. Some vendors are also examining metadata to identify data and prevent collisions.

Implementing data deduplication

The data deduplication process is usually implemented in hardware within the actual storage system, but it is also appearing in backup software. Hardware-based implementations are usually easier to deploy and are geared to reducing storage at the disk level within the appliance or storage system. Software-based implementations also reduce data, but the reduction is performed at the backup server. This minimizes the bandwidth between the backup server and backup system, which is particularly handy if the backup system is located remotely. “Users get end-to-end benefits when deduplicating data at the source—less data traverses the WAN, LAN and SAN,” However, deploying deduplication in a new backup application is more disruptive because it involves installing lightweight agents on the systems that must be backed up, in addition to installing the new backup engine.


Caveats regarding data deduplication

There is no universal approach to data deduplication; results can vary dramatically, depending on factors such as the storage environment and, of course, which dedupe product is selected. Data deduplication only makes sense when long-term retention is involved, usually for backup and archive tasks. Short-term retention sees little benefit because there is nothing to deduplicate against. Preston cautions against the misinformation circulating between deduplication vendors and suggests focusing on issues of performance, capacity and cost. With due diligence, you can identify potential performance and compression issues in your environment. “Let’s say you’re backing up seismic data or medical imaging data—this data tends to not dedupe very well,” he says. He also advices users to test a prospective data deduplication platform with various types of backups and restores, and see how it functions under actual circumstances. Scalability is another issue for data deduplication deployment, especially in terms of performance as the data deduplication system grows. Performance might have been an issue as early hash indexes grew large and additional time was needed to look up each block, but Preston calls that FUD (fouled up data) marketing now. “All of the vendors that I am aware of that are currently shipping or about to ship have addressed this [scaling issue] in one way or another,” he says. Nevertheless, he recommends you check with you data deduplication vendor on the matter. From a management perspective, data deduplication should not present any noticeable increase in overhead. “It [management] shouldn’t be any more or less than just a standard VTL [virtual tape library].” When multiple deduplication devices are needed, however, there could be an incremental increase in management effort.


Impact of data deduplication

The Appalachian and coastal areas South Carolina are enticing attractions to tourists and regional industry. Advertising, communication and literature have emerged as key assets to the Department of Parks, Recreation and Tourism—the agency responsible for promoting tourism as an industry and maintaining an extensive park system throughout the state. The agency originally had an EMC Corp. storage area network (SAN) hosting a total of 4 terabytes (TB), of which 1.2 TB comprised the actual working data set of databases and files, while 2 TB was allocated for disk backups before being relegated to DLT. Like many IT organizations, the agency sought ways to mitigate the increasing storage demands of its media and other data. After investigating numerous data deduplication vendors, the agency settled on Data Domain Inc.’s 430 appliance for disk backup tasks. With 2 TB of onboard storage, the 430 replaced the 2 TB that had previously been set aside on the SAN. The reduction in space was dramatic with bit level deduplication. “With the compression and deduplication, I think we’re using about 900 MB,” says Bernie Robichau, the agency’s systems administrator and security officer. The space reduction was a welcome cost savings, but it also allowed much longer backup retention on disk. “If someone had requested a two-week old file, I would have never been able to get that from a disk-based backup because I couldn’t keep two sets of backups on our allocated 2 TB of hard drive [SAN] storage,” Robichau notes. “Now someone can request a file from three weeks ago or six weeks ago, and it’s immediately available.” Robichau says that installation of the data deduplication platform was relatively quick and easy, requiring only about four hours of onsite engineering work and minimal configuration. Its current CommVault System Inc. backup infrastructure proved to be fully compatible—backup agents were simply pointed to the new appliance rather than the EMC SAN. “The backups worked just as they always did, but we’re consuming far less disk space and much more retention than we ever did before,” he says. While the deduplication appliance requires almost no management time, Robichau notes as much as 75% labor savings in tape overhead, such as cartridge rotation, cleaning and storage. The only remaining tape effort involves full backups on weekends and systematic cartridge rotation to an offsite location. Although there are no immediate plans to upgrade storage on the 430 appliance, the attention is clearly focused on disaster recovery. Previous considerations of complex disaster recovery plans were put on hold due to complexity. However, the 430 supports replication easily and Robichau expects to replicate the 430 to a duplicate appliance and eliminate backup tapes entirely sometime in the next fiscal year or beyond. “There’s no planning beyond synchronizing an identical appliance on site and putting it in one of our remote locations.”
Denver-based IT hosting provider was drowning in customer data. Its challenge: to keep its data protection business running smoothly, along with other services, like managed server hosting, managed firewalls and load balancing. However, its backup environment was formidable; handling 20,000 backups per month with each customer protecting 20 GB to100 GB. Even with 4.5 TB of protected storage, They could only keep two weeks of retention. To make matters even more challenging, its StorageTek L700 and L11000 tape libraries were managed by an outsourced provider, requiring a full-time engineer at the hosting provider. But, it was ongoing restoration problems that forced them into action. “Our success rate from backups, at the lowest point, was roughly 70%,” says senior systems engineer. “And far too often, we couldn’t hit [restore] the exact day they wanted.” Poor performance of the tape backup process also plagued the organization, with full backup windows often exceeding 18 hours. These problems also translated into significant customer support costs. It became clear to him that disk storage was the key to beating reliability and performance woes, and data deduplication would be essential to reduce the total volume of storage needed for full and incremental customer backups. They opted for Avamar Technologies’ Axion software running on a cluster of 11 Dell 2850s offering about 10 TB of total storage. The actual deployment involved a forklift upgrade, but he reportsthat the system was up and running in just a few days after installing agents on almost 400 backup servers and migrating necessary data.

The move to data deduplication brought several significant benefits, most notably a reduction in storage requirements. While it might have taken 350 GB to protect 100 GB of customer data without deduplication (full and incremental backups), with data deduplication, it actually takes less storage than the data it’s protecting. “I’m using about 7 TB of storage to protect roughly 8 TB of data,” he says. “That includes anywhere from two weeks to one year of retention [daily full backups].” Backup time was also slashed; in some cases an 18-hour backup window fell to 1-1/2 hours, while the backup and restoration success rate was improved to 98% or more. Before, two full-time engineers were needed. After the deployment, that requirement fell to one half-time engineer. “We wanted to have an ROI [return on investment] of 24 months, and we hit payback at 20 months,”.

Today, the 4.5 TB of protected data has grown to about 7.6 TB protected by data deduplication. About 2 TB of that protected data is replicated to a smaller Avamar deployment at a disaster recovery site in St. Louis. The company continues to use tape for long-term archival backups. He expects the amount of protected data to double in the foreseeable future, though less storage will be required to handle the growth.

The future of data deduplication

In the near term, industry experts see data deduplication filling an important role in disaster recovery: saving disk storage space by replicating the data of one deduplication platform to another located off site. This reduces the need to move tapes back and forth, which can be particularly valuable when replicating hundreds of terabytes of data. Other analysts note that the separate “point products,” like VTL, will address backup window performance, while data deduplication addresses the issue of storage capacity. Whitehouse says, “Next-generation backup solutions fix both, deduplicating data as it’s sourced from the backup target and improving the efficiency of its transfer across a LAN/WAN to the central disk repository.” Deduplication is now common in VTLs and will appear as a feature of traditional backup products.

Tuesday, March 18, 2008

Key considerations in developing a storage area network design

Storage area networks (SANs) let several servers share storage resources and are often used in situations that require high performance or shared storage with block-level access, like virtualized servers and clustered databases. Although SANs started out as a high-end technology used only in large enterprises, cheaper SANs are now affordable even for small and medium-sized businesses (SMBs). In earlier installments of this Hot Spot Tutorial, we examined what benefits SANs offer over other storage architectural choices, as well as the two main storage networking protocols, Fibre Channel and iSCSI. In this installment, we'll look at the main considerations you should keep in mind when putting together a storage area network design.

Uptime and availability
Because several servers will rely on a SAN for all of their data, it's important to make the system very reliable and eliminate any single points of failure. Most SAN hardware vendors offer redundancy within each unit -- like dual power supplies, internal controllers and emergency batteries -- but you should make sure that redundancy extends all the way to the server.
In a typical storage area network design, each storage device connects to a switch that then connects to the servers that need to access the data. To make sure this path isn't a point of failure, your client should buy two switches for the SAN network. Each storage unit should connect to both switches, as should each server. If either path fails, software can fail over to the other. Some programs will handle that failover automatically, but cheaper software may require you to enable the failover manually. You can also configure the program to use both paths if they're available, for load balancing.
But you should also consider how the drives themselves are configured, Franco said. RAID technology spreads data among several disks -- a technique called striping -- and can add parity checks so that if any one disk fails, its content can be rebuilt from the others. There are several types of RAID, but the most common in SAN designs are levels 5, 6 and 1+0.
RAID 5 stripes data across every disk in the unit except one, which is used to store parity information that can be used to rebuild any drive that needs to be replaced. RAID 6 adds a second disk for redundant parity. This protects your client's data in case a second drive breaks during the first disk's rebuild, which can take up to 24 hours for a terabyte, Franco said. RAID 1+0 stripes data across a series of disks without any parity checks, which is very fast, but mirrors each of those disks to a second set of striped disks for redundancy.

Capacity and scalability
A good storage area network design should not only accommodate your client's current storage needs, but it should also be scalable so that your client can upgrade the SAN as needed throughout the expected lifespan of the system. You should consider how scalable the SAN is in terms of storage capacity, number of devices it supports and speed.
Because a SAN's switch connects storage devices on one side and servers on the other, its number of ports can affect both storage capacity and speed, Schulz said. By allowing enough ports to support multiple, simultaneous connections to each server, switches can multiply the bandwidth to servers. On the storage device side, you should make sure you have enough ports for redundant connections to existing storage units, as well as units your client may want to add later.
One feature of storage area network design that you should consider is thin provisioning of storage. Thin provisioning tricks servers into thinking a given volume within a SAN, known as a logical unit number (LUN), has more space than it physically does. For instance, an operating system (OS) that connects to a given LUN may think the LUN is 2 TB, even though you have only allocated 250 GB of physical storage for it.
Thin provisioning allows you to plan for future growth without your client having to buy all of its expected storage hardware up front. In a typical "fat provisioning" model, each LUN's capacity corresponds to physical storage. That means that your client will have to buy as much space as it anticipates needing for the next few years. While it's possible to allocate a smaller amount of space for now and transfer its data to a larger provision as needed, that process is slow and could result in downtime for your client.
Thin provisioning allows you to essentially overbook a SAN's storage, promising a total capacity to the LUNs that is greater than the SAN physically has. As those LUNs fill up and start to reach the system's physical capacity, you can add more units to the SAN -- often in a hot-swappable way. But because this approach to storage area network design requires more maintenance down the road, it's best for stable environments where a client can fairly accurately predict how each LUN's storage needs will grow.

Security
With several servers able to share the same physical hardware, it should be no surprise that security plays an important role in a storage area network design. Your client will want to know that servers can only access data if they're specifically allowed to. If your client is using iSCSI, which runs on a standard Ethernet network, it's also crucial to make sure outside parties won't be able to hack into the network and have raw access to the SAN.
Most of this security work is done at the SAN's switch level. Zoning allows you to give only specific servers access to certain LUNs, much as a firewall allows communication on specific ports for a given IP address. If any outward-facing application needs to access the SAN, like a website, you should configure the switch so that only that server's IP address can access it.
If your client is using virtual servers, the storage area network design will also need to make sure that each virtual machine (VM) has access only to its LUNs. Virtualization complicates SAN security because you cannot limit access to LUNs by physical controllers anymore -- a given controller on a physical server may now be working for several VMs, each with its own permissions. To restrict each server to only its LUNs, set up a virtual adapter for each virtual server. This will let your physical adapter present itself as a different adapter for each VM, with access to only those LUNs that the virtualized server should see.

Replication and disaster recovery
With so much data stored on a SAN, your client will likely want you to build disaster recovery into the system. SANs can be set up to automatically mirror data to another site, which could be a failsafe SAN a few meters away or a disaster recovery (DR) site hundreds or thousands of miles away.
If your client wants to build mirroring into the storage area network design, one of the first considerations is whether to replicate synchronously or asynchronously. Synchronous mirroring means that as data is written to the primary SAN, each change is sent to the secondary and must be acknowledged before the next write can happen.
While this ensures that both SANs are true mirrors, synchronization introduces a bottleneck. If the secondary site has a latency as high as even 100 to 200 milliseconds (msec), your system will slow down as the primary SAN has to wait for each confirmation. Although there are other factors, latency is often related to distance; synchronous replication is generally possible up to about 6 miles.
The alternative is to asynchronously mirror changes to the secondary site. You can configure this replication to happen as quickly as every second, or every few minutes or hours. While this means that your client could permanently lose some data, if the primary SAN goes down before it has a chance to copy its data to the secondary, your client should make calculations based on its recovery point objective (RPO) to determine how often it needs to mirror.

Wednesday, March 12, 2008

Disk users looking to add Tape back into their storage infastructure

OVER TWO THIRDS OF DISK-ONLY USERS LOOK TO ADD TAPE BACK INTO STORAGE INFRASTRUCTURE ACCORDING TO RECENT SURVEY

Survey Data Suggests that Most Companies Surveyed Are Migrating to a Tiered Storage Infrastructure of Disk and Tape Deployments

SILICON VALLEY, CALIF. — (March 12, 2008) — HP, IBM Corporation and Quantum Corporation, the three technology provider companies for the Linear Tape-Open (LTO) Program today released survey results that strongly suggest that storage customers that use a disk-only infrastructure are now looking at tape storage technology as part of a tiered storage infrastructure to support backup and archiving. Over two thirds of surveyed businesses said they were looking to add tape storage back into their overall network infrastructure and of those respondents, over 80-percent plan to add tape storage solutions within the next 12 months.

The survey, which was taken in the fourth quarter of 2007, focused on the views of more than 200 network administrators and mid-level tech specialists at mid-size to large companies throughout the United States.


"The integration of tape storage into a tiered information infrastructure is highly strategic for customers, due to its low cost of ownership, low energy consumption and portability for data protection," said Cindy Grossman, Vice President of Tape Storage Systems, IBM. "LTO tape technology is a perfect choice for enterprise and mid-sized customer with its proven reliability, high capacity, high performance and ability to address data security with built-in encryption and data retention requirements for the evolving data center."

According to the survey, 58-percent of the respondents use a combination of disk and tape for long term archiving, 24 percent use tape exclusively, and 18-percent employ a disk-only approach. In this group, 68-percent of the current disk only users plan to start using tape for long-term archiving, and over half (58-percent) plan to add tape for short-term data protection.

"The survey findings suggest that disk-only users may be experiencing a bit of buyer’s remorse," said David Geddes, senior vice president at Fleishman-Hillard Research, who oversaw the study. "We found that a wide majority of companies that employ purely disk-based approaches are looking to quickly include tape in their backup and archiving strategies.

LTO tape technology
delivers the backup and archiving features needed by today’s storage administrators, including high capacity, blazing performance, 256-bit drive-level encryption for data security and WORM cartridge support to address data retention needs. With low energy consumption, tape technology can also provide organizations with a green alternative for the data center. Studies have shown that tape-based backup and archiving solutions can deliver substantial TCO benefits and energy savings

The LTO format is a powerful, scalable, adaptable open tape format developed and continuously enhanced by technology providers HP, IBM Corporation and Quantum Corporation (and their predecessors) to help address the growing demands of data protection in the midrange to enterprise-class server environments. This ultra-high capacity generation of tape storage products is designed to deliver outstanding performance, capacity and reliability combining the advantages of linear multi-channel, bi-directional formats with enhancements in servo technology, data compression, track layout, and error correction.

The LTO Ultrium format has a well-defined roadmap for growth and scalability.