Data Deduplication

Data deduplication, also called “dedup”, eliminates duplicate or redundant information in a dataset. In short, dedup is a process that ensures only one copy of data exists in a particular dataset or block.

Data deduplication definition

Data deduplication, also called “dedup”, eliminates duplicate or redundant information in a dataset. In short, dedup is a process that ensures only one copy of data exists in a particular dataset or block.

The process improves storage capacity and optimizes redundancies without compromising the fidelity or integrity of the data. References or pointers to the single saved copy replace the deleted redundant information. Very often, data deduplication is coupled with data compression for more optimal storage savings.

There are two ways to classify deduplication based on where it takes place. The first is source side deduplication which indicates that dedup occurs at source where the data originates. The second is target side deduplication which suggests that dedup occurs on the target storage where the information is stored.

What is data deduplication?

Data deduplication is the process of removing duplicate copies of datasets to optimize storage resources and enhance their performance. By eliminating redundant information, the system frees storage space and reduces the size of datasets. The changes lower the cost of storage and improve the performance of applications that can carry out tasks using smaller sets of data.

Data deduplication makes sense only when applied to secondary storage locations with backup data. The secondary storage repositories used for backing up data tend to have higher duplication rates that make dedup necessary. On the other hand, primary storage used for production environments prioritize performance over other factors and may opt not to utilize dedup.

When deploying deduplication, it is important to be aware that some relational databases such as Oracle and Microsoft SQL do not benefit from dedup. Such databases often keep a unique key for each record, preventing deduplication engines from recognizing duplicate records as duplicates.

Let us look at a common example.

The CEO of the company sends an email to all 100 employees with an updated organization chart as an attachment.  All 100 employees receive the same file and save it to their desktop so they can refer to it later.  When the backup runs later that night it saves all 100 copies of that file, but because of deduplication only 1 actual file is saved, with 99 references or pointers to the original.

In this example, we saw a deduplication ratio of 100 to 1. When paired with data compression, only the saved instances of the data are compressed to further enhance storage capacity with the efficient encoding of data.

What is the data deduplication process, and how does it work?

Deduplication involves analyzing the data to identify the unique blocks of data before storing them. When duplicates of the individual data block are encountered, the redundant patterns are deleted and replaced with references, or pointers, to the saved unique data. The agent performing the deduplication assigns a unique identifier number to each stored block of data.

When duplicate data is found when checking the unique identifier numbers, the newly discovered redundant information is removed. As part of the deduplication process, blocks or chunks of data are compared by scanning the unique identifier numbers assigned to stored data. When a duplicate identifier number is found, the new data chunk with the duplicate identifying number is deemed a redundant copy and deleted. The deduplication agent assumes that duplicate unique numbers mean a redundant data block.

Data deduplication can be performed either in-line as the data is flowing or “post-process” after the information has already been written to storage devices.

Data deduplication runs in the background and follows prescribed optimization policies to identify files that need work. The dedup agent then breaks the target files into variable data chunks, identifies and stores the unique ones with assigned unique identifier numbers. As part of the deduplication process, duplicate data blocks are removed and replaced with identical saved data references. Next, the dedup agent re-establishes the original file stream with the newly optimized files.

As noted earlier in the email example, it is not uncommon (with certain data types) for deduplication ratios to be as high as 90:1, which increases the benefits of deduplication. More storage efficiencies are realized by coupling deduplication with data compression that further reduces the size of stored data and its storage footprint.

Data deduplication is a simple concept that contributes dramatically to reducing storage resources and the costs associated with them. When coupled with data compression, the savings increase, and the storage devices’ performance and applications improve.

Why is data deduplication important?

The unchecked exponential data growth coupled with shrinking backup windows, data retention SLAs, and regulatory requirements strain IT resources and business growth. Data deduplication technology reduces physical disk capacity requirements while meeting data retention requirements.

The amount of data created daily is staggering, and all of it must be optimized for better utilization of storage capacity and protected against loss. On average, in 2020, humans generated 1.7MB of data every second for every person on earth.1 By 2025, it is estimated that 463 exabytes of data will be created each day.2 

Unfortunately, most of the stored data are duplicates that increase customers’ storage costs and lower cloud systems’ performance and utilization. For example, users’ file sharing results in many duplicate copies of the same file. In virtualized environments, guest operating systems may be almost identical to several of the virtual machines (VMs). Even backup snapshots may have minor differences from one day to another. Deduplication removes these inefficiencies from storage systems.

According to Microsoft’s estimate, virtualization libraries may see typical space-saving up to 50%, while virtualization libraries might realize up to 95% space savings.The savings vary depending on datasets.

Want to see data protection in action?

See the fully functional, full-service product today, and see how Commvault can serve your needs directly.

Benefits of data deduplication

Applications and the data they generate are the drivers of business analytics and the determining factor for successful growth. Managing data growth and deduplication benefits are not limited to storage systems but extend to the entire IT infrastructure and application performance. A smaller storage footprint produced by dedup and compression lowers storage costs, reduces pressure on networking and constraints on bandwidth. It also improves the performance of applications at endpoints, which contributes to enhancing remote workers’ productivity.

Data deduplication solutions deliver significant business benefits that extend beyond areas directly related to storage devices:

  • Lower costs. Smaller storage capacity translates into lower expenses that ripple through the entire IT operations. Less infrastructure to manage means less admin and management resources, lower overall charges by cloud providers for data movement and networking traffic. The savings for on-premises storage include Capex for the storage devices and the space they occupy along with cooling systems. The savings also include Opex for storage management, cooling and other utilities, including power.

  • Longer data retention. With data deduplication, enterprises get the luxury of retaining the smaller datasets for more extended periods and meeting more stringent retention requirements.

  • Higher overall performance. As cloud providers base many of their charges on data movement, it makes sense for customers to optimize their datasets. Smaller data traffic between cloud locations reduces incurred costs and frees network bandwidth for more users and faster delivery of services. It is wise for businesses to deduplicate their data before sending it to the cloud. It is also prudent to dedup data stored in the cloud.

Common use cases for data deduplication

Organizations of all sizes need deduplication for these everyday use cases:

  • Virtual machines. VMs with application deployments result in duplicate guests and associated data. Dedup can help VMs work more efficiently.

  • Endpoints. Endpoint clients, including desktops and laptops, are susceptible to building duplicate data that need regular dedup for better performance and more efficient backup operations.

  • Cloud storage. As more data is moving to the cloud, deduplication offers oversized benefits in the cloud as data build-up may be costlier than on-premises data storage, if it goes unchecked. Some customers may find applying dedup to their data before sending it to the cloud for long term retention, is helpful in reducing cost. Generally, reducing the dataset’s size lowers costs and improves networking and performance.

Does Commvault offer data deduplication?

Yes! Commvault offers enterprise-grade deduplication in our portfolio of data protection offerings. Through our sophisticated dedup capabilities, customers enjoy fast performance, storage optimization, and cost savings. Get a demo today to learn how it would help your organization!

Sources

  1. Domo.com 2020. “Data Never Sleeps 6.0.”
  2. Raconteur. Data in a Day Infographic.
  3. Microsoft 2021. “Why is Data Deduplication useful?”
  4. WinPure January 2021. “Data Deduplication Benefits: Improve Your Company’s ROI.”

Defending Your Data

Keys to protecting and managing your institutional, student, and research data.