No one likes to duplicate work. In the e-discovery world, the daunting term of “de-duplication” embodies the act of removing duplicate documents from a collection of data to be reviewed. It would be a waste of time and money for someone to review five exact copies of an e-mail or document.

While the concept sounds simple enough, the practical implications of de-duplication can have a significant impact on your litigation costs.

De-Duplication Basics

The term “de-duplication” is not new. The concept has been used in the IT world for many years as a convenient and resourceful way to conserve precious storage space.

Consider this example: You receive a memo attached to an e-mail from your outside law firm. You need to respond, but first you forward the memo to 4 or 5 co-workers for their feedback. Each person you forward it to now has a copy of that memo in their e-mail. Instead of backing up 4 or 5 duplicate copies of that memo, your IT department likely uses a process to identify the duplicates so they only back up one copy of that file while maintaining an index of where each file was found. The index allows the IT department to restore that file for each user if that is ever necessary … in response to a production request, for example.

To compare two (or more) documents, each file is assigned a unique identifier based on its content. This identifier is called a “hash” value probably because it looks like an indecipherable mess of letters and numbers. The hash value is generated from a precise mathematical equation, so only two completely identical files will have the same hash value. If a comma or space is added to one document, the hash value of that file will be completely different. (For more information on hash values, see the e-Discovery Team blog.

E-discovery vendors apply this technology when they process electronically stored information in preparation for review. This process can be invaluable in culling down a large document collection, and obviously saves time on the actual review. The amount of data by which the de-duplication process shrinks the relevant universe of documents to can be surprising.

Duplicates on the Horizon

De-duplication is usually applied in two different scenarios. The first is “vertical” de-duplication, sometimes called “custodian de-duplication.” In this scenario, the de-duplication technique is only applied to a single custodian's data collection. This would catch two duplicate documents that existed on that person's computer hard drive, but it would not eliminate duplicates of that same document that existed on other people's computers.

Vertical de-duplication is commonly used with backup tapes. If an individual kept an e-mail or a document on their system for 6 months, and there were 6 monthly backup tapes, then de-duplication would eliminate 5 redundant copies of that e-mail or document.

The other scenario is called “horizontal” de-duplication which is applied across all the custodians or data sets involved in a matter. Horizontal de-duplication would catch all 5 copies of the e-mailed memo in our example above from each person's e-mail collection. The trick here is to ensure that your e-discovery processor keeps track of the source of each copy in case that information becomes necessary.

Exact Duplicates vs. Near-Duplicates

If those were the only factors that had to be covered in de-duplication, the e-discovery issue would be easy. In the real world, however, the same Microsoft Word document could exist as a PDF file as well. These two files may be exactly alike except one file ends in .doc and the other one ends in .pdf. Because of that, the hash values would be completely different, and both of these files would be included in the review database.

One way to solve this issue would be to only create the hash value based on certain characteristics of the files. Instead of establishing the hash value based on the content of the document, you could create the hash value from just the title of the document, or other specific properties of the files. This could be helpful, but a little peculiar unless you really know your document collection and understand how de-duplication will cull it down.

If the idea of de-duplication is leaving you a bit duped, perhaps you'll be a little more comfortable with the idea of “near-duplicates.” A company called Equivio has popularized the phrase “near-duplicates” based on their technology that identifies documents that are mostly similar to each other. Examples of near-duplicates include various drafts of the same Word document or an e-mail forward that contains the entire original message. Neither of these examples are exact duplicates, but the files are so extremely similar that a reviewer may only need to see the final version of the document to determine if the entire group of near-duplicates is relevant.

De-duplication techniques and technology are not perfect, but they go a long way in effectively culling down a large document collection in preparation for review. Many of the important decisions regarding how documents should be de-duplicated depend greatly on the knowledge about the documents and how they have been collected. If you are familiar with the schedule of your company's backup tapes, for example, that can help you decide if vertical or horizontal de-duplication will be the best method for a matter.

No one likes to duplicate work. In the e-discovery world, the daunting term of “de-duplication” embodies the act of removing duplicate documents from a collection of data to be reviewed. It would be a waste of time and money for someone to review five exact copies of an e-mail or document.

While the concept sounds simple enough, the practical implications of de-duplication can have a significant impact on your litigation costs.

De-Duplication Basics

The term “de-duplication” is not new. The concept has been used in the IT world for many years as a convenient and resourceful way to conserve precious storage space.

Consider this example: You receive a memo attached to an e-mail from your outside law firm. You need to respond, but first you forward the memo to 4 or 5 co-workers for their feedback. Each person you forward it to now has a copy of that memo in their e-mail. Instead of backing up 4 or 5 duplicate copies of that memo, your IT department likely uses a process to identify the duplicates so they only back up one copy of that file while maintaining an index of where each file was found. The index allows the IT department to restore that file for each user if that is ever necessary … in response to a production request, for example.

To compare two (or more) documents, each file is assigned a unique identifier based on its content. This identifier is called a “hash” value probably because it looks like an indecipherable mess of letters and numbers. The hash value is generated from a precise mathematical equation, so only two completely identical files will have the same hash value. If a comma or space is added to one document, the hash value of that file will be completely different. (For more information on hash values, see the e-Discovery Team blog.

E-discovery vendors apply this technology when they process electronically stored information in preparation for review. This process can be invaluable in culling down a large document collection, and obviously saves time on the actual review. The amount of data by which the de-duplication process shrinks the relevant universe of documents to can be surprising.

Duplicates on the Horizon

De-duplication is usually applied in two different scenarios. The first is “vertical” de-duplication, sometimes called “custodian de-duplication.” In this scenario, the de-duplication technique is only applied to a single custodian's data collection. This would catch two duplicate documents that existed on that person's computer hard drive, but it would not eliminate duplicates of that same document that existed on other people's computers.

Vertical de-duplication is commonly used with backup tapes. If an individual kept an e-mail or a document on their system for 6 months, and there were 6 monthly backup tapes, then de-duplication would eliminate 5 redundant copies of that e-mail or document.

The other scenario is called “horizontal” de-duplication which is applied across all the custodians or data sets involved in a matter. Horizontal de-duplication would catch all 5 copies of the e-mailed memo in our example above from each person's e-mail collection. The trick here is to ensure that your e-discovery processor keeps track of the source of each copy in case that information becomes necessary.

Exact Duplicates vs. Near-Duplicates

If those were the only factors that had to be covered in de-duplication, the e-discovery issue would be easy. In the real world, however, the same Microsoft Word document could exist as a PDF file as well. These two files may be exactly alike except one file ends in .doc and the other one ends in .pdf. Because of that, the hash values would be completely different, and both of these files would be included in the review database.

One way to solve this issue would be to only create the hash value based on certain characteristics of the files. Instead of establishing the hash value based on the content of the document, you could create the hash value from just the title of the document, or other specific properties of the files. This could be helpful, but a little peculiar unless you really know your document collection and understand how de-duplication will cull it down.

If the idea of de-duplication is leaving you a bit duped, perhaps you'll be a little more comfortable with the idea of “near-duplicates.” A company called Equivio has popularized the phrase “near-duplicates” based on their technology that identifies documents that are mostly similar to each other. Examples of near-duplicates include various drafts of the same Word document or an e-mail forward that contains the entire original message. Neither of these examples are exact duplicates, but the files are so extremely similar that a reviewer may only need to see the final version of the document to determine if the entire group of near-duplicates is relevant.

De-duplication techniques and technology are not perfect, but they go a long way in effectively culling down a large document collection in preparation for review. Many of the important decisions regarding how documents should be de-duplicated depend greatly on the knowledge about the documents and how they have been collected. If you are familiar with the schedule of your company's backup tapes, for example, that can help you decide if vertical or horizontal de-duplication will be the best method for a matter.