Dispelling Doubts about De-Duplication
The practical implications of de-duplication can have a significant impact on your litigation costs.
July 16, 2008 at 08:00 PM
10 minute read
The original version of this story was published on Law.com
No one likes to duplicate work. In the e-discovery world, the daunting term of “de-duplication” embodies the act of removing duplicate documents from a collection of data to be reviewed. It would be a waste of time and money for someone to review five exact copies of an e-mail or document.
While the concept sounds simple enough, the practical implications of de-duplication can have a significant impact on your litigation costs.
De-Duplication Basics
The term “de-duplication” is not new. The concept has been used in the IT world for many years as a convenient and resourceful way to conserve precious storage space.
Consider this example: You receive a memo attached to an e-mail from your outside law firm. You need to respond, but first you forward the memo to 4 or 5 co-workers for their feedback. Each person you forward it to now has a copy of that memo in their e-mail. Instead of backing up 4 or 5 duplicate copies of that memo, your IT department likely uses a process to identify the duplicates so they only back up one copy of that file while maintaining an index of where each file was found. The index allows the IT department to restore that file for each user if that is ever necessary … in response to a production request, for example.
To compare two (or more) documents, each file is assigned a unique identifier based on its content. This identifier is called a “hash” value probably because it looks like an indecipherable mess of letters and numbers. The hash value is generated from a precise mathematical equation, so only two completely identical files will have the same hash value. If a comma or space is added to one document, the hash value of that file will be completely different. (For more information on hash values, see the e-Discovery Team blog.
E-discovery vendors apply this technology when they process electronically stored information in preparation for review. This process can be invaluable in culling down a large document collection, and obviously saves time on the actual review. The amount of data by which the de-duplication process shrinks the relevant universe of documents to can be surprising.
Duplicates on the Horizon
De-duplication is usually applied in two different scenarios. The first is “vertical” de-duplication, sometimes called “custodian de-duplication.” In this scenario, the de-duplication technique is only applied to a single custodian's data collection. This would catch two duplicate documents that existed on that person's computer hard drive, but it would not eliminate duplicates of that same document that existed on other people's computers.
Vertical de-duplication is commonly used with backup tapes. If an individual kept an e-mail or a document on their system for 6 months, and there were 6 monthly backup tapes, then de-duplication would eliminate 5 redundant copies of that e-mail or document.
The other scenario is called “horizontal” de-duplication which is applied across all the custodians or data sets involved in a matter. Horizontal de-duplication would catch all 5 copies of the e-mailed memo in our example above from each person's e-mail collection. The trick here is to ensure that your e-discovery processor keeps track of the source of each copy in case that information becomes necessary.
Exact Duplicates vs. Near-Duplicates
If those were the only factors that had to be covered in de-duplication, the e-discovery issue would be easy. In the real world, however, the same Microsoft Word document could exist as a PDF file as well. These two files may be exactly alike except one file ends in .doc and the other one ends in .pdf. Because of that, the hash values would be completely different, and both of these files would be included in the review database.
One way to solve this issue would be to only create the hash value based on certain characteristics of the files. Instead of establishing the hash value based on the content of the document, you could create the hash value from just the title of the document, or other specific properties of the files. This could be helpful, but a little peculiar unless you really know your document collection and understand how de-duplication will cull it down.
If the idea of de-duplication is leaving you a bit duped, perhaps you'll be a little more comfortable with the idea of “near-duplicates.” A company called Equivio has popularized the phrase “near-duplicates” based on their technology that identifies documents that are mostly similar to each other. Examples of near-duplicates include various drafts of the same Word document or an e-mail forward that contains the entire original message. Neither of these examples are exact duplicates, but the files are so extremely similar that a reviewer may only need to see the final version of the document to determine if the entire group of near-duplicates is relevant.
De-duplication techniques and technology are not perfect, but they go a long way in effectively culling down a large document collection in preparation for review. Many of the important decisions regarding how documents should be de-duplicated depend greatly on the knowledge about the documents and how they have been collected. If you are familiar with the schedule of your company's backup tapes, for example, that can help you decide if vertical or horizontal de-duplication will be the best method for a matter.
No one likes to duplicate work. In the e-discovery world, the daunting term of “de-duplication” embodies the act of removing duplicate documents from a collection of data to be reviewed. It would be a waste of time and money for someone to review five exact copies of an e-mail or document.
While the concept sounds simple enough, the practical implications of de-duplication can have a significant impact on your litigation costs.
De-Duplication Basics
The term “de-duplication” is not new. The concept has been used in the IT world for many years as a convenient and resourceful way to conserve precious storage space.
Consider this example: You receive a memo attached to an e-mail from your outside law firm. You need to respond, but first you forward the memo to 4 or 5 co-workers for their feedback. Each person you forward it to now has a copy of that memo in their e-mail. Instead of backing up 4 or 5 duplicate copies of that memo, your IT department likely uses a process to identify the duplicates so they only back up one copy of that file while maintaining an index of where each file was found. The index allows the IT department to restore that file for each user if that is ever necessary … in response to a production request, for example.
To compare two (or more) documents, each file is assigned a unique identifier based on its content. This identifier is called a “hash” value probably because it looks like an indecipherable mess of letters and numbers. The hash value is generated from a precise mathematical equation, so only two completely identical files will have the same hash value. If a comma or space is added to one document, the hash value of that file will be completely different. (For more information on hash values, see the e-Discovery Team blog.
E-discovery vendors apply this technology when they process electronically stored information in preparation for review. This process can be invaluable in culling down a large document collection, and obviously saves time on the actual review. The amount of data by which the de-duplication process shrinks the relevant universe of documents to can be surprising.
Duplicates on the Horizon
De-duplication is usually applied in two different scenarios. The first is “vertical” de-duplication, sometimes called “custodian de-duplication.” In this scenario, the de-duplication technique is only applied to a single custodian's data collection. This would catch two duplicate documents that existed on that person's computer hard drive, but it would not eliminate duplicates of that same document that existed on other people's computers.
Vertical de-duplication is commonly used with backup tapes. If an individual kept an e-mail or a document on their system for 6 months, and there were 6 monthly backup tapes, then de-duplication would eliminate 5 redundant copies of that e-mail or document.
The other scenario is called “horizontal” de-duplication which is applied across all the custodians or data sets involved in a matter. Horizontal de-duplication would catch all 5 copies of the e-mailed memo in our example above from each person's e-mail collection. The trick here is to ensure that your e-discovery processor keeps track of the source of each copy in case that information becomes necessary.
Exact Duplicates vs. Near-Duplicates
If those were the only factors that had to be covered in de-duplication, the e-discovery issue would be easy. In the real world, however, the same
One way to solve this issue would be to only create the hash value based on certain characteristics of the files. Instead of establishing the hash value based on the content of the document, you could create the hash value from just the title of the document, or other specific properties of the files. This could be helpful, but a little peculiar unless you really know your document collection and understand how de-duplication will cull it down.
If the idea of de-duplication is leaving you a bit duped, perhaps you'll be a little more comfortable with the idea of “near-duplicates.” A company called Equivio has popularized the phrase “near-duplicates” based on their technology that identifies documents that are mostly similar to each other. Examples of near-duplicates include various drafts of the same Word document or an e-mail forward that contains the entire original message. Neither of these examples are exact duplicates, but the files are so extremely similar that a reviewer may only need to see the final version of the document to determine if the entire group of near-duplicates is relevant.
De-duplication techniques and technology are not perfect, but they go a long way in effectively culling down a large document collection in preparation for review. Many of the important decisions regarding how documents should be de-duplicated depend greatly on the knowledge about the documents and how they have been collected. If you are familiar with the schedule of your company's backup tapes, for example, that can help you decide if vertical or horizontal de-duplication will be the best method for a matter.
This content has been archived. It is available through our partners, LexisNexis® and Bloomberg Law.
To view this content, please continue to their sites.
Not a Lexis Subscriber?
Subscribe Now
Not a Bloomberg Law Subscriber?
Subscribe Now
NOT FOR REPRINT
© 2024 ALM Global, LLC, All Rights Reserved. Request academic re-use from www.copyright.com. All other uses, submit a request to [email protected]. For more information visit Asset & Logo Licensing.
You Might Like
View AllDigging Deep to Mitigate Risk in Lithium Mine Venture Wins GM Legal Department of the Year Award
5 minute readFTC Settles With Security Firm Over AI Claims Under Agency's Compliance Program
6 minute readPeople and Purpose: AbbVie's GC on Leading With Impact and Inspiring Change
7 minute readTrending Stories
Who Got The Work
Michael G. Bongiorno, Andrew Scott Dulberg and Elizabeth E. Driscoll from Wilmer Cutler Pickering Hale and Dorr have stepped in to represent Symbotic Inc., an A.I.-enabled technology platform that focuses on increasing supply chain efficiency, and other defendants in a pending shareholder derivative lawsuit. The case, filed Oct. 2 in Massachusetts District Court by the Brown Law Firm on behalf of Stephen Austen, accuses certain officers and directors of misleading investors in regard to Symbotic's potential for margin growth by failing to disclose that the company was not equipped to timely deploy its systems or manage expenses through project delays. The case, assigned to U.S. District Judge Nathaniel M. Gorton, is 1:24-cv-12522, Austen v. Cohen et al.
Who Got The Work
Edmund Polubinski and Marie Killmond of Davis Polk & Wardwell have entered appearances for data platform software development company MongoDB and other defendants in a pending shareholder derivative lawsuit. The action, filed Oct. 7 in New York Southern District Court by the Brown Law Firm, accuses the company's directors and/or officers of falsely expressing confidence in the company’s restructuring of its sales incentive plan and downplaying the severity of decreases in its upfront commitments. The case is 1:24-cv-07594, Roy v. Ittycheria et al.
Who Got The Work
Amy O. Bruchs and Kurt F. Ellison of Michael Best & Friedrich have entered appearances for Epic Systems Corp. in a pending employment discrimination lawsuit. The suit was filed Sept. 7 in Wisconsin Western District Court by Levine Eisberner LLC and Siri & Glimstad on behalf of a project manager who claims that he was wrongfully terminated after applying for a religious exemption to the defendant's COVID-19 vaccine mandate. The case, assigned to U.S. Magistrate Judge Anita Marie Boor, is 3:24-cv-00630, Secker, Nathan v. Epic Systems Corporation.
Who Got The Work
David X. Sullivan, Thomas J. Finn and Gregory A. Hall from McCarter & English have entered appearances for Sunrun Installation Services in a pending civil rights lawsuit. The complaint was filed Sept. 4 in Connecticut District Court by attorney Robert M. Berke on behalf of former employee George Edward Steins, who was arrested and charged with employing an unregistered home improvement salesperson. The complaint alleges that had Sunrun informed the Connecticut Department of Consumer Protection that the plaintiff's employment had ended in 2017 and that he no longer held Sunrun's home improvement contractor license, he would not have been hit with charges, which were dismissed in May 2024. The case, assigned to U.S. District Judge Jeffrey A. Meyer, is 3:24-cv-01423, Steins v. Sunrun, Inc. et al.
Who Got The Work
Greenberg Traurig shareholder Joshua L. Raskin has entered an appearance for boohoo.com UK Ltd. in a pending patent infringement lawsuit. The suit, filed Sept. 3 in Texas Eastern District Court by Rozier Hardt McDonough on behalf of Alto Dynamics, asserts five patents related to an online shopping platform. The case, assigned to U.S. District Judge Rodney Gilstrap, is 2:24-cv-00719, Alto Dynamics, LLC v. boohoo.com UK Limited.
Featured Firms
Law Offices of Gary Martin Hays & Associates, P.C.
(470) 294-1674
Law Offices of Mark E. Salomone
(857) 444-6468
Smith & Hassler
(713) 739-1250