Deja Vu in Discovery: Didn't I Just Review That?
Near-dupe detection and email threading can greatly reduce the number of "duplicate" documents that must be reviewed.
April 07, 2020 at 12:23 PM
5 minute read
"I thought I just reviewed that document!" is something every document reviewer since the dawn of time has thought. Duplicate documents, and documents that seem like duplicates but aren't, can slow down document review, increase costs and lead to inconsistent document coding—which can be extremely problematic when occurring during a privilege review. This article explores why document reviewers who claim to see duplicates are right and wrong at the same time by explaining how deduplication works and the inherent shortcomings with that process. This article also offers a solution: two technologies—near-dupe detection and email threading—which can greatly reduce the number of "duplicate" documents that must be reviewed.
Deduplication sounds simple in theory: Remove all of the duplicate documents when you load documents into your document review platform. Deduplication isn't simple in practice: Is a PDF and Word document that have the exact same text a duplicate ? (No) What about a document that is attached to an email and saved on someone's hard drive? (Yes) The slightest of differences in documents, which might not be perceptible to the document reviewer, or documents saved in different formats, can mean the documents are not exact duplicates. If they aren't exact duplicates, you'll be stuck looking at what is practically the same document multiple times, because the documents won't generate the same hash value (what was used to answer the questions posed above). Hash values are the standard method document review platforms use to deduplicate documents. When the documents are processed into the platform, a hash value for each document will be generated and compared with all of the documents already processed, and any documents with matching hash values will not be made available for review. (How hash values are generated is beyond the scope of this article.)
Deduplication, using hash values, can be done several different ways; the two most common are globally and by custodian. Email helps explain the difference between the two. If two people email each other and their emails are collected, each will have a copy of all the emails sent back and forth. If you deduplicate globally, only one copy of each email will remain in your document review database. If you deduplicate by custodian, you'll be left with two copies of each email, but any extra copies of the emails will not get into the database. When you deduplicate documents globally, it is imperative that you be mindful of the metadata. For example, a field should be created that indicates all of the custodians of the document.
Deduplication is going to prevent a lot of documents from entering your review platform, but even when done globally, it still has shortcomings. One major shortcoming is that the way hash values are created lead to some quirks as demonstrated by the questions posed above. The other major shortcoming is simply how people operate: saving draft upon draft of the same document before finalizing it—these are obviously different documents, but they feel like duplicates. The solution to these problems is near-dupe detection.
Near-dupe detection should be an option with every document review platform worth its salt. What most versions of it do, generally, is come up with a score or percentage for how alike certain documents are to each other. A Word document and the PDF generated from it should have an extremely high score. Multiple drafts of the same letter will not score as high, but, depending on your needs, can help you cut through a lot of documents quickly when they are clustered together. How you implement this tool might vary. If you need to get to the most interesting documents and content quickly, you could skip reviewing all documents that are a 75% match to a document you've already reviewed. If it's important to know who made what edits to a draft report and when they made them, you'd obviously want to study each different version. In this latter example, near-dupe detection can still be helpful because it will cluster all the different drafts together—so don't just view it as a tool to eliminate documents from the review process, it can help with much more.
Email threading is the other technology that can drastically reduce the number of similar documents that need to be reviewed. It works by combining back and forth email exchanges, which would otherwise create as many separate documents as there are individual emails, into a single document. Without email threading, the following exchange would create three documents that have to be reviewed: Email 1: "Are you available to meet today?" Email 2: "Yes. After 2:00 I'm free." Email 3: "Okay, let's meet at 2:30 to discuss next year's marketing budget." With email threading, those three emails would appear as one document. The obvious benefit is fewer documents to review. The less obvious benefit is that it leads to more consistent coding of documents, which is particularly helpful when privilege review takes place.
This article barely skimmed the surface of how these two tools can make document review easier, but hopefully with this basic understanding, you can see the utility of implementing them on your next document review project. Whichever company provides your document review platform will have a lot more information on how their particular tools work and can discuss best practices for implementing them with you.
Todd Heffner is a construction litigator and eDiscovery specialist with Jones Walker in Atlanta.
This content has been archived. It is available through our partners, LexisNexis® and Bloomberg Law.
To view this content, please continue to their sites.
Not a Lexis Subscriber?
Subscribe Now
Not a Bloomberg Law Subscriber?
Subscribe Now
NOT FOR REPRINT
© 2024 ALM Global, LLC, All Rights Reserved. Request academic re-use from www.copyright.com. All other uses, submit a request to [email protected]. For more information visit Asset & Logo Licensing.
You Might Like
View AllOn The Move: Polsinelli Adds Health Care Litigator in Nashville, Ex-SEC Enforcer Joins BCLP in Atlanta
6 minute readAkerman Opens Charlotte Office With Focus on Renewable Energy, Data Center Practices
4 minute readNelson Mullins, Greenberg Traurig, Jones Day Have Established Themselves As Biggest Outsiders in Atlanta Legal Market
7 minute readLaw Firms Mentioned
Trending Stories
- 1Armstrong Teasdale's London Creditors Face Big Losses
- 2Texas Court Invalidates SEC’s Dealer Rule, Siding with Crypto Advocates
- 3Quinn Emanuel Has Thrived in China. Will Trump Help Boost Its Fortunes?
- 4Manufacturer Must Provide Details Surrounding Expert’s Livestreamed Inspection, Fed Court Rules
- 5Waterbury Jury Awards $2 Million Verdict Against Eversource
Who Got The Work
Michael G. Bongiorno, Andrew Scott Dulberg and Elizabeth E. Driscoll from Wilmer Cutler Pickering Hale and Dorr have stepped in to represent Symbotic Inc., an A.I.-enabled technology platform that focuses on increasing supply chain efficiency, and other defendants in a pending shareholder derivative lawsuit. The case, filed Oct. 2 in Massachusetts District Court by the Brown Law Firm on behalf of Stephen Austen, accuses certain officers and directors of misleading investors in regard to Symbotic's potential for margin growth by failing to disclose that the company was not equipped to timely deploy its systems or manage expenses through project delays. The case, assigned to U.S. District Judge Nathaniel M. Gorton, is 1:24-cv-12522, Austen v. Cohen et al.
Who Got The Work
Edmund Polubinski and Marie Killmond of Davis Polk & Wardwell have entered appearances for data platform software development company MongoDB and other defendants in a pending shareholder derivative lawsuit. The action, filed Oct. 7 in New York Southern District Court by the Brown Law Firm, accuses the company's directors and/or officers of falsely expressing confidence in the company’s restructuring of its sales incentive plan and downplaying the severity of decreases in its upfront commitments. The case is 1:24-cv-07594, Roy v. Ittycheria et al.
Who Got The Work
Amy O. Bruchs and Kurt F. Ellison of Michael Best & Friedrich have entered appearances for Epic Systems Corp. in a pending employment discrimination lawsuit. The suit was filed Sept. 7 in Wisconsin Western District Court by Levine Eisberner LLC and Siri & Glimstad on behalf of a project manager who claims that he was wrongfully terminated after applying for a religious exemption to the defendant's COVID-19 vaccine mandate. The case, assigned to U.S. Magistrate Judge Anita Marie Boor, is 3:24-cv-00630, Secker, Nathan v. Epic Systems Corporation.
Who Got The Work
David X. Sullivan, Thomas J. Finn and Gregory A. Hall from McCarter & English have entered appearances for Sunrun Installation Services in a pending civil rights lawsuit. The complaint was filed Sept. 4 in Connecticut District Court by attorney Robert M. Berke on behalf of former employee George Edward Steins, who was arrested and charged with employing an unregistered home improvement salesperson. The complaint alleges that had Sunrun informed the Connecticut Department of Consumer Protection that the plaintiff's employment had ended in 2017 and that he no longer held Sunrun's home improvement contractor license, he would not have been hit with charges, which were dismissed in May 2024. The case, assigned to U.S. District Judge Jeffrey A. Meyer, is 3:24-cv-01423, Steins v. Sunrun, Inc. et al.
Who Got The Work
Greenberg Traurig shareholder Joshua L. Raskin has entered an appearance for boohoo.com UK Ltd. in a pending patent infringement lawsuit. The suit, filed Sept. 3 in Texas Eastern District Court by Rozier Hardt McDonough on behalf of Alto Dynamics, asserts five patents related to an online shopping platform. The case, assigned to U.S. District Judge Rodney Gilstrap, is 2:24-cv-00719, Alto Dynamics, LLC v. boohoo.com UK Limited.
Featured Firms
Law Offices of Gary Martin Hays & Associates, P.C.
(470) 294-1674
Law Offices of Mark E. Salomone
(857) 444-6468
Smith & Hassler
(713) 739-1250