Deja Vu in Discovery: Didn't I Just Review That?

Near-dupe detection and email threading can greatly reduce the number of "duplicate" documents that must be reviewed.

April 07, 2020 at 12:23 PM

5 minute read

Todd Heffner of Jones Walker. (Courtesy photo)

"I thought I just reviewed that document!" is something every document reviewer since the dawn of time has thought. Duplicate documents, and documents that seem like duplicates but aren't, can slow down document review, increase costs and lead to inconsistent document coding—which can be extremely problematic when occurring during a privilege review. This article explores why document reviewers who claim to see duplicates are right and wrong at the same time by explaining how deduplication works and the inherent shortcomings with that process. This article also offers a solution: two technologies—near-dupe detection and email threading—which can greatly reduce the number of "duplicate" documents that must be reviewed.

Deduplication sounds simple in theory: Remove all of the duplicate documents when you load documents into your document review platform. Deduplication isn't simple in practice: Is a PDF and Word document that have the exact same text a duplicate ? (No) What about a document that is attached to an email and saved on someone's hard drive? (Yes) The slightest of differences in documents, which might not be perceptible to the document reviewer, or documents saved in different formats, can mean the documents are not exact duplicates. If they aren't exact duplicates, you'll be stuck looking at what is practically the same document multiple times, because the documents won't generate the same hash value (what was used to answer the questions posed above). Hash values are the standard method document review platforms use to deduplicate documents. When the documents are processed into the platform, a hash value for each document will be generated and compared with all of the documents already processed, and any documents with matching hash values will not be made available for review. (How hash values are generated is beyond the scope of this article.)

Deduplication, using hash values, can be done several different ways; the two most common are globally and by custodian. Email helps explain the difference between the two. If two people email each other and their emails are collected, each will have a copy of all the emails sent back and forth. If you deduplicate globally, only one copy of each email will remain in your document review database. If you deduplicate by custodian, you'll be left with two copies of each email, but any extra copies of the emails will not get into the database. When you deduplicate documents globally, it is imperative that you be mindful of the metadata. For example, a field should be created that indicates all of the custodians of the document.

Deduplication is going to prevent a lot of documents from entering your review platform, but even when done globally, it still has shortcomings. One major shortcoming is that the way hash values are created lead to some quirks as demonstrated by the questions posed above. The other major shortcoming is simply how people operate: saving draft upon draft of the same document before finalizing it—these are obviously different documents, but they feel like duplicates. The solution to these problems is near-dupe detection.

Near-dupe detection should be an option with every document review platform worth its salt. What most versions of it do, generally, is come up with a score or percentage for how alike certain documents are to each other. A Word document and the PDF generated from it should have an extremely high score. Multiple drafts of the same letter will not score as high, but, depending on your needs, can help you cut through a lot of documents quickly when they are clustered together. How you implement this tool might vary. If you need to get to the most interesting documents and content quickly, you could skip reviewing all documents that are a 75% match to a document you've already reviewed. If it's important to know who made what edits to a draft report and when they made them, you'd obviously want to study each different version. In this latter example, near-dupe detection can still be helpful because it will cluster all the different drafts together—so don't just view it as a tool to eliminate documents from the review process, it can help with much more.

Email threading is the other technology that can drastically reduce the number of similar documents that need to be reviewed. It works by combining back and forth email exchanges, which would otherwise create as many separate documents as there are individual emails, into a single document. Without email threading, the following exchange would create three documents that have to be reviewed: Email 1: "Are you available to meet today?" Email 2: "Yes. After 2:00 I'm free." Email 3: "Okay, let's meet at 2:30 to discuss next year's marketing budget." With email threading, those three emails would appear as one document. The obvious benefit is fewer documents to review. The less obvious benefit is that it leads to more consistent coding of documents, which is particularly helpful when privilege review takes place.

This article barely skimmed the surface of how these two tools can make document review easier, but hopefully with this basic understanding, you can see the utility of implementing them on your next document review project. Whichever company provides your document review platform will have a lot more information on how their particular tools work and can discuss best practices for implementing them with you.

Todd Heffner is a construction litigator and eDiscovery specialist with Jones Walker in Atlanta.

This content has been archived. It is available through our partners, LexisNexis® and Bloomberg Law.

To view this content, please continue to their sites.

Go To Lexis →

Not a Lexis Subscriber?
Subscribe Now

Go To Bloomberg Law →

Not a Bloomberg Law Subscriber?
Subscribe Now

NOT FOR REPRINT

You Might Like

January 31, 2025

Law Firms Expand Scope of Immigration Expertise Amid Blitz of Trump Orders

By Patrick Smith

6 minute read

January 31, 2025

Losses Mount at Morris Manning, but Departing Ex-Chair Stays Bullish About His Old Firm's Future

By Thomas Spigolon

5 minute read

January 31, 2025

Bass Berry & Sims Relocates to Nashville Office Designed to Encourage Collaboration, Inclusion

By Thomas Spigolon

4 minute read

January 30, 2025

Gunderson Dettmer Opens Atlanta Office With 3 Partners From Morris Manning

By Thomas Spigolon

3 minute read

Law Firms Mentioned

Jones, Walker, Waechter, Poitevent, CarrÃƒÂ¨re & DenÃƒÂ¨gre LLP

Latest

Trending

Who Got The Work

J. Brugh Lower of Gibbons has entered an appearance for industrial equipment supplier Devco Corporation in a pending trademark infringement lawsuit. The suit, accusing the defendant of selling knock-off Graco products, was filed Dec. 18 in New Jersey District Court by Rivkin Radler on behalf of Graco Inc. and Graco Minnesota. The case, assigned to U.S. District Judge Zahid N. Quraishi, is 3:24-cv-11294, Graco Inc. et al v. Devco Corporation.

Who Got The Work

Rebecca Maller-Stein and Kent A. Yalowitz of Arnold & Porter Kaye Scholer have entered their appearances for Hanaco Venture Capital and its executives, Lior Prosor and David Frankel, in a pending securities lawsuit. The action, filed on Dec. 24 in New York Southern District Court by Zell, Aron & Co. on behalf of Goldeneye Advisors, accuses the defendants of negligently and fraudulently managing the plaintiff's $1 million investment. The case, assigned to U.S. District Judge Vernon S. Broderick, is 1:24-cv-09918, Goldeneye Advisors, LLC v. Hanaco Venture Capital, Ltd. et al.

Who Got The Work

Attorneys from A&O Shearman has stepped in as defense counsel for Toronto-Dominion Bank and other defendants in a pending securities class action. The suit, filed Dec. 11 in New York Southern District Court by Bleichmar Fonti & Auld, accuses the defendants of concealing the bank's 'pervasive' deficiencies in regards to its compliance with the Bank Secrecy Act and the quality of its anti-money laundering controls. The case, assigned to U.S. District Judge Arun Subramanian, is 1:24-cv-09445, Gonzalez v. The Toronto-Dominion Bank et al.

Who Got The Work

Crown Castle International, a Pennsylvania company providing shared communications infrastructure, has turned to Luke D. Wolf of Gordon Rees Scully Mansukhani to fend off a pending breach-of-contract lawsuit. The court action, filed Nov. 25 in Michigan Eastern District Court by Hooper Hathaway PC on behalf of The Town Residences LLC, accuses Crown Castle of failing to transfer approximately $30,000 in utility payments from T-Mobile in breach of a roof-top lease and assignment agreement. The case, assigned to U.S. District Judge Susan K. Declercq, is 2:24-cv-13131, The Town Residences LLC v. T-Mobile US, Inc. et al.

Who Got The Work

Wilfred P. Coronato and Daniel M. Schwartz of McCarter & English have stepped in as defense counsel to Electrolux Home Products Inc. in a pending product liability lawsuit. The court action, filed Nov. 26 in New York Eastern District Court by Poulos Lopiccolo PC and Nagel Rice LLP on behalf of David Stern, alleges that the defendant's refrigerators’ drawers and shelving repeatedly break and fall apart within months after purchase. The case, assigned to U.S. District Judge Joan M. Azrack, is 2:24-cv-08204, Stern v. Electrolux Home Products, Inc.

Learn More About Radar

Featured Firms

Law Offices of Gary Martin Hays & Associates, P.C.

(470) 294-1674

Law Offices of Mark E. Salomone

(857) 444-6468

Smith & Hassler

(713) 739-1250

Deja Vu in Discovery: Didn't I Just Review That?

This content has been archived. It is available through our partners, LexisNexis® and Bloomberg Law.

You Might Like

Law Firms Mentioned

Featured Firms

More from ALM

Subscribe to Daily Report