"I thought I just reviewed that document!" is something every document reviewer since the dawn of time has thought. Duplicate documents, and documents that seem like duplicates but aren't, can slow down document review, increase costs and lead to inconsistent document coding—which can be extremely problematic when occurring during a privilege review. This article explores why document reviewers who claim to see duplicates are right and wrong at the same time by explaining how deduplication works and the inherent shortcomings with that process. This article also offers a solution: two technologies—near-dupe detection and email threading—which can greatly reduce the number of "duplicate" documents that must be reviewed.

Deduplication sounds simple in theory: Remove all of the duplicate documents when you load documents into your document review platform. Deduplication isn't simple in practice: Is a PDF and Word document that have the exact same text a duplicate ? (No) What about a document that is attached to an email and saved on someone's hard drive? (Yes) The slightest of differences in documents, which might not be perceptible to the document reviewer, or documents saved in different formats, can mean the documents are not exact duplicates. If they aren't exact duplicates, you'll be stuck looking at what is practically the same document multiple times, because the documents won't generate the same hash value (what was used to answer the questions posed above). Hash values are the standard method document review platforms use to deduplicate documents. When the documents are processed into the platform, a hash value for each document will be generated and compared with all of the documents already processed, and any documents with matching hash values will not be made available for review. (How hash values are generated is beyond the scope of this article.)

Deduplication, using hash values, can be done several different ways; the two most common are globally and by custodian. Email helps explain the difference between the two. If two people email each other and their emails are collected, each will have a copy of all the emails sent back and forth. If you deduplicate globally, only one copy of each email will remain in your document review database. If you deduplicate by custodian, you'll be left with two copies of each email, but any extra copies of the emails will not get into the database. When you deduplicate documents globally, it is imperative that you be mindful of the metadata. For example, a field should be created that indicates all of the custodians of the document.

Deduplication is going to prevent a lot of documents from entering your review platform, but even when done globally, it still has shortcomings. One major shortcoming is that the way hash values are created lead to some quirks as demonstrated by the questions posed above. The other major shortcoming is simply how people operate: saving draft upon draft of the same document before finalizing it—these are obviously different documents, but they feel like duplicates. The solution to these problems is near-dupe detection.

Near-dupe detection should be an option with every document review platform worth its salt. What most versions of it do, generally, is come up with a score or percentage for how alike certain documents are to each other. A Word document and the PDF generated from it should have an extremely high score. Multiple drafts of the same letter will not score as high, but, depending on your needs, can help you cut through a lot of documents quickly when they are clustered together. How you implement this tool might vary. If you need to get to the most interesting documents and content quickly, you could skip reviewing all documents that are a 75% match to a document you've already reviewed. If it's important to know who made what edits to a draft report and when they made them, you'd obviously want to study each different version. In this latter example, near-dupe detection can still be helpful because it will cluster all the different drafts together—so don't just view it as a tool to eliminate documents from the review process, it can help with much more.

Email threading is the other technology that can drastically reduce the number of similar documents that need to be reviewed. It works by combining back and forth email exchanges, which would otherwise create as many separate documents as there are individual emails, into a single document. Without email threading, the following exchange would create three documents that have to be reviewed: Email 1: "Are you available to meet today?" Email 2: "Yes. After 2:00 I'm free." Email 3: "Okay, let's meet at 2:30 to discuss next year's marketing budget." With email threading, those three emails would appear as one document. The obvious benefit is fewer documents to review. The less obvious benefit is that it leads to more consistent coding of documents, which is particularly helpful when privilege review takes place.

This article barely skimmed the surface of how these two tools can make document review easier, but hopefully with this basic understanding, you can see the utility of implementing them on your next document review project. Whichever company provides your document review platform will have a lot more information on how their particular tools work and can discuss best practices for implementing them with you.

Todd Heffner is a construction litigator and eDiscovery specialist with Jones Walker in Atlanta.