The litigator’s toolbelt: Predictive Coding 101
Understanding how predictive coding technology and other TAR tools can be used together as part of a defensible e-discovery process can help organizations reduce risk and cost simultaneously.
February 07, 2014 at 03:00 AM
12 minute read
The original version of this story was published on Law.com
Query your average litigation attorney about the difference between predictive coding technology and other more traditional litigation tools, and you're likely to get a wide range of responses. The fact that “predictive coding” goes by many names, including “computer-assisted review” (CAR) and “technology-assisted review” (TAR) illustrates a fundamental problem: What is predictive coding, and how is it different from other tools in the litigator's technology toolbelt?
Predictive coding is a type of technology that enables a computer to “predict” how documents should be classified based on input or “training” from human reviewers. The technology can expedite the document review process by finding key documents faster, potentially saving organizations thousands of hours of time. And in a profession where time is money, narrowing days, weeks, or even months of tedious document review into more reasonable time frames means organizations can save big bucks by keeping litigation expenditures in check.
Despite the promise of predictive coding, widespread adoption among the legal community has been slower than expected, due in part to the confusion about how it differs from other types of TAR tools that have been available for years. Unlike TAR tools that automatically extract patterns and identify relationships between documents with minimal human intervention, predictive coding requires significant reliance on humans to train and fine-tune the system through an iterative process. Some common TAR tools used in e-discovery that do not include this same level of interaction are described below:
- Keyword search: In its simplest form, a word is input into a computer system which then retrieves the documents within the collection that contain the same word. Also commonly referred to as Boolean searching, keyword search tools typically include enhanced capabilities to identify word combinations and derivatives of root words among other things.
- Concept search: Typically involves the use of linguistic and statistical algorithms to determine whether a document is responsive to a particular key word search query. The technology considers variables such as the proximity and frequency of words that appear in relationship to the keywords used. Concept search tools retrieve more documents than keyword searches because documents containing concepts related to the keyword search are retrieved in addition to documents that contain the keyword search terms used.
- Discussion threading: Utilizes algorithms to dynamically link together related documents (most commonly e-mail messages) into chronological threads that reveal entire discussions. This technology simplifies the process of identifying participants to a conversation and understanding the substance of the conversation.
- Clustering: Involves the use of linguistic algorithms that automatically organize a large collection of documents into different topical groupings based on similarity.
- Find similar: Enables the retrieval of documents related to a particular document of interest. Reviewing similar documents together can simplify the review process, provide broader context, and help increase coding accuracy.
- Near-duplicate identification: Allows reviewers to easily identify, retrieve, and code documents that are very similar but not exact duplicates. Some systems can highlight discrepancies between near-duplicate documents which makes identifying subtle differences between documents easier.
Because of the deep level of human training involved on the front end, predictive coding technology only requires humans to review a small fraction of documents, resulting in a fraction of the review costs. The process typically begins with document reviewers using a computer system to review and classify a small sample of case documents as either responsive or non-responsive. These classification decisions are simultaneously logged by the computer, and the computer uses the information gleaned from this “training set” to construct a model for distinguishing between a responsive and non-responsive document. The model is applied to the remaining documents in order to predict how they should be classified and to rank the documents by degree of responsiveness. Although predictive coding technology is often viewed as a tool for segregating responsive and non-responsive documents, the technology can also be used to classify case documents based on other criteria, such as attorney-client privilege or key issues relevant to the case.
Training the predictive coding system is an iterative process that requires attorneys and their legal teams to evaluate the accuracy of the computer's document prediction scores throughout multiple training stages. A prediction score is simply a percentage value assigned to each document that is used to rank all the documents by degree of responsiveness. If the accuracy of the computer-generated predictions is insufficient, additional training documents can be selected and reviewed to help improve the system's performance. Multiple training sets are commonly reviewed and coded until the desired performance levels are achieved. Once the desired performance levels are achieved, informed decisions can be made about which documents to produce to the requesting party.
For example, if the legal team's analysis of the computer's predictions reveals that within a population of one million documents only those with prediction scores in the 70 percent range and higher appear to be responsive, the team may elect to review and then produce only those 300,000 documents to the requesting party. The financial consequences of this approach are significant, considering a recent RAND Corporation study estimates the cost of reviewing a single gigabyte of data is approximately $18,000. That means excluding 700,000 documents (around 10 to 15 gigabytes of data, depending on file sizes) from review can save organizations significant time and money.
Predictive coding technology is still relatively new to the legal community, and the multiple names is just one cause for confusion. Adding to that is the number of competing solutions, consultants, and “experts” in the market that help perpetuate a lot of misinformation. One claim that causes confusion is the notion that predictive coding renders other TAR tools obsolete. On the contrary, predictive coding technology should be viewed as one of many different tools in the litigator's toolbelt that can and should be used independently or in combination with other tools, depending on the needs of the case. Understanding how predictive coding technology and other TAR tools can be used together as part of a defensible e-discovery process can help organizations reduce risk and cost simultaneously. Providing the industry with this basic level of understanding will help ensure that predictive coding technology and related best practices standards will evolve in a manner that is fair to all parties — ultimately expediting rather than slowing broader adoption of this promising new technology. To learn more, download a free copy of Predictive Coding for Dummies.
Query your average litigation attorney about the difference between predictive coding technology and other more traditional litigation tools, and you're likely to get a wide range of responses. The fact that “predictive coding” goes by many names, including “computer-assisted review” (CAR) and “technology-assisted review” (TAR) illustrates a fundamental problem: What is predictive coding, and how is it different from other tools in the litigator's technology toolbelt?
Predictive coding is a type of technology that enables a computer to “predict” how documents should be classified based on input or “training” from human reviewers. The technology can expedite the document review process by finding key documents faster, potentially saving organizations thousands of hours of time. And in a profession where time is money, narrowing days, weeks, or even months of tedious document review into more reasonable time frames means organizations can save big bucks by keeping litigation expenditures in check.
Despite the promise of predictive coding, widespread adoption among the legal community has been slower than expected, due in part to the confusion about how it differs from other types of TAR tools that have been available for years. Unlike TAR tools that automatically extract patterns and identify relationships between documents with minimal human intervention, predictive coding requires significant reliance on humans to train and fine-tune the system through an iterative process. Some common TAR tools used in e-discovery that do not include this same level of interaction are described below:
- Keyword search: In its simplest form, a word is input into a computer system which then retrieves the documents within the collection that contain the same word. Also commonly referred to as Boolean searching, keyword search tools typically include enhanced capabilities to identify word combinations and derivatives of root words among other things.
- Concept search: Typically involves the use of linguistic and statistical algorithms to determine whether a document is responsive to a particular key word search query. The technology considers variables such as the proximity and frequency of words that appear in relationship to the keywords used. Concept search tools retrieve more documents than keyword searches because documents containing concepts related to the keyword search are retrieved in addition to documents that contain the keyword search terms used.
- Discussion threading: Utilizes algorithms to dynamically link together related documents (most commonly e-mail messages) into chronological threads that reveal entire discussions. This technology simplifies the process of identifying participants to a conversation and understanding the substance of the conversation.
- Clustering: Involves the use of linguistic algorithms that automatically organize a large collection of documents into different topical groupings based on similarity.
- Find similar: Enables the retrieval of documents related to a particular document of interest. Reviewing similar documents together can simplify the review process, provide broader context, and help increase coding accuracy.
- Near-duplicate identification: Allows reviewers to easily identify, retrieve, and code documents that are very similar but not exact duplicates. Some systems can highlight discrepancies between near-duplicate documents which makes identifying subtle differences between documents easier.
Because of the deep level of human training involved on the front end, predictive coding technology only requires humans to review a small fraction of documents, resulting in a fraction of the review costs. The process typically begins with document reviewers using a computer system to review and classify a small sample of case documents as either responsive or non-responsive. These classification decisions are simultaneously logged by the computer, and the computer uses the information gleaned from this “training set” to construct a model for distinguishing between a responsive and non-responsive document. The model is applied to the remaining documents in order to predict how they should be classified and to rank the documents by degree of responsiveness. Although predictive coding technology is often viewed as a tool for segregating responsive and non-responsive documents, the technology can also be used to classify case documents based on other criteria, such as attorney-client privilege or key issues relevant to the case.
Training the predictive coding system is an iterative process that requires attorneys and their legal teams to evaluate the accuracy of the computer's document prediction scores throughout multiple training stages. A prediction score is simply a percentage value assigned to each document that is used to rank all the documents by degree of responsiveness. If the accuracy of the computer-generated predictions is insufficient, additional training documents can be selected and reviewed to help improve the system's performance. Multiple training sets are commonly reviewed and coded until the desired performance levels are achieved. Once the desired performance levels are achieved, informed decisions can be made about which documents to produce to the requesting party.
For example, if the legal team's analysis of the computer's predictions reveals that within a population of one million documents only those with prediction scores in the 70 percent range and higher appear to be responsive, the team may elect to review and then produce only those 300,000 documents to the requesting party. The financial consequences of this approach are significant, considering a recent RAND Corporation study estimates the cost of reviewing a single gigabyte of data is approximately $18,000. That means excluding 700,000 documents (around 10 to 15 gigabytes of data, depending on file sizes) from review can save organizations significant time and money.
Predictive coding technology is still relatively new to the legal community, and the multiple names is just one cause for confusion. Adding to that is the number of competing solutions, consultants, and “experts” in the market that help perpetuate a lot of misinformation. One claim that causes confusion is the notion that predictive coding renders other TAR tools obsolete. On the contrary, predictive coding technology should be viewed as one of many different tools in the litigator's toolbelt that can and should be used independently or in combination with other tools, depending on the needs of the case. Understanding how predictive coding technology and other TAR tools can be used together as part of a defensible e-discovery process can help organizations reduce risk and cost simultaneously. Providing the industry with this basic level of understanding will help ensure that predictive coding technology and related best practices standards will evolve in a manner that is fair to all parties — ultimately expediting rather than slowing broader adoption of this promising new technology. To learn more, download a free copy of Predictive Coding for Dummies.
This content has been archived. It is available through our partners, LexisNexis® and Bloomberg Law.
To view this content, please continue to their sites.
Not a Lexis Subscriber?
Subscribe Now
Not a Bloomberg Law Subscriber?
Subscribe Now
NOT FOR REPRINT
© 2025 ALM Global, LLC, All Rights Reserved. Request academic re-use from www.copyright.com. All other uses, submit a request to [email protected]. For more information visit Asset & Logo Licensing.
You Might Like
View AllExits Leave American Airlines, SiriusXM, Spotify Searching for New Legal Chiefs
2 minute read'A Warning Shot to Board Rooms': DOJ Decision to Fight $14B Tech Merger May Be Bad Omen for Industry
'Incredibly Complicated'? Antitrust Litigators Identify Pros and Cons of Proposed One Agency Act
5 minute readTrending Stories
- 1Uber Files RICO Suit Against Plaintiff-Side Firms Alleging Fraudulent Injury Claims
- 2The Law Firm Disrupted: Scrutinizing the Elephant More Than the Mouse
- 3Inherent Diminished Value Damages Unavailable to 3rd-Party Claimants, Court Says
- 4Pa. Defense Firm Sued by Client Over Ex-Eagles Player's $43.5M Med Mal Win
- 5Losses Mount at Morris Manning, but Departing Ex-Chair Stays Bullish About His Old Firm's Future
Who Got The Work
J. Brugh Lower of Gibbons has entered an appearance for industrial equipment supplier Devco Corporation in a pending trademark infringement lawsuit. The suit, accusing the defendant of selling knock-off Graco products, was filed Dec. 18 in New Jersey District Court by Rivkin Radler on behalf of Graco Inc. and Graco Minnesota. The case, assigned to U.S. District Judge Zahid N. Quraishi, is 3:24-cv-11294, Graco Inc. et al v. Devco Corporation.
Who Got The Work
Rebecca Maller-Stein and Kent A. Yalowitz of Arnold & Porter Kaye Scholer have entered their appearances for Hanaco Venture Capital and its executives, Lior Prosor and David Frankel, in a pending securities lawsuit. The action, filed on Dec. 24 in New York Southern District Court by Zell, Aron & Co. on behalf of Goldeneye Advisors, accuses the defendants of negligently and fraudulently managing the plaintiff's $1 million investment. The case, assigned to U.S. District Judge Vernon S. Broderick, is 1:24-cv-09918, Goldeneye Advisors, LLC v. Hanaco Venture Capital, Ltd. et al.
Who Got The Work
Attorneys from A&O Shearman has stepped in as defense counsel for Toronto-Dominion Bank and other defendants in a pending securities class action. The suit, filed Dec. 11 in New York Southern District Court by Bleichmar Fonti & Auld, accuses the defendants of concealing the bank's 'pervasive' deficiencies in regards to its compliance with the Bank Secrecy Act and the quality of its anti-money laundering controls. The case, assigned to U.S. District Judge Arun Subramanian, is 1:24-cv-09445, Gonzalez v. The Toronto-Dominion Bank et al.
Who Got The Work
Crown Castle International, a Pennsylvania company providing shared communications infrastructure, has turned to Luke D. Wolf of Gordon Rees Scully Mansukhani to fend off a pending breach-of-contract lawsuit. The court action, filed Nov. 25 in Michigan Eastern District Court by Hooper Hathaway PC on behalf of The Town Residences LLC, accuses Crown Castle of failing to transfer approximately $30,000 in utility payments from T-Mobile in breach of a roof-top lease and assignment agreement. The case, assigned to U.S. District Judge Susan K. Declercq, is 2:24-cv-13131, The Town Residences LLC v. T-Mobile US, Inc. et al.
Who Got The Work
Wilfred P. Coronato and Daniel M. Schwartz of McCarter & English have stepped in as defense counsel to Electrolux Home Products Inc. in a pending product liability lawsuit. The court action, filed Nov. 26 in New York Eastern District Court by Poulos Lopiccolo PC and Nagel Rice LLP on behalf of David Stern, alleges that the defendant's refrigerators’ drawers and shelving repeatedly break and fall apart within months after purchase. The case, assigned to U.S. District Judge Joan M. Azrack, is 2:24-cv-08204, Stern v. Electrolux Home Products, Inc.
Featured Firms
Law Offices of Gary Martin Hays & Associates, P.C.
(470) 294-1674
Law Offices of Mark E. Salomone
(857) 444-6468
Smith & Hassler
(713) 739-1250