Technology: The role of probability & certainty in developing document review strategies
The introduction of probability and certainty allows human reviewers to accurately identify responsive data in large populations while examining only a small percentage of the total data population.
January 24, 2014 at 03:00 AM
10 minute read
The original version of this story was published on Law.com
With corporate legal budgets under continued pressure, inside counsel is often faced with answering three difficult questions:
- What will the pursuit of this matter cost?
- How long will it take?
- If I work to control my costs, what will that do to my risk?
In pondering the implications of these questions, one's focus quickly centers on document review. Document review now represents 70 percent of e-discovery costs — not a surprising figure when you consider that the cost to collect, process and host a document for review runs in the neighborhood of five cents ($.05) while the cost to review a document is roughly $1 using a contract attorney (and many times that amount if the document is reviewed by a more senior attorney.)
Aside from using less expensive attorneys to do the work, the only way to control review costs is to review fewer documents. Experience tells us that most of the time, 80 percent of the documents collected for review are not responsive to the issues. The cost of reading these documents is a waste of time and money. How can we avoid reviewing non-relevant documents?
As an industry, we have been looking for a solution to this problem for a long time. Over the past 10 years, our attention has turned to various forms of artificial intelligence (AI) algorithms to address the time and costs associated with document review. AI algorithms are intriguing, but they are hardly a panacea and they are certainly not for everyone or every matter. Altogether, AI is not the only alternative available to dramatically reduce document review costs.
In this article, we will demonstrate how to reduce the cost of document review by 85 percent without AI. No, that isn't a typo — it is indeed possible to find every relevant document in a collection by reading just 15 percent of the collection — and that's 15 percent after de-duplication and metadata filtering. You can get these results using mathematics and technology that have been around for more than 50 years (or in the case of the math, hundreds of years).
The process is built on three mathematical principles: certainty, probability and sampling. Taken step by step, they work to dramatically cut the cost of document review.
The process starts by determining certainty. How certain must we be that all relevant documents have been found (or produced)? From a production perspective, the goal is “reasonableness”. If we are 85 percent certain to have produced every relevant (non-privileged) document, is that good enough? Maybe the sensitivity of the matter requires that we be 95 percent or even 98 percent certain. Identifying the certainty level is the first step of the process.
Next probability enters the process. Probability determines how much work we will have to do to uncover the relevant documents.
There are two probabilistic events to consider. First, how long will it take reviewers to come across the first relevant document? And, having seen the first one, how long before the appearance of the next one? Another way of saying this is: How many non-relevant documents will reviewers have to slog through before they see the next relevant document? To save time and money, we want to do everything we can to shorten the distance between the appearance of relevant documents.
The second probabilistic event concerns the likelihood that a given document is unique. When we see a relevant document, what is the probability that other documents in the collection are relevant for exactly the same reason as this one? Or put another way: How many other documents in the collection contain the same relevant language as the document under review?
For the third and final step in the process we introduce sampling. A review is complete when we can prove — to the level of certainty agreed upon — that all relevant documents have been identified. This is a simple but hugely consequential statement.
Given these principles, the workflow runs as follows:
- Determine an acceptable level of certainty
- Organize the review so that the “Next Document” selection is driven by a random algorithm (19th century mathematics for finding needles in a haystack)
- When a reviewer sees a relevant document, have him/her highlight the language that makes the document relevant.
- Using a simple Boolean search engine, find and tag every document in the collection that contains the highlighted language. (Boolean search technology is more than 50 years old.)
- Stop the review when the distance between the appearances of relevant documents correlates to having reached the certainty level.
- Formally test the pile of documents not tagged as relevant to prove that all relevant documents have been identified to the desired level of certainty. (Sampling has its roots in the Bible where it was called “drawing lots.”)
The introduction of probability and certainty allows human reviewers to accurately identify responsive data in large populations while examining only a small percentage of the total data population. This workflow has been executed on many dozens of cases and has never failed to deliver.
With corporate legal budgets under continued pressure, inside counsel is often faced with answering three difficult questions:
- What will the pursuit of this matter cost?
- How long will it take?
- If I work to control my costs, what will that do to my risk?
In pondering the implications of these questions, one's focus quickly centers on document review. Document review now represents 70 percent of e-discovery costs — not a surprising figure when you consider that the cost to collect, process and host a document for review runs in the neighborhood of five cents ($.05) while the cost to review a document is roughly $1 using a contract attorney (and many times that amount if the document is reviewed by a more senior attorney.)
Aside from using less expensive attorneys to do the work, the only way to control review costs is to review fewer documents. Experience tells us that most of the time, 80 percent of the documents collected for review are not responsive to the issues. The cost of reading these documents is a waste of time and money. How can we avoid reviewing non-relevant documents?
As an industry, we have been looking for a solution to this problem for a long time. Over the past 10 years, our attention has turned to various forms of artificial intelligence (AI) algorithms to address the time and costs associated with document review. AI algorithms are intriguing, but they are hardly a panacea and they are certainly not for everyone or every matter. Altogether, AI is not the only alternative available to dramatically reduce document review costs.
In this article, we will demonstrate how to reduce the cost of document review by 85 percent without AI. No, that isn't a typo — it is indeed possible to find every relevant document in a collection by reading just 15 percent of the collection — and that's 15 percent after de-duplication and metadata filtering. You can get these results using mathematics and technology that have been around for more than 50 years (or in the case of the math, hundreds of years).
The process is built on three mathematical principles: certainty, probability and sampling. Taken step by step, they work to dramatically cut the cost of document review.
The process starts by determining certainty. How certain must we be that all relevant documents have been found (or produced)? From a production perspective, the goal is “reasonableness”. If we are 85 percent certain to have produced every relevant (non-privileged) document, is that good enough? Maybe the sensitivity of the matter requires that we be 95 percent or even 98 percent certain. Identifying the certainty level is the first step of the process.
Next probability enters the process. Probability determines how much work we will have to do to uncover the relevant documents.
There are two probabilistic events to consider. First, how long will it take reviewers to come across the first relevant document? And, having seen the first one, how long before the appearance of the next one? Another way of saying this is: How many non-relevant documents will reviewers have to slog through before they see the next relevant document? To save time and money, we want to do everything we can to shorten the distance between the appearance of relevant documents.
The second probabilistic event concerns the likelihood that a given document is unique. When we see a relevant document, what is the probability that other documents in the collection are relevant for exactly the same reason as this one? Or put another way: How many other documents in the collection contain the same relevant language as the document under review?
For the third and final step in the process we introduce sampling. A review is complete when we can prove — to the level of certainty agreed upon — that all relevant documents have been identified. This is a simple but hugely consequential statement.
Given these principles, the workflow runs as follows:
- Determine an acceptable level of certainty
- Organize the review so that the “Next Document” selection is driven by a random algorithm (19th century mathematics for finding needles in a haystack)
- When a reviewer sees a relevant document, have him/her highlight the language that makes the document relevant.
- Using a simple Boolean search engine, find and tag every document in the collection that contains the highlighted language. (Boolean search technology is more than 50 years old.)
- Stop the review when the distance between the appearances of relevant documents correlates to having reached the certainty level.
- Formally test the pile of documents not tagged as relevant to prove that all relevant documents have been identified to the desired level of certainty. (Sampling has its roots in the Bible where it was called “drawing lots.”)
The introduction of probability and certainty allows human reviewers to accurately identify responsive data in large populations while examining only a small percentage of the total data population. This workflow has been executed on many dozens of cases and has never failed to deliver.
This content has been archived. It is available through our partners, LexisNexis® and Bloomberg Law.
To view this content, please continue to their sites.
Not a Lexis Subscriber?
Subscribe Now
Not a Bloomberg Law Subscriber?
Subscribe Now
NOT FOR REPRINT
© 2025 ALM Global, LLC, All Rights Reserved. Request academic re-use from www.copyright.com. All other uses, submit a request to [email protected]. For more information visit Asset & Logo Licensing.
You Might Like
View AllInternal Whistleblowing Surged Globally in 2024, So Why Were US Numbers Flat?
6 minute readLawyers' Phones Are Ringing: What Should Employers Do If ICE Raids Their Business?
6 minute readTrending Stories
- 1We the People?
- 2New York-Based Skadden Team Joins White & Case Group in Mexico City for Citigroup Demerger
- 3No Two Wildfires Alike: Lawyers Take Different Legal Strategies in California
- 4Poop-Themed Dog Toy OK as Parody, but Still Tarnished Jack Daniel’s Brand, Court Says
- 5Meet the New President of NY's Association of Trial Court Jurists
Who Got The Work
J. Brugh Lower of Gibbons has entered an appearance for industrial equipment supplier Devco Corporation in a pending trademark infringement lawsuit. The suit, accusing the defendant of selling knock-off Graco products, was filed Dec. 18 in New Jersey District Court by Rivkin Radler on behalf of Graco Inc. and Graco Minnesota. The case, assigned to U.S. District Judge Zahid N. Quraishi, is 3:24-cv-11294, Graco Inc. et al v. Devco Corporation.
Who Got The Work
Rebecca Maller-Stein and Kent A. Yalowitz of Arnold & Porter Kaye Scholer have entered their appearances for Hanaco Venture Capital and its executives, Lior Prosor and David Frankel, in a pending securities lawsuit. The action, filed on Dec. 24 in New York Southern District Court by Zell, Aron & Co. on behalf of Goldeneye Advisors, accuses the defendants of negligently and fraudulently managing the plaintiff's $1 million investment. The case, assigned to U.S. District Judge Vernon S. Broderick, is 1:24-cv-09918, Goldeneye Advisors, LLC v. Hanaco Venture Capital, Ltd. et al.
Who Got The Work
Attorneys from A&O Shearman has stepped in as defense counsel for Toronto-Dominion Bank and other defendants in a pending securities class action. The suit, filed Dec. 11 in New York Southern District Court by Bleichmar Fonti & Auld, accuses the defendants of concealing the bank's 'pervasive' deficiencies in regards to its compliance with the Bank Secrecy Act and the quality of its anti-money laundering controls. The case, assigned to U.S. District Judge Arun Subramanian, is 1:24-cv-09445, Gonzalez v. The Toronto-Dominion Bank et al.
Who Got The Work
Crown Castle International, a Pennsylvania company providing shared communications infrastructure, has turned to Luke D. Wolf of Gordon Rees Scully Mansukhani to fend off a pending breach-of-contract lawsuit. The court action, filed Nov. 25 in Michigan Eastern District Court by Hooper Hathaway PC on behalf of The Town Residences LLC, accuses Crown Castle of failing to transfer approximately $30,000 in utility payments from T-Mobile in breach of a roof-top lease and assignment agreement. The case, assigned to U.S. District Judge Susan K. Declercq, is 2:24-cv-13131, The Town Residences LLC v. T-Mobile US, Inc. et al.
Who Got The Work
Wilfred P. Coronato and Daniel M. Schwartz of McCarter & English have stepped in as defense counsel to Electrolux Home Products Inc. in a pending product liability lawsuit. The court action, filed Nov. 26 in New York Eastern District Court by Poulos Lopiccolo PC and Nagel Rice LLP on behalf of David Stern, alleges that the defendant's refrigerators’ drawers and shelving repeatedly break and fall apart within months after purchase. The case, assigned to U.S. District Judge Joan M. Azrack, is 2:24-cv-08204, Stern v. Electrolux Home Products, Inc.
Featured Firms
Law Offices of Gary Martin Hays & Associates, P.C.
(470) 294-1674
Law Offices of Mark E. Salomone
(857) 444-6468
Smith & Hassler
(713) 739-1250