Technology: The role of probability & certainty in developing document review strategies
The introduction of probability and certainty allows human reviewers to accurately identify responsive data in large populations while examining only a small percentage of the total data population.
January 24, 2014 at 03:00 AM
10 minute read
The original version of this story was published on Law.com
With corporate legal budgets under continued pressure, inside counsel is often faced with answering three difficult questions:
- What will the pursuit of this matter cost?
- How long will it take?
- If I work to control my costs, what will that do to my risk?
In pondering the implications of these questions, one's focus quickly centers on document review. Document review now represents 70 percent of e-discovery costs — not a surprising figure when you consider that the cost to collect, process and host a document for review runs in the neighborhood of five cents ($.05) while the cost to review a document is roughly $1 using a contract attorney (and many times that amount if the document is reviewed by a more senior attorney.)
Aside from using less expensive attorneys to do the work, the only way to control review costs is to review fewer documents. Experience tells us that most of the time, 80 percent of the documents collected for review are not responsive to the issues. The cost of reading these documents is a waste of time and money. How can we avoid reviewing non-relevant documents?
As an industry, we have been looking for a solution to this problem for a long time. Over the past 10 years, our attention has turned to various forms of artificial intelligence (AI) algorithms to address the time and costs associated with document review. AI algorithms are intriguing, but they are hardly a panacea and they are certainly not for everyone or every matter. Altogether, AI is not the only alternative available to dramatically reduce document review costs.
In this article, we will demonstrate how to reduce the cost of document review by 85 percent without AI. No, that isn't a typo — it is indeed possible to find every relevant document in a collection by reading just 15 percent of the collection — and that's 15 percent after de-duplication and metadata filtering. You can get these results using mathematics and technology that have been around for more than 50 years (or in the case of the math, hundreds of years).
The process is built on three mathematical principles: certainty, probability and sampling. Taken step by step, they work to dramatically cut the cost of document review.
The process starts by determining certainty. How certain must we be that all relevant documents have been found (or produced)? From a production perspective, the goal is “reasonableness”. If we are 85 percent certain to have produced every relevant (non-privileged) document, is that good enough? Maybe the sensitivity of the matter requires that we be 95 percent or even 98 percent certain. Identifying the certainty level is the first step of the process.
Next probability enters the process. Probability determines how much work we will have to do to uncover the relevant documents.
There are two probabilistic events to consider. First, how long will it take reviewers to come across the first relevant document? And, having seen the first one, how long before the appearance of the next one? Another way of saying this is: How many non-relevant documents will reviewers have to slog through before they see the next relevant document? To save time and money, we want to do everything we can to shorten the distance between the appearance of relevant documents.
The second probabilistic event concerns the likelihood that a given document is unique. When we see a relevant document, what is the probability that other documents in the collection are relevant for exactly the same reason as this one? Or put another way: How many other documents in the collection contain the same relevant language as the document under review?
For the third and final step in the process we introduce sampling. A review is complete when we can prove — to the level of certainty agreed upon — that all relevant documents have been identified. This is a simple but hugely consequential statement.
Given these principles, the workflow runs as follows:
- Determine an acceptable level of certainty
- Organize the review so that the “Next Document” selection is driven by a random algorithm (19th century mathematics for finding needles in a haystack)
- When a reviewer sees a relevant document, have him/her highlight the language that makes the document relevant.
- Using a simple Boolean search engine, find and tag every document in the collection that contains the highlighted language. (Boolean search technology is more than 50 years old.)
- Stop the review when the distance between the appearances of relevant documents correlates to having reached the certainty level.
- Formally test the pile of documents not tagged as relevant to prove that all relevant documents have been identified to the desired level of certainty. (Sampling has its roots in the Bible where it was called “drawing lots.”)
The introduction of probability and certainty allows human reviewers to accurately identify responsive data in large populations while examining only a small percentage of the total data population. This workflow has been executed on many dozens of cases and has never failed to deliver.
With corporate legal budgets under continued pressure, inside counsel is often faced with answering three difficult questions:
- What will the pursuit of this matter cost?
- How long will it take?
- If I work to control my costs, what will that do to my risk?
In pondering the implications of these questions, one's focus quickly centers on document review. Document review now represents 70 percent of e-discovery costs — not a surprising figure when you consider that the cost to collect, process and host a document for review runs in the neighborhood of five cents ($.05) while the cost to review a document is roughly $1 using a contract attorney (and many times that amount if the document is reviewed by a more senior attorney.)
Aside from using less expensive attorneys to do the work, the only way to control review costs is to review fewer documents. Experience tells us that most of the time, 80 percent of the documents collected for review are not responsive to the issues. The cost of reading these documents is a waste of time and money. How can we avoid reviewing non-relevant documents?
As an industry, we have been looking for a solution to this problem for a long time. Over the past 10 years, our attention has turned to various forms of artificial intelligence (AI) algorithms to address the time and costs associated with document review. AI algorithms are intriguing, but they are hardly a panacea and they are certainly not for everyone or every matter. Altogether, AI is not the only alternative available to dramatically reduce document review costs.
In this article, we will demonstrate how to reduce the cost of document review by 85 percent without AI. No, that isn't a typo — it is indeed possible to find every relevant document in a collection by reading just 15 percent of the collection — and that's 15 percent after de-duplication and metadata filtering. You can get these results using mathematics and technology that have been around for more than 50 years (or in the case of the math, hundreds of years).
The process is built on three mathematical principles: certainty, probability and sampling. Taken step by step, they work to dramatically cut the cost of document review.
The process starts by determining certainty. How certain must we be that all relevant documents have been found (or produced)? From a production perspective, the goal is “reasonableness”. If we are 85 percent certain to have produced every relevant (non-privileged) document, is that good enough? Maybe the sensitivity of the matter requires that we be 95 percent or even 98 percent certain. Identifying the certainty level is the first step of the process.
Next probability enters the process. Probability determines how much work we will have to do to uncover the relevant documents.
There are two probabilistic events to consider. First, how long will it take reviewers to come across the first relevant document? And, having seen the first one, how long before the appearance of the next one? Another way of saying this is: How many non-relevant documents will reviewers have to slog through before they see the next relevant document? To save time and money, we want to do everything we can to shorten the distance between the appearance of relevant documents.
The second probabilistic event concerns the likelihood that a given document is unique. When we see a relevant document, what is the probability that other documents in the collection are relevant for exactly the same reason as this one? Or put another way: How many other documents in the collection contain the same relevant language as the document under review?
For the third and final step in the process we introduce sampling. A review is complete when we can prove — to the level of certainty agreed upon — that all relevant documents have been identified. This is a simple but hugely consequential statement.
Given these principles, the workflow runs as follows:
- Determine an acceptable level of certainty
- Organize the review so that the “Next Document” selection is driven by a random algorithm (19th century mathematics for finding needles in a haystack)
- When a reviewer sees a relevant document, have him/her highlight the language that makes the document relevant.
- Using a simple Boolean search engine, find and tag every document in the collection that contains the highlighted language. (Boolean search technology is more than 50 years old.)
- Stop the review when the distance between the appearances of relevant documents correlates to having reached the certainty level.
- Formally test the pile of documents not tagged as relevant to prove that all relevant documents have been identified to the desired level of certainty. (Sampling has its roots in the Bible where it was called “drawing lots.”)
The introduction of probability and certainty allows human reviewers to accurately identify responsive data in large populations while examining only a small percentage of the total data population. This workflow has been executed on many dozens of cases and has never failed to deliver.
This content has been archived. It is available through our partners, LexisNexis® and Bloomberg Law.
To view this content, please continue to their sites.
Not a Lexis Subscriber?
Subscribe Now
Not a Bloomberg Law Subscriber?
Subscribe Now
NOT FOR REPRINT
© 2024 ALM Global, LLC, All Rights Reserved. Request academic re-use from www.copyright.com. All other uses, submit a request to [email protected]. For more information visit Asset & Logo Licensing.
You Might Like
View AllInside Track: How 2 Big Financial Stories—an Antitrust Case and a Megamerger—Became Intertwined
CLOs Still Jazzed About Gen Al, Even as They Realize Successfully Implementing It Is Harder Than It Looks
2 minute readAT&T General Counsel Joins ADM Board as Company Reels From Accounting Scandal
How Gen AI Is Changing Legal Work for In-House Counsel
Trending Stories
- 1Trump's Return to the White House: The Legal Industry Reacts
- 2Infant Formula Judge Sanctions Kirkland's Jim Hurst: 'Overtly Crossed the Lines'
- 3Climate Disputes, International Arbitration, and State Court Limitations for Global Issues
- 4Election 2024: Nationwide Judicial Races and Ballot Measures to Watch
- 5Judicial Face-Off: Navigating the Ethical and Efficient Use of AI in Legal Practice [CLE Pending]
Who Got The Work
Michael G. Bongiorno, Andrew Scott Dulberg and Elizabeth E. Driscoll from Wilmer Cutler Pickering Hale and Dorr have stepped in to represent Symbotic Inc., an A.I.-enabled technology platform that focuses on increasing supply chain efficiency, and other defendants in a pending shareholder derivative lawsuit. The case, filed Oct. 2 in Massachusetts District Court by the Brown Law Firm on behalf of Stephen Austen, accuses certain officers and directors of misleading investors in regard to Symbotic's potential for margin growth by failing to disclose that the company was not equipped to timely deploy its systems or manage expenses through project delays. The case, assigned to U.S. District Judge Nathaniel M. Gorton, is 1:24-cv-12522, Austen v. Cohen et al.
Who Got The Work
Edmund Polubinski and Marie Killmond of Davis Polk & Wardwell have entered appearances for data platform software development company MongoDB and other defendants in a pending shareholder derivative lawsuit. The action, filed Oct. 7 in New York Southern District Court by the Brown Law Firm, accuses the company's directors and/or officers of falsely expressing confidence in the company’s restructuring of its sales incentive plan and downplaying the severity of decreases in its upfront commitments. The case is 1:24-cv-07594, Roy v. Ittycheria et al.
Who Got The Work
Amy O. Bruchs and Kurt F. Ellison of Michael Best & Friedrich have entered appearances for Epic Systems Corp. in a pending employment discrimination lawsuit. The suit was filed Sept. 7 in Wisconsin Western District Court by Levine Eisberner LLC and Siri & Glimstad on behalf of a project manager who claims that he was wrongfully terminated after applying for a religious exemption to the defendant's COVID-19 vaccine mandate. The case, assigned to U.S. Magistrate Judge Anita Marie Boor, is 3:24-cv-00630, Secker, Nathan v. Epic Systems Corporation.
Who Got The Work
David X. Sullivan, Thomas J. Finn and Gregory A. Hall from McCarter & English have entered appearances for Sunrun Installation Services in a pending civil rights lawsuit. The complaint was filed Sept. 4 in Connecticut District Court by attorney Robert M. Berke on behalf of former employee George Edward Steins, who was arrested and charged with employing an unregistered home improvement salesperson. The complaint alleges that had Sunrun informed the Connecticut Department of Consumer Protection that the plaintiff's employment had ended in 2017 and that he no longer held Sunrun's home improvement contractor license, he would not have been hit with charges, which were dismissed in May 2024. The case, assigned to U.S. District Judge Jeffrey A. Meyer, is 3:24-cv-01423, Steins v. Sunrun, Inc. et al.
Who Got The Work
Greenberg Traurig shareholder Joshua L. Raskin has entered an appearance for boohoo.com UK Ltd. in a pending patent infringement lawsuit. The suit, filed Sept. 3 in Texas Eastern District Court by Rozier Hardt McDonough on behalf of Alto Dynamics, asserts five patents related to an online shopping platform. The case, assigned to U.S. District Judge Rodney Gilstrap, is 2:24-cv-00719, Alto Dynamics, LLC v. boohoo.com UK Limited.
Featured Firms
Law Offices of Gary Martin Hays & Associates, P.C.
(470) 294-1674
Law Offices of Mark E. Salomone
(857) 444-6468
Smith & Hassler
(713) 739-1250