Are People the Weak Link in Technology-Assisted Review?
The weak link preventing technology-assisted review (TAR) from achieving its true potential is a lack of clarity surrounding the technology—the components, the development and the distinctions.
February 01, 2019 at 12:53 PM
8 minute read
In a word, yes. But it's not what you might think.
The weak link preventing technology-assisted review (TAR) from achieving its true potential is a lack of clarity surrounding the technology—the components, the development and the distinctions. No doubt, TAR is seeing greater acceptance and refinement in the legal space. But with a deeper understanding of the technology, TAR can be even more useful and effective.
|Understanding the Technology
To start, TAR is a process by which reviewers code documents for some target criteria (e.g., responsiveness), and an algorithm uses those coding decisions to efficiently manage the review of the unseen documents—known as “supervised machine learning.” Some TAR processes manage review by categorizing the remaining documents, others manage by ranking the collection. Either way, the goal is to effectively train the algorithm and minimize the number of documents that need to be reviewed to achieve recall objectives for the target criteria.
If coding decisions are not being used to train the algorithm (known as “unsupervised machine learning”), the process simply is not a TAR process. Therefore, while clustering, near-duplicate analysis and email threading all use technology to aid in the review process, they are not TAR for purposes of this discussion.
A true TAR application has three layers. The base layer consists of feature extraction, where the documents are decomposed into the elements, or “features,” that will be used by the algorithm to evaluate coding decisions, and compare and make decisions about unreviewed documents. On top of feature extraction sits the supervised machine learning algorithm layer. And the entire TAR operation is directed by the “process” layer, which controls all aspects of the training protocol.
Contemporary feature extraction techniques typically focus on the text in the body of individual documents. Features most often consist of individual words or word fragments. However, expanding the feature set to include two- and three-word segments has been found to improve performance. Conversely, feature reduction techniques such as latent semantic indexing, which consolidate multiple words into a single proxy feature, have been shown to degrade performance with most TAR algorithms.
As research and development continue, the feature extraction layer is likely to see expansion beyond the body text, and continued refinement to improve TAR efficiency. See, e.g., Jones, Amanda, et al., ”The Role of Metadata in Machine Learning for Technology Assisted Review,” DESI VI Workshop, June 8, 2015.
At the next level, the consistent emphasis on identifying the specific TAR algorithm is a prime example of the educational weak link that inhibits progress. With a few exceptions, the supervised machine learning algorithms used in TAR applications, all other things being equal, will see somewhat equivalent results. Whether it's SVM (support vector machine), logistic regression, Naïve Bayes or even a proprietary algorithm, operational differences typically do not depend on the specific TAR algorithm being used.
Certainly, however, there are a few exceptions. The 1-nearest neighbor algorithm has been shown to be somewhat ineffective in e-discovery review applications. And there is simply not enough training data to take advantage of deep learning algorithms in e-discovery. Conversely, incorporating reinforcement learning may well improve the effectiveness of a TAR algorithm.
As an aside to clarify messaging, the fact that TAR applications rely on supervised machine learning algorithms means that TAR is, by definition, using artificial intelligence or AI, since supervised machine learning is indeed one form of AI.
|Differentiating TAR 1.0 from TAR 2.0
Perhaps the most significant distinction between TAR applications is found at the process layer, which can be broken down into two principal categories that are most often referred to as TAR 1.0 and TAR 2.0. The primary distinction stems from the protocol for training the algorithm.
In a TAR 1.0 application, documents are reviewed and coded to train the algorithm only until either the algorithm shows no further improvement (referred to as stabilization); or the production metrics of recall and precision appear to be sufficient, typically by reference to a random, representative control set designed to monitor progress. Training usually consists of a few thousand documents. The algorithm will then automatically classify the remaining documents or, alternatively, rank them to facilitate a manual classification. Once classified, the presumptively positive documents may or may not be reviewed and coded, but will not further train the algorithm.
TAR 1.0 applications can be further divided into simple passive learning (SPL) and simple active learning (SAL) protocols, depending upon the manner in which training documents are selected. With an SPL protocol, training documents are selected at random. The protocol is “simple” because there is a discrete training phase, after which training ceases regardless of further coding. It is passive because the algorithm does not select the random training documents. With a SAL, protocol, the algorithm typically selects training documents from those about which the algorithm is the least certain. This is known as “uncertainty sampling,” and it is considered an active protocol because the algorithm actively selects the training documents.
With TAR 2.0, documents are continuously reviewed and coded to train the algorithm until enough positive documents have been located, reviewed, and coded to achieve production objectives. Training documents are primarily selected through relevance feedback, which focuses on documents the algorithm sees as most likely to be positive. This protocol is called continuous active learning (CAL). The protocol is “continuous” because every coding decision is used to train the algorithm. And again, it is active because the algorithm actively selects the training documents. This is typically accomplished by ranking the entire collection so the most likely positive documents at the top can be reviewed first.
Studies show that CAL (TAR 2.0) is typically more efficient than either TAR 1.0 protocol when the presumptively-positive (e.g., responsive) documents will be reviewed. That is simply because, while TAR 1.0 training is very limited, the resultant presumptively-positive set contains more negative documents than would be reviewed with CAL.
CAL also overcomes many of the practical obstacles to adoption that are inherent in the operation of TAR 1.0. A control set is not required, making it easier to handle rolling collections. There is no need for a subject matter expert (SME) to train the algorithm to avoid propagating erroneous decisions—CAL is noise tolerant, and our studies have shown that contract review attorneys train the algorithm as well as, and in some cases better than, an SME. Eliminating the SME also means that document review can start immediately, rather than waiting for an SME to code the control set and the training set. And the review can focus on the documents most likely to be positive (i.e., the best or most relevant documents), rather than the random or uncertain documents used to train TAR 1.0 applications.
Advances at the process level are most likely to come from operational refinements and workflow improvements to the CAL protocol. For example, studies show that more frequent ranking tends to improve CAL efficiency. And, since TAR operates at the document level, eliminating family batching will reduce the number of negative documents reviewed. J. Pickens, et al. “Break up the Family: Protocols for Efficient Recall-Oriented Retrieval Under Legally-Necessitated Dual Constraints.” Proceedings of the Second Annual Workshop on Big Data Analytics in the Legal Industry, IEEE Big Data 2018 (Seattle).
|Advancing Legal Application
TAR is certainly moving in the direction of greater acceptance by the judiciary. Indeed, the court in Winfield v. City of New York, 2017 WL 5664852 (S.D.N.Y. 2017) essentially directed the use of TAR to improve the pace of discovery. And the New York Commercial Division adopted as Rule 11-e(f) the goal of using the most efficient review techniques, expressly including TAR. This trend will only continue as ESI collections grow, technical familiarity with TAR improves, and proportionality considerations prescribe efficiency.
Courts are necessarily refining the boundaries of cooperation and transparency surrounding TAR protocols, with particular emphasis on demonstrable production deficiencies. See, Entrata v. Yardi Systems, No. 2:15-cv-00102 (D. Utah 2018) (rejecting a post hoc demand for sweeping disclosures); Winfield (directing production of a sample of nonresponsive documents to “increase transparency”).
As parties become more sophisticated, there is a greater emphasis on the negotiation and use of TAR protocols in litigation. These protocols can be very comprehensive, addressing a wide range of issues such as keyword culling procedures, transparency obligations, and validation parameters. See, In Re Broiler Chicken Antitrust Litigation, No. 1:16-cv-08637 (N.D. Ill.) (No. 586).
Sophisticated parties are also taking maximum advantage of TAR techniques both inside and outside the courthouse. When comprehensive review may be unnecessary, such as second requests and subpoena responses, respondents may resort to TAR 1.0 protocols. Conversely, given that review begins immediately, CAL protocols are expanding into early case assessment, investigations and compliance monitoring.
Ultimately, with a clear understanding of the technology, TAR promises to see increasing utility, and significantly enhance document review on any number of fronts. Technological advances and workflow optimization will incrementally improve TAR efficiencies. And knowledgeable innovation will lead to ever-expanding application opportunities.
Thomas Gricks is managing director, Professional Services, Catalyst. Gricks advises corporations and law firms on best practices for applying TAR technology.
This content has been archived. It is available through our partners, LexisNexis® and Bloomberg Law.
To view this content, please continue to their sites.
Not a Lexis Subscriber?
Subscribe Now
Not a Bloomberg Law Subscriber?
Subscribe Now
NOT FOR REPRINT
© 2024 ALM Global, LLC, All Rights Reserved. Request academic re-use from www.copyright.com. All other uses, submit a request to [email protected]. For more information visit Asset & Logo Licensing.
You Might Like
View AllPa. Federal District Courts Reach Full Complement Following Latest Confirmation
The Defense Bar Is Feeling the Strain: Busy Med Mal Trial Schedules Might Be Phila.'s 'New Normal'
7 minute readFederal Judge Allows Elderly Woman's Consumer Protection Suit to Proceed Against Citizens Bank
5 minute readJudge Leaves Statute of Limitations Question in Injury Crash Suit for a Jury
4 minute readTrending Stories
- 1Trailblazing Broward Judge Retires; Legacy Includes Bush v. Gore
- 2Federal Judge Named in Lawsuit Over Underage Drinking Party at His California Home
- 3'Almost an Arms Race': California Law Firms Scooped Up Lateral Talent by the Handful in 2024
- 4Pittsburgh Judge Rules Loan Company's Online Arbitration Agreement Unenforceable
- 5As a New Year Dawns, the Value of Florida’s Revised Mediation Laws Comes Into Greater Focus
Who Got The Work
Michael G. Bongiorno, Andrew Scott Dulberg and Elizabeth E. Driscoll from Wilmer Cutler Pickering Hale and Dorr have stepped in to represent Symbotic Inc., an A.I.-enabled technology platform that focuses on increasing supply chain efficiency, and other defendants in a pending shareholder derivative lawsuit. The case, filed Oct. 2 in Massachusetts District Court by the Brown Law Firm on behalf of Stephen Austen, accuses certain officers and directors of misleading investors in regard to Symbotic's potential for margin growth by failing to disclose that the company was not equipped to timely deploy its systems or manage expenses through project delays. The case, assigned to U.S. District Judge Nathaniel M. Gorton, is 1:24-cv-12522, Austen v. Cohen et al.
Who Got The Work
Edmund Polubinski and Marie Killmond of Davis Polk & Wardwell have entered appearances for data platform software development company MongoDB and other defendants in a pending shareholder derivative lawsuit. The action, filed Oct. 7 in New York Southern District Court by the Brown Law Firm, accuses the company's directors and/or officers of falsely expressing confidence in the company’s restructuring of its sales incentive plan and downplaying the severity of decreases in its upfront commitments. The case is 1:24-cv-07594, Roy v. Ittycheria et al.
Who Got The Work
Amy O. Bruchs and Kurt F. Ellison of Michael Best & Friedrich have entered appearances for Epic Systems Corp. in a pending employment discrimination lawsuit. The suit was filed Sept. 7 in Wisconsin Western District Court by Levine Eisberner LLC and Siri & Glimstad on behalf of a project manager who claims that he was wrongfully terminated after applying for a religious exemption to the defendant's COVID-19 vaccine mandate. The case, assigned to U.S. Magistrate Judge Anita Marie Boor, is 3:24-cv-00630, Secker, Nathan v. Epic Systems Corporation.
Who Got The Work
David X. Sullivan, Thomas J. Finn and Gregory A. Hall from McCarter & English have entered appearances for Sunrun Installation Services in a pending civil rights lawsuit. The complaint was filed Sept. 4 in Connecticut District Court by attorney Robert M. Berke on behalf of former employee George Edward Steins, who was arrested and charged with employing an unregistered home improvement salesperson. The complaint alleges that had Sunrun informed the Connecticut Department of Consumer Protection that the plaintiff's employment had ended in 2017 and that he no longer held Sunrun's home improvement contractor license, he would not have been hit with charges, which were dismissed in May 2024. The case, assigned to U.S. District Judge Jeffrey A. Meyer, is 3:24-cv-01423, Steins v. Sunrun, Inc. et al.
Who Got The Work
Greenberg Traurig shareholder Joshua L. Raskin has entered an appearance for boohoo.com UK Ltd. in a pending patent infringement lawsuit. The suit, filed Sept. 3 in Texas Eastern District Court by Rozier Hardt McDonough on behalf of Alto Dynamics, asserts five patents related to an online shopping platform. The case, assigned to U.S. District Judge Rodney Gilstrap, is 2:24-cv-00719, Alto Dynamics, LLC v. boohoo.com UK Limited.
Featured Firms
Law Offices of Gary Martin Hays & Associates, P.C.
(470) 294-1674
Law Offices of Mark E. Salomone
(857) 444-6468
Smith & Hassler
(713) 739-1250