Using CAL and SAL: Matching Learning Strategies to Predictive Coding Use Cases
The ASU-Arkfeld eDiscovery and Digital Evidence Conference called for papers addressing the progress, challenges, and future of e-discovery, digital evidence and data analytics. Here's the winner.
February 06, 2018 at 08:00 AM
6 minute read
The following article is written by the winners of the “Call for Papers” associated with the Seventh Annual ASU-Arkfeld eDiscovery and Digital Evidence Conference. The conference hosts a competitive annual “Call for Papers” which address the progress, challenges, and future of e-discovery, digital evidence and data analytics. The article below was accepted as 2018's winning paper.
The authors of this winning paper will be presenters at the conference on March 6-8, 2018 in Phoenix, Arizona, at ASU's Sandra Day O'Connor College of Law. Those interested may register for the conference at discounted early bird rates until Friday, Feb. 9, and use discount code LTNArkfeld2018 for an extra 15% off: http://events.asucollegeoflaw.com/ediscovery/register/
One of the most common things we hear as Predictive Coding specialists is “We want to use Continuous Active Learning (CAL) on this project.” The term “CAL” has come to signify the ultimate method of TAR training, and also implies that other methods, such as Simple Active Learning (SAL), are things of the past that should be immediately discarded. While we agree CAL can be a very useful training strategy to employ, it is helpful to remember the purpose for which you are using Predictive Coding on your particular matter, and choose a learning strategy that fits that goal more closely.
The term “Continuous Active Learning” or “CAL” derives from a study done by Gordon Cormack and Maura Grossman demonstrating its superiority at improving recall over other training methods. Since then, CAL has developed a reputation as being the most effective method of training Predictive Coding. For that reason, it has been heralded as “TAR 2.0” and a slew of other marketing driven terms singing its praises. Despite its appeal as “the next big thing” by vendor marketing departments, it has actually been around since the onset of machine learning years ago.
The logic of a CAL training strategy is very simple: Continue to prioritize high scoring documents for training until no more relevant documents remain. Every time system learning occurs, the system refreshes the document rankings so documents you train are the highest scoring available at that time, and therefore, most likely relevant.
We do not dispute CAL is an efficient way to train Predictive Coding. In fact, we agree wholeheartedly with it, and use it often. However, it is important to remember that this method emphasizes a more ordinal approach to review; documents are trained and reviewed in descending order based on rank, and it is the rank (not necessarily the score value) that matters. CAL has a tendency to collaterally elevate scores on irrelevant documents with similar content to relevant ones (increasing the volume of false positives, therefore reducing precision). Erroneously escalated documents are trained as irrelevant only as they come up in the review.
While the order of likely relevance based on the relative scores to one another may be improved, looking at the database by objective scores, it is hard to tell what is relevant and what is not, and where to draw that line. Unfortunately, that means unless you spend the time to train everything down to the point to consider ending review, it is difficult to estimate how many documents to review. Initial sampling can help in this regard, but it is still difficult to plan staffing and deadlines when the location of the finish line is uncertain. From a practical use standpoint, this can be frustrating, particularly since many prefer to use Predictive Coding more as a culling tool, determining what to send to review or production, instead of as a feature designed to enhance the review and make it more efficient. If you are using Predictive Coding to rely on system suggestions prospectively, such as with high volume, tight deadline situations (second requests, large multi-jurisdiction litigation, etc.) where you cannot review or train every document, you may have to work on improving the system suggestions more broadly instead. That use case requires the more traditional Predictive Coding approach, since a CAL review method may be impractical.
The traditional Predictive Coding training method is often referred to as Simple Active Learning (SAL) or “TAR 1.0,” but can also be described as “training the unknowns,” “focus document training,” or “uncertainty sampling.” SAL is about building a model instead of an order. With this method, the system is trying to find the cut-off between relevant and not relevant and refine that threshold a little more clearly. In this instance, we train the documents near the suggestion threshold, as these documents are related to concepts that are the most questionable to the machine learning. Thus, SAL tries carving out false positives pre-emptively, instead of at the point-in-time of review based on the rank order, as CAL does. Taken from the perspective of a particular scores in the database, whereas CAL drags relevant (and similar irrelevant) documents upward in score, increasing recall to the initial detriment of precision, SAL removes irrelevant from the relevant, increasing precision while maintaining recall.
Considering the positive attributes of both methods, we tend to utilize SAL earlier in the discovery process, on very large projects to help determine the review or production set, or in situations where project duration or training resources are very limited. We tend to see CAL used more often in a generic review setting, where eyes-on review training is more feasible.
This does not preclude the use of either model being applied on a limited basis to improve results in a situational use case; using a limited CAL approach on a second request to help improve recall or using SAL on an ongoing review to help improve precision can, does and should happen when warranted. Understanding how best to utilize each goes a long way to understanding that a one-size fits all solution (as CAL is often heralded as) is not always the best solution.
In conclusion, both CAL and SAL training strategies are helpful in any TAR review. As skilled e-discovery practitioners understand, knowing when best to utilize each method should depend more on the circumstances of the project, and not on the marketing.
NOT FOR REPRINT
© 2024 ALM Global, LLC, All Rights Reserved. Request academic re-use from www.copyright.com. All other uses, submit a request to [email protected]. For more information visit Asset & Logo Licensing.
You Might Like
View AllIs International Regulation of AI Moving in the Right Direction or Moving at All?
4 minute readLegal IT Professionals: Beware the Seven Deadly Vulnerabilities of Domain Names
Natural Language Processing and Survey Data: LDA and the Importance of Topic Modeling
6 minute readNatural Language Processing and Survey Data: Word Clouds, Associations, Sentiment and Bigrams
7 minute readTrending Stories
- 1Infant Formula Judge Sanctions Kirkland's Jim Hurst: 'Overtly Crossed the Lines'
- 2Preparing Your Law Firm for 2025: Smart Ways to Embrace AI & Other Technologies
- 3Abbott, Mead Johnson Win Defense Verdict Over Preemie Infant Formula
- 4Greenberg Traurig Initiates String of Suits Following JPMorgan Chase's 'Infinite Money Glitch'
- 5It's Time Law Firms Were Upfront About Who Their Salaried Partners Are
Who Got The Work
Michael G. Bongiorno, Andrew Scott Dulberg and Elizabeth E. Driscoll from Wilmer Cutler Pickering Hale and Dorr have stepped in to represent Symbotic Inc., an A.I.-enabled technology platform that focuses on increasing supply chain efficiency, and other defendants in a pending shareholder derivative lawsuit. The case, filed Oct. 2 in Massachusetts District Court by the Brown Law Firm on behalf of Stephen Austen, accuses certain officers and directors of misleading investors in regard to Symbotic's potential for margin growth by failing to disclose that the company was not equipped to timely deploy its systems or manage expenses through project delays. The case, assigned to U.S. District Judge Nathaniel M. Gorton, is 1:24-cv-12522, Austen v. Cohen et al.
Who Got The Work
Edmund Polubinski and Marie Killmond of Davis Polk & Wardwell have entered appearances for data platform software development company MongoDB and other defendants in a pending shareholder derivative lawsuit. The action, filed Oct. 7 in New York Southern District Court by the Brown Law Firm, accuses the company's directors and/or officers of falsely expressing confidence in the company’s restructuring of its sales incentive plan and downplaying the severity of decreases in its upfront commitments. The case is 1:24-cv-07594, Roy v. Ittycheria et al.
Who Got The Work
Amy O. Bruchs and Kurt F. Ellison of Michael Best & Friedrich have entered appearances for Epic Systems Corp. in a pending employment discrimination lawsuit. The suit was filed Sept. 7 in Wisconsin Western District Court by Levine Eisberner LLC and Siri & Glimstad on behalf of a project manager who claims that he was wrongfully terminated after applying for a religious exemption to the defendant's COVID-19 vaccine mandate. The case, assigned to U.S. Magistrate Judge Anita Marie Boor, is 3:24-cv-00630, Secker, Nathan v. Epic Systems Corporation.
Who Got The Work
David X. Sullivan, Thomas J. Finn and Gregory A. Hall from McCarter & English have entered appearances for Sunrun Installation Services in a pending civil rights lawsuit. The complaint was filed Sept. 4 in Connecticut District Court by attorney Robert M. Berke on behalf of former employee George Edward Steins, who was arrested and charged with employing an unregistered home improvement salesperson. The complaint alleges that had Sunrun informed the Connecticut Department of Consumer Protection that the plaintiff's employment had ended in 2017 and that he no longer held Sunrun's home improvement contractor license, he would not have been hit with charges, which were dismissed in May 2024. The case, assigned to U.S. District Judge Jeffrey A. Meyer, is 3:24-cv-01423, Steins v. Sunrun, Inc. et al.
Who Got The Work
Greenberg Traurig shareholder Joshua L. Raskin has entered an appearance for boohoo.com UK Ltd. in a pending patent infringement lawsuit. The suit, filed Sept. 3 in Texas Eastern District Court by Rozier Hardt McDonough on behalf of Alto Dynamics, asserts five patents related to an online shopping platform. The case, assigned to U.S. District Judge Rodney Gilstrap, is 2:24-cv-00719, Alto Dynamics, LLC v. boohoo.com UK Limited.
Featured Firms
Law Offices of Gary Martin Hays & Associates, P.C.
(470) 294-1674
Law Offices of Mark E. Salomone
(857) 444-6468
Smith & Hassler
(713) 739-1250