E-discovery: The unfulfilled last chapter in the story of keyword searching
E-discovery seismic shifts have become so common that we are often quick to forget what we havent yet learned.
June 18, 2013 at 05:15 AM
13 minute read
The original version of this story was published on Law.com
E-discovery seismic shifts have become so common that we are often quick to forget what we haven't yet learned.
What I mean is that the e-discovery technical revolution throws so many new tools at us that we can fixate on them before we have fully understood and adopted previous, effective process improvements. Keyword searching is a classic example. With all the rush to the varieties of “predictive coding,” “computer assisted review” and machine learning, we have failed to get keyword searching right before promptly bidding it an abrupt “adieu.”
Of course, we can't really say goodbye to keyword searching; it remains a critical part of e-discovery. Just the same, keyword searching hadn't reached its fully deployed development before the e-discovery world moved on to what George Socha of Electronic Discovery Reference Model (EDRM) calls “the almost irresistible allure of the next bright and shiny ornament.”
What did we miss in getting keyword searching right before moving on? The story starts with the “good-old days” of document production before electronic discovery. Back then, after the receipt of a request for production, a round of motion practice would ensue over “general objections” that the document production requests were ambiguous and impossible to understand, overbroad and burdensome, and irrelevant and not calculated to lead to the discovery of admissible evidence. Once these objections were resolved (and following some scant discussions about what documents should be included), the client would send the responsive paper documents to retained counsel. Retained counsel would then review, categorize, redact, stamp, number and deliver the documents to the opposition, with a modest photocopying bill attached.
This model followed retained counsel into the e-discovery era. The only problem is that the documents changed from plain-old paper to electronically stored information (ESI), giving rise to a host of complications—the new “see-it-only-on-a-screen” (ah, horrors) digital data didn't lend itself to the old procedures. Leaving aside for the moment the “structured data” in databases, most ESI is disorganized and voluminous, often saved and stored in a careless, chaotic manner that would have gotten any file clerk fired with extreme prejudice in the good old days. For example, email arrives, day in and day out, one after another. Hundreds of posts pile up in the inbox and outbox, week after week. Some of them may be placed in folders, and some “rules” may be applied, but the vast majority of potentially relevant emails sit in email boxes, mixed in with mostly non-relevant email. Other emails find their ways into multiple folders in multiple locations, heedless of their common subjects. In this new environment, retained counsel have to collect entire email mailboxes and folders, which then need to be laboriously searched for relevant email.
And in case you weren't already enjoying the process, here's where the fun and frustration begins. Search terms invariably identify a large number of non-responsive documents (poor precision) and miss a large number of responsive documents (poor recall). So while you're digging through a hundred folders, you have to wonder if countless documents in them have anything to do with the case as well as what important document you may have missed that somehow ended up in the “Cat Pictures” folder.
How did we make keyword searches better? The answer is multifaceted. Some efficacious solutions had already been deployed before we became transfixed, ogling the new big thing. For example, Boolean searching has been available to help exclude misidentified documents, while “near” and “within” search capabilities have helped capture phrases. There's also fuzzy logic already at your disposal to help identify misspelling, and stemming to catch words with the same core. In short, keyword search has been there, nicely functioning as an iterative process of refinement rather than a one-shot game of good guessing. Search terms have evolved into developed “search strings” and “search expressions.” Help was on the way and had arrived.
But before we mastered all of these utilities, predictive coding arrived on the heels of some interesting studies, which suggested that even though we thought keyword practice was improving, we still had a long way to go to beat the latest machine-learning algorithms. After all, “Watson” won at Jeopardy. In a flash, the e-discovery world was chasing the new “predictive coding” technologies as if they were robo-heroes and keyword searching a limping dinosaur. Don't get me wrong! These are wonderfully powerful technologies that have been increasingly deployed in significant cases, and they promise even more excitement for the everyday case in the near future. But they are only a few technologies in the big e-discovery toolbox, shiny and new as they may be.
Not to sound curmudgeonly, but what ever happened to that old war horse, the keyword search? Remember, we said that keyword search practice was really improving. Unfortunately, it got sidetracked just as the real capstone to key word searching was emerging: verification by statistical random sampling. As Craig Ball and others have often noted, you just can't look wide-eyed at the search results, you have to test them. Or as Reagan would have said, “Trust but verify.”
We know, we know: “But attorneys never did that in the good old days. Why now?”
E-discovery is different. We could only look at paper (one sheet at a time) in the good old days, and there really wasn't much of it compared to the now-normal gigabytes and terabytes of litigation data. Now, with all that data to sift through, we can only deploy random sampling rather than a brute physical audit. And because we are engaged in searching data (a.k.a “information retrieval”), with random sampling you can actually know with reasonable, legally defensible confidence that the responsive documents have been produced. Indeed, without random sampling, you cannot know with reasonable confidence if you are turning over non-responsive documents. And better yet, random sampling is immensely cost effective—due to the magic of statistical science, the number of randomly selected documents (in ESI speak, “files”) that need to be reviewed to obtain the desired confidence level tops out quickly and does not increase as the volume of data increases.
So let's complete the arrested development of keyword searching. We need the final chapter in its storied development. Statistically sample both your produced and non-produced documents. You might have gotten it right with a stab in the dark (which is, ahem, statistically unlikely), but do you want to take that chance? With statistical sampling, you can be reasonably confident that your productions are legally defensible. And then you can sleep easily at night. While you're at it, make sure the opposition statistically samples the culled documents they are not producing so you can be reasonably confident that you got all the goodies.
And you can even keep playing with the new predictive coding technologies, which, we note with some irony, depend on statistical sampling! We'll visit that in the next article.
E-discovery seismic shifts have become so common that we are often quick to forget what we haven't yet learned.
What I mean is that the e-discovery technical revolution throws so many new tools at us that we can fixate on them before we have fully understood and adopted previous, effective process improvements. Keyword searching is a classic example. With all the rush to the varieties of “predictive coding,” “computer assisted review” and machine learning, we have failed to get keyword searching right before promptly bidding it an abrupt “adieu.”
Of course, we can't really say goodbye to keyword searching; it remains a critical part of e-discovery. Just the same, keyword searching hadn't reached its fully deployed development before the e-discovery world moved on to what George Socha of Electronic Discovery Reference Model (EDRM) calls “the almost irresistible allure of the next bright and shiny ornament.”
What did we miss in getting keyword searching right before moving on? The story starts with the “good-old days” of document production before electronic discovery. Back then, after the receipt of a request for production, a round of motion practice would ensue over “general objections” that the document production requests were ambiguous and impossible to understand, overbroad and burdensome, and irrelevant and not calculated to lead to the discovery of admissible evidence. Once these objections were resolved (and following some scant discussions about what documents should be included), the client would send the responsive paper documents to retained counsel. Retained counsel would then review, categorize, redact, stamp, number and deliver the documents to the opposition, with a modest photocopying bill attached.
This model followed retained counsel into the e-discovery era. The only problem is that the documents changed from plain-old paper to electronically stored information (ESI), giving rise to a host of complications—the new “see-it-only-on-a-screen” (ah, horrors) digital data didn't lend itself to the old procedures. Leaving aside for the moment the “structured data” in databases, most ESI is disorganized and voluminous, often saved and stored in a careless, chaotic manner that would have gotten any file clerk fired with extreme prejudice in the good old days. For example, email arrives, day in and day out, one after another. Hundreds of posts pile up in the inbox and outbox, week after week. Some of them may be placed in folders, and some “rules” may be applied, but the vast majority of potentially relevant emails sit in email boxes, mixed in with mostly non-relevant email. Other emails find their ways into multiple folders in multiple locations, heedless of their common subjects. In this new environment, retained counsel have to collect entire email mailboxes and folders, which then need to be laboriously searched for relevant email.
And in case you weren't already enjoying the process, here's where the fun and frustration begins. Search terms invariably identify a large number of non-responsive documents (poor precision) and miss a large number of responsive documents (poor recall). So while you're digging through a hundred folders, you have to wonder if countless documents in them have anything to do with the case as well as what important document you may have missed that somehow ended up in the “Cat Pictures” folder.
How did we make keyword searches better? The answer is multifaceted. Some efficacious solutions had already been deployed before we became transfixed, ogling the new big thing. For example, Boolean searching has been available to help exclude misidentified documents, while “near” and “within” search capabilities have helped capture phrases. There's also fuzzy logic already at your disposal to help identify misspelling, and stemming to catch words with the same core. In short, keyword search has been there, nicely functioning as an iterative process of refinement rather than a one-shot game of good guessing. Search terms have evolved into developed “search strings” and “search expressions.” Help was on the way and had arrived.
But before we mastered all of these utilities, predictive coding arrived on the heels of some interesting studies, which suggested that even though we thought keyword practice was improving, we still had a long way to go to beat the latest machine-learning algorithms. After all, “Watson” won at Jeopardy. In a flash, the e-discovery world was chasing the new “predictive coding” technologies as if they were robo-heroes and keyword searching a limping dinosaur. Don't get me wrong! These are wonderfully powerful technologies that have been increasingly deployed in significant cases, and they promise even more excitement for the everyday case in the near future. But they are only a few technologies in the big e-discovery toolbox, shiny and new as they may be.
Not to sound curmudgeonly, but what ever happened to that old war horse, the keyword search? Remember, we said that keyword search practice was really improving. Unfortunately, it got sidetracked just as the real capstone to key word searching was emerging: verification by statistical random sampling. As Craig Ball and others have often noted, you just can't look wide-eyed at the search results, you have to test them. Or as Reagan would have said, “Trust but verify.”
We know, we know: “But attorneys never did that in the good old days. Why now?”
E-discovery is different. We could only look at paper (one sheet at a time) in the good old days, and there really wasn't much of it compared to the now-normal gigabytes and terabytes of litigation data. Now, with all that data to sift through, we can only deploy random sampling rather than a brute physical audit. And because we are engaged in searching data (a.k.a “information retrieval”), with random sampling you can actually know with reasonable, legally defensible confidence that the responsive documents have been produced. Indeed, without random sampling, you cannot know with reasonable confidence if you are turning over non-responsive documents. And better yet, random sampling is immensely cost effective—due to the magic of statistical science, the number of randomly selected documents (in ESI speak, “files”) that need to be reviewed to obtain the desired confidence level tops out quickly and does not increase as the volume of data increases.
So let's complete the arrested development of keyword searching. We need the final chapter in its storied development. Statistically sample both your produced and non-produced documents. You might have gotten it right with a stab in the dark (which is, ahem, statistically unlikely), but do you want to take that chance? With statistical sampling, you can be reasonably confident that your productions are legally defensible. And then you can sleep easily at night. While you're at it, make sure the opposition statistically samples the culled documents they are not producing so you can be reasonably confident that you got all the goodies.
And you can even keep playing with the new predictive coding technologies, which, we note with some irony, depend on statistical sampling! We'll visit that in the next article.
This content has been archived. It is available through our partners, LexisNexis® and Bloomberg Law.
To view this content, please continue to their sites.
Not a Lexis Subscriber?
Subscribe Now
Not a Bloomberg Law Subscriber?
Subscribe Now
NOT FOR REPRINT
© 2025 ALM Global, LLC, All Rights Reserved. Request academic re-use from www.copyright.com. All other uses, submit a request to [email protected]. For more information visit Asset & Logo Licensing.
You Might Like
View AllExits Leave American Airlines, SiriusXM, Spotify Searching for New Legal Chiefs
2 minute read'A Warning Shot to Board Rooms': DOJ Decision to Fight $14B Tech Merger May Be Bad Omen for Industry
'Incredibly Complicated'? Antitrust Litigators Identify Pros and Cons of Proposed One Agency Act
5 minute readTrending Stories
- 1An Eye on ‘De-Risking’: Chewing on Hot Topics in Litigation Funding With Jeffery Lula of GLS Capital
- 2Arguing Class Actions: With Friends Like These...
- 3How Some Elite Law Firms Are Growing Equity Partner Ranks Faster Than Others
- 4Fried Frank Partner Leaves for Paul Hastings to Start Tech Transactions Practice
- 5Stradley Ronon Welcomes Insurance Team From Mintz
Who Got The Work
J. Brugh Lower of Gibbons has entered an appearance for industrial equipment supplier Devco Corporation in a pending trademark infringement lawsuit. The suit, accusing the defendant of selling knock-off Graco products, was filed Dec. 18 in New Jersey District Court by Rivkin Radler on behalf of Graco Inc. and Graco Minnesota. The case, assigned to U.S. District Judge Zahid N. Quraishi, is 3:24-cv-11294, Graco Inc. et al v. Devco Corporation.
Who Got The Work
Rebecca Maller-Stein and Kent A. Yalowitz of Arnold & Porter Kaye Scholer have entered their appearances for Hanaco Venture Capital and its executives, Lior Prosor and David Frankel, in a pending securities lawsuit. The action, filed on Dec. 24 in New York Southern District Court by Zell, Aron & Co. on behalf of Goldeneye Advisors, accuses the defendants of negligently and fraudulently managing the plaintiff's $1 million investment. The case, assigned to U.S. District Judge Vernon S. Broderick, is 1:24-cv-09918, Goldeneye Advisors, LLC v. Hanaco Venture Capital, Ltd. et al.
Who Got The Work
Attorneys from A&O Shearman has stepped in as defense counsel for Toronto-Dominion Bank and other defendants in a pending securities class action. The suit, filed Dec. 11 in New York Southern District Court by Bleichmar Fonti & Auld, accuses the defendants of concealing the bank's 'pervasive' deficiencies in regards to its compliance with the Bank Secrecy Act and the quality of its anti-money laundering controls. The case, assigned to U.S. District Judge Arun Subramanian, is 1:24-cv-09445, Gonzalez v. The Toronto-Dominion Bank et al.
Who Got The Work
Crown Castle International, a Pennsylvania company providing shared communications infrastructure, has turned to Luke D. Wolf of Gordon Rees Scully Mansukhani to fend off a pending breach-of-contract lawsuit. The court action, filed Nov. 25 in Michigan Eastern District Court by Hooper Hathaway PC on behalf of The Town Residences LLC, accuses Crown Castle of failing to transfer approximately $30,000 in utility payments from T-Mobile in breach of a roof-top lease and assignment agreement. The case, assigned to U.S. District Judge Susan K. Declercq, is 2:24-cv-13131, The Town Residences LLC v. T-Mobile US, Inc. et al.
Who Got The Work
Wilfred P. Coronato and Daniel M. Schwartz of McCarter & English have stepped in as defense counsel to Electrolux Home Products Inc. in a pending product liability lawsuit. The court action, filed Nov. 26 in New York Eastern District Court by Poulos Lopiccolo PC and Nagel Rice LLP on behalf of David Stern, alleges that the defendant's refrigerators’ drawers and shelving repeatedly break and fall apart within months after purchase. The case, assigned to U.S. District Judge Joan M. Azrack, is 2:24-cv-08204, Stern v. Electrolux Home Products, Inc.
Featured Firms
Law Offices of Gary Martin Hays & Associates, P.C.
(470) 294-1674
Law Offices of Mark E. Salomone
(857) 444-6468
Smith & Hassler
(713) 739-1250