Know Your Index!
Indexing can be effective, but counsel should be aware of the technical aspects.
November 17, 2010 at 07:00 PM
13 minute read
The original version of this story was published on Law.com
In its simplest form, indexing is the process of scanning the text of multiple electronic documents to build a database table of search terms that correspond to those documents. It's what Google does for Internet search, and it's also a key component in e-discovery. Indexing enables optimized search and retrieval, and is particularly advantageous for managing captive and centralized repositories with non-changing static data. For the enterprise, this approach is best employed to manage captive and centralized data archives and business records repositories, which is what “enterprise search” tools were originally designed to do.
Indexes are used in e-discovery as well , both at the point of data collection (desktops, laptops, file and email servers, etc.) and at the back end during the processing, analysis and review phase of the Electronic Discovery Reference Model (EDRM) process. Indexing during processing and review is a highly effective way to optimize a search, evaluate a collection and facilitate review.
Although an indexed-based collection can be an effective collection strategy, it can also be risky if counsel is not aware of how an index operates.
In order to properly defend a point-of-collection indexed search, counsel will need to be keenly aware of all indexing rules and limitations, many of which are highly technical and often not readily available to the non-expert. Here are the major technical points concerning indexes that counsel should be cognizant of to properly defend an indexed-based collection:
- Indexing engines run at the operating system (OS) level to extract and parse the source ESI based on predefined rules. Therefore, indexes may not include all ESI because they are limited by the operating system, the indexing rules and other technology living within the OS. For example, indexes many not capture legacy files from applications the index was not designed to recognize.
- The bigger the index, the more likely it is to be corrupted. Indexes can easily grow very large especially if they hold both the parsed data tables and a copy of the actual document.
- Indexes have become a discovery source themselves since they often hold copies of the documents as well as valuable ESI within the index tables.
- Many types of encoded text are excluded from an indexed search such as nested data, foreign language data, etc.
- Corrupted files, ESI that is not text based and encrypted or password protected files, will normally not be indexed and thus excluded from the search results.
- Indexing engines are programmed to ignore file types deemed unlikely to hold text. For example, an index will not see ESI that has been hidden within another file extension or extensions that are unknown to the index.
- Index engines normally don't capture non-text files. Facsimile or TIFF images are classic examples of text-laden documents not captured by an index. Those types of documents must be identified by extension and force-collected.
- Because indexes consist of archived or migrated copies of original data, they quickly become stale as the original documents change. In the event of a large search, this can often happen before the index on the initial data set is built. Documents held by the custodians continue to be modified, added and deleted, which can also cause inconsistencies with document retention policies.
- Indexes may change file metadata or fail to collect all associated file metadata.
To this point, industry research from Forrester questions the “archive everything” approach to the preservation and collection of unmanaged data for e-discovery:
Do Not Believe In The “Archive Everything” Approach . . . Efficient e-discovery finds information quickly without any system downtime. Managing large archives results in slower search queries and the need for more advanced culling tools to minimize the responsive information set. In addition, many clients report significant problems managing the indexes required with large archives. If your index breaks every six months and you need to continually re-index, you have erased the risk mitigation benefit that you thought you were achieving.
e-Discovery Best Practices, Forrester Research, Sept. 24, 2007
On the other hand, there are highly effective search tools that do not require an upfront index. A classic example is forensic-based technology like EnCase (TM) eDiscovery that can search the target computers within an enterprise at the disk level. Several advantages to this approach are that it does not rely on the operating system of the target computer and so it can see everything on the drive, the searches can be started immediately since no index needs to be built, it will forensically preserve all ESI including metadata and can perform rapid culling of files at the point of collection.
By far, the largest legal risk and exposure in the e-discovery process is at the collection and preservation stage. And this exposure is increasing. According to a Gibson Dunn survey, the number of cases in the first five months of 2009 where sanctions were considered and awarded for preservation failures increased two-fold over those considered in the first 10 months of 2008. In 2009, the number of cases where sanctions were awarded related to e-discovery collections totaled 36 percent of cases.
Unless counsel completely understands indexing rules and limitations at the collection stage of the EDRM process, they are exposing their clients to a significant risk that important ESI will be omitted from production and that they will not be able to properly defend the search before a judge.
In its simplest form, indexing is the process of scanning the text of multiple electronic documents to build a database table of search terms that correspond to those documents. It's what
Indexes are used in e-discovery as well , both at the point of data collection (desktops, laptops, file and email servers, etc.) and at the back end during the processing, analysis and review phase of the Electronic Discovery Reference Model (EDRM) process. Indexing during processing and review is a highly effective way to optimize a search, evaluate a collection and facilitate review.
Although an indexed-based collection can be an effective collection strategy, it can also be risky if counsel is not aware of how an index operates.
In order to properly defend a point-of-collection indexed search, counsel will need to be keenly aware of all indexing rules and limitations, many of which are highly technical and often not readily available to the non-expert. Here are the major technical points concerning indexes that counsel should be cognizant of to properly defend an indexed-based collection:
- Indexing engines run at the operating system (OS) level to extract and parse the source ESI based on predefined rules. Therefore, indexes may not include all ESI because they are limited by the operating system, the indexing rules and other technology living within the OS. For example, indexes many not capture legacy files from applications the index was not designed to recognize.
- The bigger the index, the more likely it is to be corrupted. Indexes can easily grow very large especially if they hold both the parsed data tables and a copy of the actual document.
- Indexes have become a discovery source themselves since they often hold copies of the documents as well as valuable ESI within the index tables.
- Many types of encoded text are excluded from an indexed search such as nested data, foreign language data, etc.
- Corrupted files, ESI that is not text based and encrypted or password protected files, will normally not be indexed and thus excluded from the search results.
- Indexing engines are programmed to ignore file types deemed unlikely to hold text. For example, an index will not see ESI that has been hidden within another file extension or extensions that are unknown to the index.
- Index engines normally don't capture non-text files. Facsimile or TIFF images are classic examples of text-laden documents not captured by an index. Those types of documents must be identified by extension and force-collected.
- Because indexes consist of archived or migrated copies of original data, they quickly become stale as the original documents change. In the event of a large search, this can often happen before the index on the initial data set is built. Documents held by the custodians continue to be modified, added and deleted, which can also cause inconsistencies with document retention policies.
- Indexes may change file metadata or fail to collect all associated file metadata.
To this point, industry research from Forrester questions the “archive everything” approach to the preservation and collection of unmanaged data for e-discovery:
Do Not Believe In The “Archive Everything” Approach . . . Efficient e-discovery finds information quickly without any system downtime. Managing large archives results in slower search queries and the need for more advanced culling tools to minimize the responsive information set. In addition, many clients report significant problems managing the indexes required with large archives. If your index breaks every six months and you need to continually re-index, you have erased the risk mitigation benefit that you thought you were achieving.
e-Discovery Best Practices, Forrester Research, Sept. 24, 2007
On the other hand, there are highly effective search tools that do not require an upfront index. A classic example is forensic-based technology like EnCase (TM) eDiscovery that can search the target computers within an enterprise at the disk level. Several advantages to this approach are that it does not rely on the operating system of the target computer and so it can see everything on the drive, the searches can be started immediately since no index needs to be built, it will forensically preserve all ESI including metadata and can perform rapid culling of files at the point of collection.
By far, the largest legal risk and exposure in the e-discovery process is at the collection and preservation stage. And this exposure is increasing. According to a
Unless counsel completely understands indexing rules and limitations at the collection stage of the EDRM process, they are exposing their clients to a significant risk that important ESI will be omitted from production and that they will not be able to properly defend the search before a judge.
This content has been archived. It is available through our partners, LexisNexis® and Bloomberg Law.
To view this content, please continue to their sites.
Not a Lexis Subscriber?
Subscribe Now
Not a Bloomberg Law Subscriber?
Subscribe Now
NOT FOR REPRINT
© 2024 ALM Global, LLC, All Rights Reserved. Request academic re-use from www.copyright.com. All other uses, submit a request to [email protected]. For more information visit Asset & Logo Licensing.
You Might Like
View AllSoundCloud GC Takes Legal Reins of Condé Nast at Tumultuous Time
Senate Panel Postpones Vote on Reconfirmation of Democrat Crenshaw to SEC
As AI-Generated Fraud Rises, Financial Companies Face a Long Cybersecurity Battle
FTC, DOJ Withdrawal of Antitrust Guidelines for Collaboration Infuriates Republicans
5 minute readTrending Stories
- 1With SDNY Stay Lifted, Sex Trafficking Civil Suit Against Vince McMahon, WWE Gets Green Light
- 2Insurer Has No Duty to Defend 'Laidlow' Claims, NJ Supreme Court Says
- 3The Marble Palace Blog: The Supreme Court’s Bond With Baseball
- 4Meet the Big Law Partners Advising Political Appointees
- 5Scan In Progress: Litigators Leverage AI to Screen Prospective Jurors
Who Got The Work
Michael G. Bongiorno, Andrew Scott Dulberg and Elizabeth E. Driscoll from Wilmer Cutler Pickering Hale and Dorr have stepped in to represent Symbotic Inc., an A.I.-enabled technology platform that focuses on increasing supply chain efficiency, and other defendants in a pending shareholder derivative lawsuit. The case, filed Oct. 2 in Massachusetts District Court by the Brown Law Firm on behalf of Stephen Austen, accuses certain officers and directors of misleading investors in regard to Symbotic's potential for margin growth by failing to disclose that the company was not equipped to timely deploy its systems or manage expenses through project delays. The case, assigned to U.S. District Judge Nathaniel M. Gorton, is 1:24-cv-12522, Austen v. Cohen et al.
Who Got The Work
Edmund Polubinski and Marie Killmond of Davis Polk & Wardwell have entered appearances for data platform software development company MongoDB and other defendants in a pending shareholder derivative lawsuit. The action, filed Oct. 7 in New York Southern District Court by the Brown Law Firm, accuses the company's directors and/or officers of falsely expressing confidence in the company’s restructuring of its sales incentive plan and downplaying the severity of decreases in its upfront commitments. The case is 1:24-cv-07594, Roy v. Ittycheria et al.
Who Got The Work
Amy O. Bruchs and Kurt F. Ellison of Michael Best & Friedrich have entered appearances for Epic Systems Corp. in a pending employment discrimination lawsuit. The suit was filed Sept. 7 in Wisconsin Western District Court by Levine Eisberner LLC and Siri & Glimstad on behalf of a project manager who claims that he was wrongfully terminated after applying for a religious exemption to the defendant's COVID-19 vaccine mandate. The case, assigned to U.S. Magistrate Judge Anita Marie Boor, is 3:24-cv-00630, Secker, Nathan v. Epic Systems Corporation.
Who Got The Work
David X. Sullivan, Thomas J. Finn and Gregory A. Hall from McCarter & English have entered appearances for Sunrun Installation Services in a pending civil rights lawsuit. The complaint was filed Sept. 4 in Connecticut District Court by attorney Robert M. Berke on behalf of former employee George Edward Steins, who was arrested and charged with employing an unregistered home improvement salesperson. The complaint alleges that had Sunrun informed the Connecticut Department of Consumer Protection that the plaintiff's employment had ended in 2017 and that he no longer held Sunrun's home improvement contractor license, he would not have been hit with charges, which were dismissed in May 2024. The case, assigned to U.S. District Judge Jeffrey A. Meyer, is 3:24-cv-01423, Steins v. Sunrun, Inc. et al.
Who Got The Work
Greenberg Traurig shareholder Joshua L. Raskin has entered an appearance for boohoo.com UK Ltd. in a pending patent infringement lawsuit. The suit, filed Sept. 3 in Texas Eastern District Court by Rozier Hardt McDonough on behalf of Alto Dynamics, asserts five patents related to an online shopping platform. The case, assigned to U.S. District Judge Rodney Gilstrap, is 2:24-cv-00719, Alto Dynamics, LLC v. boohoo.com UK Limited.
Featured Firms
Law Offices of Gary Martin Hays & Associates, P.C.
(470) 294-1674
Law Offices of Mark E. Salomone
(857) 444-6468
Smith & Hassler
(713) 739-1250