In its simplest form, indexing is the process of scanning the text of multiple electronic documents to build a database table of search terms that correspond to those documents. It's what Google does for Internet search, and it's also a key component in e-discovery. Indexing enables optimized search and retrieval, and is particularly advantageous for managing captive and centralized repositories with non-changing static data. For the enterprise, this approach is best employed to manage captive and centralized data archives and business records repositories, which is what “enterprise search” tools were originally designed to do.

Indexes are used in e-discovery as well , both at the point of data collection (desktops, laptops, file and email servers, etc.) and at the back end during the processing, analysis and review phase of the Electronic Discovery Reference Model (EDRM) process. Indexing during processing and review is a highly effective way to optimize a search, evaluate a collection and facilitate review.

Although an indexed-based collection can be an effective collection strategy, it can also be risky if counsel is not aware of how an index operates.

In order to properly defend a point-of-collection indexed search, counsel will need to be keenly aware of all indexing rules and limitations, many of which are highly technical and often not readily available to the non-expert. Here are the major technical points concerning indexes that counsel should be cognizant of to properly defend an indexed-based collection:

  • Indexing engines run at the operating system (OS) level to extract and parse the source ESI based on predefined rules. Therefore, indexes may not include all ESI because they are limited by the operating system, the indexing rules and other technology living within the OS. For example, indexes many not capture legacy files from applications the index was not designed to recognize.
  • The bigger the index, the more likely it is to be corrupted. Indexes can easily grow very large especially if they hold both the parsed data tables and a copy of the actual document.
  • Indexes have become a discovery source themselves since they often hold copies of the documents as well as valuable ESI within the index tables.
  • Many types of encoded text are excluded from an indexed search such as nested data, foreign language data, etc.
  • Corrupted files, ESI that is not text based and encrypted or password protected files, will normally not be indexed and thus excluded from the search results.
  • Indexing engines are programmed to ignore file types deemed unlikely to hold text. For example, an index will not see ESI that has been hidden within another file extension or extensions that are unknown to the index.
  • Index engines normally don't capture non-text files. Facsimile or TIFF images are classic examples of text-laden documents not captured by an index. Those types of documents must be identified by extension and force-collected.
  • Because indexes consist of archived or migrated copies of original data, they quickly become stale as the original documents change. In the event of a large search, this can often happen before the index on the initial data set is built. Documents held by the custodians continue to be modified, added and deleted, which can also cause inconsistencies with document retention policies.
  • Indexes may change file metadata or fail to collect all associated file metadata.

To this point, industry research from Forrester questions the “archive everything” approach to the preservation and collection of unmanaged data for e-discovery:

Do Not Believe In The “Archive Everything” Approach . . . Efficient e-discovery finds information quickly without any system downtime. Managing large archives results in slower search queries and the need for more advanced culling tools to minimize the responsive information set. In addition, many clients report significant problems managing the indexes required with large archives. If your index breaks every six months and you need to continually re-index, you have erased the risk mitigation benefit that you thought you were achieving.

e-Discovery Best Practices, Forrester Research, Sept. 24, 2007

On the other hand, there are highly effective search tools that do not require an upfront index. A classic example is forensic-based technology like EnCase (TM) eDiscovery that can search the target computers within an enterprise at the disk level. Several advantages to this approach are that it does not rely on the operating system of the target computer and so it can see everything on the drive, the searches can be started immediately since no index needs to be built, it will forensically preserve all ESI including metadata and can perform rapid culling of files at the point of collection.

By far, the largest legal risk and exposure in the e-discovery process is at the collection and preservation stage. And this exposure is increasing. According to a Gibson Dunn survey, the number of cases in the first five months of 2009 where sanctions were considered and awarded for preservation failures increased two-fold over those considered in the first 10 months of 2008. In 2009, the number of cases where sanctions were awarded related to e-discovery collections totaled 36 percent of cases.

Unless counsel completely understands indexing rules and limitations at the collection stage of the EDRM process, they are exposing their clients to a significant risk that important ESI will be omitted from production and that they will not be able to properly defend the search before a judge.

In its simplest form, indexing is the process of scanning the text of multiple electronic documents to build a database table of search terms that correspond to those documents. It's what Google does for Internet search, and it's also a key component in e-discovery. Indexing enables optimized search and retrieval, and is particularly advantageous for managing captive and centralized repositories with non-changing static data. For the enterprise, this approach is best employed to manage captive and centralized data archives and business records repositories, which is what “enterprise search” tools were originally designed to do.

Indexes are used in e-discovery as well , both at the point of data collection (desktops, laptops, file and email servers, etc.) and at the back end during the processing, analysis and review phase of the Electronic Discovery Reference Model (EDRM) process. Indexing during processing and review is a highly effective way to optimize a search, evaluate a collection and facilitate review.

Although an indexed-based collection can be an effective collection strategy, it can also be risky if counsel is not aware of how an index operates.

In order to properly defend a point-of-collection indexed search, counsel will need to be keenly aware of all indexing rules and limitations, many of which are highly technical and often not readily available to the non-expert. Here are the major technical points concerning indexes that counsel should be cognizant of to properly defend an indexed-based collection:

  • Indexing engines run at the operating system (OS) level to extract and parse the source ESI based on predefined rules. Therefore, indexes may not include all ESI because they are limited by the operating system, the indexing rules and other technology living within the OS. For example, indexes many not capture legacy files from applications the index was not designed to recognize.
  • The bigger the index, the more likely it is to be corrupted. Indexes can easily grow very large especially if they hold both the parsed data tables and a copy of the actual document.
  • Indexes have become a discovery source themselves since they often hold copies of the documents as well as valuable ESI within the index tables.
  • Many types of encoded text are excluded from an indexed search such as nested data, foreign language data, etc.
  • Corrupted files, ESI that is not text based and encrypted or password protected files, will normally not be indexed and thus excluded from the search results.
  • Indexing engines are programmed to ignore file types deemed unlikely to hold text. For example, an index will not see ESI that has been hidden within another file extension or extensions that are unknown to the index.
  • Index engines normally don't capture non-text files. Facsimile or TIFF images are classic examples of text-laden documents not captured by an index. Those types of documents must be identified by extension and force-collected.
  • Because indexes consist of archived or migrated copies of original data, they quickly become stale as the original documents change. In the event of a large search, this can often happen before the index on the initial data set is built. Documents held by the custodians continue to be modified, added and deleted, which can also cause inconsistencies with document retention policies.
  • Indexes may change file metadata or fail to collect all associated file metadata.

To this point, industry research from Forrester questions the “archive everything” approach to the preservation and collection of unmanaged data for e-discovery:

Do Not Believe In The “Archive Everything” Approach . . . Efficient e-discovery finds information quickly without any system downtime. Managing large archives results in slower search queries and the need for more advanced culling tools to minimize the responsive information set. In addition, many clients report significant problems managing the indexes required with large archives. If your index breaks every six months and you need to continually re-index, you have erased the risk mitigation benefit that you thought you were achieving.

e-Discovery Best Practices, Forrester Research, Sept. 24, 2007

On the other hand, there are highly effective search tools that do not require an upfront index. A classic example is forensic-based technology like EnCase (TM) eDiscovery that can search the target computers within an enterprise at the disk level. Several advantages to this approach are that it does not rely on the operating system of the target computer and so it can see everything on the drive, the searches can be started immediately since no index needs to be built, it will forensically preserve all ESI including metadata and can perform rapid culling of files at the point of collection.

By far, the largest legal risk and exposure in the e-discovery process is at the collection and preservation stage. And this exposure is increasing. According to a Gibson Dunn survey, the number of cases in the first five months of 2009 where sanctions were considered and awarded for preservation failures increased two-fold over those considered in the first 10 months of 2008. In 2009, the number of cases where sanctions were awarded related to e-discovery collections totaled 36 percent of cases.

Unless counsel completely understands indexing rules and limitations at the collection stage of the EDRM process, they are exposing their clients to a significant risk that important ESI will be omitted from production and that they will not be able to properly defend the search before a judge.