It has become increasingly common for both in-house and defense counsel to find themselves confronted with the task of analyzing a large incoming document production. Incoming collections always present special challenges. The team may be less familiar with the language of the other side's documents. Or perhaps many of the original players in the events may no longer be readily available for questioning. In spite of these difficulties, counsel must answer four fundamental questions:

  1. Did we receive the documents we asked for?
  2. Based on what we received, did we ask for the right things?
  3. What do these documents actually say about the issues?
  4. Which documents will become exhibits?

Examining large incoming collections can be very time-consuming and expensive, and most attorneys tend to have a low tolerance for these costs. This task is also not well suited to the various AI solutions that focus on determining what subjects a document is about, but not what the document actually says about the subject. In this article, I want to demonstrate an amazingly effective workflow for examining incoming collections. The workflow is based on the mathematical consequences of the Poisson distribution as applied to the observance of rare events.

The good news is that attorneys don't need to understand the “why” of the mathematics to take advantage of the “how.” Leveraging Poisson mathematics will dramatically reduce the number of documents that need to be reviewed to understand what a collection actually says about the issues. The method is easy to deploy, logical to understand and very, very useful. Here's how it works.

Step 1 – Create an organization structure

Define an organizational schema for the incoming collection or, put more simply, define a bunch of categories that will be used to organize the incoming documents. Most likely your team has already submitted a document request to the other side, and the categories defined in that request can form the basis of your schema. It's always good to review the request to make sure that there is as little overlap as possible between the categories.

Step 2 – Populate the categories

Identify the keywords and phrases that will be associated with each category. You will use these words and phrases to place the documents into the appropriate categories. I usually extract the vocabulary of the collection and arrange the root words by parts of speech to help with this task.

Step 3 – Examine the categories

The first question is: Did we get what we expected? The second question is: What else did we get? To answer these questions, sample the documents that did not fit into categories. Can you prove with a 95 percent confidence level that all of these documents are not relevant? If you can, then you are ready to move on to Step 4.

If you can't, then one of two things must be true: the uncategorized documents contain some unexpected keywords that you must now add, or the collection contains documents about topics that you hadn't thought of as relevant and must now reconsider. Continue your cleanup until you can prove with 95 percent certainty that none of the uncategorized documents are relevant to your case.

Step 4 – Identify a document strategy for each category

Part of the reason for reviewing the incoming collection is to identify documents that will be used as exhibits. For each category, there are three strategic possibilities:

  1. We are seeking enough good documents to make our point.
  2. We need to find every possible example document.
  3. We hope to find a smoking gun.

It is important to define a strategy for each category so that you know when you have accomplished your goal and can move on to the next category.

Step 5 – Examine the documents

This is where the Poisson mathematics comes in. To explain the process, let's consider an example. Assume that the organizational schema consists of 50 categories and that each category has been populated with 2,000 documents.

Query: Do you need to read all 100,000 documents to understand what the collection says about each of the 50 issues? Poisson says “no.” You need only read 15,000.

The gist of the mathematics is as follows: To be 95 percent certain you have seen all of the relevant language that appears in more than 1 percent of the documents in the category (a “rare event”), you need only read 300 documents in that category. In other words, by reading 300 randomly selected documents from each category, you are 95 percent certain to see the relevant language that appears in all but 50 (1 percent) of the 2,000 documents in each category.

That is certainly enough language to both understand what the documents say about the issues and to find your exhibits – unless you are looking for a smoking gun. In which case, you may have to do a bit more work to reach the certainty level.

In summary, examining large incoming productions can be time-consuming and expensive. Moreover, it is not a task that is well suited to the various AI solutions. Fortunately, there is a simple method available for reducing the number of documents that must be reviewed in order to understand very accurately what the collection says about each topic. The method relies on information already known in the case and is easily executed. I encourage you to try it.