With text-mining software, which finds patterns and insights from collections of documents, one powerful capability identifies related words. The software models the words in a corpus (all the documents) into topics. Latent Dirichlet Allocation (LDA) is one of the algorithms which carries out such topic modeling. In the legal industry, LDA can work its wonders on any set of documents: client satisfaction comments, documents obtained in discovery or due diligence, annual reports, hot line replies, survey answers, and other sources.

|

An Example of LDA

Let's start with an actual set of documents and see how LDA performs. The author gathered self-descriptions by thirty U.S. law firms, as in what they might use in recruitment brochures. Each self-description runs at least 150 words. After removing trivial words, we used LDA from a package of the open-source R language and told it to model five topics. The table below lays out the 10 words the software most closely associated with each topic, in declining order within each topic.

Topic 1 appears to address client service and value (“provide,” “providing,” “value”); Topic 2 suggests depth of experience (“services,” “years,” “leading”); Topic 3 is the bragging topic (“recognize”, “top,” “best,” “ranked”); Topic 4 focuses on substantive practices (“litigation,” “real,” “estate,” “regulation”); and Topic 5 on engagement of lawyers (“pro” “bono”). Obviously, readers might pick alternative themes for the topics, but at least it the software assembles large amounts of text and isolating the key words. The software does not suggest a concept that pertains to the topics it creates.