The use of predictive coding in ediscovery document review is increasingly popular, but how long before the algorithm overtakes the human? Mark Schroeder analyzes whether the technology is best friend or job thief for lawyers
Predictive Coding, also known as algorithm-assisted “text categorization”, refers to the use of a software program to identify documents that are relevant or responsive to a particular case or issue, based on a review of test documents (or a population of seed sets, validation sets, or training sets) by lawyers and subject matter experts. The computer-assisted methodology involves a machine learning process and a combination of different algorithmic tools.
This method of assisting counsel in searching, culling and categorizing documents is considered one of the most important developments in the eDiscovery industry. In fact, it is so significant that some insiders believe the technology will eventually replace the jobs of lawyers executing document review.
While using algorithms can in many situations find the proverbial needle in a haystack much more efficiently, it is the author’s position that the methodology will continue to be more of a super-charged assistant to, rather than a replacement for, the lawyer review team.
[ihc-hide-content ihc_mb_type=”show” ihc_mb_who=”1″ ihc_mb_template=”2″ ]
The more likely future for eDiscovery with more technology-assisted review (TAR) is one where the standards of document review will be raised and the parameters on how much eDiscovery is considered to be rational and proportionate to the case will be significantly increased due to greater efficiencies.
In support of the above position, that TAR is an enhancement rather than a replacement, is a quote from Da Silva Moore v Publicis Groupe (2012), the first and most-cited case on the use of TAR. In it, Judge Andrew Peck validates lawyers being part of the process, stating: “[lawyers] … can help cull extraneous documents from a set for review and thus enrich the set of documents used to train predictive coding technology.” However, Peck further endorses TAR explaining that “[It] … can help target specific concepts that might not turn up in lawyer random sampling, which can ensure a more comprehensive review.”
Since 2012, other cases have emerged that provide more reasons for lawyer reviewers to fear for their jobs. The most significant was Federal Housing Finance Agency v HSBC (2014) where Judge Denise Cote stated: “Predictive coding had a better track record in production of responsive documents than human review.” Further supporting the HSBC case was Good v American Water Works (2014), published at the end of 2014, where Judge John Copenhaver stated that predictive coding may be used in determining privileged documents and/or content.
While the above are all US decisions, finally in early 2016, the UK courts began to support. As in Pyrrho Investments v MWB Property, where Master Matthews turned to the disclosure rules set out in Practice Direction 31b, supporting its use, stating “automated methods of searching if a full review of each and every document would be unreasonable”. He also noted “whether it would be right for approval to be given in other cases will, of course, depend upon the particular [case circumstances]”. However in the Pyrrho case, the consensus of the parties regarding proportionality, efficacy and suitability was the key consideration.
Finally, in May 2016, TAR was again supported in a commonwealth case based on a report by Berwin Leighton Paisner (BLP), where the petitioner sought a buyout of his minority shareholding. The respondents contested the allegations and petitioner’s suggested valuation. Nevertheless, the parties reached agreement on most directions in advance of the first Case Management Conference. The respondent possessed the vast majority of the potentially relevant documents, approximately 500,000.
The sticking point, according to BLP, was over the most proportionate and appropriate approach to disclosure. The petitioner’s solicitors wanted to adopt a linear review approach using an agreed upon list of custodians and search terms. BLP, which represented the respondent, asserted that the costs of this approach would be excessive and TAR could achieve “super results … at a more proportionate cost”. The court agreed, and ordered that TAR be used by the respondent, following the respondent’s solicitors’ arguments to the court referring to the relevant passages and relevant factors outlined by Master Matthews in the Pyrrho case supporting the use of TAR.
As shown above, in the past few years the use of predictive coding has been increasingly advocated for by corporate counsel and supported by judicial cases, primarily for its efficiencies. That said, it is also well established that a substantial amount of linear document review by lawyers or subject matter experts is needed to effectively and accurately train the predictive coding algorithm.
Unfortunately, to simply proclaim a “general rule” on how large the “training set” needs to be is not entirely possible. While there is a relationship between the training set size and the total number of documents in the population of a given case, the more relevant determination of size has to do with the “complexity of the categorization problem at hand”, according to Ali Hadjarian, a senior manager at Deloitte Financial Advisory Services. Stated another way, the more relevancy issues given and within each issue, and the more words and phrases codified as relevant and not relevant, the larger the training sets will need to be.
So, when should you consider using predictive coding? There is no simple answer. However, some broad parameters are emerging in terms of total document volume and training set sizes that provide general guidance.
From a technical standpoint a training-set can be as low as 500 documents and provide very precise results if the categorization is exceptionally simple. On the other hand, according to Hadjarian, if the categorization is exceptionally complex, a training set of 30,000 may still be too small to provide the desired level of confidence.
It should be noted, though, that predictive coding is not a one-set process and the most senior experts have difficulty agreeing on “determination of seed sets [random, judgmental, mix), layering search terms, and the best or most accurate analytic and coding methodology”, wrote Laura Ewing-Pearle, a certified eDiscovery specialist recognized by the Association of Certified eDiscovery Specialists (ACEDS), in an article. She said developing the training set is an iterative process that can take “as few as three generations or as many as 45”.
Furthermore, each relevancy issue has a binary decision tree. Therefore, if there are many separate issues, that increases complexity, and thus training-set sizes need to be significantly increased.
With that said, from a practical standpoint cases that utilize predictive coding typically involve total document volumes of more than 500,000, with the average case involving more than one million documents. This large number has historically been the case due at least in part to the significant cost of the predictive coding software. The average training set will range from about 7,000 to 12,000 documents, but due to the iterative nature of the process the size of the training set could potentially be much larger.
In any case, the average cost savings from the use of predictive coding range from 30% to 80%, based on the ability of lawyer reviewers to forego the review of hundreds to thousands of non-responsive documents.
Further, the time saved by delegating review to the trained algorithm can free lawyer reviewers to perform more sophisticated analyses and higher-order tasks to ensure sound case strategy. For example, lawyers are still required to perform the higher-level review of documents prioritized in the system. Typically, these are the documents most likely to be critical in the dispute, or protected by attorney-client or work-product privilege.
Subject matter experts and lawyers also help choose the right keywords to maximize the return of responsive results while minimizing the likelihood of overlooking important variants or other related terms. Additionally, statisticians can play a role in validating the reliability and quality of search results by sampling throughout the process to demonstrate that the process is consistent, as necessary to ensure its defensibility. Finally, forensic technology specialists can help guide lawyers in using the most effective types of review (search and cull) platforms.
In summary, TAR, or more accurately “predictive coding”, may be able to replicate, and in some cases overtake, basic first-pass document review functions that entail tagging vast quantities of documents for relevancy, and into legal issue categories. However, it is doubtful that this technology can, in the foreseeable future, replace the knowledge, frame of reference and expertise of seasoned lawyers and legal technology professionals, who continue to be required to help manage the discovery process to a successful result.
[/ihc-hide-content]
















