Textmining For Lawyers Dyevre
Textmining For Lawyers Dyevre
Abstract
Many questions facing legal scholars and practitioners can only be answered
by analysing and interrogating large collections of legal documents: statutes,
treaties, judicial decisions and law review articles. I survey a range of novel
techniques in machine learning and natural language processing – including
topic modelling, word embeddings and transfer learning – that can be applied
to the large-scale investigation of legal texts.
I am grateful to Dr. Nicolas Lampach, Dr. Timothy Yu-Cheung Yeung, Monika Glavina, Kyra Wigard and Nusret
Ipek for invaluable research assistance. I acknowledge financial support from European Research Council
Horizon 2020 Starting Grant #638154 (EUTHORITY).
Much of the information of interest to lawyers and legal scholars comes in the form of texts.
Whether briefs, contracts, court rulings, law review articles, legislative acts, treaties,
newspapers or blog posts, all are either legal documents themselves or documents about the
law. Retrieving, analyzing, commenting, relating and expounding these documents has been
the bread and butter of legal practice and legal scholarship alike for centuries.
Lawyers deal in word and the law can be viewed as a vast and complex network of interrelated
texts, as illustrated in Figure 1. The function of this discourse is not only to announce legal
rules and how they apply to a particular set of facts but also to explain or summarise them in
more succinct or more accessible language – which is understood to be one of the core
functions of traditional, doctrinal scholarship.
While the study of legal texts is at least as old as academic legal scholarship, what is new is
that a whole range of text-mining techniques have emerged to assist the legal community in
navigating and analyzing the ever-expanding sea of legal and law-related documents. These
techniques rely on recent advances in machine learning and natural language processing.
The media hype about Artificial Intelligence (AI) occasionally leads to exaggerated claims
about the capabilities of these techniques. Except for the simplest legal tasks, robot lawyers
are not yet around the corner. Nor are fully automated robot judges (provided that robot
judges are even desirable, which is, at least, questionable). However, even if the media hype
(sometimes amplified by legal scholars) paints a misleading picture of what AI can achieve, it
would be at least equally wrong to dismiss these techniques as irrelevant to legal practice or
legal scholarship. This is true even for those who see themselves as hardcore black-letter law
scholars. The now famous Gartner Hype Cycles tell us that perceptions of AI advances oscillate
between peaks of inflated expectations and troughs of disillusionment, before reaching a
plateau of productivity.1
Researchers with experience in text-mining applications in the legal domain recognize that
text-mining techniques cannot (yet) fully replace careful human reading. Yet these
technologies are already sufficiently mature and progressing at a breakneck pace to deliver
substantial advances. While increasingly popular in the interdisciplinary fields of Law &
Economics, Empirical Legal Studies and Law & Politics,2 text-mining methods are also directly
relevant to the work of doctrinal legal scholars. Indeed, one way to view them is as
augmented doctrinal reality.
1
See https://www.gartner.com/smarterwithgartner/5-trends-drive-the-gartner-hype-cycle-for-emerging-
technologies-2020/ (Accessed 2 March 2021).
2
For a review see Jens Frankenreiter & Michael A. Livermore, Computational methods in legal analysis, 16
ANNU. REV. LAW SOC. SCI. 39–57 (2020); for reflections and illustrations of the use of machine learning and
natural language processing methods in empirical legal studies see MICHAEL A. LIVERMORE & DANIEL NAHUM
ROCKMORE, LAW AS DATA: COMPUTATION, TEXT, & THE FUTURE OF LEGAL ANALYSIS (2019).
Figure 1. Law as text.
The present contribution aims to introduce these techniques to jurists unfamiliar with
machine learning and natural language processing or who may only have a faint notion of the
use which these tools can be put to. To that end, I shall first describe how data-harvesting
methods can be deployed to collect large collections of legal documents. I will then proceed
to explain how text is transformed into input data for text-mining tasks. Next, I will offer an
overview of the text-mining techniques themselves, distinguishing supervised and
unsupervised methods and walking the reader through a bunch of examples from the
EUTHORITY Project (www.euthority.eu). Finally, because I expect that many readers might be
interested in learning some of the reviewed techniques, I will say a few words about practical
software implementation.
The target audience of the present paper are continental European legal scholars. Because of
this deliberate focus, the discussion deliberately excludes tasks and questions – such as
contract review or document assembly – that are important in legal practice3, but of lesser
relevance to academic legal research, as traditionally understood in continental Europe. Nor
do I engage matters such as causal inference that are central to the integration of text-mining
and machine learning approaches in empirical legal studies and Law & Economics. 4 Also, my
aim is to introduce text-mining methods in terms that my target audience (hopefully) will find
understandable. For that reason, I eschew mathematical notation and technical jargon to
focus on the underlying conceptual intuitions with the help of concrete illustrations.
Obviously, this comes at the price of precision. But I hope that this sacrifice comes with the
benefit of lowering the barrier to access. It is also worth mentioning at the outset that the
scope of the present review is, by its very nature, limited. Text-mining and natural language
processing have become vast fields, currently progressing at a neck-breaking pace that is
possibly unmatched in any other field of scientific inquiry. So to pretend that this survey is, in
any sense, comprehensive would be silly.
3
Efforts to automate these tasks have been an important focus of the emerging Legal Tech scene, see Robert
Formatted: English (United States)
Dale, Law and word order: NLP in legal tech, 25 NAT. LANG. ENG. 211–217 (2019).
4
See LIVERMORE AND ROCKMORE, supra note 3. Formatted: English (United States)
The present contribution assumes that legal scholars, with or without prior training in
statistics or empirical methods, can become not just intelligent consumers but also active
users of this panoply of powerful techniques. Readers interested in applying computational
textual methods will be find some pointers in the Section on “Learning Text-Mining Methods”.
Computerised text-mining methods require that texts be in digital form. Luckily, millions of
legal documents are now available at a few click in electronic repositories and legal databases.
The degree of exhaustiveness of these repositories varies widely from jurisdiction to
jurisdiction. At best, judicial databases offer access to all published decisions. Often it will only
be to a subset of these decisions, with older rulings typically less likely to make the cut.
Because the universe of documents is somewhat smaller, legislative databases usually fare
better, although, here too, there are jurisdictional and cross-national disparities.5 As official
gazettes are increasingly published digitally, they potentially represent a treasure trove of
legal data.
When documents are not available in digital format, it is still possible to convert them to this
format using scanning combined with Optical Character Recognition (OCR). OCR works better
with more recent, undamaged texts than with old dusty casebooks or well-worn legal
treatises. However, the technology has made huge strides, thanks in large part to machine
learning (which helps guess semi-erased words or phrases). It is now even possible to
digitalise hand-written documents,6 opening up new possibilities for legal historians to scour
old manuscripts.
When done manually assembling a large collection of legal documents for a text-mining
project can be excruciatingly time-consuming (try to download all European Court of Justice
decisions since 1954). However, data-harvesting techniques can make this step considerably
easier. Using libraries designed for this purpose in popular programming languages like R and
Python, it is possible to download the entire content of EUR-Lex (the EU law database) with
less than five lines of code. Web-scraping, as the method is commonly referred to, is now the
chief technique to collect data in social scientific disciplines.7 As scientists, including in physics
and medicine, have turned to text-mining methods to summarise vast collections of peer-
reviewed papers, publishers (notably Oxford University Press and Elsevier) have made their
journal collections available. The terms and conditions of commercial and non-commercial
databases may sometimes explicitly prohibit web-scraping and there remain some
uncertainties about when web-scraped may be prohibited even for non-profit, purely
academic research purposes.
5
At European Union level, EUR-Lex is fairly comprehensive, with regard to both legislative acts and case law.
EU law - EUR-Lex, , https://eur-lex.europa.eu/homepage.html (last visited Nov 9, 2020). National databases
are typically less complete.
6
Digitize Your Notes With Microsoft Computer Vision API | Nordic APIs |, , NORDIC APIS (2017),
https://nordicapis.com/digitize-your-notes-with-microsoft-vision-api/ (last visited Nov 9, 2020).
7
Nicholas J. DeVito, Georgia C. Richards & Peter Inglesby, How we learnt to stop worrying and love web
scraping, 585 NATURE 621–622 (2020).
From Text to Data
When we read a text our brain parses it applying our knowledge of semantics, syntax and
context. In any language, the stock of words is finite, but syntactic rules allow the construction
of infinitely many sentences from this finite vocabulary. Moreover, humans are able to
communicate more than they say or write by taking the context into account. It is why we
ascribe different meanings to the sentence “I would like a table” when uttered in a restaurant
and when uttered in a furniture shop.8 While the language of legal documents – including
contracts, statutes and judicial opinions – can diverge, sometimes significantly (``Any proviso
to the contrary notwithstanding’’), from everyday language, these basic principles of linguistic
cognition and interpersonal communication are equally valid in the legal domain as in other
areas of human activity.
Text-mining methods do not parse texts quite the same way the human brain does. Instead,
these methods typically involve a good deal of complexity reduction. This may seem
surprising to those less well-versed in machine learning. But even the most advanced natural
language processing algorithms are still based on statistical principles. Texts are represented
as numbers and the algorithms look for patterns in these numbers. The ability to detect
patterns depends on the amount of textual data and the sophistication of the algorithm, but
the basic principle remains the same, including for the most cutting-edge techniques. In that
sense, it is not entirely wrong to say that machine learning algorithms are still quite dumb.
Yet their power stems from their ability to leverage the brute force of computing to arrive at
useful (and sometimes surprisingly good) approximations.
Text 1
“The law shall apply to
agreements between non-
resident firms.”
the firms shall are apply …
Text 1 1 1 1 0 1 …
Text 2 2 1 0 1 0 …
Text 2
“Firms are required to abide by Document-term matrix
the terms of the present law.”
Until recently, most text-mining methods relied on what is known as the bag-of-words (BOW)
approach. To see what this amounts to, let’s assume, for the sake of exposition, that we have
a corpus with two texts, Text 1 and Text 2, as in Figure 2. The BOW approach involves
converting texts to sequences of word counts and corpora to document-term matrices. The
sequence of word counts representing a text is called a “vector”. This vector contains counts
of all the words occurring in that text and zeros for the words occurring in the other texts but
not in that particular text. For Text 1 the zeros will represent all the words that appear in Text
8
DAN SPERBER & DEIRDRE WILSON, RELEVANCE: COMMUNICATION AND COGNITION (1996). Formatted: English (United States)
2 but not in Text 1, and vice-versa. In a large corpus spanning a vocabulary of millions of
words, the vector of word counts representing a text will contain mostly zeros – accounting
for all the words that occur in other texts but not in the one under consideration.
To keep some phrases such as “European Union” or “Court of Justice” together instead of
treating their component words as distinct lexemes, it is possible to throw some bigrams or
trigrams into the document-term matrix. Think of a n-gram as a contiguous sequence of
words. A bigram is a sequence of two words; a trigram a sequence of two words, and so on.
Turned into a bigram “European Union”, for example, becomes “European_Union” whereas
“Court of Justice” becomes the trigram “Court_of_Justice”. These n-grams can then be
processed just as individual words (unigrams).
This modus operandi may strike many as a crude simplification. Yet, crude as it may be, this
simplification can nonetheless produce useful results, as we shall see.
It is easy to see, however, that progress in modelling language and improvements in the
performance of downstream applications – in law just as in other fields – ultimately entailed
bringing the field beyond the BOW paradigm to develop richer representations of
vocabularies while capturing more of the context and rules of syntax.
As will see, static word embedding models such as Word2Vec have taken one significant step
in that direction by representing words by their co-occurrence associations. These methods
reflect the emergence of new paradigm building on notions from distributional linguistics,
notably the intuition that a word is defined by the company it keeps.
Cutting-edge methods like transformers have taken the field several steps further into this
new paradigm. Pre-trained on giant corpora, transformer models like Google’s BERT rely on
a contextualised representation of word usage, enabling them to handle polysemy and to
parse the reference of pronouns – a remarkable achievement that constitutes a major
milestone in the development of AI language models.
Note, though, that while these novel techniques do not require converting raw texts to a
document-term matrix, they still require texts to be in digitalised, machine-readable format.
Unsupervised Techniques
Computer scientists and machine learning scholars typically speak in terms of tasks –
information retrieval, clustering, summarizing, forecasting, etc. – or in terms of whether the
9
Some text-mining tasks such as authorship identification require a distinct approach to pre-processing.
Indeed, because pronouns and prepositions are markers of personal style, it is common to restrict the
document-matrix to this class of words and to exclude nouns, verbs and adjectives.
method or algorithm operates with human-labelled documents or not – supervised vs
unsupervised.
Translated into more familiar language, information retrieval is what jurists do when they
search a document collection for a specific set of documents: e.g. entering a list of keywords
into a database search engine to retrieve all judicial rulings addressing a particular issue.
Similarly, clustering is what lawyers do when they try to sort out documents into categories:
e.g. the themes to which law review articles relate or the topics coming up in judicial rulings.
Turning long documents into more easily digestible summaries is also something that lawyers
do on a routine basis. Prediction is something that one may not intuitively associate with
texts. Yet words, too, whether from legal briefs or other textual inputs, can also serve to
predict events or behaviours.
Some techniques and machine learning algorithms have been specifically designed for
particular tasks. Yet several methods, some supervised, others unsupervised, may sometimes
come into consideration for the same task, in which case the optimal choice should ultimately
depend on the specific research question of interest to the legal analyst.
World Cloud
Word cloud plots are arguably one of the most familiar and simplest text-mining methods. A
word cloud simply plots words according to their aggregate frequency in the document-term
matrix. Illustrated in Figure 3 is a slightly more sophisticated word cloud, known as a
“comparison cloud”.10 It is based on a corpus compiling all European Court of Justice rulings
up to 2015 (over 12,000 documents). Plotted are not the most frequent words in the overall
corpus but the words that are most distinctive of the three main procedures: annulments
(Article 263 Treaty on the Functioning of the European Union (TFEU)); infringements (Article
258 TFEU) and preliminary rulings (Article 267 TFEU).
10
The size of a word reflects its deviation from their average across documents. Suppose 𝑝𝑖,𝑗 is the rate which
∑𝑖 𝑝𝑖,𝑗
word i occurs in document j and 𝑝𝐼 its average rate across documents ( ). Word size is
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
Word clouds are popular and easy to interpret, but are rather crude tools when it comes to
detecting more granular patterns. In some applications pre-processing steps, such as
restricting the document-term matrix to certain parts of speech (e.g. nouns or adjectives) may
help make them more informative. But limitations remain.
A notch more advanced are Principal Component Analysis (PCA) and Latent Semantic Analysis
(LSA). Both are closely related and relatively old statistical techniques to arrange large arrays
of data into more interpretable patterns. In the field of text-mining, they fundamentally serve
as unsupervised clustering methods to explore how texts and their words relate to each other.
PCA and LSA both work by seeking to represent the high-dimensional variations in word usage
– a corpus and the documents it comprises vary in as many ways as the number of words in
its vocabulary – into something more easily interpretable (and cognitively manageable) for
the human brain. The output of both statistical procedures are a smaller number of
dimensions on which words and documents are arrayed to facilitate the identification of
meaningful patterns of relatedness.
The patterns of interest and the words expressing them depend on the specific task. PCA, for
example, has been used to identify the authorship of The Federalist Papers.11 But both
methods can also be used to cluster legal documents around themes if one of the generated
dimensions allows such an interpretation.
11
David I. Holmes, Authorship attribution, 28 COMPUT. HUMANIT. 87–106 (1994).
Figure 4 and 5 illustrate the use of LSA to explore oscillations in the position of the German
Federal Constitutional Court over European from the 1960s and up to 2020. The corpus
comprises 26 rulings, with length varying from a little over 1,000 to more than 20,000 token
words.12
Here the basic interpretive assumption with which the output of the algorithm was
approached is that variations in jurisprudential stance should be reflected in the use of words
related to statehood and the internal market, with greater divergence in vocabulary
manifesting greater jurisprudential divergence.
Figure 4 shows the extent to which selected clusters of words tend to appear in the same
decisions. Unlike in word cloud plots such as the one depicted in Figure 3, the position of
words has a precise meaning here. Vertical and horizontal axes denote separate dimensions
while the position of words it itself related to the documents in which they occur. If two
documents share many words that are close to each other on a dimension, these documents
will also be close to each other on that particular dimension. For exemple, “sovereignty of the
people” (Volkssouveränität), “constitutional identity” (Verfassungsidentität), “enumerated
powers” (Einzelermächtigung) and “ultra vires” are close to each other on Dimension 1. These
words are also more closely associated with the Court’s more Eurosceptic judgments, like
Maastricht and Lisbon. “Duty to refer” (Vorlagepflicht), “direct” (unmittelbar), “effect”
(Wirkung), “export” (Ausfuhr), “good” (Ware) form another separate cluster on the same
dimension on the right-hand side. These words are also more closely associated with
integration-friendly rulings, like Kloppenburg, Banana or Lütticke.
12
For a discussion and assessment of the performance of LSA and other text-mining methods to map
jurisprudential change see Arthur Dyevre, The promise and pitfall of automated text-scaling techniques for the
analysis of jurisprudential change, ARTIF. INTELL. LAW 1–31 (2020).
Figure 4. Frames and phraseology of German constitutional rulings on Europe
A recent paper compared the performance of eight algorithms, including LSA, in mapping the
evolution of the German Court’s case on European integration.13 The positions ascribed to
the decisions by the algorithms were evaluated against scholarly accounts and legal expert
Formatted: English (United States)
Field Code Changed
13
Id. Formatted: English (United States)
ratings. A variant of LSA (Correspondence Analysis) performed best against scholarly accounts
in law journals, achieving a 75% pairwise correlation.14
Topic Modelling
A more recent technique specifically designed for clustering and automated classification is
topic modelling.15 Suppose you have a large amount of legal texts and you want to get a sense
of the themes and topics they pertain to. Instead of asking you to come up with a list of
categories or a classificatory scheme, topic modelling generates the categories and sorts out
the documents accordingly after you specified how many topics you wanted. At least this is
how the method is supposed to work.
In topic modelling, topics are modelled as probability over words and documents as
probability over topics. To generate the topics, the algorithm tries to find which probabilities
are most likely to have generated the observed documents.
Figure 6 illustrates the output of a topic model of preliminary rulings (approximately 8000
rulings). The number of topics was set at 25. What Figure 6 displays is one of these topics
represented by its 10 most distinctive words (note that the higher the beta value, the more
characteristic of the topic the word is). Looking at these “keywords” – which, it is essential to
understand, are not chosen by the researcher but emerge from the analysis – we may
plausible summarize this topic as corporate taxation.
14
Id.
15
For a non-technical introduction see David M. Blei, Probabilistic topic models, 55 COMMUN. ACM 77–84
(2012).
a 2012 Grand Chamber ruling, which according to the model is 99% about topic 14. Here is a
quote from the first ruling:
The High Court of Justice of England and Wales, Chancery Division, seeks, first, to
obtain clarification regarding paragraph 56 of the judgment in Test Claimants in the
FII Group Litigation and point 1 of its operative part. It recalls that the Court of Justice
held, in paragraphs 48 to 53, 57 and 60 of that judgment, that national legislation
which applies the exemption method to nationally-sourced dividends and the
imputation method to foreign-sourced dividends is not contrary to Articles 49 TFEU
and 63 TFEU, provided that the tax rate applied to foreign-sourced dividends is not
higher than the rate applied to nationally-sourced dividends and that the tax credit is
at least equal to the amount paid in the Member State of the company making the
distribution, up to the limit of the tax charged in the Member State of the company
receiving the dividends.
Topics can be visualised in various ways. In Figure 7, they are represented as a network in
which node size represents overall topic proportion in the overall document collection while
edge thickness correspond to the weighted number of shared words. This way we can see
themes emerging from the topics.
Among other things, Figure 7 suggests that internal market and tax issues represent a big
chunk of what the CJEU does. However, social rights, residence rights and the recognition of
foreign judgments (private international law) also make for a substantive share of the cases
on which the Luxembourg judges sit.
If you think that 25 categories is too few to get a good sense of issue prevalence in the
Court’s case law, how about a topic model with 100 categories? Looking at Figure 5, we see
that such a topic provides a more detailed picture, although we find the same themes
(internal market in the lower-right region, social and immigration issues in the left region).
It is also possible to construct dynamic topic models to study the evolution of case law over
time or “litigant” topic model to study how issues vary across litigant types.
Recent work has applied topic modelling to study relative issue emphasis across infringement,
annulment and preliminary rulings, highlighting how the CJEU’s case law is influenced by the
litigation agenda of case initiators (like the European Commission)16; to compare topic
salience in European Union legislation, CJEU rulings and contributions to the Common Market
Law Review17; to explore Dutch supreme court decisions18; and to demonstrate the lingering
centrality of market regulation in European Union law-making in the twenty-first century19;
While significant efforts have been expended on manually classifying the legal areas
addressed by US Supreme Court rulings, some authors have proposed topic modelling as a
more efficient and more accurate alternative.20 Work by Peter Grazl and Peter Murrel further
illustrates how topic modelling can assist in exploring large collections of old legal texts. They
16
Arthur Dyevre & Nicolas Lampach, Issue attention on international courts: Evidence from the European Court
of Justice, REV. INT. ORGAN. 1–23 (2020).
17
Arthur Dyevre, Michal Ovadek & Monika Glavina, The Voices of European Law: Legislators, Judges and Law
Professors, forthcoming GER. LAW J. (2021).
18
Ylja Remmits, Finding the Topics of Case Law: Latent Dirichlet Allocation on Supreme Court Decisions (2017).
19
Nicolas Lampach, Wessel Wijtvliet & Arthur Dyevre, Merchant Hubs and Spatial Disparities in the Private
Enforcement of International Trade Regimes, INT. REV. LAW ECON. 105946 (2020).
20
Douglas Rice, Measuring the issue content of Supreme Court opinions, 7 J. LAW COURTS 107–127 (2019).
apply topic modelling to reports of cases heard by English between the fourteenth and
eighteenth century (N = 52,949).21
Word Embeddings
Tools like LSA, PCA and topic modelling are typical of the BOW paradigm. Word embeddings,
by contrast, are part of a new text-mining paradigm inspired by the defining principle of
distributed linguistics – “a word is defined by the company it keeps”.22
To explain how word embeddings work, the best is, again, to start with an exemple. Suppose
you want to investigate variations in attention to a particular phenomenon, e.g. politics and
populism in posts on a major legal blog. To measure attention to this concept, we might first
try to come up with a list of keywords (e.g. “politics”, “party”, “populism”...) capturing
attention to this phenomenon and then determine the extent to which our keywords are
actually matched in the document collection. However, this approach often delivers poor
results because the same phenomenon can be characterised in many different ways, leading
exact matches to either over- or underestimate the true number of relevant instances of
attention to the phenomenon in question. (The frustrating feeling is surely one that many
jurists have experienced when trying to retrieve documents via a keyword search in some
legal database.)
21
Peter Grajzl & Peter Murrell, A machine-learning history of English caselaw and legal ideas prior to the
Industrial Revolution I: generating and interpreting the estimates, 17 J. INSTITUTIONAL ECON. 1–19 (2021).
22
Tomas Mikolov et al., Distributed representations of words and phrases and their compositionality, in
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 3111–3119 (2013).
Table 1. Top 40 occurrence similarity scores for vector Politik + Parteien + Populismus from
embeddings (Word2Vec) trained on German-language contributions to the
Verfassungsblog
Word embeddings help deal with this problem by representing words not as frequencies –
the BOW approach – but as sequences (i.e. a vectors) of occurrence similarities, generated
via a shallow neural network. For example, Table 1 displays the first 40 items in the vector of
occurence similarilities yielded by a word embeddings model trained on the German-
language contributions to the Verfassungsblog (a leading constitutional law blog) using the
Word2Vec algorithm. The vector corresponds to the words “Politik” (politics), “Parteien”
(parties) and “Populismus” (populism).23
Numbers next to the words in Table 1 indicate the cosine occurence similarity. The closer it is
to 1, the more similar is the word’s context of occurence to that of Politik + Parteien +
Populismus. Here the word exhibiting the highest cosine similarity score is “Eliten” (elites),
which makes sense since elite-bashing is a defining feature of populist discourse. Other terms,
including “Presse” (press), “Medien” (media), “Bürger” (citizen), frequently come in populist
rhetoric, too. “Verwaltung” (adminstration) and “Instrumente” (instruments), though, are
less intuitively associated with politics, populism or partisan organisations.
Co-occurence similarity here refers to the words that tend to appear around the target word.
How many words before and after the target word should be considered – the window size –
is one of the parameters that have to be set by the researcher before training an embedding
model. A window size of 5 means that two words and two after the target word will be
considered; a window size of 9 four words before and four words after; and so on. The neural
network is then trained to predict either the target word from the surrounding words
(continuous bag-of-words method) or the surrounding words given the target word (skip-
gram method).
As with machine learning and neural networks in general, the more data (texts) the better.
This is why instead of training embeddings from scratch on a relately small collection of blog
posts, it may be preferable to use a pre-trained model built on a much larger corpus. Table 2,
shows the words associated with Politik + Parteien + Populismus from a pre-trained
embeddings model constructed from all German Wikipedia pages, with a vocabulary of nearly
five million words.24 Pre-trained embeddings constructed from legal documents also exist.25
The cosine similarity scores are generally higher in Table 2 than in Table 1, which suggests
that the pre-trained model better captures contextual similarity. In fact, it assigns a high
cosine similarity scores to typos like “poltk” (cosine = 0.846). This is because typos appear in
the same context as the word with the correct spelling. The similarity scores assigned to typos
23
Several word embedding algorithms exist, including Word2Vec, Fasttext and Glove. Here we relied on the
Word2Vec approach.
24
A wide range of word embedding models spanning multiple languages can be downloaded from a repository
made available by the Language Technology Group at University of Oslo, see http://vectors.nlpl.eu/repository
(accessed 12 November 2020).
25
Ilias Chalkidis & Dimitrios Kampas, Deep learning in law: early adaptation and legal word embeddings
trained on large corpora, 27 ARTIF. INTELL. LAW 171–198 (2019).
highlight how word embedding models handle synonymy, which represents a major advance
for legal information retrieval tasks.
That pre-trained embeddings can deliver better results than locally-trained embeddings (i.e.
embeddings trained on the corpus one actually wants to investigate) illustrates the notion of
transfer learning. What a model learns about language use from a very large corpus is often
transferable to smaller text collections.
Table 2. Top 40 occurrences for vector Politik + Parteien + Populismus from embeddings
trained on German Wikipedia pages
One powerful application of word embeddings is to generate weighted lexicons, which can
be utilised to detect attention to a particular phenomenon or concept of interest. Figure 9
plots the variation in attention to politics, parties and populism in German-language
contributions to the Verfassungsblog using the words contained in the vector Politik +
Parteien + Populismus to measure the average attention the underlying phenomenon.
Over time, the Verfassungsblog has been posting a growing number of English-language
contributions. Figure 10 charts attention to the same phenomenon in English-language posts
using the vector politics + parties + populism generated by Google’s pre-trained word
embedding model (Word2Vec) for English – which boasts a vocabulary of three billion words
trained on Google News data.
Political words scores
100
90
80
Aggregated Score
70
60
50
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Year
Figure 10. Relative incidence of words relating to “politics”, “parties” and “populism” in
English-language contributions to the Verfassungsblog, 2012-2019
These are potentially interesting results for scholars interested in the evolution of European
constitutional law scholarship and a possible shift from a legalistic, narrowly doctrinal
conception of legal scholarship to one that pays greater heed to political behaviours and
social dynamics.26
To further illustrate the potential of word embeddings for attention detection and document
retrieval, note that we can vary the specification of vectors to improve results or to capture
conceptual nuances. The vector generated for politics alone will be different from the vector
generated for politics + parties + populism. But if we wanted to a generate a vector for terms
associated with politics and political parties but not with populism, we could specify a vector
like politics + parties - populism. Remarkably, in the Google pre-trained model, specifying king
- man generates a vector in which the word with highest cosine similarity score is queen.
So, by comparison with document search engines based on exact keyword matching, word
embeddings provide a considerably more powerful tool to capture attention to concepts.
A related study by Douglas Rice, Jesse Rhodes and Tatishe Nteta has examined racial biases
in a corpus comprising over 1 million of state and federal court opinions. The authors find
stereotypically African-American names to be systematically associated with more negative
words compared to stereotypically European-American names.29
26
Bart Caiepo & Federico Benetti, How Political Turmoil is Changing European Constitutional Law: Evidence
from the Verfassungsblog, VERFASSUNGSBLOG (2020), https://verfassungsblog.de/how-political-turmoil-is-
changing-european-constitutional-law-evidence-from-the-verfassungsblog/ (last visited Nov 9, 2020).
27
Studies adopting this approach have revealed the evolution of gender and ethnic stereotypes or the
changing connotations of the word “gay” see William L. Hamilton, Jure Leskovec & Dan Jurafsky, Diachronic
word embeddings reveal statistical laws of semantic change, ARXIV PREPR. ARXIV160509096 (2016); Nikhil Garg
et al., Word embeddings quantify 100 years of gender and ethnic stereotypes, 115 PROC. NATL. ACAD. SCI. E3635–
E3644 (2018).
28
ELLIOTT ASH, DANIEL L. CHEN & ARIANNA ORNAGHI, Stereotypes in High-Stakes Decisions: Evidence from US Circuit
Courts (2020).
29
Douglas Rice, Jesse H. Rhodes & Tatishe Nteta, Racial bias in legal language, 6 RES. POLIT. 2053168019848930 Formatted: English (United States)
(2019).
Document Clustering with Word Embeddings: Doc2Vec
Closely related to the word embedding approach just described is a document clustering
technique known as Doc2Vec. It relies on the same representation of words. But instead of
training the neural network to predict only the target word or the surrounding terms, it is also
trained to predict the documents in which they occur. Documents thus become associated
with word vectors.
Doc2Vec is similar to PCA/LSA in that it simultaneously relates words and documents. The
principal difference, though, is that Doc2Vec draws on a much more sophisticated word
representation.
Figure 11 shows a t-SNE plot of a Doc2Vec model of European Court of Justice rulings, with
colours denoting the procedure. The horizontal and vertical axes of a t-SNE plot are not
amenable to substantive interpretation. But spatial proximity reflects similarity in word
usage. Here the plot suggests some degree of overlap across procedures, but greater
heterogeneity in rulings originating in preliminary references.
Figure 11. T-SNE plot of Doc2Vec model of European Court of Justice rulings (colour
denotes procedure).
Looking at a large corpus of US Court of Appeals rulings, Daniel Chen and Elliott Ash have
explored a variety of possible uses of Doc2Vec for the analysis of judicial opinions.30
Because precedents play an important role in legal argumentation, several studies have
proposed Doc2Vec as a methodology to identify and measure case similarity.31
30
Elliott Ash & Daniel L. Chen, Case vectors: Spatial representations of the law using document embeddings,
LAW DATA ST. FE INST. PRESS ED M LIVERMORE ROCKMORE (2019).
31
Tereza Novotná, Document Similarity of Czech Supreme Court Decisions, 14 MASARYK UNIV. J. LAW TECHNOL.
105–122 (2020); Lorenz Timothy Barco Ranera, Geoffrey A. Solano & Nathaniel Oco, Retrieval of Semantically
Similar Philippine Supreme Court Case Decisions using Doc2Vec, in 2019 INTERNATIONAL SYMPOSIUM ON MULTIMEDIA
AND COMMUNICATION TECHNOLOGY (ISMAC) 1–6 (2019); Paheli Bhattacharya et al., Methods for computing legal
document similarity: A comparative study, ARXIV PREPR. ARXIV200412307 (2020).
Supervised Classification Methods
Unsupervised approaches produce models and output without human input, which may seem
to be a great advantage. However, the models and output generated by unsupervised
methods always require ex post human interpretation. There is no absolute guaranty that the
topics generated by a topic model will make sense or that the dimensions produced by an LSA
model will be interpretable. This is not necessarily a problem if unsupervised techniques are
primarily used for exploratory purposes. However, if one purports to rest an empirical
assertion on the results of unsupervised methods, some human validation of at least a subset
of these results may be required in order to demonstrate intersubjective validity.
Supervised methods, by contrast, do not require ex post validation because they seek to
“emulate” what humans do by discovering patterns in documents labelled by human
annotators prior to training.
Supervised approaches all require labelled documents. There are only two avenues to obtain
labelled data. The first is to rely on documents that other researchers have already annotated.
To measure the ideological direction of US federal court opinions, Carina Hausladen, Marcel
Schubert and Elliott Ash were able to leverage an existing database (the Songer Database)
where ideological direction had been hand-coded for a subset (5%) of federal cases. These
annotated opinions were then used to train and test a range of algorithms.32 Using labelled
data sets from outside the legal domain can be tempting. But results may then have to be
interpreted with caution. One study, for example, has sought to leverage academic papers
from moral philosophy which had been labelled either as “deontological” or
“consequentialist” to train machine learning classifiers to detect modes of moral reasoning in
US federal opinions.33 However, given the risk of low domain adaptation (the language of
academic papers and judicial opinions may diverge too much), the results of studies adopting
this strategy should be taken with a grain of salt.
When no labelled data set exist, the only way to obtain labelled data is to build it from scratch.
In many areas, supervised machine learning projects rely on crowdsourcing platforms such as
Amazon Mechnical Turk, where annotators recruited online tag documents for a modest
compensation (and so at a low cost for the researcher).34 However, crowdsourcing works best
when a task is simple, quick and straightforward. So the specificity, technicality and
complexity of legal language means that crowdsourcing is not a well-suited approach for legal
projects.
32
Carina I. Hausladen, Marcel H. Schubert & Elliott Ash, Text classification of ideological direction in judicial
opinions, 62 INT. REV. LAW ECON. 105903 (2020).
33
Nischal Mainali et al., Automated classification of modes of moral reasoning in judicial decisions, in
COMPUTATIONAL LEGAL STUDIES (2020).
34
Catherine Grady & Matthew Lease, Crowdsourcing document relevance assessment with mechanical turk, in
PROCEEDINGS OF THE NAACL HLT 2010 WORKSHOP ON CREATING SPEECH AND LANGUAGE DATA WITH AMAZON’S MECHANICAL
TURK 172–179 (2010).
Law students potentially offer a solution to the document annotation challenge. Without
being accomplished legal experts, they tend to be more comfortable with legalistic language
and better at parsing judicial prose or statutory provision. It is possible to integrate legal
annotation tasks in tutorials and interactive classes. In fact, annotating legal documents can
be viewed as an excellent exercise for students to practice and perfect legal analytic skills.35
Conducting annotation tasks with large groups of students demands a good work flow.
Annotators must first be trained to recognise the concepts and information which the labels
have been designed to capture. Documents must then be distributed to annotators and
annotated documents collected. To produce high-quality annotations, it is recommended to
have two or more annotators independently annotating the same document. Discrepancies
then have to be identified to compute inter-annotator agreement metrics. To bolster quality,
a reconciliation procedure can also be put in place. Software and online platforms have been
developed to facilitate the completion of document annotation tasks by teams of annotators.
The aforementioned annotation project conducted in Leuven, for example, relied on the
cloud version of the TagTog37 platform (which the project was allowed to use free of charge
in exchange for making the labelled data public). Some software solutions, such as
WebAnno38, are open source and can thus be used free of charge, but require local server
installation – which can be technically involved, unless technical support is provided.
Bag-of-words Methods
As with unsupervised techniques, supervised techniques, too, initially all relied on the BOW
paradigm. Supervised BOW methods involved the same data preparation steps, including
converting texts into document-term matrix format. In a supervised setup, the document
matrix will look very similar, except that it will contain one or more additional columns for
the labels produced by human annotators.
Before trying out some classification algorithms, the next stage will be to divide the data into
train and test data. As its name suggests, the train set will serve to train many versions of the
algorithm, whereas the test set will serve to measure their performance and select the best
35
Conducting legal AI projects also help bring greater awareness of the potential of new technologies for legal
research and legal practice while contributing to the modernization of legal education, see Arthur Dyevre,
Fixing European Law Schools, EUR. REV. PRIV. LAW (2017). Formatted: English (United States)
36
See http://www.image-net.org/ (accessed 9 November 2020).
37
See https://www.tagtog.net/ (last accessed 3 March 2021).
38
https://webanno.github.io/webanno/ (last access 4 Marche 2021).
one. Dividing the data into train and test sets is called the “holdout” procedure and is only
one of many sampling procedures. When the number of annotated documents is small (less
than 1000), it is recommended to use some “cross-validation” procedure. Cross-validation
procedures begin by dividing the annotated documents in several folds (e.g. 10). One of the
folds then serves as test set while the algorithms are trained on the remaining folds. This
process is then iterated with a different fold until every fold as served as training set.
Performance is evaluated by looking at the average across test folds. This way cross-validation
ensures that as much data as possible is used for training.
When we say that the train set serves to train “many versions” of an algorithm, we mean
many combination of words correlated with the labels. How many versions of the algorithm
are fitted to the train data is for the researcher to decide in light of time and computational
constraints (fitting a broader range of possible combinations obviously take more time).
All these competing versions of the algorithm are then tested against the test data. The
version that best predicts the human annotations in that set is then selected as the winner.
By way of illustration, we trained several algorithms to predict the labels “EU law” and “no
EU law” in the aforementioned student-annotated corpus of Belgian high court rulings.
Because this data set is relatively small (519 documents), we employed a cross-validation
procedure. We then fitted thousands of versions of a handful of popular algorithms: logistic
regression, Support Vector Machine (SVM), Random Forest and Sequential Neural Network.39
While explaining the technical specifications of these algorithms is beyond the scope of the
present paper, Table 3 reports the performance of the “best version” of each of these
algorithms.
39
For a concise explanation of these algorithms I refer the reader to Soner Yıldırım, 11 Most Common Machine
Learning Algorithms Explained in a Nutshell, MEDIUM (2020), https://towardsdatascience.com/11-most-
common-machine-learning-algorithms-explained-in-a-nutshell-cc6e98df93be (last visited Nov 9, 2020). For a
survey from the perspective of econometrics see Matthew Gentzkow, Bryan Kelly & Matt Taddy, Text as data,
57 J. ECON. LIT. 535–74 (2019).
The metrics reported in Table 3 are the ones typically used in supervised text-mining
classification tasks. Precision indicates the proportion of documents predicted to contain
references to EU law that truly do so. On this metric, logistic regression and sequential neural
network did best, achieving a precision of 95%. Recall measures the proportion of documents
human annotators labelled as featuring EU law which the algorithm was able to retrieve. Here
Random Forest did best, retrieving 88% of the documents thus labelled. F1 is a metric that
combines precision and recall into a single number. The Matthews Correlation Coefficient
(MCC) is yet another performance metric. It is recognized as the most reliable metric to
evaluate a binary classifier because it takes into account the proportion of true negatives
(documents predicted to feature no EU law and do not), false negatives (documents predicted
to feature no EU law but that actually do), true positives (documents predicted to feature that
really do) and false positives (documents predicted to feature EU law but which do not). Here
Random Forest performs best with MCC = 0.77.
Similar BOW supervised approaches have variously been used to predict the outcome of ECHR
cases40; the ideological direction of US federal opinions41 and to detect unfair clauses in online
terms of service.42
The new state-of-the-art in supervised document classification draws its strength from
several advances. The first is a revolutionary self-attention mechanism, known as
“transformer”, which supports rich, contextualized representations of lexical and sentence
meaning.43 The second are new training methods. Models are trained to predict target words
and whether two sentences appear next to each other. The third is greater leverage of
transfer learning. Models are pre-trained, without human supervision, on vast repositories of
texts. This knowledge can then be transferred to “local” supervised tasks with additional fine-
tuning steps.
Based on a deep neural network architecture, BERT is able to focus attention on a given word
in a sentence while simultaneously identifying the context of all the other words in relation
to that word. The “static”, type-based, word embeddings discussed in the previous section
represent a word as a vector of co-occurrences with cosine similarity scores reflecting co-
occurrence frequencies. This permits static word embedding to handle synonymy (if, for
instance, “car” and “vehicle” are used to mean the same thing they will have high cosine Formatted: English (United States)
Formatted: English (United States)
40
Masha Medvedeva, Michel Vols & Martijn Wieling, Using machine learning to predict decisions of the Formatted: English (United States)
European Court of Human Rights, 28 ARTIF. INTELL. LAW 237–266 (2020). Field Code Changed
41
Hausladen, Schubert, and Ash, supra note 34.
42
Ranera, Solano, and Oco, supra note 33. Formatted: English (United States)
43
Ashish Vaswani et al., Attention is all you need, in ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 5998– Field Code Changed
6008 (2017).
44 Formatted: English (United States)
Jacob Devlin et al., Bert: Pre-training of deep bidirectional transformers for language understanding, ARXIV
PREPR. ARXIV181004805 (2018). Formatted: English (United States)
similarity score), but not polysemy or co-reference resolution (to determine what a pronoun
refers to). The vector representing the word “party”, for instance, will not differentiate
between party as in “political party” and the party to a legal case. In a large and relatively
diverse corpus, the vector is thus liable to assign high cosine similarity to words associated
with both usages (e.g. “political” and “court”). By contrast, transformer models like BERT go
beyond generalizing across contexts. They represent words as dynamic, token-based vector
embeddings., thereby coming much closer to capturing the particular, sentence-specific
context of occurrence of a word. This, in turn, enables BERT to handle polysemy and co-
reference resolution much better than previous language models.
The original BERT was trained on a giant corpus of GoogleBooks (800 million words) and
Wikipedia pages (2.5 billion words) without human supervision by simply feeding it raw texts.
Yet the power of BERT for supervised classification lies in the possibility to further fine-tune
the pre-trained BERT on a “local” data set. What has been learned from the giant corpus can
thus be transferred to the local, smaller data set of direct interest to the researcher.
Obviously, there are many linguistic patterns that no algorithm will be able to learn from a
small data set. But a small data set may also instantiate specific patterns absent in the giant
data set. In short, transfer learning helps combine the strengths of both data sets. Technically,
local fine-tuning adds an additional layer of neurons to the neural network, thereby
incorporating the local knowledge into the larger model.
BERT has been shown to outperform other algorithms on a wide range of natural language
processing tasks.45 One study has shown BERT to perform well at predicting the issue area
codes of EU legislative acts.46
Table 4 reports the confusion matrix and performance metrics of a BERT model trained to
predict whether EU legislative acts will be litigated. The data set was built by matching EU
legislative acts in the EUR-Lex database to CJEU rulings. Only 3% of all EU legislative acts are
ever litigated and the probability that a given piece of legislation will be litigated in a particular
case is very low.
When an outcome is a rare event, as with our example, it is important to think carefully about
what the bar for a good model should be for the task at hand. Indeed, unexperienced lawyers
and laypeople are often impressed by headline metrics like “90% accuracy”. However,
achieving 90% correct classifications may, in many settings, be indicative of a poor
performance. In fact, it all depends on the task and data set. With our EU litigation data set,
it would have been easy to achieve 97% accuracy, since a model predicting that EU legislative
acts are never litigated would be right 97% of the time. So here accuracy is a misleading metric
and precision and recall for the rare outcome provide a better gauge of performance.
Table 4 reports results for a sub-set of the data, where CJEU decisions featuring EU legislative
acts have been deliberately oversampled. Oversampling the rare outcome is important to
ensure that the algorithm has enough information to learn the patterns associated with this
outcome.
45
Id.
46
Ilias Chalkidis et al., Large-scale multi-label text classification on eu legislation, ARXIV PREPR. ARXIV190602192
(2019).
While it is certainly possible to improve on these results through further local fine-tuning, a
precision of 0.65 (i.e. out of 1884 predicted to be litigated 1220 actually are) and a recall of
0.81 (i.e. out of 1500 litigated 1220 were predicted to be so) are encouraging results.
Table 4. Confusion matrix and performance metrics of a BERT model trained to predict
whether an EU legislative act will be litigated before the CJEU
Since the release of the first BERT, new variants of BERT have appeared, pre-trained on a wide
range of general (RoBERTa) or domain-specific corpora (BioBERT, sciBERT…) in a variety of
languages (e.g. robBert in Dutch, flauBERT in French, etc.). Multi-lingual BERT models,
simultaneously pre-trained on multiple languages, have been shown to support zero-shot
cross-lingual model transfer in which task-specific annotations in one language are to fine-
tune a model subsequently applied to classify documents in another language.47 BERT models
pre-trained on large collections of legal documents have also been released to assist with
legal classification and prediction tasks.48
The arrival of BERT has triggered an AI race where research teams at big tech firms are vying
to attain ever-higher performance with increasingly complex transformer language models:
RoBERTa (Facebook), XLNET (Google), GPT-2 (OpenAI), Turing NLG (Microsoft)… The last such
model to outperform its rivals, GPT-3 from OpenAI, boasts 175 billion parameters (by
comparison BERT has only 110 million parameters). The pace of technological development
holds out great promise for the future of legal text-mining research and natural legal language
understanding.
While transformers have just come along and applications to the legal domain are only
starting to appear in publications and conference proceedings, a paper by Evan Gretok, David
Langerman and Wesley Oliver provides an interesting illustration of the application of
transformer models to the study of legal doctrines. The authors trained transformer-based
algorithms to classify rulings pertaining to the Fourth Amendment of the US Constitution
47
Telmo Pires, Eva Schlinger & Dan Garrette, How multilingual is multilingual bert?, ARXIV PREPR.
ARXIV190601502 (2019); Sean MacAvaney, Luca Soldaini & Nazli Goharian, Teaching a New Dog Old Tricks:
Resurrecting Multilingual Retrieval Using Zero-shot Learning, in EUROPEAN CONFERENCE ON INFORMATION RETRIEVAL
246–254 (2020).
48
Chalkidis et al., supra note 48.
depending on whether they applied a bright-line or a totality-of-the-circumstances rule. The
best model (based on BERT) achieves accuracy of 92 per cent.49
As researchers begin to realize the potential of Natural Language Processing for large-scale
doctrinal analysis, we should expect to see many studies along these lines in the near future.
In the multi-lingual context of continental Europe, researchers may further seek to leverage
the power of multilingual transformers to develop legal documents classifiers or predictors
that can be deployed across multiple jurisdictions.
How do lawyers with no prior training in machine learning or data science can get started?
One answer (at least for the motivated reader) is to learn a programming language like Python
either by following one of the many free online tutorials or by taking a class at a nearby
university campus. Python50 is the language of choice for machine learning, text-mining and
data harvesting tasks and the most popular among researchers and developers. Its eco-
system of libraries support the latest models and algorithms. While some lawyers may find
the mention of “programming” off-putting, Python is actually a very intuitive programming
language. Moreover, the libraries provide many shortcuts which make it possible to complete
a task with very few lines of code.
Both Python and R, along with their libraries, are entirely open source. They all can be
downloaded and installed from the internet. The same goes for the pre-trained embeddings
and transformers mentioned in this paper (except for GPT-3).
Finally, for those who would prefer to avoid any kind of programming, RapidMiner comes
with a graphical user interface to to carry out end-to-end text-mining tasks without writing
code.51 Unlike Python and R, RapidMiner is a commercial platform. Yet its free version
supports a wide range of supervised as well as unsupervised methods for data sets with up to
10 000 rows.
Conclusion
Text-mining and natural language understanding have been making great strides in recent
years. Some of these techniques are at the heart of the hyper-hyped “AI revolution” and are
fueling the development of legaltech.
49
Evan Gretok, David Langerman & Wesley M. Oliver, Transformers for Classifying Fourth Amendment
Elements and Factors Tests, LEG. KNOWL. INF. SYST. JURIX 63–72 (2020). Formatted: English (United States)
50
https://www.python.org (accessed 9 November 2020).
51
https://rapidminer.com (accessed 9 November 2020).
To be sure, anyone who has actually attempted to use the techniques surveyed here will have
realized that algorithms do not process language the way humans do. All techniques, even
the most advanced ones, have limitations. Yet, thanks to their scalability, they open up new
possibility for legal research to explore and canvass vast repositories of legal documents.
There exist many variants of the techniques reviewed in this paper and many more tasks to
which they either already have or may potentially be applied. However, I hope that the
illustrations I gave and the techniques I surveyed give the reader a sense of the potential that
these techniques offer for academic legal research.