0% found this document useful (0 votes)

109 views28 pages

Textmining For Lawyers Dyevre

This document discusses how machine learning techniques can be used to analyze large collections of legal documents. It surveys topic modeling, word embeddings, and transfer learning as techniques that can be applied to investigate legal texts at scale. The document provides an overview of how these text mining methods work and how they can help lawyers and legal scholars better understand legal discourse, despite not being able to fully replace human analysis yet. It also addresses how legal texts can be harvested digitally and transformed into data that can be analyzed by machine learning algorithms.

Uploaded by

SriramGudimella

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views28 pages

Textmining For Lawyers Dyevre

Uploaded by

SriramGudimella

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Text-mining for Lawyers: How Machine Learning

Techniques Can Advance our Understanding of Legal

Discourse
Arthur Dyevre
KU Leuven Centre for Empirical Jurisprudence
arthur.dyevre@kuleuven.be

Abstract
Many questions facing legal scholars and practitioners can only be answered
by analysing and interrogating large collections of legal documents: statutes,
treaties, judicial decisions and law review articles. I survey a range of novel
techniques in machine learning and natural language processing – including
topic modelling, word embeddings and transfer learning – that can be applied
to the large-scale investigation of legal texts.


I am grateful to Dr. Nicolas Lampach, Dr. Timothy Yu-Cheung Yeung, Monika Glavina, Kyra Wigard and Nusret
Ipek for invaluable research assistance. I acknowledge financial support from European Research Council
Horizon 2020 Starting Grant #638154 (EUTHORITY).
Much of the information of interest to lawyers and legal scholars comes in the form of texts.
Whether briefs, contracts, court rulings, law review articles, legislative acts, treaties,
newspapers or blog posts, all are either legal documents themselves or documents about the
law. Retrieving, analyzing, commenting, relating and expounding these documents has been
the bread and butter of legal practice and legal scholarship alike for centuries.

Lawyers deal in word and the law can be viewed as a vast and complex network of interrelated
texts, as illustrated in Figure 1. The function of this discourse is not only to announce legal
rules and how they apply to a particular set of facts but also to explain or summarise them in
more succinct or more accessible language – which is understood to be one of the core
functions of traditional, doctrinal scholarship.

While the study of legal texts is at least as old as academic legal scholarship, what is new is
that a whole range of text-mining techniques have emerged to assist the legal community in
navigating and analyzing the ever-expanding sea of legal and law-related documents. These
techniques rely on recent advances in machine learning and natural language processing.

The media hype about Artificial Intelligence (AI) occasionally leads to exaggerated claims
about the capabilities of these techniques. Except for the simplest legal tasks, robot lawyers
are not yet around the corner. Nor are fully automated robot judges (provided that robot
judges are even desirable, which is, at least, questionable). However, even if the media hype
(sometimes amplified by legal scholars) paints a misleading picture of what AI can achieve, it
would be at least equally wrong to dismiss these techniques as irrelevant to legal practice or
legal scholarship. This is true even for those who see themselves as hardcore black-letter law
scholars. The now famous Gartner Hype Cycles tell us that perceptions of AI advances oscillate
between peaks of inflated expectations and troughs of disillusionment, before reaching a
plateau of productivity.1

Researchers with experience in text-mining applications in the legal domain recognize that
text-mining techniques cannot (yet) fully replace careful human reading. Yet these
technologies are already sufficiently mature and progressing at a breakneck pace to deliver
substantial advances. While increasingly popular in the interdisciplinary fields of Law &
Economics, Empirical Legal Studies and Law & Politics,2 text-mining methods are also directly
relevant to the work of doctrinal legal scholars. Indeed, one way to view them is as
augmented doctrinal reality.

1
See https://www.gartner.com/smarterwithgartner/5-trends-drive-the-gartner-hype-cycle-for-emerging-
technologies-2020/ (Accessed 2 March 2021).
2
For a review see Jens Frankenreiter & Michael A. Livermore, Computational methods in legal analysis, 16
ANNU. REV. LAW SOC. SCI. 39–57 (2020); for reflections and illustrations of the use of machine learning and
natural language processing methods in empirical legal studies see MICHAEL A. LIVERMORE & DANIEL NAHUM
ROCKMORE, LAW AS DATA: COMPUTATION, TEXT, & THE FUTURE OF LEGAL ANALYSIS (2019).
Figure 1. Law as text.

The present contribution aims to introduce these techniques to jurists unfamiliar with
machine learning and natural language processing or who may only have a faint notion of the
use which these tools can be put to. To that end, I shall first describe how data-harvesting
methods can be deployed to collect large collections of legal documents. I will then proceed
to explain how text is transformed into input data for text-mining tasks. Next, I will offer an
overview of the text-mining techniques themselves, distinguishing supervised and
unsupervised methods and walking the reader through a bunch of examples from the
EUTHORITY Project (www.euthority.eu). Finally, because I expect that many readers might be
interested in learning some of the reviewed techniques, I will say a few words about practical
software implementation.

The target audience of the present paper are continental European legal scholars. Because of
this deliberate focus, the discussion deliberately excludes tasks and questions – such as
contract review or document assembly – that are important in legal practice3, but of lesser
relevance to academic legal research, as traditionally understood in continental Europe. Nor
do I engage matters such as causal inference that are central to the integration of text-mining
and machine learning approaches in empirical legal studies and Law & Economics. 4 Also, my
aim is to introduce text-mining methods in terms that my target audience (hopefully) will find
understandable. For that reason, I eschew mathematical notation and technical jargon to
focus on the underlying conceptual intuitions with the help of concrete illustrations.
Obviously, this comes at the price of precision. But I hope that this sacrifice comes with the
benefit of lowering the barrier to access. It is also worth mentioning at the outset that the
scope of the present review is, by its very nature, limited. Text-mining and natural language
processing have become vast fields, currently progressing at a neck-breaking pace that is
possibly unmatched in any other field of scientific inquiry. So to pretend that this survey is, in
any sense, comprehensive would be silly.

3
Efforts to automate these tasks have been an important focus of the emerging Legal Tech scene, see Robert
Formatted: English (United States)
Dale, Law and word order: NLP in legal tech, 25 NAT. LANG. ENG. 211–217 (2019).
4
See LIVERMORE AND ROCKMORE, supra note 3. Formatted: English (United States)
The present contribution assumes that legal scholars, with or without prior training in
statistics or empirical methods, can become not just intelligent consumers but also active
users of this panoply of powerful techniques. Readers interested in applying computational
textual methods will be find some pointers in the Section on “Learning Text-Mining Methods”.

Harvesting Legal Texts

Computerised text-mining methods require that texts be in digital form. Luckily, millions of
legal documents are now available at a few click in electronic repositories and legal databases.
The degree of exhaustiveness of these repositories varies widely from jurisdiction to
jurisdiction. At best, judicial databases offer access to all published decisions. Often it will only
be to a subset of these decisions, with older rulings typically less likely to make the cut.
Because the universe of documents is somewhat smaller, legislative databases usually fare
better, although, here too, there are jurisdictional and cross-national disparities.5 As official
gazettes are increasingly published digitally, they potentially represent a treasure trove of
legal data.

When documents are not available in digital format, it is still possible to convert them to this
format using scanning combined with Optical Character Recognition (OCR). OCR works better
with more recent, undamaged texts than with old dusty casebooks or well-worn legal
treatises. However, the technology has made huge strides, thanks in large part to machine
learning (which helps guess semi-erased words or phrases). It is now even possible to
digitalise hand-written documents,6 opening up new possibilities for legal historians to scour
old manuscripts.

When done manually assembling a large collection of legal documents for a text-mining
project can be excruciatingly time-consuming (try to download all European Court of Justice
decisions since 1954). However, data-harvesting techniques can make this step considerably
easier. Using libraries designed for this purpose in popular programming languages like R and
Python, it is possible to download the entire content of EUR-Lex (the EU law database) with
less than five lines of code. Web-scraping, as the method is commonly referred to, is now the
chief technique to collect data in social scientific disciplines.7 As scientists, including in physics
and medicine, have turned to text-mining methods to summarise vast collections of peer-
reviewed papers, publishers (notably Oxford University Press and Elsevier) have made their
journal collections available. The terms and conditions of commercial and non-commercial
databases may sometimes explicitly prohibit web-scraping and there remain some
uncertainties about when web-scraped may be prohibited even for non-profit, purely
academic research purposes.

5
At European Union level, EUR-Lex is fairly comprehensive, with regard to both legislative acts and case law.
EU law - EUR-Lex, , https://eur-lex.europa.eu/homepage.html (last visited Nov 9, 2020). National databases
are typically less complete.
6
Digitize Your Notes With Microsoft Computer Vision API | Nordic APIs |, , NORDIC APIS (2017),
https://nordicapis.com/digitize-your-notes-with-microsoft-vision-api/ (last visited Nov 9, 2020).
7
Nicholas J. DeVito, Georgia C. Richards & Peter Inglesby, How we learnt to stop worrying and love web
scraping, 585 NATURE 621–622 (2020).
From Text to Data

When we read a text our brain parses it applying our knowledge of semantics, syntax and
context. In any language, the stock of words is finite, but syntactic rules allow the construction
of infinitely many sentences from this finite vocabulary. Moreover, humans are able to
communicate more than they say or write by taking the context into account. It is why we
ascribe different meanings to the sentence “I would like a table” when uttered in a restaurant
and when uttered in a furniture shop.8 While the language of legal documents – including
contracts, statutes and judicial opinions – can diverge, sometimes significantly (``Any proviso
to the contrary notwithstanding’’), from everyday language, these basic principles of linguistic
cognition and interpersonal communication are equally valid in the legal domain as in other
areas of human activity.

Text-mining methods do not parse texts quite the same way the human brain does. Instead,
these methods typically involve a good deal of complexity reduction. This may seem
surprising to those less well-versed in machine learning. But even the most advanced natural
language processing algorithms are still based on statistical principles. Texts are represented
as numbers and the algorithms look for patterns in these numbers. The ability to detect
patterns depends on the amount of textual data and the sophistication of the algorithm, but
the basic principle remains the same, including for the most cutting-edge techniques. In that
sense, it is not entirely wrong to say that machine learning algorithms are still quite dumb.
Yet their power stems from their ability to leverage the brute force of computing to arrive at
useful (and sometimes surprisingly good) approximations.

Text 1
“The law shall apply to
agreements between non-
resident firms.”
the firms shall are apply …
Text 1 1 1 1 0 1 …
Text 2 2 1 0 1 0 …

Text 2
“Firms are required to abide by Document-term matrix
the terms of the present law.”

Figure 2. Converting text into document-term matrix.

Until recently, most text-mining methods relied on what is known as the bag-of-words (BOW)
approach. To see what this amounts to, let’s assume, for the sake of exposition, that we have
a corpus with two texts, Text 1 and Text 2, as in Figure 2. The BOW approach involves
converting texts to sequences of word counts and corpora to document-term matrices. The
sequence of word counts representing a text is called a “vector”. This vector contains counts
of all the words occurring in that text and zeros for the words occurring in the other texts but
not in that particular text. For Text 1 the zeros will represent all the words that appear in Text

8
DAN SPERBER & DEIRDRE WILSON, RELEVANCE: COMMUNICATION AND COGNITION (1996). Formatted: English (United States)
2 but not in Text 1, and vice-versa. In a large corpus spanning a vocabulary of millions of
words, the vector of word counts representing a text will contain mostly zeros – accounting
for all the words that occur in other texts but not in the one under consideration.

To keep some phrases such as “European Union” or “Court of Justice” together instead of
treating their component words as distinct lexemes, it is possible to throw some bigrams or
trigrams into the document-term matrix. Think of a n-gram as a contiguous sequence of
words. A bigram is a sequence of two words; a trigram a sequence of two words, and so on.
Turned into a bigram “European Union”, for example, becomes “European_Union” whereas
“Court of Justice” becomes the trigram “Court_of_Justice”. These n-grams can then be
processed just as individual words (unigrams).

In many applications, it is also common to remove so-called “stopwords” – articles and

prepositions like “the”, “but”, “to”, etc.—and to convert all words to lower case.9 The
resulting document-term matrix is the basic input of many popular text-mining methods, such
as Latent Semantic Analysis (LSA) or topic modelling.

This modus operandi may strike many as a crude simplification. Yet, crude as it may be, this
simplification can nonetheless produce useful results, as we shall see.

It is easy to see, however, that progress in modelling language and improvements in the
performance of downstream applications – in law just as in other fields – ultimately entailed
bringing the field beyond the BOW paradigm to develop richer representations of
vocabularies while capturing more of the context and rules of syntax.

As will see, static word embedding models such as Word2Vec have taken one significant step
in that direction by representing words by their co-occurrence associations. These methods
reflect the emergence of new paradigm building on notions from distributional linguistics,
notably the intuition that a word is defined by the company it keeps.

Cutting-edge methods like transformers have taken the field several steps further into this
new paradigm. Pre-trained on giant corpora, transformer models like Google’s BERT rely on
a contextualised representation of word usage, enabling them to handle polysemy and to
parse the reference of pronouns – a remarkable achievement that constitutes a major
milestone in the development of AI language models.

Note, though, that while these novel techniques do not require converting raw texts to a
document-term matrix, they still require texts to be in digitalised, machine-readable format.

Unsupervised Techniques

Computer scientists and machine learning scholars typically speak in terms of tasks –
information retrieval, clustering, summarizing, forecasting, etc. – or in terms of whether the

9
Some text-mining tasks such as authorship identification require a distinct approach to pre-processing.
Indeed, because pronouns and prepositions are markers of personal style, it is common to restrict the
document-matrix to this class of words and to exclude nouns, verbs and adjectives.
method or algorithm operates with human-labelled documents or not – supervised vs
unsupervised.

Translated into more familiar language, information retrieval is what jurists do when they
search a document collection for a specific set of documents: e.g. entering a list of keywords
into a database search engine to retrieve all judicial rulings addressing a particular issue.
Similarly, clustering is what lawyers do when they try to sort out documents into categories:
e.g. the themes to which law review articles relate or the topics coming up in judicial rulings.
Turning long documents into more easily digestible summaries is also something that lawyers
do on a routine basis. Prediction is something that one may not intuitively associate with
texts. Yet words, too, whether from legal briefs or other textual inputs, can also serve to
predict events or behaviours.

Techniques referred to as “supervised” are those that necessitate human-labelled

documents. They operate by seeking patterns correlated with human annotations and their
ability to predict how humans would annotate unlabelled documents is the measure of their
performance. “Unsupervised” techniques, on the other hand, do not require manually
labelled textual input. However, the output they generate requires human interpretation or
validation.

Some techniques and machine learning algorithms have been specifically designed for
particular tasks. Yet several methods, some supervised, others unsupervised, may sometimes
come into consideration for the same task, in which case the optimal choice should ultimately
depend on the specific research question of interest to the legal analyst.

World Cloud

Word cloud plots are arguably one of the most familiar and simplest text-mining methods. A
word cloud simply plots words according to their aggregate frequency in the document-term
matrix. Illustrated in Figure 3 is a slightly more sophisticated word cloud, known as a
“comparison cloud”.10 It is based on a corpus compiling all European Court of Justice rulings
up to 2015 (over 12,000 documents). Plotted are not the most frequent words in the overall
corpus but the words that are most distinctive of the three main procedures: annulments
(Article 263 Treaty on the Functioning of the European Union (TFEU)); infringements (Article
258 TFEU) and preliminary rulings (Article 267 TFEU).

Our comparison cloud suggests that “undertakings” is more distinctive of annulment

proceedings (maybe because European Commission competition decisions reach the Court
via this procedural channel) whereas “agreement” and “sugar” are more characteristic of,
respectively, infringement and preliminary rulings.

10
The size of a word reflects its deviation from their average across documents. Suppose 𝑝𝑖,𝑗 is the rate which
∑𝑖 𝑝𝑖,𝑗
word i occurs in document j and 𝑝𝐼 its average rate across documents ( ). Word size is
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠

determined by the maximum deviation (𝑚𝑎𝑥(𝑝𝑖,𝑗 − 𝑝𝐼 )).

Figure 3. Comparison word cloud of infringement, annulment and preliminary rulings.

Word clouds are popular and easy to interpret, but are rather crude tools when it comes to
detecting more granular patterns. In some applications pre-processing steps, such as
restricting the document-term matrix to certain parts of speech (e.g. nouns or adjectives) may
help make them more informative. But limitations remain.

Latent Semantic Analysis and Principal Component Analysis

A notch more advanced are Principal Component Analysis (PCA) and Latent Semantic Analysis
(LSA). Both are closely related and relatively old statistical techniques to arrange large arrays
of data into more interpretable patterns. In the field of text-mining, they fundamentally serve
as unsupervised clustering methods to explore how texts and their words relate to each other.

PCA and LSA both work by seeking to represent the high-dimensional variations in word usage
– a corpus and the documents it comprises vary in as many ways as the number of words in
its vocabulary – into something more easily interpretable (and cognitively manageable) for
the human brain. The output of both statistical procedures are a smaller number of
dimensions on which words and documents are arrayed to facilitate the identification of
meaningful patterns of relatedness.

The patterns of interest and the words expressing them depend on the specific task. PCA, for
example, has been used to identify the authorship of The Federalist Papers.11 But both
methods can also be used to cluster legal documents around themes if one of the generated
dimensions allows such an interpretation.

11
David I. Holmes, Authorship attribution, 28 COMPUT. HUMANIT. 87–106 (1994).
Figure 4 and 5 illustrate the use of LSA to explore oscillations in the position of the German
Federal Constitutional Court over European from the 1960s and up to 2020. The corpus
comprises 26 rulings, with length varying from a little over 1,000 to more than 20,000 token
words.12

Here the basic interpretive assumption with which the output of the algorithm was
approached is that variations in jurisprudential stance should be reflected in the use of words
related to statehood and the internal market, with greater divergence in vocabulary
manifesting greater jurisprudential divergence.

Figure 4 shows the extent to which selected clusters of words tend to appear in the same
decisions. Unlike in word cloud plots such as the one depicted in Figure 3, the position of
words has a precise meaning here. Vertical and horizontal axes denote separate dimensions
while the position of words it itself related to the documents in which they occur. If two
documents share many words that are close to each other on a dimension, these documents
will also be close to each other on that particular dimension. For exemple, “sovereignty of the
people” (Volkssouveränität), “constitutional identity” (Verfassungsidentität), “enumerated
powers” (Einzelermächtigung) and “ultra vires” are close to each other on Dimension 1. These
words are also more closely associated with the Court’s more Eurosceptic judgments, like
Maastricht and Lisbon. “Duty to refer” (Vorlagepflicht), “direct” (unmittelbar), “effect”
(Wirkung), “export” (Ausfuhr), “good” (Ware) form another separate cluster on the same
dimension on the right-hand side. These words are also more closely associated with
integration-friendly rulings, like Kloppenburg, Banana or Lütticke.

If we interpret Dimension 1 as Europhilia, the document positions associated with Dimension

1 can be interpreted as capturing the rulings’ expressed position over European integration.
Figure 5 depicts document positions on Dimension 1 over time. It shows that the resulting
scaling is highly consonant with the conventional doctrinal wisdom. The ECB Ultra Vires ruling,
in which the German Court declared the Court of Justice decision in Weiss ultra vires, clearly
scores as the most Eurosceptic ever. Other rulings, such as Maastricht, Lisbon and OMT, which
borrow the same state-centric sovereignty rhetoric, are also on the more Eurosceptic side, in
keeping with the conventional wisdom.

12
For a discussion and assessment of the performance of LSA and other text-mining methods to map
jurisprudential change see Arthur Dyevre, The promise and pitfall of automated text-scaling techniques for the
analysis of jurisprudential change, ARTIF. INTELL. LAW 1–31 (2020).
Figure 4. Frames and phraseology of German constitutional rulings on Europe

Figure 5. Evolution of the German Constitutional Court’s stance on European integration

based on Dimension 1 of LSA

A recent paper compared the performance of eight algorithms, including LSA, in mapping the
evolution of the German Court’s case on European integration.13 The positions ascribed to
the decisions by the algorithms were evaluated against scholarly accounts and legal expert
Formatted: English (United States)
Field Code Changed
13
Id. Formatted: English (United States)
ratings. A variant of LSA (Correspondence Analysis) performed best against scholarly accounts
in law journals, achieving a 75% pairwise correlation.14

Topic Modelling

A more recent technique specifically designed for clustering and automated classification is
topic modelling.15 Suppose you have a large amount of legal texts and you want to get a sense
of the themes and topics they pertain to. Instead of asking you to come up with a list of
categories or a classificatory scheme, topic modelling generates the categories and sorts out
the documents accordingly after you specified how many topics you wanted. At least this is
how the method is supposed to work.

In topic modelling, topics are modelled as probability over words and documents as
probability over topics. To generate the topics, the algorithm tries to find which probabilities
are most likely to have generated the observed documents.

Figure 6 illustrates the output of a topic model of preliminary rulings (approximately 8000
rulings). The number of topics was set at 25. What Figure 6 displays is one of these topics
represented by its 10 most distinctive words (note that the higher the beta value, the more
characteristic of the topic the word is). Looking at these “keywords” – which, it is essential to
understand, are not chosen by the researcher but emerge from the analysis – we may
plausible summarize this topic as corporate taxation.

Figure 6. Topic from topic model of preliminary rulings (1961-2016)

In topic modelling, documents are conceptualised as mixtures of topics and, in addition to

generating topics, a topic model tells you what proportion of what topic documents are likely
to contain. So to check that our interpretation of topic 14 is correct we can inspect the
decision which, according to the model, has the highest proportion of this topic. In that case,
it turns out to be Test Claimants in the FII Group Litigation v Commissioner of Inland Revenue,

14
Id.
15
For a non-technical introduction see David M. Blei, Probabilistic topic models, 55 COMMUN. ACM 77–84
(2012).
a 2012 Grand Chamber ruling, which according to the model is 99% about topic 14. Here is a
quote from the first ruling:

The High Court of Justice of England and Wales, Chancery Division, seeks, first, to
obtain clarification regarding paragraph 56 of the judgment in Test Claimants in the
FII Group Litigation and point 1 of its operative part. It recalls that the Court of Justice
held, in paragraphs 48 to 53, 57 and 60 of that judgment, that national legislation
which applies the exemption method to nationally-sourced dividends and the
imputation method to foreign-sourced dividends is not contrary to Articles 49 TFEU
and 63 TFEU, provided that the tax rate applied to foreign-sourced dividends is not
higher than the rate applied to nationally-sourced dividends and that the tax credit is
at least equal to the amount paid in the Member State of the company making the
distribution, up to the limit of the tax charged in the Member State of the company
receiving the dividends.

So it does really look like corporate taxation after all.

Topics can be visualised in various ways. In Figure 7, they are represented as a network in
which node size represents overall topic proportion in the overall document collection while
edge thickness correspond to the weighted number of shared words. This way we can see
themes emerging from the topics.

Figure 7. Topic model of CJEU preliminary rulings represented as network.

Among other things, Figure 7 suggests that internal market and tax issues represent a big
chunk of what the CJEU does. However, social rights, residence rights and the recognition of
foreign judgments (private international law) also make for a substantive share of the cases
on which the Luxembourg judges sit.

If you think that 25 categories is too few to get a good sense of issue prevalence in the
Court’s case law, how about a topic model with 100 categories? Looking at Figure 5, we see
that such a topic provides a more detailed picture, although we find the same themes
(internal market in the lower-right region, social and immigration issues in the left region).

Figure 8. Topic model of CJEU preliminary rulings with 100 topics.

It is also possible to construct dynamic topic models to study the evolution of case law over
time or “litigant” topic model to study how issues vary across litigant types.

Recent work has applied topic modelling to study relative issue emphasis across infringement,
annulment and preliminary rulings, highlighting how the CJEU’s case law is influenced by the
litigation agenda of case initiators (like the European Commission)16; to compare topic
salience in European Union legislation, CJEU rulings and contributions to the Common Market
Law Review17; to explore Dutch supreme court decisions18; and to demonstrate the lingering
centrality of market regulation in European Union law-making in the twenty-first century19;
While significant efforts have been expended on manually classifying the legal areas
addressed by US Supreme Court rulings, some authors have proposed topic modelling as a
more efficient and more accurate alternative.20 Work by Peter Grazl and Peter Murrel further
illustrates how topic modelling can assist in exploring large collections of old legal texts. They

16
Arthur Dyevre & Nicolas Lampach, Issue attention on international courts: Evidence from the European Court
of Justice, REV. INT. ORGAN. 1–23 (2020).
17
Arthur Dyevre, Michal Ovadek & Monika Glavina, The Voices of European Law: Legislators, Judges and Law
Professors, forthcoming GER. LAW J. (2021).
18
Ylja Remmits, Finding the Topics of Case Law: Latent Dirichlet Allocation on Supreme Court Decisions (2017).
19
Nicolas Lampach, Wessel Wijtvliet & Arthur Dyevre, Merchant Hubs and Spatial Disparities in the Private
Enforcement of International Trade Regimes, INT. REV. LAW ECON. 105946 (2020).
20
Douglas Rice, Measuring the issue content of Supreme Court opinions, 7 J. LAW COURTS 107–127 (2019).
apply topic modelling to reports of cases heard by English between the fourteenth and
eighteenth century (N = 52,949).21

Word Embeddings

Tools like LSA, PCA and topic modelling are typical of the BOW paradigm. Word embeddings,
by contrast, are part of a new text-mining paradigm inspired by the defining principle of
distributed linguistics – “a word is defined by the company it keeps”.22

To explain how word embeddings work, the best is, again, to start with an exemple. Suppose
you want to investigate variations in attention to a particular phenomenon, e.g. politics and
populism in posts on a major legal blog. To measure attention to this concept, we might first
try to come up with a list of keywords (e.g. “politics”, “party”, “populism”...) capturing
attention to this phenomenon and then determine the extent to which our keywords are
actually matched in the document collection. However, this approach often delivers poor
results because the same phenomenon can be characterised in many different ways, leading
exact matches to either over- or underestimate the true number of relevant instances of
attention to the phenomenon in question. (The frustrating feeling is surely one that many
jurists have experienced when trying to retrieve documents via a keyword search in some
legal database.)

eliten 0.8215773105621338 veränderungen 0.7256141901016235

vorteile 0.7712808847427368 gerechtigkeit 0.7244186401367188
öffentlichkeit 0.7557699084281921 medien 0.7238696813583374
bürger 0.7556987404823303 bevölkerung 0.7218047976493835
institutionen 0.7544448971748352 instrumente 0.7199758887290955
positionen 0.7507109642028809 kultur 0.7188452482223511
denjenigen 0.7492086291313171 verfassungen 0.7175022959709167
presse 0.7491012215614319 verteidiger 0.7158944606781006
vielfalt 0.7490779757499695 schmieden 0.7158849239349365
minderheiten 0.7462370991706848 verhältnisse 0.7151006460189819
nationalstaaten 0.7429373264312744 etablierten 0.7142930626869202
demokratien 0.7417148351669312 wissenschaft 0.7125871777534485
arena 0.7414146065711975 ideen 0.711867094039917
repräsentanten 0.7355824112892151 unionsbürger 0.7089967131614685
elite 0.7347079515457153 chancen 0.7073072791099548
gesellschaft 0.7327207922935486 debatten 0.7052429914474487
solidarität 0.7316428422927856 staatssekretäre 0.704701840877533
verwaltung 0.7315931916236877 justiz 0.7037882804870605
vernunft 0.7314342260360718 kommunikation 0.7020408511161804
wirtschaft 0.7265527248382568 minderheit 0.701974093914032

21
Peter Grajzl & Peter Murrell, A machine-learning history of English caselaw and legal ideas prior to the
Industrial Revolution I: generating and interpreting the estimates, 17 J. INSTITUTIONAL ECON. 1–19 (2021).
22
Tomas Mikolov et al., Distributed representations of words and phrases and their compositionality, in
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 3111–3119 (2013).
Table 1. Top 40 occurrence similarity scores for vector Politik + Parteien + Populismus from
embeddings (Word2Vec) trained on German-language contributions to the
Verfassungsblog

Word embeddings help deal with this problem by representing words not as frequencies –
the BOW approach – but as sequences (i.e. a vectors) of occurrence similarities, generated
via a shallow neural network. For example, Table 1 displays the first 40 items in the vector of
occurence similarilities yielded by a word embeddings model trained on the German-
language contributions to the Verfassungsblog (a leading constitutional law blog) using the
Word2Vec algorithm. The vector corresponds to the words “Politik” (politics), “Parteien”
(parties) and “Populismus” (populism).23

Numbers next to the words in Table 1 indicate the cosine occurence similarity. The closer it is
to 1, the more similar is the word’s context of occurence to that of Politik + Parteien +
Populismus. Here the word exhibiting the highest cosine similarity score is “Eliten” (elites),
which makes sense since elite-bashing is a defining feature of populist discourse. Other terms,
including “Presse” (press), “Medien” (media), “Bürger” (citizen), frequently come in populist
rhetoric, too. “Verwaltung” (adminstration) and “Instrumente” (instruments), though, are
less intuitively associated with politics, populism or partisan organisations.

Co-occurence similarity here refers to the words that tend to appear around the target word.
How many words before and after the target word should be considered – the window size –
is one of the parameters that have to be set by the researcher before training an embedding
model. A window size of 5 means that two words and two after the target word will be
considered; a window size of 9 four words before and four words after; and so on. The neural
network is then trained to predict either the target word from the surrounding words
(continuous bag-of-words method) or the surrounding words given the target word (skip-
gram method).

As with machine learning and neural networks in general, the more data (texts) the better.
This is why instead of training embeddings from scratch on a relately small collection of blog
posts, it may be preferable to use a pre-trained model built on a much larger corpus. Table 2,
shows the words associated with Politik + Parteien + Populismus from a pre-trained
embeddings model constructed from all German Wikipedia pages, with a vocabulary of nearly
five million words.24 Pre-trained embeddings constructed from legal documents also exist.25

The cosine similarity scores are generally higher in Table 2 than in Table 1, which suggests
that the pre-trained model better captures contextual similarity. In fact, it assigns a high
cosine similarity scores to typos like “poltk” (cosine = 0.846). This is because typos appear in
the same context as the word with the correct spelling. The similarity scores assigned to typos

23
Several word embedding algorithms exist, including Word2Vec, Fasttext and Glove. Here we relied on the
Word2Vec approach.
24
A wide range of word embedding models spanning multiple languages can be downloaded from a repository
made available by the Language Technology Group at University of Oslo, see http://vectors.nlpl.eu/repository
(accessed 12 November 2020).
25
Ilias Chalkidis & Dimitrios Kampas, Deep learning in law: early adaptation and legal word embeddings
trained on large corpora, 27 ARTIF. INTELL. LAW 171–198 (2019).
highlight how word embedding models handle synonymy, which represents a major advance
for legal information retrieval tasks.

That pre-trained embeddings can deliver better results than locally-trained embeddings (i.e.
embeddings trained on the corpus one actually wants to investigate) illustrates the notion of
transfer learning. What a model learns about language use from a very large corpus is often
transferable to smaller text collections.

parteitaktik 0.8728734850883484 europapolitik 0.8352898359298706

parteipolitik 0.8678461909294128 hinterzimmerpolitik 0.8350450992584229
parteienpolitik 0.8615270853042603 demokratiedebatte 0.8343005180358887
parteienoligarchie 0.8604286313056946 demokratieverachtung 0.833306074142456
klientel- 0.8596892952919006 demokratieverlust 0.833281397819519
einheitsparteien 0.8587565422058105 eu-kritischer 0.8332792520523071
demokratie 0.8570675849914551 nationalpopulistischen 0.8332092761993408
populisten 0.8544521331787109 systemopposition 0.8329799771308899
euro-kritik 0.8523470163345337 linkspopulisten 0.832958459854126
klüngelei 0.8509451746940613 stimmungskanzlerin 0.8328656554222107
poltik 0.8463006615638733 schröder-ära 0.8327784538269043
stimmungsdemokratie 0.8435574173927307 politischen 0.8327664136886597
eu-zentralismus 0.8430708050727844 politikeliten 0.8319368362426758
parteienkartell 0.8392568826675415 europafeindlichkeit 0.8310469388961792
regierungspolitik 0.839039146900177 wirtschaftslobbyismus 0.8308700323104858
partikularinteressen 0.83860182762146 eurorettungspolitik 0.8304780125617981
parteiengezänk 0.8367708325386047 konzernspenden 0.8304095268249512
troika-politik 0.83570396900177 protestparteien 0.8301451206207275
linkspopulismus 0.8356888890266418 euroskepsis 0.830102801322937
wirtschaftslobbyisten 0.8353110551834106 parteitagsbeschlüsse 0.8300416469573975

Table 2. Top 40 occurrences for vector Politik + Parteien + Populismus from embeddings
trained on German Wikipedia pages

One powerful application of word embeddings is to generate weighted lexicons, which can
be utilised to detect attention to a particular phenomenon or concept of interest. Figure 9
plots the variation in attention to politics, parties and populism in German-language
contributions to the Verfassungsblog using the words contained in the vector Politik +
Parteien + Populismus to measure the average attention the underlying phenomenon.

Over time, the Verfassungsblog has been posting a growing number of English-language
contributions. Figure 10 charts attention to the same phenomenon in English-language posts
using the vector politics + parties + populism generated by Google’s pre-trained word
embedding model (Word2Vec) for English – which boasts a vocabulary of three billion words
trained on Google News data.
Political words scores

100

80
Aggregated Score

2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Year

Figure 9. Relative incidence of words relating to “Politik”, “Partei” and “Populismus” in

German-language contributions to the Verfassungsblog, 2009-2019

Figure 10. Relative incidence of words relating to “politics”, “parties” and “populism” in
English-language contributions to the Verfassungsblog, 2012-2019

These are potentially interesting results for scholars interested in the evolution of European
constitutional law scholarship and a possible shift from a legalistic, narrowly doctrinal
conception of legal scholarship to one that pays greater heed to political behaviours and
social dynamics.26

To further illustrate the potential of word embeddings for attention detection and document
retrieval, note that we can vary the specification of vectors to improve results or to capture
conceptual nuances. The vector generated for politics alone will be different from the vector
generated for politics + parties + populism. But if we wanted to a generate a vector for terms
associated with politics and political parties but not with populism, we could specify a vector
like politics + parties - populism. Remarkably, in the Google pre-trained model, specifying king
- man generates a vector in which the word with highest cosine similarity score is queen.

So, by comparison with document search engines based on exact keyword matching, word
embeddings provide a considerably more powerful tool to capture attention to concepts.

Another application of word embeddings is to compare change in the connotations of words

across time, in which case separate embeddings models are trained on sub-sets of the corpus
corresponding to distinct periods.27 A noted study by Elliott Ash, Daniel Chen and Arianna
Ornaghi has applied a similar approach to compare gender stereotypes across the opinions
of federal judges in the United States.28 380 000 judicial opinions were grouped by authorship
and separate embedding models were trained for each judicial author. The distance between
the vector male - female and career - family was then used to construct a gender slant
indicator. The authors find that judges for whom these two vectors are closer – meaning they
more closely associate men and women with traditional gender roles – vote more
conservatively on women’s rights’ issues such as reproductive rights, sexual harassment, and
gender discrimination. Moreover, they are less likely to assign opinions to female judges but
are more likely to reverse lower-court decisions if the lower-court judge is a woman, and they
cite fewer female-authored opinions.

A related study by Douglas Rice, Jesse Rhodes and Tatishe Nteta has examined racial biases
in a corpus comprising over 1 million of state and federal court opinions. The authors find
stereotypically African-American names to be systematically associated with more negative
words compared to stereotypically European-American names.29

26
Bart Caiepo & Federico Benetti, How Political Turmoil is Changing European Constitutional Law: Evidence
from the Verfassungsblog, VERFASSUNGSBLOG (2020), https://verfassungsblog.de/how-political-turmoil-is-
changing-european-constitutional-law-evidence-from-the-verfassungsblog/ (last visited Nov 9, 2020).
27
Studies adopting this approach have revealed the evolution of gender and ethnic stereotypes or the
changing connotations of the word “gay” see William L. Hamilton, Jure Leskovec & Dan Jurafsky, Diachronic
word embeddings reveal statistical laws of semantic change, ARXIV PREPR. ARXIV160509096 (2016); Nikhil Garg
et al., Word embeddings quantify 100 years of gender and ethnic stereotypes, 115 PROC. NATL. ACAD. SCI. E3635–
E3644 (2018).
28
ELLIOTT ASH, DANIEL L. CHEN & ARIANNA ORNAGHI, Stereotypes in High-Stakes Decisions: Evidence from US Circuit
Courts (2020).
29
Douglas Rice, Jesse H. Rhodes & Tatishe Nteta, Racial bias in legal language, 6 RES. POLIT. 2053168019848930 Formatted: English (United States)
(2019).
Document Clustering with Word Embeddings: Doc2Vec

Closely related to the word embedding approach just described is a document clustering
technique known as Doc2Vec. It relies on the same representation of words. But instead of
training the neural network to predict only the target word or the surrounding terms, it is also
trained to predict the documents in which they occur. Documents thus become associated
with word vectors.

Doc2Vec is similar to PCA/LSA in that it simultaneously relates words and documents. The
principal difference, though, is that Doc2Vec draws on a much more sophisticated word
representation.

A Doc2Vec model can be visualized by means of a t-SNE (shorthand for “t-distributed

Stochastic Neighbour Embedding”) plot. A t-SNE plot brings the high-dimensional vector
representations of documents into a format where similarities among documents are easier
to appreciate.

Figure 11 shows a t-SNE plot of a Doc2Vec model of European Court of Justice rulings, with
colours denoting the procedure. The horizontal and vertical axes of a t-SNE plot are not
amenable to substantive interpretation. But spatial proximity reflects similarity in word
usage. Here the plot suggests some degree of overlap across procedures, but greater
heterogeneity in rulings originating in preliminary references.
Figure 11. T-SNE plot of Doc2Vec model of European Court of Justice rulings (colour
denotes procedure).

Looking at a large corpus of US Court of Appeals rulings, Daniel Chen and Elliott Ash have
explored a variety of possible uses of Doc2Vec for the analysis of judicial opinions.30

Because precedents play an important role in legal argumentation, several studies have
proposed Doc2Vec as a methodology to identify and measure case similarity.31

30
Elliott Ash & Daniel L. Chen, Case vectors: Spatial representations of the law using document embeddings,
LAW DATA ST. FE INST. PRESS ED M LIVERMORE ROCKMORE (2019).
31
Tereza Novotná, Document Similarity of Czech Supreme Court Decisions, 14 MASARYK UNIV. J. LAW TECHNOL.
105–122 (2020); Lorenz Timothy Barco Ranera, Geoffrey A. Solano & Nathaniel Oco, Retrieval of Semantically
Similar Philippine Supreme Court Case Decisions using Doc2Vec, in 2019 INTERNATIONAL SYMPOSIUM ON MULTIMEDIA
AND COMMUNICATION TECHNOLOGY (ISMAC) 1–6 (2019); Paheli Bhattacharya et al., Methods for computing legal
document similarity: A comparative study, ARXIV PREPR. ARXIV200412307 (2020).
Supervised Classification Methods

Unsupervised approaches produce models and output without human input, which may seem
to be a great advantage. However, the models and output generated by unsupervised
methods always require ex post human interpretation. There is no absolute guaranty that the
topics generated by a topic model will make sense or that the dimensions produced by an LSA
model will be interpretable. This is not necessarily a problem if unsupervised techniques are
primarily used for exploratory purposes. However, if one purports to rest an empirical
assertion on the results of unsupervised methods, some human validation of at least a subset
of these results may be required in order to demonstrate intersubjective validity.

Supervised methods, by contrast, do not require ex post validation because they seek to
“emulate” what humans do by discovering patterns in documents labelled by human
annotators prior to training.

Obtaining Labelled Documents

Supervised approaches all require labelled documents. There are only two avenues to obtain
labelled data. The first is to rely on documents that other researchers have already annotated.
To measure the ideological direction of US federal court opinions, Carina Hausladen, Marcel
Schubert and Elliott Ash were able to leverage an existing database (the Songer Database)
where ideological direction had been hand-coded for a subset (5%) of federal cases. These
annotated opinions were then used to train and test a range of algorithms.32 Using labelled
data sets from outside the legal domain can be tempting. But results may then have to be
interpreted with caution. One study, for example, has sought to leverage academic papers
from moral philosophy which had been labelled either as “deontological” or
“consequentialist” to train machine learning classifiers to detect modes of moral reasoning in
US federal opinions.33 However, given the risk of low domain adaptation (the language of
academic papers and judicial opinions may diverge too much), the results of studies adopting
this strategy should be taken with a grain of salt.

When no labelled data set exist, the only way to obtain labelled data is to build it from scratch.
In many areas, supervised machine learning projects rely on crowdsourcing platforms such as
Amazon Mechnical Turk, where annotators recruited online tag documents for a modest
compensation (and so at a low cost for the researcher).34 However, crowdsourcing works best
when a task is simple, quick and straightforward. So the specificity, technicality and
complexity of legal language means that crowdsourcing is not a well-suited approach for legal
projects.

32
Carina I. Hausladen, Marcel H. Schubert & Elliott Ash, Text classification of ideological direction in judicial
opinions, 62 INT. REV. LAW ECON. 105903 (2020).
33
Nischal Mainali et al., Automated classification of modes of moral reasoning in judicial decisions, in
COMPUTATIONAL LEGAL STUDIES (2020).
34
Catherine Grady & Matthew Lease, Crowdsourcing document relevance assessment with mechanical turk, in
PROCEEDINGS OF THE NAACL HLT 2010 WORKSHOP ON CREATING SPEECH AND LANGUAGE DATA WITH AMAZON’S MECHANICAL
TURK 172–179 (2010).
Law students potentially offer a solution to the document annotation challenge. Without
being accomplished legal experts, they tend to be more comfortable with legalistic language
and better at parsing judicial prose or statutory provision. It is possible to integrate legal
annotation tasks in tutorials and interactive classes. In fact, annotating legal documents can
be viewed as an excellent exercise for students to practice and perfect legal analytic skills.35

At the KU Leuven Law School, EUTHORITY Project researchers recently conducted an

annotation experiment with 52 third-year law students. Two hours per week, the annotation
team assembled to annotate Belgian constitutional and supreme court (Court of cassation +
Council of State) rulings for explicit references to EU law. The rulings of Belgian top courts
vary in both length (from a little over thousand to several hundred thousand words) and
complexity (from relatively simple asylum cases to more arcane regulatory issues). Eventually,
the team was able to annotate 519 rulings, which is a reasonable number of documents but
obviously pales in comparison to the millions of annoted pictures36 computer scientists and
engineers are able to tap into to train image-recognition algorithms.

Conducting annotation tasks with large groups of students demands a good work flow.
Annotators must first be trained to recognise the concepts and information which the labels
have been designed to capture. Documents must then be distributed to annotators and
annotated documents collected. To produce high-quality annotations, it is recommended to
have two or more annotators independently annotating the same document. Discrepancies
then have to be identified to compute inter-annotator agreement metrics. To bolster quality,
a reconciliation procedure can also be put in place. Software and online platforms have been
developed to facilitate the completion of document annotation tasks by teams of annotators.
The aforementioned annotation project conducted in Leuven, for example, relied on the
cloud version of the TagTog37 platform (which the project was allowed to use free of charge
in exchange for making the labelled data public). Some software solutions, such as
WebAnno38, are open source and can thus be used free of charge, but require local server
installation – which can be technically involved, unless technical support is provided.

Bag-of-words Methods

As with unsupervised techniques, supervised techniques, too, initially all relied on the BOW
paradigm. Supervised BOW methods involved the same data preparation steps, including
converting texts into document-term matrix format. In a supervised setup, the document
matrix will look very similar, except that it will contain one or more additional columns for
the labels produced by human annotators.

Before trying out some classification algorithms, the next stage will be to divide the data into
train and test data. As its name suggests, the train set will serve to train many versions of the
algorithm, whereas the test set will serve to measure their performance and select the best

35
Conducting legal AI projects also help bring greater awareness of the potential of new technologies for legal
research and legal practice while contributing to the modernization of legal education, see Arthur Dyevre,
Fixing European Law Schools, EUR. REV. PRIV. LAW (2017). Formatted: English (United States)
36
See http://www.image-net.org/ (accessed 9 November 2020).
37
See https://www.tagtog.net/ (last accessed 3 March 2021).
38
https://webanno.github.io/webanno/ (last access 4 Marche 2021).
one. Dividing the data into train and test sets is called the “holdout” procedure and is only
one of many sampling procedures. When the number of annotated documents is small (less
than 1000), it is recommended to use some “cross-validation” procedure. Cross-validation
procedures begin by dividing the annotated documents in several folds (e.g. 10). One of the
folds then serves as test set while the algorithms are trained on the remaining folds. This
process is then iterated with a different fold until every fold as served as training set.
Performance is evaluated by looking at the average across test folds. This way cross-validation
ensures that as much data as possible is used for training.

When we say that the train set serves to train “many versions” of an algorithm, we mean
many combination of words correlated with the labels. How many versions of the algorithm
are fitted to the train data is for the researcher to decide in light of time and computational
constraints (fitting a broader range of possible combinations obviously take more time).

All these competing versions of the algorithm are then tested against the test data. The
version that best predicts the human annotations in that set is then selected as the winner.

By way of illustration, we trained several algorithms to predict the labels “EU law” and “no
EU law” in the aforementioned student-annotated corpus of Belgian high court rulings.
Because this data set is relatively small (519 documents), we employed a cross-validation
procedure. We then fitted thousands of versions of a handful of popular algorithms: logistic
regression, Support Vector Machine (SVM), Random Forest and Sequential Neural Network.39
While explaining the technical specifications of these algorithms is beyond the scope of the
present paper, Table 3 reports the performance of the “best version” of each of these
algorithms.

Table 3. Performance metrics of algorithms trained to predict the presence of references

to EU law in Belgian high court rulings.

39
For a concise explanation of these algorithms I refer the reader to Soner Yıldırım, 11 Most Common Machine
Learning Algorithms Explained in a Nutshell, MEDIUM (2020), https://towardsdatascience.com/11-most-
common-machine-learning-algorithms-explained-in-a-nutshell-cc6e98df93be (last visited Nov 9, 2020). For a
survey from the perspective of econometrics see Matthew Gentzkow, Bryan Kelly & Matt Taddy, Text as data,
57 J. ECON. LIT. 535–74 (2019).
The metrics reported in Table 3 are the ones typically used in supervised text-mining
classification tasks. Precision indicates the proportion of documents predicted to contain
references to EU law that truly do so. On this metric, logistic regression and sequential neural
network did best, achieving a precision of 95%. Recall measures the proportion of documents
human annotators labelled as featuring EU law which the algorithm was able to retrieve. Here
Random Forest did best, retrieving 88% of the documents thus labelled. F1 is a metric that
combines precision and recall into a single number. The Matthews Correlation Coefficient
(MCC) is yet another performance metric. It is recognized as the most reliable metric to
evaluate a binary classifier because it takes into account the proportion of true negatives
(documents predicted to feature no EU law and do not), false negatives (documents predicted
to feature no EU law but that actually do), true positives (documents predicted to feature that
really do) and false positives (documents predicted to feature EU law but which do not). Here
Random Forest performs best with MCC = 0.77.

Similar BOW supervised approaches have variously been used to predict the outcome of ECHR
cases40; the ideological direction of US federal opinions41 and to detect unfair clauses in online
terms of service.42

Transfer Learning and Transformers

The new state-of-the-art in supervised document classification draws its strength from
several advances. The first is a revolutionary self-attention mechanism, known as
“transformer”, which supports rich, contextualized representations of lexical and sentence
meaning.43 The second are new training methods. Models are trained to predict target words
and whether two sentences appear next to each other. The third is greater leverage of
transfer learning. Models are pre-trained, without human supervision, on vast repositories of
texts. This knowledge can then be transferred to “local” supervised tasks with additional fine-
tuning steps.

These advances are embodied in BERT (for “Bi-directional Encoder Representation

Transformer”), the path-breaking natural language processing model developed by Google
researchers.44

Based on a deep neural network architecture, BERT is able to focus attention on a given word
in a sentence while simultaneously identifying the context of all the other words in relation
to that word. The “static”, type-based, word embeddings discussed in the previous section
represent a word as a vector of co-occurrences with cosine similarity scores reflecting co-
occurrence frequencies. This permits static word embedding to handle synonymy (if, for
instance, “car” and “vehicle” are used to mean the same thing they will have high cosine Formatted: English (United States)
Formatted: English (United States)
40
Masha Medvedeva, Michel Vols & Martijn Wieling, Using machine learning to predict decisions of the Formatted: English (United States)
European Court of Human Rights, 28 ARTIF. INTELL. LAW 237–266 (2020). Field Code Changed
41
Hausladen, Schubert, and Ash, supra note 34.
42
Ranera, Solano, and Oco, supra note 33. Formatted: English (United States)
43
Ashish Vaswani et al., Attention is all you need, in ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 5998– Field Code Changed
6008 (2017).
44 Formatted: English (United States)
Jacob Devlin et al., Bert: Pre-training of deep bidirectional transformers for language understanding, ARXIV
PREPR. ARXIV181004805 (2018). Formatted: English (United States)
similarity score), but not polysemy or co-reference resolution (to determine what a pronoun
refers to). The vector representing the word “party”, for instance, will not differentiate
between party as in “political party” and the party to a legal case. In a large and relatively
diverse corpus, the vector is thus liable to assign high cosine similarity to words associated
with both usages (e.g. “political” and “court”). By contrast, transformer models like BERT go
beyond generalizing across contexts. They represent words as dynamic, token-based vector
embeddings., thereby coming much closer to capturing the particular, sentence-specific
context of occurrence of a word. This, in turn, enables BERT to handle polysemy and co-
reference resolution much better than previous language models.

The original BERT was trained on a giant corpus of GoogleBooks (800 million words) and
Wikipedia pages (2.5 billion words) without human supervision by simply feeding it raw texts.
Yet the power of BERT for supervised classification lies in the possibility to further fine-tune
the pre-trained BERT on a “local” data set. What has been learned from the giant corpus can
thus be transferred to the local, smaller data set of direct interest to the researcher.
Obviously, there are many linguistic patterns that no algorithm will be able to learn from a
small data set. But a small data set may also instantiate specific patterns absent in the giant
data set. In short, transfer learning helps combine the strengths of both data sets. Technically,
local fine-tuning adds an additional layer of neurons to the neural network, thereby
incorporating the local knowledge into the larger model.

BERT has been shown to outperform other algorithms on a wide range of natural language
processing tasks.45 One study has shown BERT to perform well at predicting the issue area
codes of EU legislative acts.46

Table 4 reports the confusion matrix and performance metrics of a BERT model trained to
predict whether EU legislative acts will be litigated. The data set was built by matching EU
legislative acts in the EUR-Lex database to CJEU rulings. Only 3% of all EU legislative acts are
ever litigated and the probability that a given piece of legislation will be litigated in a particular
case is very low.

When an outcome is a rare event, as with our example, it is important to think carefully about
what the bar for a good model should be for the task at hand. Indeed, unexperienced lawyers
and laypeople are often impressed by headline metrics like “90% accuracy”. However,
achieving 90% correct classifications may, in many settings, be indicative of a poor
performance. In fact, it all depends on the task and data set. With our EU litigation data set,
it would have been easy to achieve 97% accuracy, since a model predicting that EU legislative
acts are never litigated would be right 97% of the time. So here accuracy is a misleading metric
and precision and recall for the rare outcome provide a better gauge of performance.

Table 4 reports results for a sub-set of the data, where CJEU decisions featuring EU legislative
acts have been deliberately oversampled. Oversampling the rare outcome is important to
ensure that the algorithm has enough information to learn the patterns associated with this
outcome.

45
Id.
46
Ilias Chalkidis et al., Large-scale multi-label text classification on eu legislation, ARXIV PREPR. ARXIV190602192
(2019).
While it is certainly possible to improve on these results through further local fine-tuning, a
precision of 0.65 (i.e. out of 1884 predicted to be litigated 1220 actually are) and a recall of
0.81 (i.e. out of 1500 litigated 1220 were predicted to be so) are encouraging results.

Table 4. Confusion matrix and performance metrics of a BERT model trained to predict
whether an EU legislative act will be litigated before the CJEU

Since the release of the first BERT, new variants of BERT have appeared, pre-trained on a wide
range of general (RoBERTa) or domain-specific corpora (BioBERT, sciBERT…) in a variety of
languages (e.g. robBert in Dutch, flauBERT in French, etc.). Multi-lingual BERT models,
simultaneously pre-trained on multiple languages, have been shown to support zero-shot
cross-lingual model transfer in which task-specific annotations in one language are to fine-
tune a model subsequently applied to classify documents in another language.47 BERT models
pre-trained on large collections of legal documents have also been released to assist with
legal classification and prediction tasks.48

The arrival of BERT has triggered an AI race where research teams at big tech firms are vying
to attain ever-higher performance with increasingly complex transformer language models:
RoBERTa (Facebook), XLNET (Google), GPT-2 (OpenAI), Turing NLG (Microsoft)… The last such
model to outperform its rivals, GPT-3 from OpenAI, boasts 175 billion parameters (by
comparison BERT has only 110 million parameters). The pace of technological development
holds out great promise for the future of legal text-mining research and natural legal language
understanding.

While transformers have just come along and applications to the legal domain are only
starting to appear in publications and conference proceedings, a paper by Evan Gretok, David
Langerman and Wesley Oliver provides an interesting illustration of the application of
transformer models to the study of legal doctrines. The authors trained transformer-based
algorithms to classify rulings pertaining to the Fourth Amendment of the US Constitution

47
Telmo Pires, Eva Schlinger & Dan Garrette, How multilingual is multilingual bert?, ARXIV PREPR.
ARXIV190601502 (2019); Sean MacAvaney, Luca Soldaini & Nazli Goharian, Teaching a New Dog Old Tricks:
Resurrecting Multilingual Retrieval Using Zero-shot Learning, in EUROPEAN CONFERENCE ON INFORMATION RETRIEVAL
246–254 (2020).
48
Chalkidis et al., supra note 48.
depending on whether they applied a bright-line or a totality-of-the-circumstances rule. The
best model (based on BERT) achieves accuracy of 92 per cent.49

As researchers begin to realize the potential of Natural Language Processing for large-scale
doctrinal analysis, we should expect to see many studies along these lines in the near future.
In the multi-lingual context of continental Europe, researchers may further seek to leverage
the power of multilingual transformers to develop legal documents classifiers or predictors
that can be deployed across multiple jurisdictions.

Learning Text-Mining Methods

How do lawyers with no prior training in machine learning or data science can get started?

One answer (at least for the motivated reader) is to learn a programming language like Python
either by following one of the many free online tutorials or by taking a class at a nearby
university campus. Python50 is the language of choice for machine learning, text-mining and
data harvesting tasks and the most popular among researchers and developers. Its eco-
system of libraries support the latest models and algorithms. While some lawyers may find
the mention of “programming” off-putting, Python is actually a very intuitive programming
language. Moreover, the libraries provide many shortcuts which make it possible to complete
a task with very few lines of code.

An alternative to Python is R, another popular programming language with many libraries

designed for text-mining tasks, from data-harvesting to topic modelling and LSA/PCA. R was
primarily developed for statistical analysis and does not support more advanced embeddings
and transformer methods.

Both Python and R, along with their libraries, are entirely open source. They all can be
downloaded and installed from the internet. The same goes for the pre-trained embeddings
and transformers mentioned in this paper (except for GPT-3).

Finally, for those who would prefer to avoid any kind of programming, RapidMiner comes
with a graphical user interface to to carry out end-to-end text-mining tasks without writing
code.51 Unlike Python and R, RapidMiner is a commercial platform. Yet its free version
supports a wide range of supervised as well as unsupervised methods for data sets with up to
10 000 rows.

Conclusion

Text-mining and natural language understanding have been making great strides in recent
years. Some of these techniques are at the heart of the hyper-hyped “AI revolution” and are
fueling the development of legaltech.

49
Evan Gretok, David Langerman & Wesley M. Oliver, Transformers for Classifying Fourth Amendment
Elements and Factors Tests, LEG. KNOWL. INF. SYST. JURIX 63–72 (2020). Formatted: English (United States)
50
https://www.python.org (accessed 9 November 2020).
51
https://rapidminer.com (accessed 9 November 2020).
To be sure, anyone who has actually attempted to use the techniques surveyed here will have
realized that algorithms do not process language the way humans do. All techniques, even
the most advanced ones, have limitations. Yet, thanks to their scalability, they open up new
possibility for legal research to explore and canvass vast repositories of legal documents.

There exist many variants of the techniques reviewed in this paper and many more tasks to
which they either already have or may potentially be applied. However, I hope that the
illustrations I gave and the techniques I surveyed give the reader a sense of the potential that
these techniques offer for academic legal research.

A Recommendation System For Legal Semantic Web: Abstract
No ratings yet
A Recommendation System For Legal Semantic Web: Abstract
9 pages
Recognizing Cited Facts and Principles in Legal Judgements
No ratings yet
Recognizing Cited Facts and Principles in Legal Judgements
20 pages
2021 AIandLaw Unsupervised Legal Doc Similarity
No ratings yet
2021 AIandLaw Unsupervised Legal Doc Similarity
35 pages
Huhn - Syllogistic Reasoning in Briefing Cases PDF
No ratings yet
Huhn - Syllogistic Reasoning in Briefing Cases PDF
51 pages
CAIDP-Future of Legal Profesion
No ratings yet
CAIDP-Future of Legal Profesion
44 pages
Case Monitoring System
No ratings yet
Case Monitoring System
17 pages
Fine-Tuning Llama 3 for Legal AI
No ratings yet
Fine-Tuning Llama 3 for Legal AI
6 pages
Excerpt
100% (1)
Excerpt
25 pages
Court Cases Winning Stratergys
No ratings yet
Court Cases Winning Stratergys
9 pages
AI in Legal Research: Efficiency & Applications
No ratings yet
AI in Legal Research: Efficiency & Applications
8 pages
Boosting Court Judgement Prediction and Explanation Using Legal Entities
No ratings yet
Boosting Court Judgement Prediction and Explanation Using Legal Entities
36 pages
LAWBOT
No ratings yet
LAWBOT
13 pages
Natural Language Processing For Legal Document Review: Categorising Deontic Modalities in Contracts
No ratings yet
Natural Language Processing For Legal Document Review: Categorising Deontic Modalities in Contracts
22 pages
An Intelligent Conversational Agent For The Legal
No ratings yet
An Intelligent Conversational Agent For The Legal
14 pages
AI in Legal Industry
No ratings yet
AI in Legal Industry
14 pages
The Art of Admitting Exhibits Into Evidence During Trial
No ratings yet
The Art of Admitting Exhibits Into Evidence During Trial
2 pages
AI in Law The Legal Profession Industry Insights Report
No ratings yet
AI in Law The Legal Profession Industry Insights Report
57 pages
Introduction: Artificial Intelligence, Technology, and The Law
No ratings yet
Introduction: Artificial Intelligence, Technology, and The Law
14 pages
Legal Argumentation Tutorial
No ratings yet
Legal Argumentation Tutorial
10 pages
SQLite Guide for Android Basics
No ratings yet
SQLite Guide for Android Basics
5 pages
Legal Aspects of Artificial Intelligence (v2.0) : Richard Kemp September 2018
No ratings yet
Legal Aspects of Artificial Intelligence (v2.0) : Richard Kemp September 2018
47 pages
General Case Law
No ratings yet
General Case Law
80 pages
Clear and Effective Legal Writing 4th Edition Veda Charrow Available Any Format
100% (3)
Clear and Effective Legal Writing 4th Edition Veda Charrow Available Any Format
97 pages
November 2021 Professional Examinations Advanced Audit & Assurance (Paper 3.2) Chief Examiner'S Report, Questions and Marking Scheme
No ratings yet
November 2021 Professional Examinations Advanced Audit & Assurance (Paper 3.2) Chief Examiner'S Report, Questions and Marking Scheme
22 pages
Writing, Research and Advocacy
No ratings yet
Writing, Research and Advocacy
2 pages
GenAI For Legal Professionals - What To Know and What To Do - Thomson Reuters
No ratings yet
GenAI For Legal Professionals - What To Know and What To Do - Thomson Reuters
13 pages
Impact of AI On Legal Industry
No ratings yet
Impact of AI On Legal Industry
9 pages
Build a Knowledge Map Guide
No ratings yet
Build a Knowledge Map Guide
20 pages
Whats Inside The Black Box Ai Challenges For Lawyers and Researchers
No ratings yet
Whats Inside The Black Box Ai Challenges For Lawyers and Researchers
12 pages
L M Project
No ratings yet
L M Project
24 pages
Advocacy Skill Syllabus
No ratings yet
Advocacy Skill Syllabus
3 pages
Ai Legal Practice
No ratings yet
Ai Legal Practice
30 pages
(Ebook PDF) Focus Torts 9th Edition Instant Download
100% (1)
(Ebook PDF) Focus Torts 9th Edition Instant Download
48 pages
Performance Evaluation Criteria For Board Committees of Board Chairperson and Directors New
No ratings yet
Performance Evaluation Criteria For Board Committees of Board Chairperson and Directors New
2 pages
61 Shannon Chart
No ratings yet
61 Shannon Chart
1 page
Differences Between Civil and Criminal Defamation
0% (1)
Differences Between Civil and Criminal Defamation
14 pages
Yale Law School Law Firm Practice Summary 042017
No ratings yet
Yale Law School Law Firm Practice Summary 042017
16 pages
Legal Writing Sense and Nonsense PDF
No ratings yet
Legal Writing Sense and Nonsense PDF
5 pages
Data Science Roadmap: Mathematics and Statistics
No ratings yet
Data Science Roadmap: Mathematics and Statistics
5 pages
Australian Regulatory Response To Genome Edited Plants, Karinne Ludlow
No ratings yet
Australian Regulatory Response To Genome Edited Plants, Karinne Ludlow
30 pages
AI's Impact on Legal Practice
No ratings yet
AI's Impact on Legal Practice
2 pages
10.4 - Marchant - Ai and Practice of Law SciTech Lawyer
No ratings yet
10.4 - Marchant - Ai and Practice of Law SciTech Lawyer
4 pages
Topic 5 of Professional Practice On Law Firm
No ratings yet
Topic 5 of Professional Practice On Law Firm
55 pages
Indian Penal Code Classification AI
No ratings yet
Indian Penal Code Classification AI
4 pages
ABA - BNA Lawyers' Manual On Professional Conduct
No ratings yet
ABA - BNA Lawyers' Manual On Professional Conduct
115 pages
Professional Syllabus 2025
No ratings yet
Professional Syllabus 2025
150 pages
The End of Lawyers Rethinking The Nature of Legal Services Richard
100% (4)
The End of Lawyers Rethinking The Nature of Legal Services Richard
404 pages
My Top Ten Tips On Presenting Evidence at Trial
No ratings yet
My Top Ten Tips On Presenting Evidence at Trial
32 pages
Murphy On Evidence 13th Edition Richard Glover Available Full Chapters
100% (15)
Murphy On Evidence 13th Edition Richard Glover Available Full Chapters
159 pages
Trial Advocacy Manual
No ratings yet
Trial Advocacy Manual
46 pages
CRAC Method
No ratings yet
CRAC Method
7 pages
Law's Historical Evolution
No ratings yet
Law's Historical Evolution
30 pages
Induction Handbook For Junior Lawyers - Compressed
No ratings yet
Induction Handbook For Junior Lawyers - Compressed
116 pages
Negligence
No ratings yet
Negligence
3 pages
Essential Skills for Aspiring Lawyers
No ratings yet
Essential Skills for Aspiring Lawyers
18 pages
Automating Legal Research Through Data Mining: M.F.M Firdhous
No ratings yet
Automating Legal Research Through Data Mining: M.F.M Firdhous
8 pages
Komputerisasi Penelitian Hukum DGN Teknologi Data Mining
No ratings yet
Komputerisasi Penelitian Hukum DGN Teknologi Data Mining
8 pages
The Meaning and The Mining of Legal Texts: January 2010
No ratings yet
The Meaning and The Mining of Legal Texts: January 2010
8 pages
Semantic Processing of Lega - (Z-Library) 5497037630261418474
No ratings yet
Semantic Processing of Lega - (Z-Library) 5497037630261418474
251 pages
NLP in Legal Sector Report
No ratings yet
NLP in Legal Sector Report
8 pages
Chucky: Exposing Missing Checks in Source Code For Vulnerability Discovery
No ratings yet
Chucky: Exposing Missing Checks in Source Code For Vulnerability Discovery
12 pages
XML and PHP
100% (1)
XML and PHP
33 pages
214 Topic 01 2022
No ratings yet
214 Topic 01 2022
73 pages
Eng502 Short Handouts Notes Final Term File 3 Compiled by Sir Zahid
No ratings yet
Eng502 Short Handouts Notes Final Term File 3 Compiled by Sir Zahid
35 pages
Lex A Lexical Analyzer Generator: M M.. E E.. L Leesskk A An ND de E.. S SCCH HM Miid DTT
No ratings yet
Lex A Lexical Analyzer Generator: M M.. E E.. L Leesskk A An ND de E.. S SCCH HM Miid DTT
13 pages
Python 2.7.15 Updates for Developers
No ratings yet
Python 2.7.15 Updates for Developers
37 pages
Silk Performer
100% (1)
Silk Performer
33 pages
Compiler Design Lab Guide
No ratings yet
Compiler Design Lab Guide
53 pages
Ed 3 Book
No ratings yet
Ed 3 Book
636 pages
Web Scraping
100% (1)
Web Scraping
19 pages
Final Term MCQ's and Quizzes CS606-compiler Instruction
0% (1)
Final Term MCQ's and Quizzes CS606-compiler Instruction
23 pages
Priya Paper Final
No ratings yet
Priya Paper Final
9 pages
Chapter 3 B Top-Down Parsing
No ratings yet
Chapter 3 B Top-Down Parsing
49 pages
Understanding LR Parsers: SLR, LALR, CLR
No ratings yet
Understanding LR Parsers: SLR, LALR, CLR
4 pages
Compiler Construction (CS4623) : Course Instructor: Ms. Tayyaba Zaheer
No ratings yet
Compiler Construction (CS4623) : Course Instructor: Ms. Tayyaba Zaheer
16 pages
Project Hyperion Env Building With Cursor Instructions
No ratings yet
Project Hyperion Env Building With Cursor Instructions
30 pages
Antlr C Sharp Code Generation Using Visual
No ratings yet
Antlr C Sharp Code Generation Using Visual
8 pages
Unit III - TAFL
No ratings yet
Unit III - TAFL
16 pages
Usrguide3 PDF
No ratings yet
Usrguide3 PDF
18 pages
Lexical Analyzer Synopsis Final
0% (1)
Lexical Analyzer Synopsis Final
20 pages
Compiler Design for CS Students
No ratings yet
Compiler Design for CS Students
12 pages
Compiler Design Introduction
No ratings yet
Compiler Design Introduction
33 pages
Unit 4 New
No ratings yet
Unit 4 New
60 pages
Kibur Lisanu
No ratings yet
Kibur Lisanu
118 pages
Informatica Data Quality 9 5 PDF
No ratings yet
Informatica Data Quality 9 5 PDF
3 pages
Software Construction
No ratings yet
Software Construction
16 pages
Project
No ratings yet
Project
25 pages
Teknik Kompilasi: Dasar dan Jenis
No ratings yet
Teknik Kompilasi: Dasar dan Jenis
45 pages
Software CNC Lathe Machine Simulator © Virtlabs Software
No ratings yet
Software CNC Lathe Machine Simulator © Virtlabs Software
41 pages
SpyGlass LibraryCompiler UserGuide
No ratings yet
SpyGlass LibraryCompiler UserGuide
150 pages

Textmining For Lawyers Dyevre

Uploaded by

Textmining For Lawyers Dyevre

Uploaded by

Text-mining for Lawyers: How Machine Learning

Techniques Can Advance our Understanding of Legal

Harvesting Legal Texts

Figure 2. Converting text into document-term matrix.

In many applications, it is also common to remove so-called “stopwords” – articles and

Techniques referred to as “supervised” are those that necessitate human-labelled

Our comparison cloud suggests that “undertakings” is more distinctive of annulment

determined by the maximum deviation (𝑚𝑎𝑥(𝑝𝑖,𝑗 − 𝑝𝐼 )).

Latent Semantic Analysis and Principal Component Analysis

If we interpret Dimension 1 as Europhilia, the document positions associated with Dimension

Figure 5. Evolution of the German Constitutional Court’s stance on European integration

Figure 6. Topic from topic model of preliminary rulings (1961-2016)

In topic modelling, documents are conceptualised as mixtures of topics and, in addition to

So it does really look like corporate taxation after all.

Figure 7. Topic model of CJEU preliminary rulings represented as network.

Figure 8. Topic model of CJEU preliminary rulings with 100 topics.

eliten 0.8215773105621338 veränderungen 0.7256141901016235

parteitaktik 0.8728734850883484 europapolitik 0.8352898359298706

Figure 9. Relative incidence of words relating to “Politik”, “Partei” and “Populismus” in

Another application of word embeddings is to compare change in the connotations of words

A Doc2Vec model can be visualized by means of a t-SNE (shorthand for “t-distributed

Obtaining Labelled Documents

At the KU Leuven Law School, EUTHORITY Project researchers recently conducted an

Table 3. Performance metrics of algorithms trained to predict the presence of references

Transfer Learning and Transformers

These advances are embodied in BERT (for “Bi-directional Encoder Representation

Learning Text-Mining Methods

An alternative to Python is R, another popular programming language with many libraries

You might also like