Stokes Text

Uploaded by

Karoline Rodrigues Firmino

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views7 pages

Stokes Text

Uploaded by

Karoline Rodrigues Firmino

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

[This a draft text of the talk presented at the 13th Annual Schoenberg Symposium

on Manuscript Studies in the Digital Age, November 18 2020.

As such, it is a close but not verbatim transcript of the actual presentation.]

Digital and Computational Palaeography: Some Promises and

Problems
Peter Stokes, EPHE – PSL

As those of us here today know very well, the last fifteen or twenty years or so have seen an
enormous increase in the availability of digital images of manuscripts, as well as tools for
working with them. This has been enabled by technological developments, but also by what
one might call ‘social’ changes, such as the increasing prevalence of more or less open access
to images and indeed to software; investments at local, national and international level in
the acquisition and publication of digital images as well as in open platforms and software;
and concerted work on international standards for images, particularly the International
Image Interoperability Framework, or IIIF. [Slide] Here, for instance we have the Biblissima
Portal, which provides searching across digital images of around 75,000 different Western
manuscripts, from a dozen or so different libraries, all of which have images that are
immediately accessible through IIIF.

I think it’s clear that this is (or at least can be) an enormous boon. As noted in the overview
to this conference, in many ways we were indeed ‘ready’ when most of the world’s
manuscript libraries became inaccessible in March, meaning that I can sit at home during
what is now France’s second lockdown and consult images of tens of thousands of
manuscripts scattered across Europe and beyond. [Slide] It also poses interesting questions
and challenges, some of which I will just put here for now and will return to them in a
minute.

But there are of course limitations here as well, and others (including many in the audience)
have spoken repeatedly about this. We know, I think, that availability of digital images is not
access to the object, however easy it can be to forget this. I think we all agree that to
understand manuscripts requires not only the digital but also direct physical access to the
original objects. There is a real risk that more resources in digitisation can mean fewer
resources for librarians, cataloguing, conservation and so on (and I’m thinking here
particularly of examples like the National Library of Israel which suspended all activities and
put 300 staff on temporary leave, citing COVID and budget cuts). I don’t think it’s a zero-sum
game, by the way, as there’s no question that greater exposure to these materials has led to
greater interest in palaeography, codicology and manuscript studies in general, although just
how this translates into funding is much less clear at this point. None of this is new to COVID,
of course, but it does seem that the current situation is accelerating this process. One sign of
that already is that the new Horizon Europe programme, which sets European funding
priorities for the next seven years, is expected to include calls aimed at increasing digital
approaches to cultural heritage, with explicit reference to COVID as a motivating factor.

Still, we have long been talking about digitisation as ‘democratisation’, and this is a question
that concerns me a lot. I am very aware that I am extremely privileged, having spent the
whole of my career in Cambridge (England), London and Paris, and I am extremely grateful
for that. It has meant that I have had relatively easy access to medieval manuscripts, and
also that it has been relatively easy for me to meet the ‘big names in the field’ as I simply
had to sit in the manuscript libraries and, in effect, people would come to me. So it’s easy for
me to say that access to manuscripts is essential, but what does that say about people who
don’t have this privilege? One of the ideals of the digital humanities and of digitisation has
been to try to share this more widely by increasing access, at least in some form.
Palaeography teaching is very different from when I started twenty years ago, where now
students can go and find hundreds or even thousands of high-quality colour images of
manuscripts, a situation very different from the highly selective handful of black and white
photocopies that we used to have. I teach Master’s students in DH, and [Slide] it’s well
within the capability of good students to write python scripts that harvest hundreds of
images of manuscript pages, processes the image to automatically estimate basic statistics
such as the dimensions or average number of lines of text on a page, and then look for
developments over time, space and so on. There have been very interesting projects around
crowdsourcing and gamesourcing to raise interest and teach skills, and one might hope that
this in turn leads to more investment in posts and training, but I’m by no means sure about
this yet.

2
At the same time, though, another encouraging trend that I have seen is that there has been
more and more interest from the Computer Science community in the analysis of historical
documents. [Slide] When Arianna Ciula first published her seminal article on digital
palaeography about fifteen years ago, there were very few people indeed from Computer
Science who were interested in these questions (with a couple of notable exceptions such as
Lambert Schomaker from Groningen who is still a leader in the field). Fifteen years ago,
many Computer Scientists were telling me that historical documents were not interesting
and a ‘solved problem’, but now the International Conference in Document Analysis and
Recognition has a regular dedicated section on historical documents; the International
Conference on Frontiers in Handwriting Recognition is growing rapidly; and in these
conferences there are very open discussions of how important it is to involve people from
the Humanities in this work.

So from here I want to change direction a little and consider some new areas that are
emerging in ‘digital palaeography’, in part again as a result of increasing ‘access’ or
‘democratisation’, if you will. I think the area where this is most immediately evident, and
changing extremely quickly, is in artificial intelligence and Deep Learning. Those of you who
know me well may be surprised by this, as I consider myself relatively averse to fashionable
buzzwords, and indeed I spent many years arguing against many uses of AI in palaeography.
To be clear, I still do: but not against all uses, and things are changing so fast that I do think
now that we are already at a time where we have clear examples that work extremely well.
In fact, AI and palaeography have in a sense had a long history together: Douglas Hofstadter,
a famous writer and researcher in AI, was very much preoccupied by writing right back in the
80s, and he once stated that [Slide] ‘the central problem of AI is the question: what is the
letter a?’. (In fact this is a question which has obsessed me for many years now. What is the
letter a? Really? And what does that say for palaeography that we can’t even answer a
question as fundamental as this? Today, though, people are throwing AI at these questions
in their spare time. [Slide] Here, Erik Bernhardsson apparently decided more or less on a
whim to download 50,000 computer fonts, train a neural network on them, carry out various
statistical analyses, and then program the network to generate new characters in different
styles. Now, Mr Bernhardsson is clearly a skilled programmer, but still, this is something he
did at home in his spare time.

3
Another example is a project that my team in Paris is working on, called
Kraken/eScriptorium, [Slide] which uses machine learning to automatically transcribe
manuscripts. This software is available online, and you can download it and use it yourself (if
you have the necessary technical skills). As you probably know, this isn’t magic and someone
needs to teach the computer how to do the transcription, by showing it many lines of
material that has already been transcribed. This requires a lot of work, as someone needs to
sit down and prepare the transcription, so an active area of research is how to reduce this as
we will see in a minute. Now, one of the advantages of Kraken/eScriptorium is that the
trained models themselves are also open and can be freely exported and shared, and as far
as I know this is pretty much the only platform where it’s the case. So I can train a model on
my manuscripts from the eleventh century, say, and then export that model, and give it to
you, and you can then retrain it for your manuscripts which are similar to mine but not quite
the same. This saves a huge amount of time, and indeed energy and environmental
resources, compared to us all retraining our models from scratch.

This again opens up all sorts of interesting opportunities. Think about it: we now have access
to tens of thousands (probably hundreds of thousands, if not millions) of images of
manuscripts. We are starting to get trained models capable of transcribing those images. It
won’t be perfect, sure, but even if it’s, say, 80% correct, that’s more than enough to
automatically identify the texts, and even probably which variants or versions of texts there
are. It may be enough to do some basic linguistic analysis, such as identification of dialect.
It’s certainly more than enough to do things like check the number of lines per page (a useful
complement to data in the Schoenberg database, for instance). It may be enough to identify
ownership inscriptions, glosses, and other information. It’s not quite true that we can do all
this on our home computers in practice, because for the moment AI requires fairly high-
powered computing to run in a reasonable time, but even this is changing rapidly as well
(NVIDIA for instance has released an ultra-cheap hobbyist version of a GPU as a sort of
Rasberry Pi for AI).

In fact I mentioned that there is active work on speeding up this process, and, believe it or
not, this leads to another area of ‘democratisation’ in AI: Deep Fakes. I’m sure you’ve heard

4
of this: it’s a fairly new technology based again on neural networks, that these days seems
mostly used to put your face onto that of a famous celebrity in ways that can be genuinely
difficult to detect. But this technology is also being used by Computer Scientists for
document analysis. Yes, you hard that right, [Slide] people are Deep Faking manuscripts!

This raises an obvious question: why would you want to do that? Well, I can think of very
interesting if disturbing applications for outreach and engagement, though as far as I know
this hasn’t been done in practice. But there is also a very valid reason for Computer Science.
I mentioned that we need examples of transcribed manuscripts in order to teach the
computer to do the transcription itself, and obviously this transcription takes a lot of work to
produce. But another option, which works surprisingly well, is that we can Deep Fake images
of handwriting in the style of our document, and in this case we already know what the text
says so we don’t need to prepare our own transcription for training purposes. Instead, we
can ‘teach’ our machine to read the Deep Faked manuscripts, and then apply it to real
manuscripts where we don’t know already know the text, and it turns out that the results
are surprisingly good!

So, where does this leave us? In fact, where in all this is the palaeography? Donald Knuth
once wrote that [Slide] ‘the best way to understand something is to know it so well that you
can teach it to a computer’ (and he added that ‘the process of seeking such explanations will
surely be instructive for all concerned’). But have we not solved what Hofstadter identified
as ‘the central problem of AI’: have we not taught a computer to successfully identify a in
thousands of different fonts and handwriting styles? [Slide] If this commentary on Reddit is
anything to go by then yes: according to them, we are done, the problem is solved. But what
does this tell us really? Do we now understand this so well that we can teach it to a
computer? The answer in fact is clearly ‘no’, I think: we have shown the computer a series of
images and told it what we think, but it hasn’t yet done much for our understanding, and
this Is one of the main reasons why I have always been very wary of highly complex
algorithmic analyses such as Deep Learning: because of the enormous problem of
algorithmic transparency and understanding. This is now, finally, largely recognised among
the Computer Science community [Slide x 2], and there are now very good people working
very hard on these questions. But I still think these questions apply just as much to the

5
Digital Humanities, and that the real question from our point of view is not so much ‘what is
the answer?’ as ‘what does this mean?’.

Indeed, these systems can give us an answer, but we will always need many, different
answers. As Collette Sirat has noted, Carl Popper’s point is very relevant also to
palaeography: [Slide] ‘Two things which are similar are always similar in certain respects. …
Generally, similarity, and with it repetition, always presupposes the adoption of a point of
view: some similarities or repetitions will strike us if we are interested in one problem or
another’. In order for our digital methods to be useful, then, we need to find ways of
allowing for these different points of view, and this includes contexts different from our
own.

One of the many privileges of being at the EPHE is that I have colleagues working on pretty
much every historical script you can imagine, and many more that you can’t. At the moment,
though, the vast majority of online tools and models are designed at least implicitly and
often explicitly for Western, often English, documents. So when we’re talking about wider
access in a world of COVID lockdown, how are those working on less representative
materials managing? The answer is often ‘not very well’. We need to allow for different
characters, different scripts, different directions of writing, different conventions for
transcription and scholarly presentation. We need to be able to handle cases where we
don’t already have millions of words already available, where there may not already be
trained models at all, or where the corpus may be very small. We need to be able to treat
pages like this [Slide], where the writing is in Arabic, so from right to left, but with glosses
radiating out in this star-like form.

To take another example, for sharing transcription data, you need to specify the baseline.
But what is the baseline, exactly? [Slide] And these are the easy cases: we then have ones
like this [Slide]. Yes, I could model Hebrew as being right to left written on a baseline, but if I
then share my data with yours, did you model it this way, or did you do it right to left from a
topline? How can I tell?

6
Even very basic definitions start to fall apart very quickly once you move outside an Anglo-
European context: what exactly is a grapheme, for instance? What exactly is a word?

I talked a bit already about sharing trained models, but sharing any models and software
(whether trained AI, or hand-built databases, or whatever) requires a huge amount of
communication, understanding, and making knowledge explicit, and this is still one of the
real challenges, I think. This, I think, is what Knuth meant when he wrote about teaching the
computer. [Slide] (For what it’s worth, here is the start of my answer to the question ‘what
is the letter a’.) There are a couple of small groups working on these questions, and here is
one of my contributions to the subject, but relatively speaking there is much less going on
here, I think.

So COVID has accelerated a process of digitisation, and this with the current trends look like
more funds for digital work on cultural heritage, at least in Europe, but with the economic
crisis likely meaning less for so-called ‘traditional’ palaeography, librarianship and so on. So I
think the old questions still apply: We need to think how to be relevant without
oversimplifying, how to engage but in the right way, how to pool our resources and be as
efficient as possible without privileging a few points of view that are already dominant.
These challenges are not new, but they look likely to become more urgent than ever.

[Slide] Thank you

Paleography Between Erudition and Computation
No ratings yet
Paleography Between Erudition and Computation
30 pages
The Role of Digital Humanities in Papyrology: Practices and User Needs in Papyrological Research
No ratings yet
The Role of Digital Humanities in Papyrology: Practices and User Needs in Papyrological Research
2 pages
Introduction Digital Humanities
No ratings yet
Introduction Digital Humanities
0 pages
Digital Materiality
No ratings yet
Digital Materiality
22 pages
STAUFFER OldSweethearts 2016
No ratings yet
STAUFFER OldSweethearts 2016
13 pages
Center For The Study of Digital Libraries
No ratings yet
Center For The Study of Digital Libraries
12 pages
Precott Slow Digitization
No ratings yet
Precott Slow Digitization
15 pages
REGGIANI - DB Papirologia
No ratings yet
REGGIANI - DB Papirologia
328 pages
V2N1 Practices of Historical Research Ed HHH
No ratings yet
V2N1 Practices of Historical Research Ed HHH
2 pages
10663-Article Text-9835-1-10-20220808
No ratings yet
10663-Article Text-9835-1-10-20220808
4 pages
Cultural Heritage Information Artefacts PDF
No ratings yet
Cultural Heritage Information Artefacts PDF
25 pages
DigitalHumanities AnIntroduction
No ratings yet
DigitalHumanities AnIntroduction
10 pages
Asso For Info Science Tech - 2014 - Mak - Archaeology of A Digitization-2
No ratings yet
Asso For Info Science Tech - 2014 - Mak - Archaeology of A Digitization-2
12 pages
One Era's Nonsense, Another's Norm
100% (3)
One Era's Nonsense, Another's Norm
19 pages
ARTICLE 1 - Digital Surrogate Preservations Iranian Heritage
No ratings yet
ARTICLE 1 - Digital Surrogate Preservations Iranian Heritage
14 pages
(Digital Research in The Arts and Humanities) Marilyn Deegan, Kathryn Sutherland - Text Editing, Print and The Digital World-Routledge (2008)
100% (1)
(Digital Research in The Arts and Humanities) Marilyn Deegan, Kathryn Sutherland - Text Editing, Print and The Digital World-Routledge (2008)
225 pages
The Biscari Archive. A Case Study of The Application of The Tran-Skribus Tool Salvatore Spina
No ratings yet
The Biscari Archive. A Case Study of The Application of The Tran-Skribus Tool Salvatore Spina
8 pages
Notes From Defining Digital Humanities
No ratings yet
Notes From Defining Digital Humanities
4 pages
Semiotics of The Archive Semiotique de L
No ratings yet
Semiotics of The Archive Semiotique de L
9 pages
Digital Philology, Medieval Texts, and The Corpus of Latin Rhythms, A Digital Edition of Music and Poems
No ratings yet
Digital Philology, Medieval Texts, and The Corpus of Latin Rhythms, A Digital Edition of Music and Poems
27 pages
L1 - Intro To DH
No ratings yet
L1 - Intro To DH
64 pages
Untitled
No ratings yet
Untitled
30 pages
Among Digitized Manuscripts. Philology, Codicology, Paleography in A Digital World
No ratings yet
Among Digitized Manuscripts. Philology, Codicology, Paleography in A Digital World
345 pages
Nichols-Tecnologies of Knowledge
No ratings yet
Nichols-Tecnologies of Knowledge
6 pages
Driscoll Pierazzo
No ratings yet
Driscoll Pierazzo
292 pages
Digital Public History
No ratings yet
Digital Public History
30 pages
Tecnología y Recepción Clásica Kjkjngyufnytfiyfun
No ratings yet
Tecnología y Recepción Clásica Kjkjngyufnytfiyfun
12 pages
Declaration of Florence
No ratings yet
Declaration of Florence
15 pages
Art History, Scholarship and Image Libraries: Realising The Potential of The Digital Age
100% (9)
Art History, Scholarship and Image Libraries: Realising The Potential of The Digital Age
18 pages
01 Hopkins
No ratings yet
01 Hopkins
4 pages
Old Books New Science
No ratings yet
Old Books New Science
14 pages
FNGR 2017-5 Parikka Article1
No ratings yet
FNGR 2017-5 Parikka Article1
2 pages
5 Challnges DH Assignmnt
No ratings yet
5 Challnges DH Assignmnt
6 pages
Digital Medieval Studies Practice and Preservation
No ratings yet
Digital Medieval Studies Practice and Preservation
127 pages
Johannadrucker Remarks Gettydah-Lab 2013
No ratings yet
Johannadrucker Remarks Gettydah-Lab 2013
6 pages
Mcgann 1996 Radiant Textuality B
No ratings yet
Mcgann 1996 Radiant Textuality B
13 pages
Beale y Reilly 2017 - After Virtual Archaeology Rethinking Archaeological Approaches To The Adoption of Technology
No ratings yet
Beale y Reilly 2017 - After Virtual Archaeology Rethinking Archaeological Approaches To The Adoption of Technology
16 pages
Digital Ethnography, or 'Deep Hanging Out' in The Age of Big Data
No ratings yet
Digital Ethnography, or 'Deep Hanging Out' in The Age of Big Data
23 pages
Textual Criticism Its Impact On Digital Research
No ratings yet
Textual Criticism Its Impact On Digital Research
4 pages
Preserving The Past Unveiling Challenges in Ancien
No ratings yet
Preserving The Past Unveiling Challenges in Ancien
5 pages
02 Hopkins
No ratings yet
02 Hopkins
4 pages
A Genealogy of Digital Humanities: Jdoc 67,3
No ratings yet
A Genealogy of Digital Humanities: Jdoc 67,3
27 pages
A Short Guide To The Digital - Humanities: Questions & Answers
No ratings yet
A Short Guide To The Digital - Humanities: Questions & Answers
16 pages
ContentServer PDF
No ratings yet
ContentServer PDF
15 pages
Rehbein, Saale, Schaßan/Kodikologie Und Paläographie Im Digitalen Zeitalter
No ratings yet
Rehbein, Saale, Schaßan/Kodikologie Und Paläographie Im Digitalen Zeitalter
32 pages
Gardiner Towards Standardization
No ratings yet
Gardiner Towards Standardization
5 pages
QD Lecture
No ratings yet
QD Lecture
10 pages
Network For Digital Methods in The Arts and Humanities (Nedimah)
No ratings yet
Network For Digital Methods in The Arts and Humanities (Nedimah)
12 pages
Roland 2012
No ratings yet
Roland 2012
18 pages
DH History
No ratings yet
DH History
5 pages
McGann, Jerome J - A New Republic of Letters - Memory and Scholarship in The Age of Digital Reproduction-Harvard University Press (2014)
No ratings yet
McGann, Jerome J - A New Republic of Letters - Memory and Scholarship in The Age of Digital Reproduction-Harvard University Press (2014)
253 pages
Paleography
No ratings yet
Paleography
10 pages
Digital Scholarly Editing
100% (1)
Digital Scholarly Editing
21 pages
The Humanities in The Digital Beyond Critical Digital Humanities
100% (1)
The Humanities in The Digital Beyond Critical Digital Humanities
193 pages
Germanic Philology
No ratings yet
Germanic Philology
18 pages
AI and The Preservation of Historic Documents
No ratings yet
AI and The Preservation of Historic Documents
16 pages
The Virtual Representations of The Past PDF
No ratings yet
The Virtual Representations of The Past PDF
40 pages
PPS VS PPC
No ratings yet
PPS VS PPC
11 pages
Solo Leveling Volume 4
No ratings yet
Solo Leveling Volume 4
357 pages
Vol Ielts
No ratings yet
Vol Ielts
2 pages
Python For Telco Network Performance Analysis
No ratings yet
Python For Telco Network Performance Analysis
23 pages
Encyclopedia of Renewable and Sustainable Materials 1st Edition - Ebook PDFPDF Download
100% (5)
Encyclopedia of Renewable and Sustainable Materials 1st Edition - Ebook PDFPDF Download
49 pages
French Polynesia Marine Molluscs Inventory
No ratings yet
French Polynesia Marine Molluscs Inventory
90 pages
DK Eyewitness Travel Guide US
100% (2)
DK Eyewitness Travel Guide US
20 pages
Speakout Perason Intermediate End of Course Test
No ratings yet
Speakout Perason Intermediate End of Course Test
8 pages
Macbeth PDF FolgerShakespeare
No ratings yet
Macbeth PDF FolgerShakespeare
96 pages
Top 3 Must-See London Attractions
No ratings yet
Top 3 Must-See London Attractions
2 pages
Arts of Neoclassic and Romantic Period
No ratings yet
Arts of Neoclassic and Romantic Period
59 pages
Blis-02 423
No ratings yet
Blis-02 423
5 pages
The Sealed Library Dec 2020
0% (2)
The Sealed Library Dec 2020
16 pages
Lui Elaine
No ratings yet
Lui Elaine
155 pages
Defending Modern Architecture
No ratings yet
Defending Modern Architecture
6 pages
Coaching Practiced David Tee Instant Download
100% (1)
Coaching Practiced David Tee Instant Download
92 pages
Architectural Museum With Art Gallery
No ratings yet
Architectural Museum With Art Gallery
7 pages
Inscriptions of The Aulikaras and Their Associates 9783110649789 9783110644722 - Compress
No ratings yet
Inscriptions of The Aulikaras and Their Associates 9783110649789 9783110644722 - Compress
294 pages
Early Journal Content On JSTOR, Free To Anyone in The World
No ratings yet
Early Journal Content On JSTOR, Free To Anyone in The World
2 pages
A Beginners Guide To Devising Theatre
No ratings yet
A Beginners Guide To Devising Theatre
297 pages
Louvre-Tickets V25078257285
No ratings yet
Louvre-Tickets V25078257285
1 page
Bulk Solids Handling Equipment Selection and Operation 1st Edition Dr. D. Mcglinchey PDF Download
No ratings yet
Bulk Solids Handling Equipment Selection and Operation 1st Edition Dr. D. Mcglinchey PDF Download
52 pages
Đề 10
No ratings yet
Đề 10
26 pages
Reaction Paper: Maria Aleli B. Triguero Ab Avc Ii-B
No ratings yet
Reaction Paper: Maria Aleli B. Triguero Ab Avc Ii-B
1 page
Byzantium and The West Perception and Reality 11th 15th C Nikolaos Chrissis Full Digital Chapters
No ratings yet
Byzantium and The West Perception and Reality 11th 15th C Nikolaos Chrissis Full Digital Chapters
86 pages
Fliцаl: Mormo
100% (1)
Fliцаl: Mormo
10 pages
Rijksmuseum Visitor Guide
No ratings yet
Rijksmuseum Visitor Guide
2 pages
Literature Review Example Turabian
100% (1)
Literature Review Example Turabian
8 pages
Beard, Daniel Carter - The Jack of All Trades or New Ideas For American Boys (1907)
No ratings yet
Beard, Daniel Carter - The Jack of All Trades or New Ideas For American Boys (1907)
314 pages
(Ebook) Spectral Theory and Differential Operators by D.E. Edmunds Des Evans ISBN 9780198812050, 0198812051 PDF Download
100% (1)
(Ebook) Spectral Theory and Differential Operators by D.E. Edmunds Des Evans ISBN 9780198812050, 0198812051 PDF Download
52 pages
Buzz5 Cambridge Young Learners Worksheets
No ratings yet
Buzz5 Cambridge Young Learners Worksheets
16 pages