0% found this document useful (0 votes)
12 views45 pages

(Original PDF) Quantitative Corpus Linguistics With R Second Edition

Ebookluna.com offers seamless full ebook downloads across various genres, including titles like 'Quantitative Corpus Linguistics with R' and 'Quantitative Literacy: Thinking Between the Lines.' Users can access instant digital products in formats such as PDF, ePub, and MOBI. The site provides a wide range of educational resources for those interested in corpus linguistics and quantitative methods.

Uploaded by

litingekwele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views45 pages

(Original PDF) Quantitative Corpus Linguistics With R Second Edition

Ebookluna.com offers seamless full ebook downloads across various genres, including titles like 'Quantitative Corpus Linguistics with R' and 'Quantitative Literacy: Thinking Between the Lines.' Users can access instant digital products in formats such as PDF, ePub, and MOBI. The site provides a wide range of educational resources for those interested in corpus linguistics and quantitative methods.

Uploaded by

litingekwele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Experience Seamless Full Ebook Downloads for Every Genre at ebookluna.

com

(Original PDF) Quantitative Corpus Linguistics


with R Second Edition

https://ebookluna.com/product/original-pdf-quantitative-
corpus-linguistics-with-r-second-edition/

OR CLICK BUTTON

DOWNLOAD NOW

Explore and download more ebook at https://ebookluna.com


Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

(eBook PDF) Quantitative Literacy: Thinking Between the


Lines Second Edition

https://ebookluna.com/product/ebook-pdf-quantitative-literacy-
thinking-between-the-lines-second-edition/

ebookluna.com

Advanced R, Second Edition by Hadley Wickham

https://ebookluna.com/product/advanced-r-second-edition-by-hadley-
wickham/

ebookluna.com

Quantitative Methods for Business 13th Edition by David R.


Anderson (eBook PDF)

https://ebookluna.com/product/quantitative-methods-for-business-13th-
edition-by-david-r-anderson-ebook-pdf/

ebookluna.com

(eBook PDF) Research Methods in Linguistics

https://ebookluna.com/product/ebook-pdf-research-methods-in-
linguistics/

ebookluna.com
(eBook PDF) Learner Corpus Research: New Perspectives and
Applications

https://ebookluna.com/product/ebook-pdf-learner-corpus-research-new-
perspectives-and-applications/

ebookluna.com

An Introduction to Management Science: Quantitative


Approach 15th Edition David R. Anderson - eBook PDF

https://ebookluna.com/download/an-introduction-to-management-science-
quantitative-approach-ebook-pdf/

ebookluna.com

(eBook PDF) Biology Now with Physiology (Second Edition)

https://ebookluna.com/product/ebook-pdf-biology-now-with-physiology-
second-edition/

ebookluna.com

Progress in Heterocyclic Chemistry Volume 29 1st Edition -


eBook PDF

https://ebookluna.com/download/progress-in-heterocyclic-chemistry-
ebook-pdf/

ebookluna.com

(eBook PDF) A Concise Introduction to Linguistics 5th


Edition

https://ebookluna.com/product/ebook-pdf-a-concise-introduction-to-
linguistics-5th-edition/

ebookluna.com
List of Contents vii

5.3.3 The Reduction of to be Before Verbs 212


5.3.4 Verb Collexemes After Must 215
5.3.5 Noun Collocates After Speed Adjectives in COCA (Fiction) 218
5.3.6 Collocates of Will and Shall in COHA (1810–1890) 221
5.3.7 Split Infinitives 225
5.4 Other Applications 228
5.4.1 Corpus Conversion: the ICE-GB 228
5.4.2 Three Indexing Applications 231
5.4.3 Playing With CELEX 235
5.4.4 Match All Numbers 237
5.4.5 Retrieving Adjective Sequences From Untagged Corpora 237
5.4.6 Type-Token Ratios/Vocabulary Growth: Hamlet vs. Macbeth 242
5.4.7 Hyphenated Forms and Their Alternative Spellings 248
5.4.8 Lexical Frequency Profiles 251
5.4.9 CHAT Files 1: Eve’s MLUs and ttrs 257
5.4.10 CHAT Files 2: Merging Multiple Files 263

6 Next Steps . . .  269

Appendix 271
Index 272
Figures

2.1 Representational format of corpus files and data frames 15


3.1 Representational format of corpus files and data frames 25
3.2 The contents of <_qclwr2/_inputfiles/dat_vector-a.txt> 37
3.3 The contents of <_qclwr2/_inputfiles/dat_vector-b.txt> 38
3.4 An example data frame 52
3.5 A few words from the BNC World Edition (SGML format) 108
3.6 The XML representation of “It” in the BNC World Edition 117
3.7 The XML representation of the number of sentence tags in the
BNC World Edition 118
3.8 A hypothetical XML representation of “It” in the BNC World Edition 118
3.9 The topmost hierarchical parts of BNC World Edition:
<corp_D94_part.xml> 119
3.10 The header of <corp_D94_part.xml> 120
3.11 The stext part of <corp_D94_part.xml> 122
3.12 The teiHeader/profileDesc part of <corp_H00.xml> 126
4.1 Interaction plot for GramRelation × ClauseType 1 146
4.2 Interaction plot for GramRelation × ClauseType 2 146
4.3 Interaction plot for GramRelation × ClauseType 3 147
4.4 Mosaic plot of the distribution of verb-particle constructions
in Gries (2003a) 156
4.5 Association plot of the distribution of verb-particle constructions
in Gries (2003a) 159
4.6 Average fictitious temperatures of two cities 162
4.7 Plots of the temperatures of the two cities 165
4.8 Plots of the lengths of subjects and objects 168
4.9 Scatterplots of the lengths of words in syllables and words 172
5.1 Dispersion results for perl in its Wikipedia entry 183
5.2 The annotation of even when as a multi-word unit 204
5.3 The first few lines of <wlp_fic_1990.txt> 218
5.4 Desired result of transforming Figure 5.2 into the COCA format
of Figure 5.3 221
5.5 A 3L-3R collocate display of shall as a modal verb (1810–1819
in COHA) 223
5.6 Three sentences from ICE-GB Release 2 229
List of Figures ix

5.7 The same three sentences after processing 230


5.8 Problematic (for us) uses of brackets in the ICE-GB Release 2 230
5.9 Three lines from CELEX, <EPL.CD> 235
5.10 Three lines from CELEX, <ESL.CD> 235
5.11 Thirty-nine character strings representing numbers to match and
one not to match 237
5.12 The first six lines of <BROWN1_A.txt> 241
5.13 The vocabulary-growth curve of tokens 243
5.14 The first ten lines of <baseword1.txt> 252
5.15 Excerpt of a file from the CHILDES database annotated in
the CHAT format 258
5.16 Intended output of the case study 263
Tables

2.1 Examples of differently ordered frequency lists 12


2.2 A collocate display of alphabetic based on the BNC 16
2.3 A collocate display of alphabetical based on the BNC 17
2.4 An example display of a concordance of before and after (sentence display) 18
2.5 An example display of a concordance of before and after (tabular) 19
4.1 Fictitious data set for a study on constituent lengths 145
4.2 A bad data frame 149
4.3 A better data frame 150
4.4 Observed distribution of verb-particle constructions in Gries (2003a) 152
4.5 Expected distribution of verb-particle constructions in Gries (2003a) 152
4.6 Observed distribution of alphabetical and order in the BNC 159
4.7 Structure of a quantitative corpus-linguistic paper 174
5.1 The frequency of w (=“perl”) and all other words in two ‘corpus files’ 197
5.2 Frequencies of sentences with and without alphabetical and order 208
5.3 Frequencies of must/other admit/other in BNC K 215
5.4 Desired co-occurrence results to be extracted from COC5.3A: fiction 218
Acknowledgments

This book is dedicated to the people who have been so kind as to be part of what I might
self-deprecatingly call my ‘support network’; they are in alphabetical order of last names:
PMC, SCD, MIF, BH, S[LW], MN, H[RW], and DS – I am very grateful to all you’ve
done and all your tolerance over the last year or so! I wish to thank the team at Routledge
for their interest in, and support of, a second edition of this textbook; also, I am grateful
to the members of my corpus linguistics and statistics newsgroups for their questions,
suggestions, and feedback on various issues and topics that have now made it into this
second edition. Finally, I am grateful to many students and participants of classes, summer
schools, and workshops/bootcamps where parts of this book were used.
1 Introduction

1.1 Why Another Introduction to Corpus Linguistics?


In some sense at least, this book is an introduction to corpus linguistics. If you are a little
familiar with the field, this probably immediately triggers the question “Why yet another
introduction to corpus linguistics?” This is a valid question because, given the upsurge of
studies using corpus data in linguistics, there are also already quite a few very good intro-
ductions available. Do we really need another one? Predictably, I think the answer is still
“yes” and “yes, even a second edition,” and the reason is that this introduction is radi-
cally different from every other introduction to corpus linguistics out there. For example,
there are a lot of things that are regularly dealt with at length in introductions to corpus
linguistics that I will not talk about much:

•• the history of corpus linguistics: Kaeding, Fries, early 1m word corpora, up to the
contemporary giga corpora and the still lively web-as-corpus discussion;
•• how to compile corpora: size, sampling, balancedness, representativity;
•• how to create corpus markup and annotation: lemmatization, tagging, parsing;
•• kinds and examples of corpora: synchronic vs. diachronic, annotated vs. unannotated;
•• what kinds of corpus-linguistic research have been done.

That is to say, rather than telling you about the discipline of corpus linguistics – its history,
its place in linguistics, its contributions to different fields, etc. – with this book, I will ‘only’
teach you how to do corpus-linguistic data processing with the programming language
R (see McEnery and Hardie 2011 for an excellent recent introduction). In other words,
this book presupposes that you know what you would like to explore but gives you tools
to do it that go beyond what most commonly used tools can offer and, thus, hopefully
also open up your minds about how to approach your corpus-linguistic questions. This
is important since, to me, corpus linguistics is a method of analysis, so talking about how
to do things should enjoy a high priority (see Gries 2010 and the rest of that special issue,
as well as Gries 2011 for my subjective takes on this matter). Therefore, I will mostly be
concerned with:

•• aspects of how exactly data are retrieved from corpora to be used in linguistically
informed analyses, specifically how to obtain from corpora frequency lists, dispersion
information, collocation displays, concordances, etc. (see Chapter 2 for explanation
and exemplification of these terms);
•• aspects of data manipulation and evaluation: how to process and convert corpus data;
how to save various kinds of results; how to import them into a spreadsheet pro-
gram for further annotation; how to analyze results statistically; how to represent the
results graphically; and how to report your results.
2 Introduction
A second important characteristic of this book is that it only uses freely available software:

•• R, the corpus linguist’s all-purpose tool (cf. R Core Team 2016): a software which is
a calculator, a statistics program, a (statistical) graphics program, and a programming
language at the same time. The versions used in this book are R (www.r-project.org)
and the freely available Microsoft R Open 3.3.1 (https://mran.revolutionanalytics.com/
open, the versions for Ubuntu 16.04 LTS (or Mint 18) and Microsoft Windows 10);
•• RStudio 0.99.1294 (www.rstudio.com);
•• LibreOffice 5.2.0.4 (www.libreoffice.org).

The choice of these software tools, especially the decision to use R, has a number of
important implications, which should be mentioned early on. As I just mentioned, R
is a full-fledged multi-purpose programming language and, thus, a very powerful tool.
However, this degree of power does come at a cost: In the beginning, it is undoubtedly
more difficult to do things with R than with ready-made (free or commercial) concord-
ancing software that has been written specifically for corpus-linguistic applications. For
example, if you want to generate a frequency list of a corpus or a concordance of a word
in a corpus with R, you must write a small script or a little bit of code in a programming
language, which is the technical way of saying you write lines of text that are instructions
to R. If you do not need pretty output, this script may consist of just a few lines, but it will
often also be longer than that. On the other hand, if you have a ready-made concordancer,
you click a few buttons (and enter a search term) to get the job done. One may therefore
ask why go through the trouble of learning R? There is a variety of very good reasons for
this, some of them related to corpus linguistics, some more general.
First, let me address this very argument, which is often made against using R (or other
programming languages): why use a lot of time and effort to learn a programming lan-
guage if you can get results from ready-made software within minutes? With regard to
the time that goes into learning R, yes, there is a learning curve. However, that time may
not be as long as you think: Many participants in my bootcamps and other workshops
develop a first good understanding of R that allows them to begin to proceed on their
own within just a few days. Plus, being able to program is an extremely useful skill for
academic purposes, but also for jobs outside of academia; I would go so far as to say that
learning to program is extremely useful in how it develops, or hones, a particular way of
analytical and rigorous thinking that is useful in general. With regard to the time that goes
into writing a script, much of that usually needs to be undertaken only once. As you will
see below, once you have written your first few scripts while going through this book, you
can usually reuse (parts of) them for many different tasks and corpora, and the amount
of time that is required to perform a particular task becomes very similar to that of using
a ready-made program. In fact, nearly all corpus-linguistic tasks in my own research are
done with (somewhat adjusted) scripts or small snippets of code from this book. In addi-
tion, once you explore how to write your own functions (see Section 3.10), you can easily
write your own versatile or specialized functions yourself; I will make several of those
available in subsequent chapters. This way, the actual effort of generating a frequency list,
a collocate display, a dispersion plot, etc. often reduces to about the time you need with
a concordance program. In fact, R may even be faster than competing applications: For
example, some concordance programs read in the corpus files once before they are pro-
cessed and then again for performing the actual task – R requires only one pass and may,
therefore, outperform some competitors in terms of processing time.
Another point related to the notion that programming knowledge is useful: The knowl-
edge you will acquire by working through this book is quite general, and I mean that in a
Introduction 3
good way. This is because you will not be restricted to just one particular software appli-
cation (or even one version of one particular software application) and its restricted set
of features. Rather, you will acquire knowledge of a programming language and regular
expressions which will allow you to use many different utilities and to understand scripts
in other programming languages, such as Perl or Python. (At the same time, I think R is
simpler than Perl or Python, but can also interface with them via RSPerl and RSPython,
respectively; see www.omegahat.org.) For example, if you ever come across scripts by
other people or decide to turn to these languages yourself, you will benefit from know-
ing R in a way that no ready-made concordancing software would allow for. If you are
already a bit familiar with corpus-linguistic work, you may now think “but why turn to
R and not use Perl or Python (especially since you say Perl and Python are similar anyway
and many people already use one of these languages)?” This is a good question, and I
myself used Perl for corpus processing before I turned to R. However, I think I also have
a good answer to why to use R instead. First, the issue of speed is much less of a problem
than one may think. R is fast enough and stable enough for most applications (especially if
you heed some of the advice given in Sections 3.6.3 and 3.10). Thus, if a script takes a bit
of time, you can simply run it over lunch, while you are in class, or even overnight and col-
lect the results afterwards. Second, R has other advantages. The main one is probably that,
in addition text-processing capabilities, R offers a large number of ready-made functions
for the statistical evaluation and graphical representation of data, which allows you to
perform just about all corpus-linguistic tasks within only one programming environment.
You can do your data processing, data retrieval, annotation, statistical evaluation, graphi-
cal representation . . . everything within just one environment, whereas if you wanted to
do all these things in Perl or Python, you would require a huge amount of separate pro-
gramming. Consider a very simple example: R has a function called table that generates
a frequency table. To perform the same in Perl you would either have to have a small loop
counting elements in an array and in a stepwise fashion increment their frequencies in a
hash or, later and more cleverly, program a subroutine which you would then always call
upon. While this is no problem with a one-dimensional frequency list, this is much harder
with multidimensional frequency tables: Perl’s arrays of arrays or hashes of arrays etc. are
not for the faint-hearted, whereas R’s table is easy to handle, and additional functions
(table, xtabs, ftable, etc.) allow you to handle such tables very easily. I believe learning
one environment can be sufficiently hard for beginners, and therefore recommend using
the more comprehensive environment with the greater number of simpler functions, which
to me clearly is R. And, once you have mastered the fundamentals of R and face situations
in which you need maximal computational power, switching to Perl or Python in a limited
number of cases will be easier for you anyway, especially since much of the programming
languages’ syntaxes is similar and the regular expressions used in this book are all Perl
compatible. (Let me tell you, though, that in all my years using R, there were a mere two
instances where I had to switch to Perl and that was only because I didn’t yet know how
to solve a particular problem in R.)
Second, by learning to do your analyses with a programming language, you usually
have more control over what you are actually doing: Different concordance programs
have different settings or different ways of handling searches that are not always obvious
to the (inexperienced) user. For instance, ready-made concordance tools often have slightly
different settings that specify what ‘a word’ is, which means you can get different results
if you have different programs perform the same search on the same corpus. Yes, those
settings can usually be tweaked, but that means that, actually, such a ready-made applica-
tion requires the same attention to detail as R, and with a programming language all of
your methodological choices are right there in the code for everyone to see and replicate.
4 Introduction
Third, if you use a particular concordancing software, you are at the mercy of its
developer. If the developers change its behavior, its results output, or its default settings,
you can only hope that this is documented well and/or does not affect your results. There
have been cases where even silent over-the-internet updates have changed the output of
such software from one day to the next. Worse, developers might even discontinue the
development of a tool altogether – and let us not even consider how sorry the state of the
discipline of corpus linguistics would be if a majority of its practitioners was dependent
on not even a handful of ready-made corpus tools and websites that allow you to search
a corpus online. Somewhat polemically speaking, being able to enter a URL and type in a
search word shouldn’t make you a corpus linguist.
The fourth and maybe most important reason for learning a programming language
such as R is that a programming language is a much more versatile tool than any ready-
made software application. For instance, many ready-made corpus tools can only offer the
functionality they aim to provide for corpora with particular formats, and then can only
provide a small number of kinds of output. R, as a programming language, can handle
pretty much any input and can generate pretty much any output you want – in fact, in my
bootcamps, I tell participants on day 1 that I don’t want to hear any questions that begin
with “Can R . . . ?” because the answer is “Yes”. For instance, with R you can readily use
the CELEX database, CHAT files from language acquisition corpora, the very hierarchi-
cally layered annotation of XML corpora, previously generated frequency lists for corpora
you no longer have access to, literature files from Project Gutenberg or similar sites, tabu-
lar corpus files such as those from the Corpus of Contemporary American English (http://
corpus.byu.edu/coca) or the Corpus of Historical American English (http://corpus.byu.
edu/coha), and so on and so forth. You can use files of whatever encoding, meaning
that data from any language/writing system can be straightforwardly processed, and R’s
general data-processing capabilities are mostly only limited by your working memory
and abilities (rather than, for instance, the number of rows your spreadsheet software
can handle). With very few exceptions, R works identically on all three major operating
systems: Linux/Unix, Windows, and Mac OS X. In a way, once you have mastered the
basic mechanisms, there is basically no limit to what you can do with it, both in terms of
linguistic processing and statistical evaluation.
But there are also additional important advantages in the fact that R is an open-source
tool/programming language. For instance, there is a large number of functions and pack-
ages that are contributed by users all over the world. These often allow effective shortcuts
that are not, or hardly, possible with ready-made applications, which you cannot tweak
as you wish. Also, contrary to commercial concordance software, bug-fixes are usu-
ally available very quickly. And a final, obvious, and very down-to-earth advantage of
using open-source software is of course that it comes free of charge. Any student or any
department’s computer lab can afford it without expensive licenses, temporally limited or
functionally restricted licenses, or irritating ads and nag screens. All this makes a strong
case for the choice of software made here.

1.2 Outline of the Book


This book has changed quite a bit from the first edition; it is now structured as follows.
Chapter 2 defines the notion of a corpus and provides a brief overview of what I con-
sider to be the most central corpus-linguistic methods, namely frequency lists, dispersion,
collocations, and concordances; in addition, I briefly mention different kinds of annota-
tion. The main change here is the addition of some discussion of the important notion of
dispersion.
Introduction 5
Chapter 3 introduces the fundamentals of R, covering a variety of functions from dif-
ferent domains, but the area which receives most consideration is that of text processing.
There are many small changes in the code and the examples (for instance, I now introduce
free-spacing), but the main differences to the first edition consist of: (1) a revision of the
section on Unicode, which is now more comprehensive; (2) the addition of a new section
specifically discussing how to get the most out of XML data using dedicated packages that
can parse the hierarchical structure of XML documents; (3) an improved version of my
exact.matches function; and (4) a new section on how to write your own functions for
text processing and other things – this is taken up a lot in Chapter 5.
Chapter 4 is what used to be Chapter 5 in the first edition. It introduces you to
some fundamental aspects of statistical thinking and testing. The questions to be covered
in this chapter include: What are hypotheses? How do I check whether my results are
noteworthy? How might I visualize results? Given considerations of space and focus, this
chapter is informative, I hope, but still short.
The main chapter of this edition, Chapter 5, is brand new and, in a sense, brings it all
together: More than 30 case studies in 27 sections illustrate various aspects of how the
methods introduced in Chapters 3 and 4 can be applied to corpus data. Using a variety
of different kinds of corpora, corpus-derived data, and other data, you will learn in detail
how to write your own programs in R for corpus-linguistic analyses, text processing, and
some statistical analysis and visualization in detailed step-by-step instructions. Every sin-
gle analysis is discussed on multiple levels of abstraction and altogether more than 6,000
lines of code, nearly every one of them commented, help you delve deeply into how power-
ful a tool R can be for your work.
Finally, Chapter 6 is a very brief conclusion that points you to a handful of useful R
packages that you might consider exploring next.
Before we begin, a few short comments on the nature of this book are necessary. This
book is kind of a sister publication to my introduction to statistics for linguists (Gries
2013), and shares with it multiple characteristics. For instance, and as already mentioned,
this introduction to corpus linguistics is different from every other introduction to corpus
linguistics I know in how it doesn’t even attempt to survey the discipline, but focuses on R
programming for corpus linguists. This has two consequences. On the one hand, this book
is not a book that requires much previous knowledge: It presupposes only basic (corpus-)
linguistic knowledge and no mathematical or any programming knowledge.
On the other hand, this book is an attempt to teach you a lot about how to be a good
corpus linguist. As a good corpus linguist, you have to combine many different methodo-
logical skills (and many equally important analytical skills that I will not be concerned
with here). Many of these methodological skills are addressed here, such as some very
basic knowledge of computers (operating systems, file types, etc.), data management, reg-
ular expressions, some elementary programming skills, some elementary knowledge of
statistics, etc. What you must know, therefore, is that (1) nobody has ever learned all of
this just by reading – you must do things – and (2) this is not an easy book that you can
read for ten minutes at a time in bed before you fall asleep. What these two things mean
is that you really must read this book while you are sitting at your computer so you can
run the code, see what it does, and work on the examples. This is particularly important
because the code file from the companion website contains more than 6,500 lines of code
and a huge amount of extra commentary to help you understand the code much better
than you can understand it from just reading the book; this is particularly relevant for
Chapter 5! You will need practice to master all the concepts introduced here, but will be
rewarded by acquiring skills that give you access to a variety of data and approaches you
may not have considered accessible to you – at least that’s what happened to me when I at
6 Introduction
one point decided to leave behind the ready-made tools I had become used to. Undergrads
in my corpus classes without prior programming experience have quickly learned to write
small programs that do things better than many concordance software, and you can do
the same.
In order to facilitate your learning process, there are four different ways in which I try
to help you get more out of this book. First, there are small Think Breaks. These are small
assignments which you should try to complete before you read on; answers to them follow
immediately in the text. Second, there are exercise boxes with small assignments. Ideally,
you should complete these and check your answers in the answer key before you read any
further, but it is not always necessary to complete them right away to understand what fol-
lows, so you can also return to them later at your own leisure. Third, there are many boxes
with recommendations for further study/exploration, which typically mention functions
that you do not need for the section in which they are mentioned the first time, but many
are used at a later stage (often this will be Chapter 5), which means I really encourage you
to follow-up on those soon after you have encountered them. Fourth, and in addition to
the above, I would like to encourage you to go to the companion website for this book at
http://tinyurl.com/QuantCorpLingWithR, as well as the Google group “CorpLing with R”
which I created and maintain. You need to go to the companion website to get all the files
that belong with this book, but if you also become a member of the Google group:

•• you can send questions about corpus linguistics with R to the list and, hopefully, get
useful responses from some kind soul(s);
•• post suggestions for revisions of this book there;
•• inform me and the other readers of errors you find and, of course, be informed when
other people or I find errata.

Thus, while this is not an easy book, I hope these aids help you to become a good
corpus linguist. If you work through the whole book, you will be able to do a large num-
ber of things you could not even do with commercial concordancing software; many of the
scripts you find here are taken from actual research, and are in fact simplified versions of
scripts I have used myself for published papers. In addition, if you also take up the many
recommendations for further exploration that are scattered throughout the book, you will
probably find ever new and more efficient ways of application.

References
Gries, Stefan Th. (2010). Corpus linguistics and theoretical linguistics: A love–hate relationship?
Not necessarily . . . International Journal of Corpus Linguistics 15(3), 327–343.
Gries, Stefan Th. (2011). Methodological and interdisciplinary stance in corpus linguistics. In
Geoffrey Barnbrook, Vander Viana, & Sonia Zyngier (Eds.), Perspectives on corpus linguistics:
Connections and controversies (pp. 81–98). Amsterdam: John Benjamins.
Gries, Stefan Th. (2013). Statistics for linguistics with R. 2nd rev. and ext. ed. Berlin: De Gruyter
Mouton.
McEnery, Tony, & Andrew Hardie. (2011). Corpus linguistics: Method, theory, and practice.
Cambridge: Cambridge University Press.
R Core Team. (2016). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. Retrieved from www.R-project.org.
2 The Four Central Corpus-Linguistic
Methods

This last point leads me, with some slight trepidation, to make a comment on our field in
general, an informal observation based largely on a number of papers I have read as submis-
sions in recent months. In particular, we seem to be witnessing as well a shift in the way
some linguists find and utilize data – many papers now use corpora as their primary data,
and many use internet data.
(Joseph 2004: 382)

In this chapter you will learn what a corpus is (plural: corpora) and what the four methods
are to which nearly all aspects of corpus-linguistic work can be reduced in some way.

2.1 Corpora
Before we start to actually look at corpus linguistics, we have to clarify our terminology a
little. While the actual programming tasks do not differ between them, in this book I will
distinguish between a corpus, a text archive, and an example collection.

2.1.1 What Is a Corpus?


In this book, the notion of a corpus refers to a machine-readable collection of (spoken
or written) texts that were produced in a natural communicative setting, and in which
the collection of texts is compiled with the intention (1) to be representative and bal-
anced with respect to a particular linguistic language, variety, register, or genre and
(2) to be analyzed linguistically. The parts of this definition need some further clarifica-
tion themselves:

•• “Machine-readable” refers to the fact that nowadays virtually all corpora are stored
in the form of plain ASCII or Unicode text files that can be loaded, manipulated, and
processed platform-independently. This does not mean, however, that corpus linguists
only deal with raw text files – quite the contrary: some corpora are shipped with
sophisticated retrieval software that makes it possible to look for precisely defined
lexical, syntactic, or other patterns. It does mean, however, that you would have a
hard time finding corpora on paper, in the form of punch cards or digitally in HTML
or Microsoft Word document formats; the probably most widely used format consists
of text files with a Unicode UTF-8 encoding and XML annotation.
•• “Produced in a natural communicative setting” means that the texts were spoken
or written for some authentic communicative purpose, but not for the purpose of
putting them into a corpus. For example, many corpora consist to a large degree of
8 The Four Central Corpus-Linguistic Methods
newspaper articles. These meet the criterion of having been produced in a natural
setting because journalists write the article to be published in newspapers and to
communicate something to their readers, not because they want to fill a linguist’s
corpus. Similarly, if I obtained permission to record all of a particular person’s con-
versations in one week, then hopefully, while the person and his interlocutors usually
are aware of their conversations being recorded, I will obtain authentic conversations
rather than conversations produced only for the sake of my corpus.
•• I use “representative [ . . . ] with respect to a particular language, variety . . . ” here to
refer to the fact that the different parts of the linguistic variety I am interested in should
all be manifested in the corpus (at least if you want to generalize much beyond your
sample, e.g., to the language in general). For example, if I was interested in phonologi-
cal reduction patterns of speech of adolescent Californians and recorded only parts of
their conversations with several people from their peer group, my corpus would not be
representative in the above sense because it would not reflect the fact that some sizable
proportion of the speech of adolescent Californians may also consist of dialogs with a
parent, a teacher, etc., which would therefore also have to be included.
•• I use “balanced with respect to a particular linguistic language, variety . . . ” to mean
that not only should all parts of which a variety consists be sampled into the corpus,
but also that the proportion with which a particular part is represented in a corpus
should reflect the proportion the part makes up in this variety and/or the importance
of the part in this variety (at least if you want to generalize much beyond your sam-
ple, e.g., to the language in general). For example, if I know that dialogs make up 65
percent of the speech of adolescent Californians, approximately 65 percent of my
corpus should consist of dialog recordings. This example already shows that this cri-
terion is more of a theoretical ideal: How would one even measure the proportion that
dialogs make-up of the speech of adolescent Californians? We can only record a tiny
sample of all adolescent Californians, and how would we measure the proportion of
dialogs? In terms of time? In terms of sentences? In terms of words? And how would
we measure the importance of a particular linguistic variety? The implicit assumption
that conversational speech is somehow the primary object of interest in linguistics also
prevails in corpus linguistics, which is why corpora often aim at including as much
spoken language as possible, but on the other hand a single newspaper headline read
by millions of people may have a much larger influence on every reader’s linguistic
system than 20 hours of dialog. In sum, balanced corpora are a theoretical ideal cor-
pus compilers constantly bear in mind, but the ultimate and exact way of compiling a
balanced corpus has remained mysterious so far.

It is useful to point out, however, that the above definition of a corpus is perhaps the
prototype, which implies that there are many other corpora that differ from the proto-
type and other kinds of corpora along a variety of dimensions. For instance, the TIMIT
Acoustic-Phonetic Continuous Speech Corpus is made up of audio recordings of 630
speakers of eight major dialects of American English, where each speaker read phoneti-
cally rich sentences, a setting which is not exactly a natural communicative setting. Or
consider the DCIEM Map Task Corpus, which consists of unscripted dialogs in which
one interlocutor describes a route on a map to the other after both interlocutors were
subjected to 60 hours of sleep deprivation and one of three drug treatments – again,
hardly a normal situation. Even a genre as widely used as newspaper text – journalese –
is not necessarily close to being a prototypical corpus, given how newspaper writing
is created much more deliberately and consciously than many other texts – plus they
often come with linguistically arbitrary restrictions regarding their length, are often not
The Four Central Corpus-Linguistic Methods 9
written by a single person, and are heavily edited, etc. Thus, the notion of corpus is really
a rather diverse one.
Many people would prefer to consider newspaper data not corpora, but text archives.
Those would be databases of texts which

•• may not have been produced in a natural setting;


•• have often not been compiled for the purposes of linguistic analysis; and
•• have often not been intended to be representative and/or balanced with respect to a
particular linguistic variety or speech community.

As the above discussion already indicated, however, the distinction between corpora and
text archives is often blurred. It is theoretically easy to make, but in practice often not
adhered to very strictly and, again, has very few implications for the kinds of (R) program-
ming they require. For example, if a publisher of a popular computing periodical makes all
the issues of the previous year available on their website, then the first criterion is met, but
not the last three. However, because of their availability and size, many corpus linguists
use them as resources, and as long as one bears their limitations in mind in terms of rep-
resentativity etc., there is little reason not to.
Finally, an example collection is just what the name says it is – a collection of examples
that, typically, the person who compiled the examples came across and noted down. For
example, much psycholinguistic research in the 1970s was based on collections of speech
errors compiled by the researchers themselves and/or their helpers. Occasionally, people
refer to such collections as error corpora, but we will not use the term corpus for these. It is
easy to see how such collections compare to corpora. On the one hand, for example, some
errors – while occurring frequently in authentic speech – are more difficult to perceive than
others and thus hardly ever make it into a collection. This would be an analog to the balanc-
edness problem outlined above. On the other hand, the perception of errors is contingent
on the acuity of the researcher while, with corpus research, the corpus compilation would
not be contingent on a particular person’s perceptual skills. Finally, because of the scarcity
of speech errors, usually all speech errors perceived (in a particular amount of time) are
included into the corpus, whereas, at least usually and ideally, corpus compilers are more
picky and select the material to be included with an eye to the criteria of representativity and
balancedness outlined above.1 Be that as it may, if only for the sake of terminological clar-
ity, it is useful to distinguish the notions of corpora, text archives, and example collections.

2.1.2 What Kinds of Corpora Are There?


Corpora differ in a variety of ways. There are a few distinctions you should be familiar with
if only to be able to find the right corpus for what you want to investigate. The most basic
distinction is that between general corpora and specific corpora. The former intend to be
representative and balanced for a language as a whole – within the above-mentioned limits,
that is – while the latter are by design restricted to a particular variety, register, genre, etc.
Another important distinction is that between raw corpora and annotated corpora. Raw
corpora consist of files only containing the corpus material (see (1) in the example below),
while annotated corpora in addition also contain additional information. Annotated cor-
pora are very often annotated according to the standards of the Text Encoding Initiative
(TEI, www.tei-c.org/index.xml) or the Corpus Encoding Standard (CES, www.cs.vassar.
edu/CES), and have two parts. The first part is called the header, which provides infor-
mation that is typically characterized as markup. This is information about (1) the text
itself, e.g., where the corpus data come from, which language is represented in the file,
10 The Four Central Corpus-Linguistic Methods
which (part of a) newspaper or book has been included, who recorded whom, where, and
when, who has the copyright, what annotation comes with the file; and information about
(2) its formatting, printing, processing, etc. Markup refers to objectively codable infor-
mation – the fact that there is a paragraph in a text or that a particular speaker is female
can typically be made without doubt; this is different from annotation, which is usually
specifically linguistic information – e.g., part-of-speech (POS) tagging, semantic infor-
mation, pragmatic information, etc. – and which is less objective (for instance, because
linguists may disagree about POS tags for specific words). This information helps users
to quickly determine, e.g., whether a particular file is part of the register one wishes to
investigate or not.
The second part is called the body and contains the corpus data proper – i.e., what people
actually said or wrote – as well as linguistic information that is usually based on some lin-
guistic theory: Parts of speech or syntactic patterns, for example, can be matters of debate.
In what follows I will briefly (and non-exhaustively!) discuss and exemplify a few common
annotation schemes (see Wynne 2005; McEnery, Xiao, & Tono 2006: A.3 and A.4; Beal,
Corrigan, & Hermann 2007a, 2007b; Gries & Newman 2013 for more discussion).
First, a corpus may be lemmatized such that each word in the corpus is followed (or
preceded) by its lemma, i.e., the form under which you would look it up in a dictionary
(see (2)). A corpus may have so-called part-of speech tags so that each word in the corpus
is followed by an abbreviation giving the word’s POS and sometimes also some morpho-
logical information (see (3)). A corpus may also be phonologically annotated (see (4)).
Then, a corpus may be syntactically parsed, i.e., contain information about the syntactic
structures of the text/utterances (see (5)). Finally, and as a last example, a corpus may
contain several different annotations on different lines (or tiers) at the same time, a format
especially common in language acquisition corpora (see (6)).

(1) I did get a postcard from him.


(2) I_I did_do get_get a_a postcard_postcard from_from him_he._punct
(3) I<PersPron> did<VerbPast> get<VerbInf> a<Det> postcard<NounSing>
from<Prep> him<PersPron>.<punct>
(4) [@:]·I·^did·get·a·!p\ostcard·fr/om·him#·-·-
(5) <Subject,·NP>
I<PersPron>
<Predicate,·VP>
did<Verb>
get<Verb>
<DirObject,·NP>
a<Det>
postcard<NounSing>
<Adverbial,·PP>
from<Prep>
him<PersPron>.
(6) *CHI: I did get a postcard from him
%mor: pro|I·v|do&PAST·v|get·det|a·n|postcard·prep|from·
pro|him·.
%lex: get
%syn: trans

Other annotation includes that with regard to semantic characteristics, stylistic aspects,
anaphoric relations (co-reference annotation), etc. Nowadays, most corpora come in the
The Four Central Corpus-Linguistic Methods 11
form of XML files, and we will explore many examples involving XML annotation in the
chapters to come. As is probably obvious from the above, annotation can sometimes be
done completely automatically (possibly with human error-checking), semi-automatically,
or must be done completely manually. POS tagging, the probably most frequent kind of
annotation, is usually done automatically, and for English taggers are claimed to achieve
accuracy rates of 97 percent – a number that I sometimes find hard to believe when I look
at corpora, but that is a different story.
Then, there is a difference between diachronic corpora and synchronic corpora. The
former aim at representing how a language/variety changes over time, while the latter
provide, so to speak, a snapshot of a language/variety at one particular point in time. Yet
another distinction is that between monolingual corpora and parallel corpora. As you
might already guess from the names, the former have been compiled to provide informa-
tion about one particular language/variety, whereas the latter ideally provide the same
text in several different languages. Examples include translations from EU Parliament
debates into the 23 languages of the European Union, or the Canadian Hansard corpus,
containing Canadian Parliament debates in English and French. Again, ideally, a parallel
corpus does not just have the translations in different languages, but has the transla-
tions sentence-aligned, such that for every sentence in language L1, you can automatically
retrieve its translation in the languages L2 to Ln.
The next distinction to be mentioned here is that of static corpora vs. dynamic/moni-
tor corpora. Static corpora have a fixed size (e.g., the Brown corpus, the LOB corpus, the
British National Corpus), whereas dynamic corpora do not since they may be constantly
extended with new material (e.g., the Bank of English).
The final distinction I would like to mention at least briefly involves the encoding of
the corpus files. Given especially the predominance of work on English in corpus linguis-
tics, until rather recently many corpora came in the so-called ASCII (American Standard
Code for Information Interchange) character encoding, an encoding scheme that encodes
27 = 128 characters as numbers and that is largely based on the Western alphabet. With
these characters, special characters that were not part of the ASCII character inventory
were often paraphrased, e.g., “é” was paraphrased as “&eacute;”. However, the number
of corpora for many more languages has been increasing steadily, and given the large
number of characters that writing systems such as Chinese have, this is not a practi-
cal approach. As such, language-specific character encodings were developed (e.g., ISO
8859-1 for Western European Languages vs. ISO 2022 for Chinese/Japanese/Korean lan-
guages). However, in the interest of overcoming compatibility problems that arose due to
how different languages used different character encodings, the field of corpus linguistics
has been moving towards using only one unified (i.e., not language-specific) multilingual
character encoding in the form of Unicode (most notably UTF-8). This development is
in tandem with the move toward XML corpus annotation and, more generally, UTF-8
becoming the most widely used character encoding on the internet.
Now that you know a bit about the kinds of corpora that exist, there is one other really
important point to be made. While we will see below that corpus linguistics has a lot to
offer to the analyst, it is worth pointing out that, strictly speaking at least, the only thing
corpora can provide is information on frequencies. Put differently, there is no meaning in
corpora, and no functions, only:

•• frequencies of occurrence of items – i.e., how often do morphemes, words, grammatical


patterns, etc. occur in (parts of) a corpus?; and
•• frequencies of co-occurrence of items – i.e., how often do morphemes occur with particular
words? How often do particular words occur in a certain grammatical construction? etc.
12 The Four Central Corpus-Linguistic Methods
It is up to the researcher to interpret these frequencies of occurrence and co-occurrence
in meaningful or functional terms. The assumption underlying basically all corpus-based
analyses, however, is that formal differences reflect functional differences: Different
frequencies of (co-)occurrences of formal elements are supposed to reflect functional
regularities, where functional is understood here in a very broad sense as anything – be
it semantic, discourse-pragmatic, etc. – that is intended to perform a particular com-
municative function. On a very general level, the frequency information a corpus offers
is exploited in four different ways, which will be the subject of this chapter: frequency
lists (Section 2.2), dispersion (Section 2.3), lexical co-occurrence lists/collocations
(Section 2.4), and concordances (Section 2.5).

2.2 Frequency Lists


The most basic corpus-linguistic tool is the frequency list. You generate a frequency list
when you want to know how often something – usually words – occur in a corpus. Thus,
a frequency list of a corpus is usually a two-column table with all words occurring in the
corpus in one column and the frequency with which they occur in the corpus in the other
column. Since the notion of word is a little ambiguous here, it is useful to introduce a
common distinction between (word) type and (word) token. The string “the word and
the phrase” contains five (word) tokens (“the”, “word”, “and”, “the”, and “phrase”),
but only four (word) types (“the”, “word”, “and”, and “phrase”), of which one (“the”)
occurs twice. In this parlance, a frequency list lists the types in one column and their token
frequencies in the other; often you will find the expression type frequency referring to the
number of different types attested in a corpus (or in a ‘slot’ such as a syntactically defined
slot in a grammatical construction).
Typically, one out of three different sorting styles is used: frequency order (ascending
or, more typically, descending; see the left panel of Table 2.1), alphabetical (ascending or
descending), and occurrence (each word occurs in a position reflecting its first occurrence
in the corpus).
Apart from this simple form in the leftmost panel, there are other varieties of frequency
lists that are sometimes found. First, a frequency list may provide the frequencies of all
words together with the words with their letters reversed. This may not seem particu-
larly useful at first, but even a brief look at the second panel of Table 2.1 clarifies that
this kind of display can sometimes be very helpful because it groups together words that
share a particular suffix – here the adverb marker -ly. Second, a frequency list may not list
each individual word token and their frequencies, but so-called n-grams, i.e., sequences of

Table 2.1 Examples of differently ordered frequency lists

Words Freq. Words Freq. Bigrams Freq. Words Tags Freq.

the 62,580 yllufdaerd 80 of the 4,892 the AT0 6,069


of 35,958 yllufecaep 1 in the 3,006 of PRF 4,106
and 27,789 yllufecarg 5 to the 1,751 a AT0 2,823
to 25,600 yllufecruoser 8 on the 1,228 and CJC 2,602
a 21,843 yllufeelg 1 and the 1,114 in PRP 2,449
in 19,446 yllufeow 1 for the 906 to TO0 1,678
that 10,296 ylluf 2 at the 832 is VBZ 1,589
is 9,938 yllufepoh 8 to be 799 to PRP 1,135
was 9,740 ylluferac 87 with the 783 for PRP 916
for 8,799 yllufesoprup 1 from the 720 be VBI 874
Random documents with unrelated
content Scribd suggests to you:
BLACK CURRANT JAM AND MARMALADE.

No fruit jellies so easily as black currants when they are ripe; and
their juice is so rich and thick that it will bear the addition of a very
small quantity of water sometimes, without causing the preserve to
mould. When the currants have been very dusty, we have
occasionally had them washed and drained before they were used,
without any injurious effects. Jam boiled down in the usual manner
with this fruit is often very dry. It may be greatly improved by taking
out nearly half the currants when it is ready to be potted, pressing
them well against the side of the preserving-pan to extract the juice:
this leaves the remainder far more liquid and refreshing than when
the skins are all retained. Another mode of making fine black currant
jam—as well as that of any other fruit—is to add one pound at least
of juice, extracted as for jelly, to two pounds of the berries, and to
allow sugar for it in the same proportion as directed for each pound
of them.
For marmalade or paste, which is most useful in affections of the
throat and chest, the currants must be stewed tender in their own
juice, and then rubbed through a sieve. After ten minutes’ boiling,
sugar in fine powder must be stirred gradually to the pulp, off the fire,
until it is dissolved: a few minutes more of boiling will then suffice to
render the preserve thick, and it will become quite firm when cold.
More or less sugar can be added to the taste, but it is not generally
liked very sweet.
Best black currant jam.—Currants, 4 lbs.; juice of currants, 2 lbs.:
15 to 20 minutes’ gentle boiling. Sugar, 3 to 4 lbs.: 10 minutes.
Marmalade, or paste of black currants.—Fruit, 4 lbs.: stewed in its
own juice 15 minutes, or until quite soft. Pulp boiled 10 minutes.
Sugar, from 7 to 9 oz. to the lb.: 10 to 14 minutes.
Obs.—The following are the receipts originally inserted in this
work, and which we leave unaltered.
To six pounds of the fruit, stripped carefully from the stalks, add
four pounds and a half of sugar. Let them heat gently, but as soon as
the sugar is dissolved boil the preserve rapidly for fifteen minutes. A
more common kind of jam may be made by boiling the fruit by itself
from ten to fifteen minutes, and for ten minutes after half its weight of
sugar has been added to it.
Black currants, 6 lbs.; sugar, 4-1/2 lbs.: 15 minutes. Or: fruit, 6 lbs.:
10 to 15 minutes. Sugar, 3 lbs.: 10 minutes.
Obs.—There are few preparations of fruit so refreshing and so
useful in illness as those of black currants, and it is therefore
advisable always to have a store of them, and to have them well and
carefully made.
NURSERY PRESERVE.

Take the stones from a couple of pounds of Kentish cherries, and


boil them twenty minutes; then add to them a pound and a half of
raspberries, and an equal quantity of red and of white currants, all
weighed after they have been cleared from their stems. Boil these
together quickly for twenty minutes; mix with them three pounds and
a quarter of common sugar, and give the preserve fifteen minutes
more of quick boiling. A pound and a half of gooseberries may be
substituted for the cherries; but they will not require any stewing
before they are added to the other fruits. The jam must be well
stirred from the beginning, or it will burn to the pan.
Kentish cherries, 2 lbs.: 20 minutes. Raspberries, red currants,
and white currants, of each 1-1/2 lb.: 20 minutes. Sugar, 3-1/4 lbs.:
15 minutes.
ANOTHER GOOD COMMON PRESERVE.

Boil together, in equal or unequal portions (for this is immaterial),


any kinds of early fruit, until they can be pressed through a sieve;
weigh, and then boil the pulp over a brisk fire for half an hour; add
half a pound of sugar for each pound of fruit, and again boil the
preserve quickly, keeping it well stirred and skimmed, from fifteen to
twenty minutes. Cherries, unless they be morellas, must first be
stewed tender apart, as they will require a much longer time to make
them so than any other of the first summer fruits.
A GOOD MÉLANGE, OR MIXED PRESERVE.

Boil for three-quarters of an hour in two pounds of clear red


gooseberry juice, one pound of very ripe greengages, weighed after
they have been pared and stoned; then stir to them one pound and a
half of good sugar, and boil them quickly again for twenty minutes. If
the quantity of preserve be much increased, the time of boiling it
must be so likewise: this is always better done before the sugar is
added.
Juice of ripe gooseberries, 2 lbs.; greengages, pared and stoned,
1 lb.: 3/4 hour. Sugar, 1-1/2 lb.: 20 minutes.
GROSEILLÉE.

(Another good preserve.)


Cut the tops and stalks from a gallon or more of well-flavoured ripe
gooseberries, throw them into a large preserving-pan, boil them for
ten minutes, and stir them often with a wooden spoon; then pass
both the juice and pulp through a fine sieve, and to every three
pounds’ weight of these add half a pint of raspberry-juice, and boil
the whole briskly for three-quarters of an hour; draw the pan aside,
stir in for the above portion of fruit, two pounds of sugar, and when it
is dissolved renew the boiling for fifteen minutes longer. Ripe
gooseberries, boiled 10 minutes. Pulp and juice of gooseberries, 6
lbs.; raspberry-juice, 1 pint: 3/4 hour. Sugar, 4 lbs.: 15 minutes.
Obs.—When more convenient, a portion of raspberries can be
boiled with the gooseberries at first.
SUPERIOR PINE-APPLE MARMALADE.

(A New Receipt.)
The market-price of our English pines is generally too high to
permit their being very commonly used for preserve; and though
some of those imported from the West Indies are sufficiently well-
flavoured to make excellent jam, they must be selected with
judgment for the purpose, or they will possibly not answer for it. They
should be fully ripe, but perfectly sound: should the stalk end appear
mouldy or discoloured, the fruit should be rejected. The degree of
flavour which it possesses may be ascertained with tolerable
accuracy by its odour; for if of good quality, and fit for use, it will be
very fragrant. After the rinds have been pared off, and every dark
speck taken from the flesh, the pines may be rasped on a fine and
delicately clean grater, or sliced thin, cut up quickly into dice, and
pounded in a stone or marble mortar; or a portion may be grated,
and the remainder reduced to pulp in the mortar. Weigh, and then
heat and boil it gently for ten minutes; draw it from the fire, and stir to
it by degrees fourteen ounces of sugar to the pound of fruit; boil it
until it thickens and becomes very transparent, which it will be in
about fifteen minutes, should the quantity be small: it will require a
rather longer time if it be large. The sugar ought to be of the best
quality and beaten quite to powder; and for this, as well as for every
other kind of preserve, it should be dry. A remarkably fine
marmalade may be compounded of English pines only, or even with
one English pine of superior growth, and two or three of the West
Indian mixed with it; but all when used should be fully ripe, without at
all verging on decay; for in no other state will their delicious flavour
be in its perfection.
In making the jam always avoid placing the preserving-pan flat
upon the fire, as this of itself will often convert what would otherwise
be excellent preserve, into a strange sort of compound, for which it is
difficult to find a name, and which results from the sugar being
subjected—when in combination with the acid of the fruit—to a
degree of heat which converts it into caramel or highly-boiled barley-
sugar. When there is no regular preserving-stove, a flat trivet should
be securely placed across the fire of the kitchen-range to raise the
pan from immediate contact with the burning coals, or charcoal. It is
better to grate down, than to pound the fruit for the present receipt
should any parts of it be ever so slightly tough; and it should then be
slowly stewed until quite tender before any sugar is added to it; or
with only a very small quantity stirred in should it become too dry. A
superior marmalade even to this, might probably be made by adding
to the rasped pines a little juice drawn by a gentle heat, or expressed
cold, from inferior portions of the fruit; but this is only supposition.
A FINE PRESERVE OF THE GREEN ORANGE PLUM.

(Sometimes called the Stonewood plum.)


This fruit, which is very insipid when ripe, makes an excellent
preserve if used when at its full growth, but while it is still quite hard
and green. Take off the stalks, weigh the plums, then gash them well
(with a silver knife, if convenient) as they are thrown into the
preserving-pan, and keep them gently stirred without ceasing over a
moderate fire, until they have yielded sufficient juice to prevent their
burning; after this, boil them quickly until the stones are entirely
detached from the flesh of the fruit. Take them out as they appear on
the surface, and when the preserve looks quite smooth and is well
reduced, stir in three-quarters of a pound of sugar beaten to a
powder, for each pound of the plums, and boil the whole very quickly
for half an hour or more. Put it, when done, into small moulds or
pans, and it will be sufficiently firm when cold to turn out well: it will
also be transparent, of a fine green colour, and very agreeable in
flavour.
Orange plums, when green, 6 lbs.: 40 to 60 minutes. Sugar, 4-1/2
lbs.: 30 to 50 minutes.
Obs.—The blanched kernels of part of the fruit should be added to
this preserve a few minutes before it is poured out: if too long boiled
in it they will become tough. They should always be wiped very dry
after they are blanched.
GREENGAGE JAM, OR MARMALADE.

When the plums are thoroughly ripe, take off the skins, stone,
weigh, and boil them quickly without sugar for fifty minutes, keeping
them well stirred; then to every four pounds add three of good sugar
reduced quite to powder, boil the preserve from five to eight minutes
longer, and clear off the scum perfectly before it is poured into the
jars. When the flesh of the fruit will not separate easily from the
stones, weigh and throw the plums whole into the preserving-pan,
boil them to a pulp, pass them through a sieve, and deduct the
weight of the stones from them when apportioning the sugar to the
jam. The Orleans plum may be substituted for greengages in this
receipt.
Greengages, stoned and skinned, 6 lbs.: 50 minutes. Sugar, 4-1/2
lbs.: 5 to 8 minutes.
PRESERVE OF THE MAGNUM BONUM, OR MOGUL PLUM.

Prepare, weigh, and boil the plums for forty minutes; stir to them
half their weight of good sugar beaten fine, and when it is dissolved
continue the boiling for ten additional minutes, and skim the preserve
carefully during the time. This is an excellent marmalade, but it may
be rendered richer by increasing the proportion of sugar. The
blanched kernels of a portion of the fruit stones will much improve its
flavour, but they should be mixed with it only two or three minutes
before it is taken from the fire. When the plums are not entirely ripe,
it is difficult to free them from the stones and skins: they should then
be boiled down and pressed through a sieve, as directed for
greengages, in the receipt above.
Mogul plums, skinned and stoned, 6 lbs.: 40 minutes. Sugar, 3
lbs.: 5 to 8 minutes.
TO DRY OR PRESERVE MOGUL PLUMS IN SYRUP.

Pare the plums, but do not remove the stalks or stones; take their
weight of dry sifted sugar, lay them into a deep dish or bowl, and
strew it over them; let them remain thus for a night, then pour them
gently into a preserving-pan with all the sugar, heat them slowly, and
let them just simmer for five minutes; in two days repeat the process,
and do so again and again at an interval of two or three days, until
the fruit is tender and very clear; put it then into jars, and keep it in
the syrup, or drain and dry the plums very gradually, as directed for
other fruit. When they are not sufficiently ripe for the skin to part from
them readily, they must be covered with spring water, placed over a
slow fire, and just scalded until it can be stripped from them easily.
They may also be entirely prepared by the receipt for dried apricots
which follows, a page or two from this.
MUSSEL PLUM CHEESE AND JELLY.

Fill large stone jars with the fruit, which should be ripe, dry, and
sound; set them into an oven from which the bread has been drawn
several hours, and let them remain all night; or, if this cannot
conveniently be done, place them in pans of water, and boil them
gently until the plums are tender, and have yielded their juice to the
utmost. Pour this from them, strain it through a jelly bag, weigh, and
then boil it rapidly for twenty-five minutes. Have ready, broken small,
three pounds of sugar for four of the juice, stir them together until it is
dissolved, and then continue the boiling quickly for ten minutes
longer, and be careful to remove all the scum. Pour the preserve into
small moulds or pans, and turn it out when it is wanted for table: it
will be very fine, both in colour and in flavour.
Juice of plums, 4 lbs.: 25 minutes. Sugar, 3 lbs.: 10 minutes.
The cheese.—Skin and stone the plums from which the juice has
been poured, and after having weighed, boil them an hour and a
quarter over a brisk fire, and stir them constantly; then to three
pounds of fruit add one of sugar, beaten to powder; boil the preserve
for another half hour, and press it into shallow pans or moulds.
Plums, 3 lbs.: 1-1/4 hour. Sugar, 1 lb.: 30 minutes.
APRICOT MARMALADE.

This may be made either by the receipt for greengage, or Mogul


plum marmalade; or the fruit may first be boiled quite tender, then
rubbed through a sieve, and mixed with three-quarters of a pound of
sugar to the pound of apricots: from twenty to thirty minutes will boil
it in this case. A richer preserve still is produced by taking off the
skins, and dividing the plums in halves or quarters, and leaving them
for some hours with their weight of fine sugar strewed over them
before they are placed on the fire; they are then heated slowly and
gently simmered for about half an hour.
TO DRY APRICOTS.

(A quick and easy method.)


Wipe gently, split, and stone some fine apricots which are not
over-ripe; weigh, and arrange them evenly in a deep dish or bowl,
and strew in fourteen ounces of sugar in fine powder, to each pound
of fruit; on the following day turn the whole carefully into a
preserving-pan, let the apricots heat slowly, and simmer them very
softly for six minutes, or for an instant longer, should they not in that
time be quite tender. Let them remain in the syrup for a day or two,
then drain and spread them singly on dishes to dry.
To each pound of apricots, 14 oz. of sugar; to stand 1 night, to be
simmered from 6 to 8 minutes, and left in syrup 2 or 3 days.
DRIED APRICOTS.

(French Receipt.)
Take apricots which have attained their full growth and colour, but
before they begin to soften; weigh, and wipe them lightly; make a
small incision across the top of each plum, pass the point of a knife
through the stalk end, and gently push out the stones without
breaking the fruit; next, put the apricots into a preserving-pan, with
sufficient cold water to float them easily; place it over a moderate
fire, and when it begins to boil, should the apricots be quite tender,
lift them out and throw them into more cold water, but simmer them,
otherwise, until they are so. Take the same weight of sugar that there
was of the fruit before it was stoned, and boil it for ten minutes with a
quart of water to the four pounds; skim the syrup carefully, throw in
the apricots (which should previously be well drained on a soft cloth,
or on a sieve), simmer them for one minute, and set them by in it
until the following day, then drain it from them, boil it for ten minutes,
and pour it on them the instant it is taken from the fire; in forty-eight
hours repeat the process, and when the syrup has boiled ten
minutes, put in the apricots, and simmer them from two to four
minutes, or until they look quite clear. They may be stored in the
syrup until wanted for drying, or drained from it, laid separately on
slates or dishes, and dried very gradually: the blanched kernels may
be put inside the fruit, or added to the syrup.
Apricots, 4 lbs., scalded until tender; sugar 4 lbs.; water, 1 quart:
10 minutes. Apricots, in syrup, 1 minute; left 24 hours. Syrup, boiled
again, 10 minutes, and poured on fruit: stand 2 days. Syrup, boiled
again, 10 minutes, and apricots 2 to 4 minutes, or until clear.
Obs.—The syrup should be quite thick when the apricots are put in
for the last time; but both fruit and sugar vary so much in quality and
in the degree of boiling which they require, that no invariable rule can
be given for the latter. The apricot syrup strained very clear, and
mixed with twice its measure of pale French brandy, makes an
agreeable liqueur, which is much improved by infusing in it for a few
days half an ounce of the fruit-kernels, blanched and bruised, to the
quart of liquor.
We have found that cherries prepared by either of the receipts
which we have given for preserving them with sugar, if thrown into
the apricot syrup when partially dried, just scalded in it, and left for a
fortnight, then drained and dried as usual, become a delicious
sweetmeat. Mussel, imperatrice, or any other plums, when quite ripe,
if simmered in it very gently until they are tender, and left for a few
days to imbibe its flavour, then drained and finished as usual, are
likewise excellent.
PEACH JAM, OR MARMALADE.

The fruit for this preserve, which is a very delicious one, should be
finely flavoured, and quite ripe, though perfectly sound. Pare, stone,
weigh, and boil it quickly for three-quarters of an hour, and do not fail
to stir it often during the time; draw it from the fire, and mix with it ten
ounces of well-refined sugar, rolled or beaten to powder, for each
pound of the peaches; clear it carefully from scum, and boil it briskly
for five minutes; throw in the strained juice of one or two good
lemons; continue the boiling for three minutes only, and pour out the
marmalade. Two minutes after the sugar is stirred to the fruit, add
the blanched kernels of part of the peaches.
Peaches, stoned and pared, 4 lbs.; 3/4 hour. Sugar, 2-1/2 lbs.: 2
minutes. Blanched peach-kernels: 3 minutes. Juice of 2 small
lemons: 3 minutes.
Obs.—This jam, like most others, is improved by pressing the fruit
through a sieve after it has been partially boiled. Nothing can be finer
than its flavour, which would be injured by adding the sugar at first;
and a larger proportion renders it cloyingly sweet. Nectarines and
peaches mixed, make an admirable preserve.
TO PRESERVE, OR TO DRY PEACHES OR NECTARINES.

(An easy and excellent Receipt.)


The fruit should be fine, freshly gathered, and fully ripe, but still in
its perfection. Pare, halve, and weigh it after the stones are removed;
lay it into a deep dish, and strew over it an equal weight of highly
refined pounded sugar; let it remain until this is nearly dissolved,
then lift the fruit gently into a preserving-pan, pour the juice and
sugar to it, and heat the whole over a very slow fire; let it just simmer
for ten minutes, then turn it softly into a bowl, and let it remain for two
days; repeat the slow heating and simmering at intervals of two or
three days, until the fruit is quite clear, when it may be potted in the
syrup, or drained from it, and dried upon large clean slates or dishes,
or upon wire-sieves. The flavour will be excellent. The strained juice
of a lemon may be added to the syrup, with good effect, towards the
end of the process, and an additional ounce or two of sugar allowed
for it.
DAMSON JAM. (VERY GOOD.)

The fruit for this jam should be freshly gathered and quite ripe.
Split, stone, weigh, and boil it quickly for forty minutes; then stir in
half its weight of good sugar roughly powdered, and when it is
dissolved, give the preserve fifteen minutes additional boiling,
keeping it stirred, and thoroughly skimmed.
Damsons, stoned, 6 lbs.: 40 minutes. Sugar, 3 lbs.: 15 minutes.
Obs.—A more refined preserve is made by pressing the fruit
through a sieve after it is boiled tender; but the jam is excellent
without.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebookluna.com

You might also like