BNC & Ice
BNC & Ice
30
2.1 The British National Corpus 31
Speech
Type Number of texts Number of words % of spoken corpus
Writing
Type Number of texts Number of words % of written corpus
Even though the creators of the BNC had a very definite plan for the composi-
tion of the corpus, it is important to realize, as Biber (1993: 256) observes, that
the creation of a corpus is a “cyclical” process, requiring constant re-evaluation
as the corpus is being compiled. Consequently, the compiler of a corpus should
be willing to change his or her initial corpus design if circumstances arise
requiring such changes to be made. For instance, in planning the part of the
BNC containing demographically sampled speech, it was initially thought that
obtaining samples of speech from 100 individuals would provide a sufficiently
representative sample of this type of speech (Crowdy 1993: 259). Ultimately,
however, the speech of 124 individuals was required to create a corpus bal-
anced by age, gender, social class, and region of origin (Aston and Burnard
1998: 32).
While changes in corpus design are inevitable, it is crucial that a number
of decisions concerning the composition of a corpus are made prior to the
collection of texts for the corpus, and the remainder of this chapter considers
the factors that influenced the decisions detailed above that the creators of the
BNC made, and the general methodological issues that these decisions raise.
Earlier corpora, such as Brown and LOB, were relatively short, primar-
ily because of the logistical difficulties that computerizing a corpus created. For
instance, all of the written texts for the Brown Corpus had to be keyed in by hand,
a process requiring a tremendous amount of very tedious and time-consuming
typing. This process has been greatly eased with the invention of optical scan-
ners (cf. section 3.7), which can automatically computerize printed texts with
fairly high rates of accuracy. However, technology has not progressed to the
point where it can greatly expedite the collection and transcription of speech:
there is much work involved in going out and recording speech (cf. section 3.2),
and the transcription of spoken texts still has to be done manually with either
a cassette transcription machine or special software that can replay segments
of speech until the segments are accurately transcribed (cf. section 3.6). These
logistical realities explain why 90 percent of the BNC is writing and only 10 per-
cent speech.
To determine how long a corpus should be, it is first of all important to
compare the resources that will be available to create it (e.g. funding, research
assistants, computing facilities) with the amount of time it will take to collect
texts for inclusion, computerize them, annotate them, and tag and parse them.
Chafe, Du Bois, and Thompson (1991: 70–1) calculate that it will require “six
person-hours” to transcribe one minute of speech for inclusion in the Santa
Barbara Corpus of Spoken American English, a corpus containing not just
orthographic transcriptions of speech but prosodic transcriptions as well (cf.
section 4.1). For the American component of ICE, it was calculated that it
would take eight hours of work to process a typical 2,000-word sample of
2.2 The overall length of a corpus 33
writing, and ten to twenty hours to process a 2,000-word spoken text (with
spontaneous dialogues containing numerous speech overlaps requiring more
time than carefully articulated monologues). Since the American component of
ICE contains 300 samples of speech and 200 samples of writing, completing
the entire corpus would take a full-time research assistant working forty hours
a week more than three years to complete. If one applies these calculations to
the BNC, which is one hundred times the length of the American component
of ICE, it becomes obvious that creating a corpus requires a large commitment
of resources. In fact, the creation of the BNC required the collaboration of a
number of universities and publishers in Great Britain, as well as considerable
financial resources from funding agencies.
After determining how large a corpus one’s resources will permit, the next
step is to consider how long the corpus needs to be to permit the kinds of studies
one envisions for it. One reason that the BNC was so long was that one planned
use of the corpus was to create dictionaries, an enterprise requiring a much
larger database than is available in shorter corpora, such as Brown or LOB. On
the other hand, all of the regional components of the ICE Corpus are only one
million words in length, since the goal of the ICE project is not to permit in-
depth lexical studies but grammatical analyses of the different regional varieties
of English – a task that can be accomplished in a shorter corpus.
In general, the lengthier the corpus, the better. However, it is possible to
estimate the required size of a corpus more precisely by using statistical for-
mulas that take the frequency with which linguistic constructions are likely
to occur in text samples and calculate how large the corpus will have to be
to study the distribution of the constructions validly. Biber (1993: 253–4)
carried out such a study (based on information in Biber 1988) on 481 text
samples that occurred in twenty-three different genres of speech and writing.
He found that reliable information could be obtained on frequently occurring
linguistic items such as nouns in as few as 59.8 text samples. On the other
hand, infrequently occurring grammatical constructions such as conditional
clauses required a much larger number of text samples (1,190) for valid in-
formation to be obtained. Biber (1993: 254) concludes that the “most con-
servative approach” would be to base the ultimate size of a corpus on “the
most widely varying feature”: those linguistic constructions, such as condi-
tional clauses, that require the largest sample size for reliable studies to be
conducted. A corpus of 1,190 samples would therefore be 2,380,000 words in
length (if text samples were 2,000 words in length, the standard length in many
corpora).3
Unfortunately, such calculations presuppose that one knows precisely what
linguistic constructions will be studied in a corpus. The ultimate length of a
3 Sánchez and Cantos (1997) describe a formula they developed, based on a corpus of Spanish they
created, to predict the length of corpus necessary to provide a reliable and accurate representation
of Spanish vocabulary. However, because their calculations are based on the Spanish language,
they caution that “future research” (p. 265) needs to be conducted to determine whether the
formula can be applied to other languages, such as English, as well.
34 Planning the construction of a corpus
corpus might therefore be better determined not by focusing too intently on the
overall length of the corpus but by focusing more on the internal structure of the
corpus: the range of genres one wishes to include in the corpus, the length and
number of individual text samples required to validly represent the genres that
will make up the corpus, and the demographic characteristics of the individuals
whose speech and writing will be chosen for inclusion in the corpus.
4 The types of individuals whose speech is represented in the demographically sampled part of the
BNC are discussed in greater detail in section 2.5.
Table 2.2 Composition of the ICE (adapted from Greenbaum 1996a: 29–30)
Speech
Type Number of texts Length % of spoken corpus
Writing
Type Number of texts Length % of written corpus
type of writing in the BNC, in the ICE Corpus it is not as prominent. More
prominent were learned and popular examples of informational writing: writing
from the humanities, social and natural sciences, and technology (40 percent of
the written texts). These categories are also represented in the BNC, although
the BNC makes a distinction between the natural, applied, and social sciences
and, unlike the ICE Corpus, does not include equal numbers of texts in each
of these categories. The ICE Corpus also contains a fairly significant number
(25 percent) of non-printed written genres (such as letters and student writing),
while only 5–10 percent of the BNC contains these types of texts.
To summarize, while there are differences in the composition of the ICE
Corpus and BNC, overall the two corpora represent similar genres of spoken
and written English. The selection of these genres raises an important method-
ological question: why these genres and not others?
This question can be answered by considering the two main types of corpora
that have been created, and the particular types of studies that can be carried out
on them. The ICE Corpus and BNC are multi-purpose corpora, that is, they are
intended to be used for a variety of different purposes, ranging from studies of
vocabulary, to studies of the differences between various national varieties of
English, to studies whose focus is grammatical analysis, to comparisons of the
various genres of English. For this reason, each of these corpora contains a broad
range of genres. But in striving for breadth of coverage, some compromises had
to be made in each corpus. For instance, while the spoken part of the ICE
Corpus contains legal cross-examinations and legal presentations, the written
part of the corpus contains no written legal English. Legal written English was
excluded from the ICE Corpus on the grounds that it is a highly specialized
type of English firmly grounded in a tradition dating back hundreds of years,
and thus does not truly represent English as it is written in the 1990s (the years
during which texts for the ICE Corpus were collected). The ICE Corpus also
contains two types of newspaper English: press reportage and press editorials.
However, as Biber (1988: 180–96) notes, newspaper English is a diverse genre,
containing not just reportage and editorials but, for instance, feature writing as
well – a type of newspaper English not represented in the ICE Corpus.
Because general-purpose corpora such as the ICE Corpus and BNC do not
always contain a full representation of a genre, it is now becoming quite common
to see special-purpose corpora being developed. For instance, the Michigan
Corpus of Academic Spoken English (MICASE) was created to study the type
of speech used by individuals conversing in an academic setting: class lectures,
class discussions, student presentations, tutoring sessions, dissertation defenses,
and many other kinds of academic speech (Powell and Simpson 2001: 34–
40). By restricting the scope of the corpus, energy can be directed towards
assembling a detailed collection of texts that fully represent the kind of academic
language one is likely to encounter at an American university. As of June 2001,
seventy-one texts (813,684 words) could be searched on the MICASE website:
http://www.hti.umich.edu/micase/.
2.3 The types of genres to include in a corpus 37
certain periods: histories in Old English and laws and correspondence in Middle
English. Correspondence in the form of personal letters is not available until the
fifteenth century. Other types of genres exist in earlier periods but are defined
differently than they are in the modern period. The Helsinki Corpus includes
the genre of science in the Old English period, but the texts that are included
focus only on astronomy, a reflection of the fact that science played a much
more minimal role in medieval culture than it does today.
Corpora vary in terms of the length of the individual text samples that
they contain. The Lampeter Corpus of Early Modern English Tracts consists
of complete rather than composite texts. In the Helsinki Corpus, text samples
range from 2,000–10,000 words in length. Samples within the BNC are equally
varied but are as long as 40,000 words. This is considerably lengthier than earlier
corpora, such as Brown or LOB, which contained 2,000-word samples, and the
London–Lund Corpus, which contained 5,000-word samples. The ICE Corpus
follows in the tradition of Brown and LOB and contains 2,000-word samples.5
Because most corpora contain relatively short stretches of text, they tend to
contain text fragments rather than complete texts. Ideally, it would be desirable
to include complete texts in corpora, since even if one is studying grammatical
constructions, it is most natural to study these constructions within the context
of a complete text rather than only part of that text. However, there are numerous
logistical obstacles that make the inclusion of complete texts in corpora nearly
impossible. For instance, many texts, such as books, are quite lengthy, and to
include a complete text in a corpus would not only take up a large part of the
corpus but require the corpus compiler to obtain permission to use not just a
text excerpt, a common practice, but an entire text, a very uncommon practice.
Experience with the creation of the ICE Corpus has shown that, in general, it
is quite difficult to obtain permission to use copyrighted material, even if only
part of a text is requested, and this difficulty will only be compounded if one
seeks permission to use an entire text (cf. section 3.3).
Of course, just because only text samples are included in a corpus does not
mean that sections of texts ought to be randomly selected for inclusion in a
corpus. It is possible to take excerpts that themselves form a coherent unit. In
the American component of ICE (and many earlier corpora as well), sections of
spoken texts are included that form coherent conversations themselves: many
conversations consist of subsections that form coherent units, and that have their
own beginnings, middles, and ends. Likewise, for written texts, one can include
5 All of these lengths are approximate. For instance, some texts in the ICE Corpus exceed 2,000
words to avoid truncating a text sample in mid-word or mid-paragraph.
2.4 The length of text samples 39
the first 2,000 words of an article, which contains the introduction and part of
the body of the article, or one can take the middle of an article, which contains a
significant amount of text developing the main point made in the article, or even
its end. Many samples in the ICE Corpus also consist of composite texts: a series
of complete short texts that total 2,000 words in length. For instance, personal
letters are often less than 2,000 words, and a text sample can be comprised of
complete letters totalling 2,000 words. For both the spoken and written parts of
the corpus, not all samples are exactly 2,000 words: a sample is not broken off in
mid-sentence but at a point (often over or just under the 2,000-word limit) where
a natural break occurs. But even though it is possible to include coherent text
samples in a corpus, creators and users of corpora simply have to acknowledge
that corpora are not always suitable for many types of discourse studies, and
that those wishing to carry out such studies will simply have to assemble their
own corpora for their own personal use.
In including short samples from many different texts, corpus compilers are
assuming that it is better to include more texts from many different speakers and
writers than fewer texts from a smaller number of speakers and writers. And
there is some evidence to suggest that this is the appropriate approach to take in
creating a corpus. Biber (1990 and 1993: 248–52) conducted an experiment in
which he used the LOB and London–Lund Corpora to determine whether text
excerpts provide a valid representation of the structure of a particular genre.
He divided the LOB and London–Lund corpora into 1,000-word samples: he
took two 1,000-word samples from each 2,000-word written sample of the LOB
Corpus, and he divided each 5,000-word sample of speech from the London–
Lund Corpus into four 1,000-word samples. In 110 of these 1,000-word samples,
Biber calculated the frequency of a range of different linguistic items, such as
nouns, prepositions, present- and past-tense verb forms, passives, and so forth.
He concluded that 1,000-word excerpts are lengthy enough to provide valid
and reliable information on the distribution of frequently occurring linguistic
items. That is, if one studied the distribution of prepositions in the first thousand
words of a newspaper article totalling 10,000 words, for instance, studying
the distribution of prepositions in the entire article would not yield different
distributions: the law of diminishing returns is reached after 1,000 words. On
the other hand, Biber found that infrequently occurring linguistic items (such as
relative clauses) cannot be reliably studied in a short excerpt; longer excerpts
are required.
In addition to studying the distribution of word categories, such as nouns or
prepositions, Biber (1993: 250) calculated the frequency with which new words
are added to a sample as the number of words in the sample increases. He found,
for instance, that humanities texts are more lexically diverse than technical
prose texts (p. 252), that is, that as a humanities text progresses, there is higher
likelihood that new words will be added as the length of the text increases than
there will be in a technical prose text. This is one reason why lexicographers
need such large corpora to study vocabulary trends, since so much vocabulary
40 Planning the construction of a corpus
(in particular, open-class items such as nouns and verbs) occurs so rarely. And
as more text is considered, there is a greater chance (particularly in humanities
texts) that new words will be encountered.
Biber’s (1993) findings would seem to suggest that corpus compilers ought
to greatly increase the length of text samples to permit the study of infrequently
occurring grammatical constructions and vocabulary. However, Biber (1993:
252) concludes just the opposite, arguing that “Given a finite effort invested
in developing a corpus, broader linguistic representation can be achieved by
focusing on diversity across texts and text types rather than by focusing on
longer samples from within texts.” In other words, corpus compilers should
strive to include more different kinds of texts in corpora rather than lengthier
text samples. Moreover, those using corpora to study infrequently occurring
grammatical constructions will need to go beyond currently existing corpora
and look at additional material on their own.6
6 It is also possible to use elicitation tests to study infrequently occurring grammatical items – tests
that ask native speakers to comment directly on particular linguistic items. See Greenbaum (1973)
for a discussion of how elicitation tests can be used to supplement corpus studies.
2.5 Determining the number of texts and range of speakers and writers 41
amply represented. The reason for this sentiment is obvious: while only a small
segment of the speakers of English create scripted broadcast news reports, all
speakers of English engage in spontaneous dialogues.
Although it is quite easy to determine the relative importance of sponta-
neous dialogues in English, it is far more difficult to go through every potential
genre to be included in a corpus and rank its relative importance and frequency
to determine how much of the genre should be included in the corpus. And
if one did take a purely “proportional” approach to creating a corpus, Biber
(1993: 247) notes, the resultant corpus “might contain roughly 90% conversa-
tion and 3% letters and notes, with the remaining 7% divided among registers
such as press reportage, popular magazines, academic prose, fiction, lectures,
news broadcasts, and unpublished writings” since these figures provide a rough
estimate of the actual percentages of individuals that create texts in each of
these genres. To determine how much of a given genre needs to be included
in a corpus, Biber (1993) argues that it is more desirable to focus only on
linguistic considerations, specifically how much internal variation there is in
the genre. As Biber (1988: 170f.) has demonstrated in his pioneering work on
genre studies, some genres have more internal variation than others and, con-
sequently, more examples of these genres need to be included in a corpus to
ensure that the genres are adequately represented. For instance, Biber (1988:
178) observes that even though the genres of official documents and academic
prose contain many subgenres, the subgenres in official documents (e.g. gov-
ernment reports, business reports, and university documents) are much more
linguistically similar than the subgenres of academic prose (e.g. the natural and
social sciences, medicine, and the humanities). That is, if the use of a con-
struction such as the passive is investigated in the various subgenres of official
documents and academic prose, there will be less variation in the use of the
passive in the official documents than in the academic prose. This means that a
corpus containing official documents and academic prose will need to include
more academic prose to adequately represent the amount of variation present
in this genre.
In general, Biber’s (1988) study indicates that a high rate of internal varia-
tion occurs not just in academic prose but in spontaneous conversation (even
though there are no clearly delineated subgenres in spontaneous conversation),
spontaneous speeches, journalistic English (though Biber analyzed only press
reportage and editorials), general fiction, and professional letters. Less inter-
nal variation occurs in official documents (as described above), science fiction,
scripted speeches, and personal letters.
Because the BNC is a very lengthy corpus, it provides a sufficient number
of samples of genres to enable generalizations to be made about the genres.
However, with the much shorter ICE Corpus (and with other million-word
corpora, such as Brown and LOB, as well), it is an open question whether the
forty 2,000-word samples of academic prose contained in the ICE Corpus, for
instance, are enough to adequately represent this genre. And given the range
42 Planning the construction of a corpus
of variation that Biber (1988) documents in academic prose, forty samples are
probably not enough. Does this mean, then, that million-word corpora are not
useful for genre studies?
The answer is no: while these corpora are too short for some studies, for fre-
quently occurring grammatical constructions they are quite adequate for making
generalizations about a genre. For instance, Meyer (1992) used 120,000-word
sections of the Brown, the London–Lund, and the Survey of English Usage
(SEU) Corpora to study the usage of appositions in four genres: spontaneous
conversation, press reportage, academic prose, and fiction. In analyzing only
twenty 2,000-word samples from the press genre of the Brown Corpus, he
found ninety-seven examples of appositions of the type political neophyte Steve
Forbes in which the first unit lacks a determiner and occurs before a proper noun
(p. 12); in eight 5,000-word samples from the press genre of the SEU Corpus,
he found only thirty-one examples of such appositions. These findings allowed
Meyer (1992) to claim quite confidently that not only were such appositions
confined to the press genre (the other genres contained virtually no examples
of this type of apposition) but they were more common in American English
than in British English.
Studying the amount of internal linguistic variation in a genre is one way
to determine how many samples of a genre need to be included in a corpus;
applying the principles of sampling methodology is another way. Although
sampling methodology can be used to determine the number of samples needed
to represent a genre, this methodology, as was demonstrated earlier in this
section, is not all that useful for this purpose. It is best used to select the
individuals whose speech and writing will be included in a corpus.
Social scientists have developed a sophisticated methodology based on math-
ematical principles that enables a researcher to determine how many “elements”
from a “sampling frame” need to be selected to produce a “representative” and
therefore “valid” sample. A sampling frame is determined by identifying a spe-
cific population that one wishes to make generalizations about. For instance, in
creating the Tampere Corpus, Norri and Kytö (1996) decided that their sampling
frame would not be just scientific writing but different types of scientific writ-
ing written at varying levels of technicality. After deciding what their sampling
frame was, Norri and Kytö (1996) then had to determine which elements from
this frame they needed to include in their corpus to produce a representative
sampling of scientific writing, that is, enough writing from the social sciences,
for instance, so that someone using the corpus could analyze the writing and
be able to claim validly that their results were true not just about the specific
examples of social science writing included in the Tampere Corpus but all so-
cial science writing. Obviously, Norri and Kytö (1996) could not include all
examples of social science writing published in, say, a given year; they had to
therefore narrow the range of texts that they selected for inclusion.
Social scientists have developed mathematical formulas that enable a re-
searcher to calculate the number of samples they will need to take from a
sampling frame to produce a representative sample of the frame. Kretzschmar,
2.5 Determining the number of texts and range of speakers and writers 43
Meyer, and Ingegneri (1997) used one of Kalton’s (1983: 82) formulas to cal-
culate the number of books published in 1992 that would have to be sampled to
create a representative sample. Bowker’s Publisher’s Weekly lists 49,276 books
as having been published in the United States in 1992. Depending on the level
of confidence desired, samples from 2,168–2,289 books would have to be in-
cluded in the corpus to produce a representative sample. If each sample from
each book was 2,000 words in length, the corpus of books would be between
4,336,000 and 4,578,000 words in length. Kalton’s (1983) formula could be
applied to any sampling frame that a corpus compiler identifies, provided that
the size of the frame can be precisely determined.
Sampling methodology can also be used to select the particular individuals
whose speech and writing will be included in a corpus. For instance, in plan-
ning the collection of demographically sampled speech for the BNC, “random
location sampling procedures” were used to select individuals whose speech
would be recorded (Crowdy 1993: 259). Great Britain was divided into twelve
regions. Within these regions, “thirty sampling points” were selected: locations
at which recordings would be made and that were selected based on “their
ACORN profile . . . (A Classification of Regional Neighbourhoods)” (Crowdy
1993: 260). This profile provides demographic information about the types of
people likely to live in certain regions of Great Britain, and the profile helped
creators of the BNC select speakers of various social classes to be recorded.
In selecting potential speakers, creators of the BNC also controlled for other
variables, such as age and gender.
In using sampling methodology to select texts and speakers and writers for
inclusion in a corpus, a researcher can employ two general types of sampling:
probability sampling and non-probability sampling (Kalton 1983). In probabil-
ity sampling (employed above in selecting speakers for the BNC), the researcher
very carefully pre-selects the population to be sampled, using statistical formu-
las and other demographic information to ensure that the number and type of
people being surveyed are truly representative. In non-probability sampling, on
the other hand, this careful pre-selection process is not employed. For instance,
one can select the population to be surveyed through the process of “haphazard,
convenience, or accidental sampling” (Kalton 1983: 90); that is, one samples in-
dividuals who happen to be available. Alternatively, one can employ “judgment,
or purposive, or expert choice” sampling (Kalton 1983:91); that is, one decides
before-hand who would be best qualified to be sampled (e.g. native rather than
non-native speakers of a language, educated vs. non-educated speakers of a
language, etc.). Finally, one can employ “quota sampling” (Kalton 1983: 91),
and sample certain percentages of certain populations. For instance, one could
create a corpus of American English by including in it samples reflecting ac-
tual percentages of ethnic groups residing in the United States (e.g. 10 percent
African Americans, 15 percent Hispanic Americans, etc.).
Although probability sampling is the most reliable type of sampling, lead-
ing to the least amount of bias, for the corpus linguist this kind of sampling
presents considerable logistical challenges. The mathematical formulas used
44 Planning the construction of a corpus
in probability sampling often produce very large sample sizes, as the example
with books cited above illustrated. Moreover, to utilize the sampling techniques
employed by creators of the BNC in the United States would require consider-
able resources and funding, given the size of the United States and the number
of speakers of American English. Consequently, it is quite common for corpus
linguists to use non-probability sampling techniques in compiling their corpora.
In creating the Brown Corpus, for instance, a type of “judgement” sampling
was used. That is, prior to the creation to the Brown Corpus, it was decided
that the writing to be included in the corpus would be randomly selected from
collections of edited writing at four locations:
(a) for newspapers, the microfilm files of the New York Public Library;
(b) for detective and romantic fiction, the holdings of the Providence Athen-
aeum, a private library;
(c) for various ephemeral and popular periodical material, the stock of a large
secondhand magazine store in New York City;
(d) for everything else, the holdings of the Brown University Library. (Quoted
from Francis 1979: 195)
Other corpora have employed other non-probability sampling techniques. The
American component of the International Corpus of English used a combina-
tion of “judgement” and “convenience” sampling: every effort was made to
collect speech and writing from a balanced group of constituencies (e.g. equal
numbers of males and females), but ultimately what was finally included in
the corpus was a consequence of whose speech or writing could be most eas-
ily obtained. For instance, much fiction is published in the form of novels or
short stories. However, many publishers require royalty payments from those
seeking reproduction rights, a charge that can sometimes involve hundreds of
dollars. Consequently, most of the fiction in the American component of ICE
consists of unpublished samples of fiction taken from the Internet: fiction for
which permission can usually be obtained for no cost and which is available in
computerized form. The Corpus of Spoken American English has employed a
variation of “quota” sampling, making every effort to collect samples of speech
from a representative sampling of men and women, and various ethnicities and
regions of the United States.
The discussion thus far in this chapter has focused on the composition of the
BNC to identify the most important factors that need to be considered prior to
collecting texts for inclusion in a corpus. This discussion has demonstrated that:
1. Lengthier corpora are better than shorter corpora. However, even more im-
portant than the sheer length of a corpus is the range of genres included
within it.
2. The range of genres to be included in a corpus is determined by whether it
will be a multi-purpose corpus (a corpus intended to have wide uses) or a
special-purpose corpus (a corpus intended for more specific uses, such as
the analysis of a particular genre like scientific writing).
2.6 The time-frame for selecting speakers and texts 45
Most corpora contain samples of speech or writing that have been writ-
ten or recorded within a specific time-frame. Synchronic corpora (i.e. corpora
containing samples of English as it is presently spoken and written) contain
texts created within a relatively narrow time-frame. Although most of the writ-
ten texts used in the BNC were written or published between 1975 and 1993, a
few texts date back as far as 1960 (Aston and Burnard 1998: 30). The Brown
and LOB corpora contain texts published in 1961. The ICE Corpus contains
texts spoken or written (or published) in the years 1990-present.
In creating a synchronic corpus, the corpus compiler wants to be sure that
the time-frame is narrow enough to provide an accurate view of contemporary
English undisturbed by language change. However, linguists disagree about
whether purely synchronic studies are even possible: new words, for instance,
come into the language every day, indicating that language change is a con-
stant process. Moreover, even grammatical constructions can change subtly in
a rather short period of time. For instance, Mair (1995) analyzed the types of
verb complements that followed the verb help in parts of the press sections
of the LOB Corpus (comprised of texts published in 1961) and the FLOB
(Freiburg–Lancaster–Oslo–Bergen) Corpus, a corpus comparable to LOB but
containing texts published thirty years later in 1991. Mair (1995) found some
differences in the complementation patterns, even over a period as short as
thirty years. While help with the infinitive to was the norm in the 1961 LOB
Corpus (e.g. I will help you to lift the books), this construction has been re-
placed in the 1991 FLOB Corpus by the bare infinitive (e.g. I will help you lift
the books).
Mair’s (1995) study indicates that language change can occur over a relatively
short period of time. Therefore, if one wishes to create a truly synchronic corpus,
46 Planning the construction of a corpus
then the texts included in the corpus should reflect as narrow a time-frame
as possible. Although this time-frame does not need to be as narrow as the
one-year frame in Brown and LOB, a time-frame of five to ten years seems
reasonable.
With diachronic corpora (i.e. corpora used to study historical periods of
English), the time-frame for texts is somewhat easier to determine, since the
various historical periods of English are fairly well defined. However, compli-
cations can still arise. For instance, Rissanen (1992: 189) notes that one goal
of a diachronic corpus is “to give the student an opportunity to map and com-
pare variant fields or variant paradigms in successive diachronic stages in the
past.” To enable this kind of study in the Helsinki Corpus, Rissanen (1992)
remarks that the Old and Early Middle English sections of the corpus were
divided into 100-year subperiods, the Late Middle and Early Modern English
periods into seventy- to eighty-year periods. The lengths of subperiods were
shorter in the later periods of the corpus “to include the crucial decades of
the gradual formation of the fifteenth-century Chancery standard within one
and the same period” (Rissanen 1992: 189). The process that was used to de-
termine the time-frames included in the Helsinki Corpus indicates that it is
important in diachronic corpora not to just cover predetermined historical pe-
riods of English but to think through how significant events occurring during
those periods can be best covered in the particular diachronic corpus being
created.
had lived in the United States and spoken American English since adolescence.
Adolescence (10–12 years of age) was chosen as the cut-off point, because when
acquisition of a language occurs after this age, foreign accents tend to appear
(one marker of non-native speaker status). In non-native varieties of English, the
level of fluency among speakers will vary considerably, and as Schmied (1996:
187) observes, “it can be difficult to determine where an interlanguage ends
and educated English starts.” Nevertheless, in selecting speakers for inclusion
in a corpus of non-native speech or writing, it is important to define specific
criteria for selection, such as how many years an individual has used English,
in what contexts they have used it, how much education in English they have
had, and so forth.
To determine whether an individual’s speech or writing is appropriate for
inclusion in a corpus (and also to keep track of the sociolinguistic variables
described in section 2.8), one can have individuals contributing texts to a corpus
fill out a biographical form in which they supply the information necessary for
determining whether their native or non-native speaker status meets the criteria
for inclusion in the corpus. For the American component of ICE, individuals
are asked to supply their residence history: the places they have lived over their
lives and the dates they have lived there. If, for instance, an individual has spent
the first four years of life living outside the United States, it can probably be
safely assumed that the individual is a native speaker of English and came to
the United States at an early enough age to be considered a native speaker of
English. On the other hand, if an individual spent the first fifteen years of life
outside the United States, he or she is probably a non-native speaker of English
and has learned English as a second language. It is generally best not to ask
individuals directly on the biographical form whether they are a native speaker
or not, since they may not understand (in the theoretical sense) precisely what a
native speaker of a language is. If the residence history on the biographical form
is unclear, it is also possible to interview the individual afterwards, provided that
he or she can be located; if the individual does not fit the criteria for inclusion,
his or her text can be discarded.
Determining native or non-native speaker status from authors of published
writing can be considerably more difficult, since it is often not possible to
locate authors and have them fill out biographical forms. In addition, it can
be misleading to use an individual’s name alone to determine native speaker
status, since someone with a non-English-sounding name may have immigrant
parents and nevertheless be a native speaker of English, and many individuals
with English-sounding names may not be native speakers: one of the written
samples of the American component of ICE had to be discarded when the au-
thor of one of the articles called on the telephone and explained that he was
not a native speaker of American English but Australian English. Schmied
(1996: 187) discovered that a major newspaper in Kenya had a chief editor
and training editor who were both Irish and who exerted considerable editorial
influence over the final form of articles written by native Kenyan reporters.
48 Planning the construction of a corpus
This discovery led Schmied (1996) to conclude that there are real questions
about the authenticity of the English used in African newspapers. When deal-
ing with written texts from earlier periods, additional problems can be en-
countered: English texts were often translations from Latin or French, and a
text published in the United States, for instance, might have been written by
a speaker of British English. Ultimately, however, when dealing with written
texts, one simply has to acknowledge that in compiling a corpus of American
English, for instance, there is a chance that the corpus will contain some
non-native speakers, despite one’s best efforts to collect only native-speaker
speech.
In spoken dialogues, one may find out that one or more of the speakers in a
conversation do not meet the criteria for inclusion because they are not native
speakers of the variety being collected. However, this does not necessarily mean
that the text has to be excluded from the corpus, since there is annotation that
can be included in a corpus indicating that certain sections of a sample are
“extra-corpus” material (cf. section 4.1), that is, material not considered part
of the corpus for purposes of word counts, generating KWIC (key word in
context) concordances, and so forth.
2.8.2 Age
Although there are special-purpose corpora such as CHILDES
(cf. section 1.3.8) that contain the language of children, adolescents, and adults,
most corpora contain the speech and writing of “adults.” The notable exception
to this trend is the British National Corpus. In the spoken section of the cor-
pus containing demographically sampled speech, there is a balanced grouping
of texts representing various age groups, ranging from the youngest group-
ing (0–14 years) to the oldest grouping (60+) (“Composition of the BNC”:
http://info.ox.ac.uk/bnc/what/balance.html). In addition, there were texts in-
cluded in the BNC taken from the Bergen Corpus of London Teenager English
(COLT), which contains the speech of adolescents up to age 16 (Aston and
Burnard 1998: 32; also cf. section 1.3.4). However, the written part of the
BNC contains a sparser selection of texts from younger age groups, largely
because younger individuals simply do not produce the kinds of written texts
(e.g. press reportage or technical reports) represented in the genres typically
included in corpora. This is one reason why corpora have tended to contain
50 Planning the construction of a corpus
mainly adult language. Another reason is that to collect the speech of children
and adolescents, one often has to obtain the permission not just of the individual
being recorded but of his or her parents as well, a complicating factor in an
already complicated endeavor.
If it is decided that a corpus is to contain only adult language, it then becomes
necessary to determine some kind of age cut-off for adult speech. The ICE
project has set the age of 18 as a cut-off point between adolescent and adult, and
thus has included in the main body of the corpus only the speech and writing of
individuals over the age of 18. However, even though one may decide to include
only adult speech in a corpus, to get a full range of adult speaking styles, it is
desirable to include adults conversing with adolescents or children as well. In
the transcribed corpus itself, markup can be included to set off the speech of
adolescents or children as “extra corpus” material.
In most corpora attempting to collect texts from individuals of varying ages,
the ages of those in the corpus are collapsed into various age groups. For
instance, the British component of ICE has five groups: 18–25, 26–45, 46–65,
66+, and a category of “co-authored” for written texts with multiple authors.
represent regional and social variation is that creators of these corpora have had
unrealistic expectations. As Chafe, Du Bois, and Thompson (1991: 69) note,
the thought of a corpus of “American English” to some individuals conjures
up images of “a body of data that would document the full sweep of the lan-
guage, encompassing dialectal diversity across regions, social classes, ethnic
groups . . .[enabling] massive correlations of social attributes with linguistic
features.” But to enable studies of this magnitude, the corpus creator would
have to have access to resources far beyond those that are currently available –
resources that would enable the speech of thousands of individuals to be recorded
and then transcribed.
Because it is not logistically feasible in large countries such as the United
States or Great Britain to create corpora that are balanced by region and social
class, it is more profitable for corpus linguists interested in studying social
and regional variation to devote their energies to the creation of corpora that
focus on smaller dialect regions. Tagliamonte (1998) and Tagliamonte and
Lawrence (2000: 325–6), for instance, contain linguistic discussions based on
a 1.5-million-word corpus of York English that has been subjected to extensive
analysis and that has yielded valuable information on dialect patterns (both
social and regional) particular to this region of England. Kirk (1992) has created
the Northern Ireland Transcribed Corpus of Speech containing transcriptions
of interviews of speakers of Hiberno English. Once a number of more focused
corpora like these are created, they can be compared with one another and valid
comparisons of larger dialect areas can then be conducted.
2.9 Conclusions
Study questions