0% found this document useful (0 votes)
17 views24 pages

BNC & Ice

The document discusses the careful planning required for constructing a corpus, emphasizing the importance of decisions regarding its size, text types, and population sampling based on intended uses. It highlights the British National Corpus (BNC) as a significant example, detailing its composition of written and spoken British English and the methodological considerations involved in its creation. Additionally, it addresses the logistical challenges and resource commitments necessary for corpus compilation, as well as the diversity of genres included in the BNC.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views24 pages

BNC & Ice

The document discusses the careful planning required for constructing a corpus, emphasizing the importance of decisions regarding its size, text types, and population sampling based on intended uses. It highlights the British National Corpus (BNC) as a significant example, detailing its composition of written and spoken British English and the methodological considerations involved in its creation. Additionally, it addresses the logistical challenges and resource commitments necessary for corpus compilation, as well as the diversity of genres included in the BNC.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

2 Planning the construction of a corpus

Before the texts to be included in a corpus are collected, annotated, and


analyzed, it is important to plan the construction of the corpus carefully: what
size it will be, what types of texts will be included in it, and what population
will be sampled to supply the texts that will comprise the corpus. Ultimately,
decisions concerning the composition of a corpus will be determined by the
planned uses of the corpus. If, for instance, the corpus is be used primarily for
grammatical analysis (e.g. the analysis of relative clauses or the structure of
noun phrases), it can consist simply of text excerpts rather than complete texts.
On the other hand, if the corpus is intended to permit the study of discourse
features, then it will have to contain complete texts.
Deciding how lengthy text samples within a corpus should be is but one of the
many methodological considerations that must be addressed before one begins
collecting data for inclusion in a corpus. To explore the process of planning a
corpus, this chapter will consider the methodological assumptions that guided
the compilation of the British National Corpus. Examining the British National
Corpus reveals how current corpus planners have overcome the methodolog-
ical deficiencies of earlier corpora, and raises more general methodological
considerations that anyone planning to create a corpus must address.

2.1 The British National Corpus

At approximately 100 million words in length, the British National


Corpus (BNC) (see table 2.1) is one of the largest corpora ever created. Most
of the corpus (about 90 percent) consists of various types of written British
English, with the remainder (about 10 percent) comprised of different types of
spoken British English. Even though writing dominates in the BNC, the amount
of speech in the corpus is the most ever made available in a corpus.1
In planning the collection of texts for the BNC, a number of decisions were
made beforehand:
1. Even though the corpus would contain both speech and writing, more writing
would be collected than speech.
1 Prior to the creation of the British National Corpus, the London–Lund Corpus contained the most
speech available in a corpus: approximately 500,000 words (see Svartvik 1990).

30
2.1 The British National Corpus 31

Table 2.1 The composition of the British National Corpus


(adapted from Aston and Burnard 1998: 29–33 and
http://info.ox.ac.uk/bnc/what/balance.html)

Speech
Type Number of texts Number of words % of spoken corpus

Demographically 153 4,211,216 41%


sampled
Educational 144 1,265,318 12%
Business 136 1,321,844 13%
Institutional 241 1,345,694 13%
Leisure 187 1,459,419 14%
Unclassified 54 761,973 7%
Total 915 10,365,464 100%

Writing
Type Number of texts Number of words % of written corpus

Imaginative 625 19,664,309 22%


Natural science 144 3,752,659 4%
Applied science 364 7,369,290 8%
Social science 510 13,290,441 15%
World affairs 453 16,507,399 18%
Commerce 284 7,118,321 8%
Arts 259 7,253,846 8%
Belief & thought 146 3,053,672 3%
Leisure 374 9,990,080 11%
Unclassified 50 1,740,527 2%
Total 3,209 89,740,544 99%2

2. A variety of different genres of speech and writing would be gathered for


inclusion in the corpus.
3. Each genre would be divided into text samples, and each sample would not
exceed 40,000 words in length.
4. A number of variables would be controlled for in the entire corpus, such as
the age and gender of speakers and writers.
5. For the written part of the corpus, a careful record of a variety of variables
would be kept, including when and where the texts were written or published;
whether they were books, articles, or manuscripts; and who their target
audience was.
6. For the demographically sampled spoken samples, texts would be collected
from individuals representing the major dialect regions of Great Britain and
the various social classes existing within these regions.
2 Because fractions were rounded up or down to the nearest whole number, cumulative frequencies
in this and subsequent tables do not always add up to exactly 100 percent.
32 Planning the construction of a corpus

Even though the creators of the BNC had a very definite plan for the composi-
tion of the corpus, it is important to realize, as Biber (1993: 256) observes, that
the creation of a corpus is a “cyclical” process, requiring constant re-evaluation
as the corpus is being compiled. Consequently, the compiler of a corpus should
be willing to change his or her initial corpus design if circumstances arise
requiring such changes to be made. For instance, in planning the part of the
BNC containing demographically sampled speech, it was initially thought that
obtaining samples of speech from 100 individuals would provide a sufficiently
representative sample of this type of speech (Crowdy 1993: 259). Ultimately,
however, the speech of 124 individuals was required to create a corpus bal-
anced by age, gender, social class, and region of origin (Aston and Burnard
1998: 32).
While changes in corpus design are inevitable, it is crucial that a number
of decisions concerning the composition of a corpus are made prior to the
collection of texts for the corpus, and the remainder of this chapter considers
the factors that influenced the decisions detailed above that the creators of the
BNC made, and the general methodological issues that these decisions raise.

2.2 The overall length of a corpus

Earlier corpora, such as Brown and LOB, were relatively short, primar-
ily because of the logistical difficulties that computerizing a corpus created. For
instance, all of the written texts for the Brown Corpus had to be keyed in by hand,
a process requiring a tremendous amount of very tedious and time-consuming
typing. This process has been greatly eased with the invention of optical scan-
ners (cf. section 3.7), which can automatically computerize printed texts with
fairly high rates of accuracy. However, technology has not progressed to the
point where it can greatly expedite the collection and transcription of speech:
there is much work involved in going out and recording speech (cf. section 3.2),
and the transcription of spoken texts still has to be done manually with either
a cassette transcription machine or special software that can replay segments
of speech until the segments are accurately transcribed (cf. section 3.6). These
logistical realities explain why 90 percent of the BNC is writing and only 10 per-
cent speech.
To determine how long a corpus should be, it is first of all important to
compare the resources that will be available to create it (e.g. funding, research
assistants, computing facilities) with the amount of time it will take to collect
texts for inclusion, computerize them, annotate them, and tag and parse them.
Chafe, Du Bois, and Thompson (1991: 70–1) calculate that it will require “six
person-hours” to transcribe one minute of speech for inclusion in the Santa
Barbara Corpus of Spoken American English, a corpus containing not just
orthographic transcriptions of speech but prosodic transcriptions as well (cf.
section 4.1). For the American component of ICE, it was calculated that it
would take eight hours of work to process a typical 2,000-word sample of
2.2 The overall length of a corpus 33

writing, and ten to twenty hours to process a 2,000-word spoken text (with
spontaneous dialogues containing numerous speech overlaps requiring more
time than carefully articulated monologues). Since the American component of
ICE contains 300 samples of speech and 200 samples of writing, completing
the entire corpus would take a full-time research assistant working forty hours
a week more than three years to complete. If one applies these calculations to
the BNC, which is one hundred times the length of the American component
of ICE, it becomes obvious that creating a corpus requires a large commitment
of resources. In fact, the creation of the BNC required the collaboration of a
number of universities and publishers in Great Britain, as well as considerable
financial resources from funding agencies.
After determining how large a corpus one’s resources will permit, the next
step is to consider how long the corpus needs to be to permit the kinds of studies
one envisions for it. One reason that the BNC was so long was that one planned
use of the corpus was to create dictionaries, an enterprise requiring a much
larger database than is available in shorter corpora, such as Brown or LOB. On
the other hand, all of the regional components of the ICE Corpus are only one
million words in length, since the goal of the ICE project is not to permit in-
depth lexical studies but grammatical analyses of the different regional varieties
of English – a task that can be accomplished in a shorter corpus.
In general, the lengthier the corpus, the better. However, it is possible to
estimate the required size of a corpus more precisely by using statistical for-
mulas that take the frequency with which linguistic constructions are likely
to occur in text samples and calculate how large the corpus will have to be
to study the distribution of the constructions validly. Biber (1993: 253–4)
carried out such a study (based on information in Biber 1988) on 481 text
samples that occurred in twenty-three different genres of speech and writing.
He found that reliable information could be obtained on frequently occurring
linguistic items such as nouns in as few as 59.8 text samples. On the other
hand, infrequently occurring grammatical constructions such as conditional
clauses required a much larger number of text samples (1,190) for valid in-
formation to be obtained. Biber (1993: 254) concludes that the “most con-
servative approach” would be to base the ultimate size of a corpus on “the
most widely varying feature”: those linguistic constructions, such as condi-
tional clauses, that require the largest sample size for reliable studies to be
conducted. A corpus of 1,190 samples would therefore be 2,380,000 words in
length (if text samples were 2,000 words in length, the standard length in many
corpora).3
Unfortunately, such calculations presuppose that one knows precisely what
linguistic constructions will be studied in a corpus. The ultimate length of a

3 Sánchez and Cantos (1997) describe a formula they developed, based on a corpus of Spanish they
created, to predict the length of corpus necessary to provide a reliable and accurate representation
of Spanish vocabulary. However, because their calculations are based on the Spanish language,
they caution that “future research” (p. 265) needs to be conducted to determine whether the
formula can be applied to other languages, such as English, as well.
34 Planning the construction of a corpus

corpus might therefore be better determined not by focusing too intently on the
overall length of the corpus but by focusing more on the internal structure of the
corpus: the range of genres one wishes to include in the corpus, the length and
number of individual text samples required to validly represent the genres that
will make up the corpus, and the demographic characteristics of the individuals
whose speech and writing will be chosen for inclusion in the corpus.

2.3 The types of genres to include in a corpus

The BNC, as table 2.1 indicates, contains a diverse range of spoken


and written genres. A plurality of the spoken texts (41 percent) were “demo-
graphically sampled”: they consisted of a variety of spontaneous dialogues and
monologues recorded from individuals living in various parts of Great Britain.4
The remainder of the spoken samples contained a fairly even sampling (12–
14 percent) of monologues and dialogues organized by purpose. For instance,
those texts in the genre of education included not just classroom dialogues and
tutorials but lectures and news commentaries as well (Crowdy 1993: 263). A
plurality of the written texts (22 percent) were “imaginative”: they represented
various kinds of fictional and creative writing. Slightly less frequent were sam-
ples of writing from world affairs (18 percent), the social sciences (15 percent),
and a number of other written genres, such as the arts (8 percent) and the natural
sciences (4 percent).
If the BNC is compared with the International Corpus of English (ICE), it
turns out that while the two corpora contain the same range of genres, the genres
are much more specifically delineated in the ICE Corpus (see table 2.2) than
they are in the BNC. For instance, in both corpora, 60 percent of the spoken
texts are dialogues and 40 percent are monologues. In the ICE Corpus, this
division is clearly reflected in the types of genres making up the spoken part
of the corpus. In the BNC, on the other hand, dialogues and monologues are
interspersed among the various genres (e.g. business, leisure) making up the
spoken part of the corpus (Crowdy 1993: 263).
Likewise, the other spoken genres in the ICE Corpus (e.g. scripted and un-
scripted speeches) are included in the BNC but within the general areas outlined
in table 2.1. In both corpora, there is a clear bias towards spontaneous dialogues:
in the ICE Corpus, 33 percent of the spoken texts consist of direct conversations
or telephone calls; in the BNC, 41 percent of the spoken texts are of this type
(although some of the texts in the category are spontaneous monologues as
well).
While the amount of writing in the BNC greatly exceeded the amount of
speech, just the opposite is true in the ICE Corpus: only 40 percent of the
texts are written. While creative (or imaginative) writing was the most common

4 The types of individuals whose speech is represented in the demographically sampled part of the
BNC are discussed in greater detail in section 2.5.
Table 2.2 Composition of the ICE (adapted from Greenbaum 1996a: 29–30)

Speech
Type Number of texts Length % of spoken corpus

Dialogues 180 360,000 59%


Private 100 200,000 33%
direct conversations 90 180,000 30%
distanced conversations 10 20,000 3%
Public 80 160,000 26%
class lessons 20 40,000 7%
broadcast discussions 20 40,000 7%
broadcast interviews 10 20,000 3%
parliamentary debates 10 20,000 3%
legal cross-examinations 10 20,000 3%
business transactions 10 20,000 3%
Monologues 120 240,000 40%
Unscripted 70 140,000 23%
spontaneous commentaries 20 40,000 7%
speeches 30 60,000 10%
demonstrations 10 20,000 3%
legal presentations 10 20,000 3%
Scripted 50 100,000 17%
broadcast news 20 40,000 7%
broadcast talks 20 40,000 7%
speeches (not broadcast) 10 20,000 3%
Total 300 600,000 99%

Writing
Type Number of texts Length % of written corpus

Non-printed 50 100,000 26%


student untimed essays 10 20,000 5%
student examination essays 10 20,000 5%
social letters 15 30,000 8%
business letters 15 30,000 8%
Printed 150 300,000 75%
informational (learned): 40 80,000 20%
humanities, social sciences,
natural sciences, technology
informational (popular): 40 80,000 20%
humanities, social sciences,
natural sciences, technology
informational (reportage) 20 40,000 10%
instructional: administrative, 20 40,000 10%
regulatory, skills, hobbies
persuasive (press editorials) 10 20,000 5%
creative (novels, stories) 20 40,000 10%
Total 200 400,000 101%
36 Planning the construction of a corpus

type of writing in the BNC, in the ICE Corpus it is not as prominent. More
prominent were learned and popular examples of informational writing: writing
from the humanities, social and natural sciences, and technology (40 percent of
the written texts). These categories are also represented in the BNC, although
the BNC makes a distinction between the natural, applied, and social sciences
and, unlike the ICE Corpus, does not include equal numbers of texts in each
of these categories. The ICE Corpus also contains a fairly significant number
(25 percent) of non-printed written genres (such as letters and student writing),
while only 5–10 percent of the BNC contains these types of texts.
To summarize, while there are differences in the composition of the ICE
Corpus and BNC, overall the two corpora represent similar genres of spoken
and written English. The selection of these genres raises an important method-
ological question: why these genres and not others?
This question can be answered by considering the two main types of corpora
that have been created, and the particular types of studies that can be carried out
on them. The ICE Corpus and BNC are multi-purpose corpora, that is, they are
intended to be used for a variety of different purposes, ranging from studies of
vocabulary, to studies of the differences between various national varieties of
English, to studies whose focus is grammatical analysis, to comparisons of the
various genres of English. For this reason, each of these corpora contains a broad
range of genres. But in striving for breadth of coverage, some compromises had
to be made in each corpus. For instance, while the spoken part of the ICE
Corpus contains legal cross-examinations and legal presentations, the written
part of the corpus contains no written legal English. Legal written English was
excluded from the ICE Corpus on the grounds that it is a highly specialized
type of English firmly grounded in a tradition dating back hundreds of years,
and thus does not truly represent English as it is written in the 1990s (the years
during which texts for the ICE Corpus were collected). The ICE Corpus also
contains two types of newspaper English: press reportage and press editorials.
However, as Biber (1988: 180–96) notes, newspaper English is a diverse genre,
containing not just reportage and editorials but, for instance, feature writing as
well – a type of newspaper English not represented in the ICE Corpus.
Because general-purpose corpora such as the ICE Corpus and BNC do not
always contain a full representation of a genre, it is now becoming quite common
to see special-purpose corpora being developed. For instance, the Michigan
Corpus of Academic Spoken English (MICASE) was created to study the type
of speech used by individuals conversing in an academic setting: class lectures,
class discussions, student presentations, tutoring sessions, dissertation defenses,
and many other kinds of academic speech (Powell and Simpson 2001: 34–
40). By restricting the scope of the corpus, energy can be directed towards
assembling a detailed collection of texts that fully represent the kind of academic
language one is likely to encounter at an American university. As of June 2001,
seventy-one texts (813,684 words) could be searched on the MICASE website:
http://www.hti.umich.edu/micase/.
2.3 The types of genres to include in a corpus 37

At the opposite extreme of special-purpose corpora like MICASE are those


which have specialized uses but not for genre studies. The Penn Treebank
consists of a heterogeneous collection of texts totalling 100 million words,
ranging from Dow Jones newswire articles to selections from the King James
version of the Bible (Marcus, Santorini, and Marcinkiewicz 1993). The reason
that a carefully selected range of genres is not important in this corpus is that
the corpus is not intended to permit genre studies but to “train” taggers and
parsers to analyze English: to present them with a sufficient amount of data so
that they can “learn” the structure of numerous constructions and thus produce
a more accurate analysis of the parts of speech and syntactic structures present
in English (cf. section 4.5 for more on parsers). And to accomplish this goal, all
that is important is that a considerable number of texts are available, and less
important are the genres from which the texts were taken.
Because of the wide availability of written and spoken material, it is relatively
easy to collect material for modern-day corpora such as the BNC and ICE; the
real work is in recording and transcribing spoken material (cf. section 3.2), for
instance, or obtaining copyright clearance for written material (cf. section 3.3).
With historical corpora, however, collecting texts from the various genres ex-
isting in earlier periods is a much more complicated undertaking.
In selecting genres for inclusion in historical corpora, the goals are similar
to those for modern-day corpora: to find a range of genres representative of
the types of English that existed during various historical periods of English.
Consequently, there exist multi-purpose corpora, such as Helsinki, which con-
tains a range of different genres (sermons, travelogues, fiction, drama, etc.;
cf. section 1.3.5 for more details), as well as specialized corpora, such as the
Corpus of Early English Correspondence, a corpus of letters written during
the Middle English period. In gathering material for corpora such as these, the
corpus compiler must deal with fact that many of the genres that existed in ear-
lier periods are either unavailable or difficult to find. For instance, even though
spontaneous dialogues were as common and prominent in earlier periods as they
are today, there were no tape recorders around to record speech. Therefore, there
exists no direct record of speech. However, this does not mean that we cannot get
at least a sense of what speech was like in earlier periods. In her study of early
American English, Kytö (1991: 29) assembled a range of written texts that are
second-hand representations of speech: various “verbatim reports,” such as the
proceedings from trials and meetings and depositions obtained from witnesses,
and transcriptions of sermons. Of course, it can never be known with any great
certainty exactly how close to the spoken word these kinds of written texts are.
Nevertheless, texts of this type give us the closest approximation of the speech
of earlier periods that we will ever be able to obtain.
In other situations, a given genre may exist but be underrepresented in a
given period. In his analysis of personal pronouns across certain periods in the
Helsinki Corpus, Rissanen (1992: 194) notes that certain genres were diffi-
cult to compare across periods because they were “poorly represented” during
38 Planning the construction of a corpus

certain periods: histories in Old English and laws and correspondence in Middle
English. Correspondence in the form of personal letters is not available until the
fifteenth century. Other types of genres exist in earlier periods but are defined
differently than they are in the modern period. The Helsinki Corpus includes
the genre of science in the Old English period, but the texts that are included
focus only on astronomy, a reflection of the fact that science played a much
more minimal role in medieval culture than it does today.

2.4 The length of individual text samples to be included


in a corpus

Corpora vary in terms of the length of the individual text samples that
they contain. The Lampeter Corpus of Early Modern English Tracts consists
of complete rather than composite texts. In the Helsinki Corpus, text samples
range from 2,000–10,000 words in length. Samples within the BNC are equally
varied but are as long as 40,000 words. This is considerably lengthier than earlier
corpora, such as Brown or LOB, which contained 2,000-word samples, and the
London–Lund Corpus, which contained 5,000-word samples. The ICE Corpus
follows in the tradition of Brown and LOB and contains 2,000-word samples.5
Because most corpora contain relatively short stretches of text, they tend to
contain text fragments rather than complete texts. Ideally, it would be desirable
to include complete texts in corpora, since even if one is studying grammatical
constructions, it is most natural to study these constructions within the context
of a complete text rather than only part of that text. However, there are numerous
logistical obstacles that make the inclusion of complete texts in corpora nearly
impossible. For instance, many texts, such as books, are quite lengthy, and to
include a complete text in a corpus would not only take up a large part of the
corpus but require the corpus compiler to obtain permission to use not just a
text excerpt, a common practice, but an entire text, a very uncommon practice.
Experience with the creation of the ICE Corpus has shown that, in general, it
is quite difficult to obtain permission to use copyrighted material, even if only
part of a text is requested, and this difficulty will only be compounded if one
seeks permission to use an entire text (cf. section 3.3).
Of course, just because only text samples are included in a corpus does not
mean that sections of texts ought to be randomly selected for inclusion in a
corpus. It is possible to take excerpts that themselves form a coherent unit. In
the American component of ICE (and many earlier corpora as well), sections of
spoken texts are included that form coherent conversations themselves: many
conversations consist of subsections that form coherent units, and that have their
own beginnings, middles, and ends. Likewise, for written texts, one can include

5 All of these lengths are approximate. For instance, some texts in the ICE Corpus exceed 2,000
words to avoid truncating a text sample in mid-word or mid-paragraph.
2.4 The length of text samples 39

the first 2,000 words of an article, which contains the introduction and part of
the body of the article, or one can take the middle of an article, which contains a
significant amount of text developing the main point made in the article, or even
its end. Many samples in the ICE Corpus also consist of composite texts: a series
of complete short texts that total 2,000 words in length. For instance, personal
letters are often less than 2,000 words, and a text sample can be comprised of
complete letters totalling 2,000 words. For both the spoken and written parts of
the corpus, not all samples are exactly 2,000 words: a sample is not broken off in
mid-sentence but at a point (often over or just under the 2,000-word limit) where
a natural break occurs. But even though it is possible to include coherent text
samples in a corpus, creators and users of corpora simply have to acknowledge
that corpora are not always suitable for many types of discourse studies, and
that those wishing to carry out such studies will simply have to assemble their
own corpora for their own personal use.
In including short samples from many different texts, corpus compilers are
assuming that it is better to include more texts from many different speakers and
writers than fewer texts from a smaller number of speakers and writers. And
there is some evidence to suggest that this is the appropriate approach to take in
creating a corpus. Biber (1990 and 1993: 248–52) conducted an experiment in
which he used the LOB and London–Lund Corpora to determine whether text
excerpts provide a valid representation of the structure of a particular genre.
He divided the LOB and London–Lund corpora into 1,000-word samples: he
took two 1,000-word samples from each 2,000-word written sample of the LOB
Corpus, and he divided each 5,000-word sample of speech from the London–
Lund Corpus into four 1,000-word samples. In 110 of these 1,000-word samples,
Biber calculated the frequency of a range of different linguistic items, such as
nouns, prepositions, present- and past-tense verb forms, passives, and so forth.
He concluded that 1,000-word excerpts are lengthy enough to provide valid
and reliable information on the distribution of frequently occurring linguistic
items. That is, if one studied the distribution of prepositions in the first thousand
words of a newspaper article totalling 10,000 words, for instance, studying
the distribution of prepositions in the entire article would not yield different
distributions: the law of diminishing returns is reached after 1,000 words. On
the other hand, Biber found that infrequently occurring linguistic items (such as
relative clauses) cannot be reliably studied in a short excerpt; longer excerpts
are required.
In addition to studying the distribution of word categories, such as nouns or
prepositions, Biber (1993: 250) calculated the frequency with which new words
are added to a sample as the number of words in the sample increases. He found,
for instance, that humanities texts are more lexically diverse than technical
prose texts (p. 252), that is, that as a humanities text progresses, there is higher
likelihood that new words will be added as the length of the text increases than
there will be in a technical prose text. This is one reason why lexicographers
need such large corpora to study vocabulary trends, since so much vocabulary
40 Planning the construction of a corpus

(in particular, open-class items such as nouns and verbs) occurs so rarely. And
as more text is considered, there is a greater chance (particularly in humanities
texts) that new words will be encountered.
Biber’s (1993) findings would seem to suggest that corpus compilers ought
to greatly increase the length of text samples to permit the study of infrequently
occurring grammatical constructions and vocabulary. However, Biber (1993:
252) concludes just the opposite, arguing that “Given a finite effort invested
in developing a corpus, broader linguistic representation can be achieved by
focusing on diversity across texts and text types rather than by focusing on
longer samples from within texts.” In other words, corpus compilers should
strive to include more different kinds of texts in corpora rather than lengthier
text samples. Moreover, those using corpora to study infrequently occurring
grammatical constructions will need to go beyond currently existing corpora
and look at additional material on their own.6

2.5 Determining the number of texts and range


of speakers and writers to include in a corpus

Related to the issue of how long text samples should be in a corpus


is precisely how many text samples are necessary to provide a representative
sampling of a genre, and what types of individuals ought to be selected to sup-
ply the speech and writing used to represent a genre. These two issues can be
approached from two perspectives: from a purely linguistic perspective, and
from the perspective of sampling methodology, a methodology developed by
social scientists to enable researchers to determine how many “elements” from
a “population” need to be selected to provide a valid representation of the popu-
lation being studied. For corpus linguists, this involves determining how many
text samples need to be included in a corpus to ensure that valid generalizations
can be made about a genre, and what range of individuals need to be selected
so that the text samples included in a corpus provide a valid representation of
the population supplying the texts.
There are linguistic considerations that need to be taken into account in deter-
mining the number of samples of a genre to include in a corpus, considerations
that are quite independent of general sampling issues. If the number of sam-
ples included in the various genres of the BNC and ICE Corpus are surveyed,
it is immediately obvious that both of these corpora place a high value on
spontaneous dialogues, and thus contain more samples of this type of speech
than, say, scripted broadcast news reports. This bias is a simple reflection of
the fact that those creating the BNC and ICE Corpus felt that spontaneous di-
alogues are a very important type of spoken English and should therefore be

6 It is also possible to use elicitation tests to study infrequently occurring grammatical items – tests
that ask native speakers to comment directly on particular linguistic items. See Greenbaum (1973)
for a discussion of how elicitation tests can be used to supplement corpus studies.
2.5 Determining the number of texts and range of speakers and writers 41

amply represented. The reason for this sentiment is obvious: while only a small
segment of the speakers of English create scripted broadcast news reports, all
speakers of English engage in spontaneous dialogues.
Although it is quite easy to determine the relative importance of sponta-
neous dialogues in English, it is far more difficult to go through every potential
genre to be included in a corpus and rank its relative importance and frequency
to determine how much of the genre should be included in the corpus. And
if one did take a purely “proportional” approach to creating a corpus, Biber
(1993: 247) notes, the resultant corpus “might contain roughly 90% conversa-
tion and 3% letters and notes, with the remaining 7% divided among registers
such as press reportage, popular magazines, academic prose, fiction, lectures,
news broadcasts, and unpublished writings” since these figures provide a rough
estimate of the actual percentages of individuals that create texts in each of
these genres. To determine how much of a given genre needs to be included
in a corpus, Biber (1993) argues that it is more desirable to focus only on
linguistic considerations, specifically how much internal variation there is in
the genre. As Biber (1988: 170f.) has demonstrated in his pioneering work on
genre studies, some genres have more internal variation than others and, con-
sequently, more examples of these genres need to be included in a corpus to
ensure that the genres are adequately represented. For instance, Biber (1988:
178) observes that even though the genres of official documents and academic
prose contain many subgenres, the subgenres in official documents (e.g. gov-
ernment reports, business reports, and university documents) are much more
linguistically similar than the subgenres of academic prose (e.g. the natural and
social sciences, medicine, and the humanities). That is, if the use of a con-
struction such as the passive is investigated in the various subgenres of official
documents and academic prose, there will be less variation in the use of the
passive in the official documents than in the academic prose. This means that a
corpus containing official documents and academic prose will need to include
more academic prose to adequately represent the amount of variation present
in this genre.
In general, Biber’s (1988) study indicates that a high rate of internal varia-
tion occurs not just in academic prose but in spontaneous conversation (even
though there are no clearly delineated subgenres in spontaneous conversation),
spontaneous speeches, journalistic English (though Biber analyzed only press
reportage and editorials), general fiction, and professional letters. Less inter-
nal variation occurs in official documents (as described above), science fiction,
scripted speeches, and personal letters.
Because the BNC is a very lengthy corpus, it provides a sufficient number
of samples of genres to enable generalizations to be made about the genres.
However, with the much shorter ICE Corpus (and with other million-word
corpora, such as Brown and LOB, as well), it is an open question whether the
forty 2,000-word samples of academic prose contained in the ICE Corpus, for
instance, are enough to adequately represent this genre. And given the range
42 Planning the construction of a corpus

of variation that Biber (1988) documents in academic prose, forty samples are
probably not enough. Does this mean, then, that million-word corpora are not
useful for genre studies?
The answer is no: while these corpora are too short for some studies, for fre-
quently occurring grammatical constructions they are quite adequate for making
generalizations about a genre. For instance, Meyer (1992) used 120,000-word
sections of the Brown, the London–Lund, and the Survey of English Usage
(SEU) Corpora to study the usage of appositions in four genres: spontaneous
conversation, press reportage, academic prose, and fiction. In analyzing only
twenty 2,000-word samples from the press genre of the Brown Corpus, he
found ninety-seven examples of appositions of the type political neophyte Steve
Forbes in which the first unit lacks a determiner and occurs before a proper noun
(p. 12); in eight 5,000-word samples from the press genre of the SEU Corpus,
he found only thirty-one examples of such appositions. These findings allowed
Meyer (1992) to claim quite confidently that not only were such appositions
confined to the press genre (the other genres contained virtually no examples
of this type of apposition) but they were more common in American English
than in British English.
Studying the amount of internal linguistic variation in a genre is one way
to determine how many samples of a genre need to be included in a corpus;
applying the principles of sampling methodology is another way. Although
sampling methodology can be used to determine the number of samples needed
to represent a genre, this methodology, as was demonstrated earlier in this
section, is not all that useful for this purpose. It is best used to select the
individuals whose speech and writing will be included in a corpus.
Social scientists have developed a sophisticated methodology based on math-
ematical principles that enables a researcher to determine how many “elements”
from a “sampling frame” need to be selected to produce a “representative” and
therefore “valid” sample. A sampling frame is determined by identifying a spe-
cific population that one wishes to make generalizations about. For instance, in
creating the Tampere Corpus, Norri and Kytö (1996) decided that their sampling
frame would not be just scientific writing but different types of scientific writ-
ing written at varying levels of technicality. After deciding what their sampling
frame was, Norri and Kytö (1996) then had to determine which elements from
this frame they needed to include in their corpus to produce a representative
sampling of scientific writing, that is, enough writing from the social sciences,
for instance, so that someone using the corpus could analyze the writing and
be able to claim validly that their results were true not just about the specific
examples of social science writing included in the Tampere Corpus but all so-
cial science writing. Obviously, Norri and Kytö (1996) could not include all
examples of social science writing published in, say, a given year; they had to
therefore narrow the range of texts that they selected for inclusion.
Social scientists have developed mathematical formulas that enable a re-
searcher to calculate the number of samples they will need to take from a
sampling frame to produce a representative sample of the frame. Kretzschmar,
2.5 Determining the number of texts and range of speakers and writers 43

Meyer, and Ingegneri (1997) used one of Kalton’s (1983: 82) formulas to cal-
culate the number of books published in 1992 that would have to be sampled to
create a representative sample. Bowker’s Publisher’s Weekly lists 49,276 books
as having been published in the United States in 1992. Depending on the level
of confidence desired, samples from 2,168–2,289 books would have to be in-
cluded in the corpus to produce a representative sample. If each sample from
each book was 2,000 words in length, the corpus of books would be between
4,336,000 and 4,578,000 words in length. Kalton’s (1983) formula could be
applied to any sampling frame that a corpus compiler identifies, provided that
the size of the frame can be precisely determined.
Sampling methodology can also be used to select the particular individuals
whose speech and writing will be included in a corpus. For instance, in plan-
ning the collection of demographically sampled speech for the BNC, “random
location sampling procedures” were used to select individuals whose speech
would be recorded (Crowdy 1993: 259). Great Britain was divided into twelve
regions. Within these regions, “thirty sampling points” were selected: locations
at which recordings would be made and that were selected based on “their
ACORN profile . . . (A Classification of Regional Neighbourhoods)” (Crowdy
1993: 260). This profile provides demographic information about the types of
people likely to live in certain regions of Great Britain, and the profile helped
creators of the BNC select speakers of various social classes to be recorded.
In selecting potential speakers, creators of the BNC also controlled for other
variables, such as age and gender.
In using sampling methodology to select texts and speakers and writers for
inclusion in a corpus, a researcher can employ two general types of sampling:
probability sampling and non-probability sampling (Kalton 1983). In probabil-
ity sampling (employed above in selecting speakers for the BNC), the researcher
very carefully pre-selects the population to be sampled, using statistical formu-
las and other demographic information to ensure that the number and type of
people being surveyed are truly representative. In non-probability sampling, on
the other hand, this careful pre-selection process is not employed. For instance,
one can select the population to be surveyed through the process of “haphazard,
convenience, or accidental sampling” (Kalton 1983: 90); that is, one samples in-
dividuals who happen to be available. Alternatively, one can employ “judgment,
or purposive, or expert choice” sampling (Kalton 1983:91); that is, one decides
before-hand who would be best qualified to be sampled (e.g. native rather than
non-native speakers of a language, educated vs. non-educated speakers of a
language, etc.). Finally, one can employ “quota sampling” (Kalton 1983: 91),
and sample certain percentages of certain populations. For instance, one could
create a corpus of American English by including in it samples reflecting ac-
tual percentages of ethnic groups residing in the United States (e.g. 10 percent
African Americans, 15 percent Hispanic Americans, etc.).
Although probability sampling is the most reliable type of sampling, lead-
ing to the least amount of bias, for the corpus linguist this kind of sampling
presents considerable logistical challenges. The mathematical formulas used
44 Planning the construction of a corpus

in probability sampling often produce very large sample sizes, as the example
with books cited above illustrated. Moreover, to utilize the sampling techniques
employed by creators of the BNC in the United States would require consider-
able resources and funding, given the size of the United States and the number
of speakers of American English. Consequently, it is quite common for corpus
linguists to use non-probability sampling techniques in compiling their corpora.
In creating the Brown Corpus, for instance, a type of “judgement” sampling
was used. That is, prior to the creation to the Brown Corpus, it was decided
that the writing to be included in the corpus would be randomly selected from
collections of edited writing at four locations:
(a) for newspapers, the microfilm files of the New York Public Library;
(b) for detective and romantic fiction, the holdings of the Providence Athen-
aeum, a private library;
(c) for various ephemeral and popular periodical material, the stock of a large
secondhand magazine store in New York City;
(d) for everything else, the holdings of the Brown University Library. (Quoted
from Francis 1979: 195)
Other corpora have employed other non-probability sampling techniques. The
American component of the International Corpus of English used a combina-
tion of “judgement” and “convenience” sampling: every effort was made to
collect speech and writing from a balanced group of constituencies (e.g. equal
numbers of males and females), but ultimately what was finally included in
the corpus was a consequence of whose speech or writing could be most eas-
ily obtained. For instance, much fiction is published in the form of novels or
short stories. However, many publishers require royalty payments from those
seeking reproduction rights, a charge that can sometimes involve hundreds of
dollars. Consequently, most of the fiction in the American component of ICE
consists of unpublished samples of fiction taken from the Internet: fiction for
which permission can usually be obtained for no cost and which is available in
computerized form. The Corpus of Spoken American English has employed a
variation of “quota” sampling, making every effort to collect samples of speech
from a representative sampling of men and women, and various ethnicities and
regions of the United States.
The discussion thus far in this chapter has focused on the composition of the
BNC to identify the most important factors that need to be considered prior to
collecting texts for inclusion in a corpus. This discussion has demonstrated that:
1. Lengthier corpora are better than shorter corpora. However, even more im-
portant than the sheer length of a corpus is the range of genres included
within it.
2. The range of genres to be included in a corpus is determined by whether it
will be a multi-purpose corpus (a corpus intended to have wide uses) or a
special-purpose corpus (a corpus intended for more specific uses, such as
the analysis of a particular genre like scientific writing).
2.6 The time-frame for selecting speakers and texts 45

3. It is more practical to include text fragments in a corpus rather than complete


texts. These fragments can be as short as 2,000 words, especially if the focus
of study is frequently occurring grammatical constructions.
4. The number of samples of a genre needed to ensure valid studies of the genre
is best determined by how much internal variation exists in the genre: the
more variation, the more samples needed.
5. Probability sampling techniques can also be used to determine the number
of samples of a genre necessary for inclusion in a corpus. However, such
techniques are better used for selecting the individuals from whom text
samples will be taken.
6. If it is not possible to select individuals using probability sampling tech-
niques, then non-probability sampling techniques can be used, provided that
a number of variables are considered and controlled for as well as possible.
These variables will be the focus of the remainder of this chapter.

2.6 The time-frame for selecting speakers and texts

Most corpora contain samples of speech or writing that have been writ-
ten or recorded within a specific time-frame. Synchronic corpora (i.e. corpora
containing samples of English as it is presently spoken and written) contain
texts created within a relatively narrow time-frame. Although most of the writ-
ten texts used in the BNC were written or published between 1975 and 1993, a
few texts date back as far as 1960 (Aston and Burnard 1998: 30). The Brown
and LOB corpora contain texts published in 1961. The ICE Corpus contains
texts spoken or written (or published) in the years 1990-present.
In creating a synchronic corpus, the corpus compiler wants to be sure that
the time-frame is narrow enough to provide an accurate view of contemporary
English undisturbed by language change. However, linguists disagree about
whether purely synchronic studies are even possible: new words, for instance,
come into the language every day, indicating that language change is a con-
stant process. Moreover, even grammatical constructions can change subtly in
a rather short period of time. For instance, Mair (1995) analyzed the types of
verb complements that followed the verb help in parts of the press sections
of the LOB Corpus (comprised of texts published in 1961) and the FLOB
(Freiburg–Lancaster–Oslo–Bergen) Corpus, a corpus comparable to LOB but
containing texts published thirty years later in 1991. Mair (1995) found some
differences in the complementation patterns, even over a period as short as
thirty years. While help with the infinitive to was the norm in the 1961 LOB
Corpus (e.g. I will help you to lift the books), this construction has been re-
placed in the 1991 FLOB Corpus by the bare infinitive (e.g. I will help you lift
the books).
Mair’s (1995) study indicates that language change can occur over a relatively
short period of time. Therefore, if one wishes to create a truly synchronic corpus,
46 Planning the construction of a corpus

then the texts included in the corpus should reflect as narrow a time-frame
as possible. Although this time-frame does not need to be as narrow as the
one-year frame in Brown and LOB, a time-frame of five to ten years seems
reasonable.
With diachronic corpora (i.e. corpora used to study historical periods of
English), the time-frame for texts is somewhat easier to determine, since the
various historical periods of English are fairly well defined. However, compli-
cations can still arise. For instance, Rissanen (1992: 189) notes that one goal
of a diachronic corpus is “to give the student an opportunity to map and com-
pare variant fields or variant paradigms in successive diachronic stages in the
past.” To enable this kind of study in the Helsinki Corpus, Rissanen (1992)
remarks that the Old and Early Middle English sections of the corpus were
divided into 100-year subperiods, the Late Middle and Early Modern English
periods into seventy- to eighty-year periods. The lengths of subperiods were
shorter in the later periods of the corpus “to include the crucial decades of
the gradual formation of the fifteenth-century Chancery standard within one
and the same period” (Rissanen 1992: 189). The process that was used to de-
termine the time-frames included in the Helsinki Corpus indicates that it is
important in diachronic corpora not to just cover predetermined historical pe-
riods of English but to think through how significant events occurring during
those periods can be best covered in the particular diachronic corpus being
created.

2.7 Sampling native vs. non-native speakers of English

If one is setting out to create a corpus of American English, it is best


to include only native speakers of American English in the corpus, since the
United States is a country in which English is predominantly a language spoken
by native speakers of English. On the other hand, if the goal is to create a corpus
of Nigerian English, it makes little sense to include in the corpus native speakers
of English residing in Nigeria, largely because Nigerian English is a variety of
English spoken primarily as a second (or additional) language. As these two
types of English illustrate, the point is not simply whether one obtains texts from
native or non-native speakers but rather that the texts selected for inclusion are
obtained from individuals who accurately reflect actual users of the particular
language variety that will make up the corpus.
Prior to selecting texts for inclusion in a corpus, it is crucial to establish
criteria to be used for selecting the individuals whose speech or writing will be
included in the corpus. Since the American component of ICE was to include
native speakers of American English, criteria were established precisely defining
a native speaker of American English. Although language theorists will differ in
their definitions of a native speaker of a language, for the American component
of ICE, a native speaker of American English was defined as someone who
2.7 Sampling native vs. non-native speakers 47

had lived in the United States and spoken American English since adolescence.
Adolescence (10–12 years of age) was chosen as the cut-off point, because when
acquisition of a language occurs after this age, foreign accents tend to appear
(one marker of non-native speaker status). In non-native varieties of English, the
level of fluency among speakers will vary considerably, and as Schmied (1996:
187) observes, “it can be difficult to determine where an interlanguage ends
and educated English starts.” Nevertheless, in selecting speakers for inclusion
in a corpus of non-native speech or writing, it is important to define specific
criteria for selection, such as how many years an individual has used English,
in what contexts they have used it, how much education in English they have
had, and so forth.
To determine whether an individual’s speech or writing is appropriate for
inclusion in a corpus (and also to keep track of the sociolinguistic variables
described in section 2.8), one can have individuals contributing texts to a corpus
fill out a biographical form in which they supply the information necessary for
determining whether their native or non-native speaker status meets the criteria
for inclusion in the corpus. For the American component of ICE, individuals
are asked to supply their residence history: the places they have lived over their
lives and the dates they have lived there. If, for instance, an individual has spent
the first four years of life living outside the United States, it can probably be
safely assumed that the individual is a native speaker of English and came to
the United States at an early enough age to be considered a native speaker of
English. On the other hand, if an individual spent the first fifteen years of life
outside the United States, he or she is probably a non-native speaker of English
and has learned English as a second language. It is generally best not to ask
individuals directly on the biographical form whether they are a native speaker
or not, since they may not understand (in the theoretical sense) precisely what a
native speaker of a language is. If the residence history on the biographical form
is unclear, it is also possible to interview the individual afterwards, provided that
he or she can be located; if the individual does not fit the criteria for inclusion,
his or her text can be discarded.
Determining native or non-native speaker status from authors of published
writing can be considerably more difficult, since it is often not possible to
locate authors and have them fill out biographical forms. In addition, it can
be misleading to use an individual’s name alone to determine native speaker
status, since someone with a non-English-sounding name may have immigrant
parents and nevertheless be a native speaker of English, and many individuals
with English-sounding names may not be native speakers: one of the written
samples of the American component of ICE had to be discarded when the au-
thor of one of the articles called on the telephone and explained that he was
not a native speaker of American English but Australian English. Schmied
(1996: 187) discovered that a major newspaper in Kenya had a chief editor
and training editor who were both Irish and who exerted considerable editorial
influence over the final form of articles written by native Kenyan reporters.
48 Planning the construction of a corpus

This discovery led Schmied (1996) to conclude that there are real questions
about the authenticity of the English used in African newspapers. When deal-
ing with written texts from earlier periods, additional problems can be en-
countered: English texts were often translations from Latin or French, and a
text published in the United States, for instance, might have been written by
a speaker of British English. Ultimately, however, when dealing with written
texts, one simply has to acknowledge that in compiling a corpus of American
English, for instance, there is a chance that the corpus will contain some
non-native speakers, despite one’s best efforts to collect only native-speaker
speech.
In spoken dialogues, one may find out that one or more of the speakers in a
conversation do not meet the criteria for inclusion because they are not native
speakers of the variety being collected. However, this does not necessarily mean
that the text has to be excluded from the corpus, since there is annotation that
can be included in a corpus indicating that certain sections of a sample are
“extra-corpus” material (cf. section 4.1), that is, material not considered part
of the corpus for purposes of word counts, generating KWIC (key word in
context) concordances, and so forth.

2.8 Controlling for sociolinguistic variables

There are a variety of sociolinguistic variables that will need to be


considered before selecting the speakers and writers whose texts are being con-
sidered for inclusion in a corpus. Some of these variables apply to the collection
of both spoken and written texts; others are more particular to spoken texts. In
general, when selecting individuals whose texts will be included in a corpus,
it is important to consider the implications that their gender, age, and level of
education will have on the ultimate composition of the corpus. For the spoken
parts of a corpus, a number of additional variables need to be considered: the
dialects the individuals speak, the contexts in which they speak, and the rela-
tionships they have with those they are speaking with. The potential influences
on a corpus that these variables will have are summarized below.

2.8.1 Gender balance


It is relatively easy when collecting speech and writing to keep track
of the number of males and females from whom texts are being collected.
Information on gender can be requested on a biographical form, and in written
texts, one can usually tell the gender of an individual by his or her first name.
Achieving gender balance in a corpus involves more than simply ensuring
that half the speakers and writers in a corpus are female and half male. In certain
written genres, such as scientific writing, it is often difficult to achieve gender
balance because writers in these genres are predominantly male – an unfortunate
2.8 Controlling for sociolinguistic variables 49

reality of modern society. To attempt to collect an equal proportion of writing


from males and females might actually misrepresent the kind of writing found
in these genres. Likewise, in earlier periods, men were more likely to be literate
than women and thus to produce more writing than women. To introduce more
writing by females into a corpus of an earlier period distorts the linguistic reality
of the period. A further complication is that much writing, particularly scientific
writing, is co-written, and if males and females collaborate, it will be difficult
to determine precisely whose writing is actually represented in a sample. One
could collect only articles written by a single author, but this again might lead
to a misrepresentation of the type of writing typically found in a genre. Finally,
even though an article may be written by a female or a male, there is no way of
determining how much an editor has intervened in the writing of an article and
thus distorted the effect that the gender of the author has had on the language
used in the article.
In speech, other complications concerning gender arise. Research has shown
that gender plays a crucial role in language usage. For instance, women will
speak differently with other women than they will with men. Consequently, to
adequately reflect gender differences in language usage, it is best to include
in a corpus a variety of different types of conversations involving males and
females: women speaking only with other women, men speaking only with
other men, two women speaking with a single man, two women and two men
speaking, and so forth.
To summarize, there is no one way to deal with all of the variables affecting
the gender balance of a corpus. The best the corpus compiler can do is be
aware of the variables, confront them head on, and deal with them as much as
is possible during the construction of a corpus.

2.8.2 Age
Although there are special-purpose corpora such as CHILDES
(cf. section 1.3.8) that contain the language of children, adolescents, and adults,
most corpora contain the speech and writing of “adults.” The notable exception
to this trend is the British National Corpus. In the spoken section of the cor-
pus containing demographically sampled speech, there is a balanced grouping
of texts representing various age groups, ranging from the youngest group-
ing (0–14 years) to the oldest grouping (60+) (“Composition of the BNC”:
http://info.ox.ac.uk/bnc/what/balance.html). In addition, there were texts in-
cluded in the BNC taken from the Bergen Corpus of London Teenager English
(COLT), which contains the speech of adolescents up to age 16 (Aston and
Burnard 1998: 32; also cf. section 1.3.4). However, the written part of the
BNC contains a sparser selection of texts from younger age groups, largely
because younger individuals simply do not produce the kinds of written texts
(e.g. press reportage or technical reports) represented in the genres typically
included in corpora. This is one reason why corpora have tended to contain
50 Planning the construction of a corpus

mainly adult language. Another reason is that to collect the speech of children
and adolescents, one often has to obtain the permission not just of the individual
being recorded but of his or her parents as well, a complicating factor in an
already complicated endeavor.
If it is decided that a corpus is to contain only adult language, it then becomes
necessary to determine some kind of age cut-off for adult speech. The ICE
project has set the age of 18 as a cut-off point between adolescent and adult, and
thus has included in the main body of the corpus only the speech and writing of
individuals over the age of 18. However, even though one may decide to include
only adult speech in a corpus, to get a full range of adult speaking styles, it is
desirable to include adults conversing with adolescents or children as well. In
the transcribed corpus itself, markup can be included to set off the speech of
adolescents or children as “extra corpus” material.
In most corpora attempting to collect texts from individuals of varying ages,
the ages of those in the corpus are collapsed into various age groups. For
instance, the British component of ICE has five groups: 18–25, 26–45, 46–65,
66+, and a category of “co-authored” for written texts with multiple authors.

2.8.3 Level of education


It is also important to consider the level of education of those whose
speech or writing is to be included in the corpus. One of the earlier corpora,
the London Corpus, set as its goal the collection of speech and writing from
educated speakers of British English. This goal presupposes not only that one
can define an educated speaker but that it is desirable to include only the lan-
guage of this group in a corpus. The ICE project defined an educated speaker
very precisely, restricting texts included in all of the components of ICE to
those with at least a high school education. Arguably, such a restriction is arbi-
trary and excludes from the ICE Corpus a significant range of speakers whose
language is a part of what we consider Modern English. Moreover, restricting
a corpus to educated speech and writing is elitist and seems to imply that only
educated speakers are capable of producing legitimate instances of Modern
English.
However, even though these are all valid criticisms, it is methodologically
valid to restrict a corpus to the speech and writing of whatever level of educa-
tion one wishes, provided that no claims are made that such a corpus is truly
representative of all speakers of a language. Consequently, the more represen-
tative one wishes a corpus to be, the more diverse the educational levels one
will want to include in the corpus. In fact, a corpus compiler may decide to
place no restrictions on the education level of those to be included in a corpus.
However, if this approach is taken, it is important to keep careful records of the
educational levels of those included in the corpus, since research has shown
that language usage varies by educational level, and future users of the corpus
may wish to use it to investigate language usage by level of education.
2.8 Controlling for sociolinguistic variables 51

2.8.4 Dialect variation


The issue of the validity of including only educated speech in a cor-
pus raises a more general consideration, namely the extent to which a corpus
should contain the range of dialects, both social and regional, that exist in any
language.
In many respects, those creating historical corpora have been more success-
ful in representing regional variation than those creating modern-day corpora:
the regional dialect boundaries in Old and Middle English are fairly well estab-
lished, and in the written documents of these periods, variant spellings reflecting
differences in pronunciation can be used to posit regional dialect boundaries.
For instance, Rissanen (1992: 190–2) is able to describe regional variation in
the distribution of (n)aught (meaning “anything,” “something,” or “nothing”)
in the Helsinki Corpus because in Old and Early Middle English this word
had variant spellings reflecting different pronunciations: a spelling with <a> in
West-Saxon (e.g. Old English (n)awuht) and a spelling with<o> in Anglian
or Mercian (e.g. Old English (no)whit). Social variation is more difficult to
document because, as Nevalainen (2000: 40) notes, “Early modern people’s
ability to write was confined to the higher social ranks and professional men.”
In addition, more detailed information on writers is really not available until the
fifteenth century. Therefore, it is unavoidable that sociolinguistic information
in a historical corpus will either be unavailable or skewed towards a particular
social class.
Because writing is now quite standardized, it no longer contains traces of
regional pronunciations. However, even though the modern-day corpus linguist
has access to individuals speaking many different regional and social varieties of
English, it is an enormous undertaking to create a spoken corpus that is balanced
by region and social class. If one considers only American English, a number
of different regional dialects can be identified, and within these major dialect
regions, one can isolate numerous subdialects (e.g. Boston English within the
coastal New England dialect). If social dialects are added to the mix of regional
dialects, even more variation can be found, as a social dialect such as African-
American Vernacular English can be found in all major urban areas of the
United States. In short, there are numerous dialects in the United States, and
to attempt to include representative samplings of each of these dialects in the
spoken part of a corpus is nothing short of a methodological nightmare.
What does one do, then, to ensure that the spoken part of a corpus contains a
balance of different dialects? In selecting speakers for inclusion in the British
National Corpus, twelve dialect regions were identified in Great Britain, and
from these dialect regions, 124 adults of varying social classes were randomly
selected as those whose speech would be included in the corpus (Crowdy 1993:
259–60). Unfortunately, the speech of only 124 individuals can hardly be ex-
pected to represent the diversity of social and regional variation in a country the
size of Great Britain. Part of the failure of modern-day corpora to adequately
52 Planning the construction of a corpus

represent regional and social variation is that creators of these corpora have had
unrealistic expectations. As Chafe, Du Bois, and Thompson (1991: 69) note,
the thought of a corpus of “American English” to some individuals conjures
up images of “a body of data that would document the full sweep of the lan-
guage, encompassing dialectal diversity across regions, social classes, ethnic
groups . . .[enabling] massive correlations of social attributes with linguistic
features.” But to enable studies of this magnitude, the corpus creator would
have to have access to resources far beyond those that are currently available –
resources that would enable the speech of thousands of individuals to be recorded
and then transcribed.
Because it is not logistically feasible in large countries such as the United
States or Great Britain to create corpora that are balanced by region and social
class, it is more profitable for corpus linguists interested in studying social
and regional variation to devote their energies to the creation of corpora that
focus on smaller dialect regions. Tagliamonte (1998) and Tagliamonte and
Lawrence (2000: 325–6), for instance, contain linguistic discussions based on
a 1.5-million-word corpus of York English that has been subjected to extensive
analysis and that has yielded valuable information on dialect patterns (both
social and regional) particular to this region of England. Kirk (1992) has created
the Northern Ireland Transcribed Corpus of Speech containing transcriptions
of interviews of speakers of Hiberno English. Once a number of more focused
corpora like these are created, they can be compared with one another and valid
comparisons of larger dialect areas can then be conducted.

2.8.5 Social contexts and social relationships


Speech takes place in many different social contexts and among speak-
ers between whom many different social relationships exist. When we work,
for instance, our conversations take place in a specific and very common social
context – the workplace – and among speakers of varying types: equals (e.g.
co-workers), between whom a balance of power exists, and disparates (e.g. an
employer and an employee), between whom an imbalance of power exists. Be-
cause the employer has more power, he or she is considered a “superordinate”
in contrast to the employee, who would be considered a “subordinate.” At home
(another social context), other social relationships exist: a mother and her child
are not simply disparates but intimates as well.
There is a vast amount of research that has documented that both the social
context in which speech occurs and the social relationships between speakers
have a tremendous influence on the structure of speech. As Biber and Burges
(2000) note, to study the influence of gender on speech, one needs to consider
not just the gender of the individual speaking but the gender of the individual(s)
to whom a person is speaking. Because spontaneous dialogues will constitute a
large section of any corpus containing speech, it is important to make provisions
for collecting these dialogues in as many different social contexts as possible.
2.9 Conclusions 53

The London–Lund Corpus contains an extensive collection of spoken texts rep-


resenting various social relationships between speakers: spontaneous conversa-
tions or discussions between equals and between disparates; radio discussions
and conversations between equals; interviews and conversations between dis-
parates; and telephone conversations between equals and between disparates
(Greenbaum and Svartvik 1990: 20–40).
The American component of ICE contains spontaneous conversations taking
place in many different social contexts: there are recordings of family mem-
bers conversing over dinner, friends engaging in informal conversation as they
drive in a car, co-workers speaking at work about work-related matters, teach-
ers and their students discussing class work at a university, individuals talk-
ing over the phone, and so forth. The Michigan Corpus of Academic Spoken
English (MICASE) collected academic speech in just about every conceivable
academic context, from lectures given by professors to students conversing in
study groups, to ensure that the ultimate corpus created represented the broad
range of speech contexts in which academic speech occurs (Simpson, Lucka,
and Ovens 2000: 48). In short, the more diversity one adds to the social contexts
in which language takes place, the more assured one can be that the full range
of contexts will be covered.

2.9 Conclusions

To create a valid and representative corpus, it is important, as this


chapter has shown, to plan the construction carefully before the collection of
data even begins. This process is guided by the ultimate use of the corpus. If one
is planning to create a multi-purpose corpus, for instance, it will be important
to consider the types of genres to be included in the corpus; the length not
just of the corpus but of the samples to be included in it; the proportion of
speech vs. writing that will be included; the educational level, gender, and
dialect backgrounds of speakers and writers included in the corpus; and the
types of contexts from which samples will be taken. However, because it is
virtually impossible for the creators of corpora to anticipate what their corpora
will ultimately be used for, it is also the responsibility of the corpus user to
make sure that the corpus he or she plans to conduct a linguistic analysis of
is valid for the particular analysis being conducted. This shared responsibility
will ensure that corpora become the most effective tools possible for linguistic
research.

Study questions

1. Is a lengthier corpus necessarily better than a shorter corpus? What kinds


of linguistic features can be studied in a lengthier corpus (such as the Bank

You might also like