Biber 2012
Biber 2012
DOUGLAS BIBER
Abstract
Over the last two decades, corpus analysis has been used as the basis for
several important reference grammars and dictionaries of English. While these
reference works have made major contributions to our understanding of En
glish lexis and grammar, most of them share a major limitation: the failure to
consider register differences. Instead, most reference works describe lexico-
grammatical patterns as if they applied generally to English.
The main goal of the present paper is to challenge this practice and the
underlying assumption that the patterns of lexical-grammatical use in English
can be described in general/global terms. Specifically, I argue that descriptions
of the average patterns of use in a general corpus do not accurately describe
any register. Rather, the patterns of use in speech are dramatically different
from the patterns in writing (especially academic writing), and so minimally
an adequate description must recognize the two major poles in this continuum
(i.e., conversation versus informational written prose).
The paper begins by comparing two general corpus approaches to the study
of language use: variationist and text-linguistic. Although both approaches can
be used to investigate the use of words, grammatical features, and registers, the
two approaches differ in their bases: the first gives primacy to each linguistic
token, while the second gives primacy to each text. This difference has important
consequences for the overall research design, the kinds of variables that can be
measured, the statistical techniques that can be applied, and the particular re
search questions that can be asked. As a result, the importance of register has
been more apparent in text-linguistic studies than in studies of linguistic variation.
The bulk of the paper, then, argues for the importance of register at all lin
guistic levels: lexical, grammatical, and lexico-grammatical. Analyses com
paring conversation and academic writing are discussed for each level, show
ing how a general ‘average’ description includes some characteristics that are
not applicable to one or the other register, while also omitting other important
patterns of use found in particular registers.
1. Introduction
One major contribution of corpus research over the past 40 years has been
the increasing awareness that lexis and grammar are intimately intertwined.
Numerous studies have exploited corpus resources to describe the systematic
lexical associations of a target grammatical construction (cf. the survey of
studies in Kennedy 1998: 121–154). For example, some of the most common
verbs in English occur most of the time in the simple present tense (e.g., think,
know, want, mean), while others occur more often in the simple past tense (e.g.,
said, came, took) (Kennedy 1998: 123). The modal would usually occurs with
a simple verb in written academic writing (the expense would fall ), while the
modal can more often occurs with be + past participle (the procedures can be
applied to . . .) (Kennedy 1998: 133). The preposition between most commonly
occurs as a noun modifier in written English (e.g., difference between, relation
ship between, agreement between), in contrast to the preposition through, which
more often has an adverbial function (e.g., go through, pass through, come
through) (Kennedy 1998: 142–143; cf. Kennedy 1991). A more recent study of
this type is Römer’s (2005) detailed description of the verbs associated with
progressive aspect.
Several major reference grammars have also employed corpus investiga-
tions to identify the words associated with grammatical constructions (e.g., lists
of the verbs and adjectives that can control a that-clause or a to-clause). One of
the earliest grammars to include extensive lexical information of this type is
the Comprehensive Grammar of the English Language (Quirk et al. 1985; see
also Quirk et al. 1972), while the Collins COBUILD English Grammar (1990),
the Longman Grammar of Spoken and Written English (Biber et al. 1999), and
the Cambridge Grammar of English (Carter and McCarthy 2006) are more
recent examples.
These grammars all take a deductive (‘corpus-based’) approach: grammatical
constructions are distinguished on the basis of traditional linguistic criteria,
and then the set of words associated with those constructions are identified
through corpus analysis. In contrast, the ‘pattern grammar’ reference books
(Francis et al. 1996, 1998) take the opposite approach, beginning with words
and then identifying the (grammatical) “phraseology frequently associated
with (a sense of ) a word” (Hunston and Francis 2000: 3). These books show
that there are systematic regularities in the associations between sets of words,
grammatical frames, and particular meanings on a much larger scale than it
Register as a predictor of linguistic variation 11
be used to investigate the use of words, grammatical features, and registers, the
two approaches differ in their bases: the first gives primacy to each linguistic
token, while the second gives primacy to each text. This difference has impor-
tant consequences for the overall research design, the kinds of variables that
can be measured, the statistical techniques that can be applied, and the par
ticular research questions that can be asked. As a result, the importance of
register has been more apparent in text-linguistic studies than in studies of
linguistic variation.
The bulk of the paper, then, argues for the importance of register at all lin-
guistic levels: lexical, grammatical, and lexico-grammatical. Analyses com-
paring conversation and academic writing are discussed for each level, show-
ing how a general ‘average’ description includes some characteristics that are
not applicable to one or the other register, while also omitting other important
patterns of use found in particular registers. (The data for several of these case
studies are taken from the Longman Grammar of Spoken and Written English,
referred to as LGSWE below.)
The units of analysis are the ‘observations’ that are described in a study. For the
most part, the observations in Type A studies do not have quantitative charac-
teristics, while the observations in Type B studies are analyzed in terms of
quantitative characteristics. (Type C studies differ from both of the others in
that there are actually very few observations – usually only 2 or 3 observations
– because each sub-corpus is treated as an observation.)
For example, a Type A study of relative clauses might have the goal of
predicting the choice of relative pronoun (who, which, that). Each relative
clause would be an observation, coded for restrictive versus non-restrictive
function and for the animacy of the head noun. All three variables (relative
pronoun, clause type, head noun type) in this study would be nominal rather
than numeric.2
In this case, descriptive statistics give the frequencies for each combination
of categories (e.g., how many occurrences of the relative pronoun that are used
with animate head nouns). However, it is difficult to document variation (or
dispersion) across the corpus, and the distribution of linguistic variants across
texts is not considered in this type of analysis (see also Gries 2006). Most Type
A studies obtain frequencies for the corpus as a whole, but give no consider-
ation to variation among the texts in a corpus.
Statistically, this type of design must be analyzed using non-parametric
techniques, such as chi-squared or log-likelihood. As a result, a Type A study
is ideal for studying the proportional preference for one or another variant, or
the proportional extent to which a linguistic variant occurs with particular
contextual factors. However, this design type is not well suited to studying the
overall extent to which a linguistic feature is used in texts.
The important point here is that Type A research designs do not provide the
basis for determining rates of occurrence, so they cannot be used to determine
if a feature or variant occurs more often in one register or another. This is po-
tentially confusing, and even published research studies sometimes make this
mistake in interpreting statistical analyses: describing proportional preference
for one variant over another as if it was the same as a higher rate of occurrence.
Type A studies can tell us what the preferred variant is in a register, and how
registers differ in their reliance on a particular variant. For example, Figure 1
shows the proportional use of that versus 0 complementizer in verb + that-
clause constructions, as in:
The commission agreed that this solution . . .
versus
I thought [0] you did it
Figure 1 compares conversation, newspaper writing, and academic prose, based
on a sample of 1,000 that-clauses taken from each of the registers. (The actual
proportions are given in Table 1; cf. LGSWE 1999: 680).
14 D. Biber
Figure 1. Proportional preference for Verb + that versus Verb + 0 in three registers
Table 1. P
roportional preference for Verb + that versus Verb + 0, based on a sample of 1,000
that-clauses taken from conversation, newspaper writing, and academic prose
Figure 2. Rates of occurrence for Verb + that and Verb + 0 in three registers
writing than in academic writing, and the rate in conversation is only some-
what lower than in academic writing. This is because that-clauses overall are
much more frequent in conversation and newspaper writing than in academic
writing. As a result, both variants (with that and 0) occur with much higher
rates in newspaper writing than in academic writing.
It is surprisingly common for researchers to confuse these two perspectives
on variation, or to at least use statements that are misleading to the naïve reader.
The main problem here has to do with claims that a linguistic feature is “fre-
quent.” Consider, for example, the following statement from Szmrecsanyi and
Hinrichs (2008: 297): “The s-genitive is, on the whole, more frequent in spo-
ken data than in written . . .” It would be natural to interpret this statement to
mean that a speaker will produce more s-genitives than a writer will. Or put
another way, a listener will encounter more s-genitives in a conversation than
a reader would in a written text. However, that interpretation is not intended by
Szmrecsanyi and Hinrichs, and in fact, it is not accurate.
The pattern being described by Szmrecsanyi and Hinrichs is one of propor-
tional preference, not frequency of occurrence in texts. That is, s-genitives are
proportionally preferred over of-genitives in speech, while of-genitives are
proportionally preferred over s-genitives in writing: “FRED [i.e. a spoken cor-
pus] exhibits the highest percentage of the s-genitive (59.6%), Brown [i.e., a
written corpus] (36.2%) the lowest.” (Szmrecsanyi and Hinrichs 2008: 297).
But, from a text-linguistic perspective, s-genitives are actually much more fre-
quent in writing than in speech. Thus, Figure 4.6 in LGSWE (1999: 302) shows
16 D. Biber
that there are only c. 800 s-genitives per million words of conversation, in
contrast to c. 2,300 s-genitives per million words in academic writing, and
c. 9,000 s-genitives per million words in newspaper writing.
The pattern here is the same as the case study for that-retention presented
above. Thus, when a speaker uses a genitive construction, it is most likely to be
an s-genitive. So, considering only conversation, s-genitives are more frequent
than of-genitives. However, genitives overall are much more frequent in writ-
ing than in speech. As a result, when speech is compared to writing, both of-
genitives and s-genitives occur more frequently in writing. That is, even though
the s-genitive is proportionally preferred in conversation, it still is much less
frequent than the s-genitive in writing.
The point here is not to criticize the Szmrecsanyi and Hinrichs (2008) study,
which is an exemplary study employing a carefully considered research design
and sophisticated statistical methods to analyze this aspect of grammatical
variation. Rather, the point is to emphasize how easy it is to confound the
variationist and the text-linguistic perspectives when reporting frequency re-
sults. This is not an obscure methodological quibble. Rather, the two perspec-
tives are completely different in their practical implications. The variationist
perspective has the goal of comparing linguistic variants: whether one or the
other variant is preferred. These preferences can be compared across registers,
but that analysis cannot tell us the actual extent to which a variant is used in
texts. In contrast, the text-linguistic perspective (a Type B study) has the goal
of providing a linguistic description of texts, by describing the density of gram-
matical features in texts. These studies directly tell us the density of occurrence
of a feature (or variant) in texts from different registers.
There are a few general points worth emphasizing here. The first is the
importance of distinguishing between variationist research designs and text-
linguistic designs: variationist designs investigate proportional preferences,
while text-linguistic designs investigate the rates of occurrence in texts. But a
more general point is a cautionary one: the text-linguistic perspective is often
the more natural interpretation, and thus it is easy for authors (and readers) to
misleadingly use the language of ‘frequency’. When a linguistic feature is
described as occurring ‘frequently’, we expect to encounter numerous occur-
rences of the feature in texts (the text-linguistic perspective). The proportional
perspective is more difficult to describe: that when a linguistic feature does
occur, it usually has certain characteristics – even if the feature itself occurs
infrequently. Thus, it is essential to be explicit about the nature of the patterns
in variationist studies: that they represent proportional preferences but not nec-
essarily frequent occurrence in texts.
In summary, a Type A research design – studies of linguistic variation where
each linguistic token is an observation – cannot describe the rates of occur-
rence in texts and registers. This design type can identify register influences on
Register as a predictor of linguistic variation 17
have
Conversation:
dinner, lunch, a drink, fun, a good time, trouble, a hard time, a/no problem
with, kids, children, a baby, a/the chance, an/no idea, a question
Informational writing:
an/no/little effect/impact/influence on, the advantage of, a range of, a wide
variety of, little/no evidence of, no knowledge of, the potential for, little sympa
thy for, implications for, an interest in, a role in
make
Conversation:
the bed, a phone call, a joke, (a) noise, a sound, an appointment, a deal, plans
to, a living, money, a difference, (a) decision(s), an effort, a mistake, (no)
sense, fun of, time for, sure
Informational writing:
assumptions, comparisons, judgments, choices, decisions, predictions, recom
mendations, (no) sense, use of, reference to
take
Conversation:
a photo/picture, a bath/shower, a nap, a break, it easy, place, a minute, time,
classes, a test, a message, notes, a car, the bus/train, a ride, a right/left [turn],
a look at, care of, charge of, responsibility for, advantage of, forever, turns
Informational writing:
action, the initiative, the lead in, steps to, the position that, the view that,
account of, into account, part in, advantage of, precedence over, the form of,
the shape of
From a text-linguistic perspective, the verbs have, make, and take are consider-
ably more frequent in conversation than in informational writing. However, the
perspective of linguistic variation asks different questions: When these verbs
are used in conversation, what are the most common collocates? When these
verbs are used in informational writing, what are the most common collocates?
Are the preferred collocates in conversation the same as those in informational
writing?5
The answers to these questions indicate that register is a fundamentally im-
portant organizing factor for studies of collocation. All three verbs have strong
collocational associations in both conversation and informational writing.
However, those associations are almost entirely non-overlapping.
A collocational analysis of these verbs in a general multi-register corpus (cf.
BNC) might identify many of these combinations. But such an analysis would
miss the point that these are not general collocations. In informational writing,
it is rare to find uses like have lunch, have fun, make a phone call, make a deal,
Register as a predictor of linguistic variation 19
take a break, take care of, etc. Similarly, in conversation, it is rare to find uses
like have implications for, make assumptions about, take precedence over.
These are all strong collocations, but they are tied to a particular domain of use,
and thus an essential component of their analysis should be documentation of
the register where they typically occur.
intervening variable slot filled by many different possible content words (e.g.,
the * of the, in the * of, to the * of ).
These different patterns also have strong grammatical correlates. For ex-
ample, the continuous fixed sequences in conversation consist of both function
words and content words. In contrast, the fixed slots in the academic writing
patterns are usually function words, while the variable slots are usually content
words. However, there are other differences having to do with the structural
correlates of these lexical patterns: verb phrase and clause fragments in the
case of conversational lexical sequences, but noun phrase and prepositional
phrase fragments in the case of the academic writing patterns. I return to these
associations in Section 5 below.
Figure 3. P
roportional preference for Verb + that versus Verb + 0: The influence of co-referential
subjects, in conversation vs. newspaper prose
the matrix verb and the that-clause. One factor that has been hypothesized to
favor that-omission is co-referential subjects in the matrix clause and the that-
clause, as in:
The Secretary argued that armed intervention was not the answer.
This factor is interesting for our purposes here because its influence is dra-
matically different in different registers. Thus, Figure 3 above ( based on Biber
et al. 1999: 681) shows that the choice of grammatical subject has little influ-
ence in conversation: the complementizer that is omitted over 80% of the time,
regardless of whether the grammatical subjects are co-referential or not. How-
ever, we see a dramatically different pattern in newspaper writing: when the
construction has co-referential subjects, the complementizer that is omitted
over 70% of the time, while that is omitted only c. 15% of the time when
Register as a predictor of linguistic variation 23
Figure 4. P
roportional preference for Verb + that versus Verb + 0: The influence of an interven
ing NP, in conversation vs. newspaper prose
the subjects are not co-referential. Thus, the strength of this factor is mediated
by register: essentially no influence in conversation ( because that-omission is
already the norm), versus a very strong influence in newspaper writing ( because
that-retention is the general register norm).
Other grammatical factors favor the retention of that, such as the presence of
a noun phrase intervening between the matrix verb and that-clause (e.g., They
told him that it’s dangerous). These grammatical factors show the opposite
interaction with register, having a very strong influence in conversation but
little influence in informational writing. Thus, Figure 4 above ( based on Biber
et al. 1999: 682) shows that the presence of an intervening noun phrase in con-
versation results in the complementizer that almost always being retained,
while it is retained less than 15% of the time when there is no intervening noun
phrase. In contrast, the presence of an intervening noun phrase has little influ-
ence in newspaper writing, because the norm is already to retain the comple-
mentizer that. Thus, even without an intervening noun phrase, the complemen-
tizer that is retained c. 75% of the time.
In sum, similar to the collocational patterns for individual words, register is
a strong predicting factor for studies of grammatical variation. This factor
interacts with other contextual influences: a contextual factor with a strong
influence in one register might have a minimal influence in another register. As
the following section shows, grammatical differences across registers are even
more notable when approached from a text-linguistic perspective.
24 D. Biber
1. structural type:
a. clauses, especially finite dependent clauses, are preferred in speech
b. (non-verbal) phrases are preferred in academic writing
2. syntactic function:
a. clausal constituents (adverbials and complement clauses) are preferred
in speech
b. noun phrase constituents (noun modifiers and noun complements) are
preferred in academic writing
Tables 2 and 3 provide selected details of these general trends, comparing the
mean scores for many of the most important complexity features.
In sum, from a text-linguistic perspective, the grammar of conversation is
dramatically different from the grammar of informational writing (cf. Biber,
Gray, and Poonpon 2011). This does not represent an absolute difference be-
tween speech and writing, because written registers like email can employ the
grammatical discourse styles typical of speech (e.g., Biber and Conrad 2009:
Ch. 7). However, informational writing is fundamentally different from con-
versation (and spoken registers generally) in its heavy reliance on phrasal
Register as a predictor of linguistic variation 25
the verbs that are most strongly attracted to active versus passive voice and the
verbs that are most strongly attracted to the ditransitive versus to-dative (cf.
Stefanowitsch and Gries 2003; Gries and Stefanowitsch 2004).
Collostructional analysis – similar to previous statistical measures of collo-
cational association – assumes the desirability of a single statistical measure of
linguistic importance (lexico-grammatical association). However, there are
theoretical and practical reasons why it might be preferable to distinguish
between the variationist and text-linguistic perspectives: the words that are
proportionally associated with a grammatical construction (even if they rarely
occur), versus the words that most frequently occur with a grammatical construc
tion (even if those words also frequently have other grammatical associations).
The two types of analysis produce results that are to a large extent comple-
mentary. For example, from the text-linguistic perspective, passive voice con-
structions in academic writing are most frequently found with verbs like be
made, be given, be used, be seen, even though those verbs are also commonly
used in the active voice (compare LGSWE pp. 367–369 and p. 375 with
p. 478). In contrast, the variationist perspective identifies a completely differ-
ent set of verbs have the strongest proportional association with passive voice
in academic writing, occurring as passives over 90% of the time, even though
none of them is especially frequent: be aligned (with), based on, coupled with,
deemed, effected, situated, subjected (to), etc. (LGSWE p. 479). Thus, a reader
will most often encounter passive verbs like be made, be given, be used, and
she/ he will develop the association that when passive voice is employed, it
often incorporates those verbs. In contrast, a reader will less often encounter
passive verbs like be aligned (with) and based on, but because those verbs are
almost never encountered in the active voice, an association in the opposite
direction is developed: when those verbs are used, they are almost always in
the passive voice. The two perspectives are distinct but both are important.
The omission of register analyses in the pattern grammar books can be justi-
fied by the magnitude of that project, compiling lists of ‘patterns’ for thousands
of verbs, nouns, and adjectives. Given finite resources, the authors chose to
increase their coverage of words, rather than investigating the possibility of
different patterns occurring in different registers. However, this omission is
also typical of most more restricted lexico-grammatical studies.
One exception to this generalization is Stefanowitsch and Gries (2008), who
investigate the influence of spoken versus written mode on the associations
between lexical items and ‘constructions’. As described above, this study uses
‘collostructional analysis’, a statistical procedure that combines proportional
preference and frequency rates of occurrence to produce a single measure of
association between words and grammatical constructions.
For example, one case study in that paper investigates the verbs that are
most strongly associated with passive voice in speech and writing. Three dif-
ferent statistical comparisons are used:
1) for each verb, contrasting the association with active voice versus passive
voice, carried out separately for the spoken and written modes;
2) for each verb, contrasting the association with the spoken mode versus
written mode, carried out separately for active voice and then passive voice.
3) for each verb, checking for cross-over effects: when a verb is associated
with passive voice in the spoken mode but active voice in the written
mode, or vice versa.
The first type of analysis identifies the set of verbs associated with passive
(versus active) voice in speech, and then separately identifies the verbs associ-
ated with passive (versus active) voice in writing. (Stefanowitsch and Gries
2008: 140) Several verbs have associations with passive voice in both modes,
such as: BE + concerned, based, published, associated, confined, designed.
However, other verbs are associated with passive (versus active) voice only in
one of the two modes.
In several cases, the results of this type of analysis are surprising, identifying
verbs that would not normally be associated with the target register. In fact,
many of these findings are difficult to interpret in terms of spoken versus writ-
ten discourse, calling into question the value of this statistical measure for this
application (cf. below). For example, the verbs most strongly associated with
passive voice in speech ( but not associated with passive voice in writing)
include ‘literate’ verbs like: BE + involved, used, engaged, enclosed, aimed,
distributed, compared, entitled. In contrast, the verbs most strongly associated
with passive voice in writing ( but not associated with passive voice in speech)
include ‘colloquial’ verbs like BE + thought, done, made.
The second type of analysis identifies the set of verbs that are more strongly
associated with passive voice in speech than in writing, and vice versa.
Register as a predictor of linguistic variation 29
(Stefanowitsch and Gries 2008: 142). Here again, we find several seemingly
‘literate’ verbs in the ‘spoken-passive’ list, such as BE + concerned, involved,
cross-examined, readmitted, adduced, extruded, while several ‘colloquial’
verbs are in the ‘written-passive’ list, such as BE + had, known, got/gotten,
thought, wanted.
The third type of analysis identified only two verbs with cross-over effects:
find and work, which are both attracted to spoken/active versus written/passive.
However, these verbs are claimed to involve different meaning senses in the
two channels, and thus to not represent a genuine register difference. Thus, the
general conclusion that Stefanowitsch and Gries draw is that register differ-
ences are not important for this type of investigation: “the results of channel-
sensitive collostructional analysis are essentially identical to those yielded by
a ‘channel-ignorant’ analysis as far as constructional meaning in the narrow
sense is concerned: we found no interaction between channel and semantics at
all.” (2008: 143)
Before addressing the claimed lack of a register effect, it is useful to discuss
three other important issues that arise in the Stefanowitsch and Gries study.
The first is to emphasize yet again the need to distinguish between the varia-
tionist and text-linguistic perspectives, and specifically to avoid claims about
‘frequency’ in variationist studies that investigate proportional preference. As
noted in Section 2 above, this confusion arises even in some of the most care-
fully designed corpus-based studies. Stefanowitsch and Gries (2008) some-
times seem to fall into this same trap; for example, “the passive construction
occurs relatively frequently with formal vocabulary in both channels.” ( p. 143)
The normal interpretation of this statement is that a speaker would frequently
produce passive constructions with formal vocabulary in speech: a text-
linguistic perspective. However, Stefanowitsch and Gries (2008) did not actu-
ally consider this perspective and provide no findings to support such a conclu-
sion. In fact, passives are not frequent at all in conversation. For example, the
LGSWE (Figure 6.7; p. 476) shows that passive verbs occur only c. 2,000
times per million words in conversation, contrasted with c. 18,000 times per
million words in academic writing. Further, the verbs that most frequently
occur with passive voice in conversation are not formal vocabulary; rather,
they are mostly everyday, colloquial verbs like BE + made, done, called, put,
told, born, paid (cf. LGSWE pp. 478– 479).
In contrast, the Stefanowitsch and Gries (2008) study apparently reflects the
fact that when a ‘formal’ verb (like engaged, enclosed, adduced, or extruded )
is used in conversation, it is usually used in the passive voice. But this does not
at all mean that such combinations occur frequently in conversation.
The second general issue here is that surprising findings require interpretation
and explanation; it is not sufficient to simply report surprising statistical results
with no discussion. This is especially the case when we are relying on a complex
30 D. Biber
statistical measure, with results that are not necessarily well-understood. Such
an analysis raises two possibilities: 1) that there are completely unanticipated
linguistic patterns identified by the analysis, requiring a radical change in our
understanding of spoken and written discourse; or 2) that the statistical
approach is not measuring what we think it is, and thus the approach itself
requires further analysis and interpretation.
For example, Table 6 in the Stefanowitsch and Gries (2008: 142) study lists
the verbs that are most strongly associated with passive voice in speech con-
trasted with writing. The expectation of the reader is that these verbs are some-
how especially typical of speech. The verbs identified by this method, though,
are extremely surprising, including be readmitted, be adduced, be extruded –
and no discussion is provided to explain how these passive verbs are typical of
speech in contrast to writing. Similarly, no discussion is offered to explain
what it means for verbs like BE had, BE got/gotten, BE wanted to be among the
verbs most strongly associated with passive voice in writing (in contrast to
speech).
Finally, these findings point to the need for triangulation of methodological
approaches, and the risks of relying exclusively on any single measure. For
example, multivariate statistics, simple descriptive statistics, and consideration
of linguistic examples in textual contexts should all be considered and recon-
ciled. Similarly, considering quantitative findings from both variationist and
text-linguistic perspectives can provide a more complete description than any
single perspective. Most importantly, analyses based on a single methodological
approach should be presented with explicit discussion of the limitations of that
approach.
For example, descriptive statistics on the use of passive verbs can be used to
illustrate the complementary kinds of information found from the variationist
versus text-linguistic analytical approaches. At the same time, these descrip-
tive statistics identify some completely different patterns from those identified
in the collostructional analysis. For example, an analysis of text-linguistic rate-
of-occurrence identifies five verbs that are especially frequent with the passive
voice in academic writing: BE + found, given, made, seen, used. (LGSWE
p. 478). However, three of those verbs are not identified at all in the first col-
lostructional analysis (verbs associated with passive voice versus active voice
in academic writing): BE + given, found, seen. Similarly, BE + given, found
are not included on the list of passive verbs associated with writing versus
speech (even though they occur c. 100 times more frequently as passive verbs
in academic writing than in speech; cf., LGSWE p. 478).
Other verbs that did not make it to the lists in the Stefanowitsch and Gries
(2008) study have strong proportional use with passive voice in academic
writing. For example, the following verbs were not identified in any of the
collostructional analyses, but descriptive statistics show that they all occur as
Register as a predictor of linguistic variation 31
Table 4. S
elected verbs showing different patterns for frequency rates of occurrence ( per 1 mil
lion words) and proportional use of passive voice, comparing conversation (AmE) and
academic writing ( based on analysis of c. 5-million word corpora for each register,
taken from the LSWE Corpus).
6. Conclusion
The present paper argues for the importance of register differences at all lin-
guistic levels. However, as background, the paper first distinguishes between
two major approaches to the study of linguistic variation and use: variationist
versus text-linguistic. The variationist approach has been widely used since
the 1960’s for the quantitative study of the ‘sociolinguistic variable’ (see, e.g.,
Labov 1972; Lavandera 1978). More recently, functional linguists from sev-
eral different sub-disciplines have employed similar approaches to study gram-
matical variation, exploring the contextual factors that influence the choice
among related grammatical variants. Variationist studies differ in their linguis-
tic focus ( phonological versus grammatical features) and in the statistical tech-
niques that they employ (e.g., Varbul, logistic regression). However, all varia-
tionist studies share certain characteristics: 1) the research goal is to describe a
linguistic feature, rather than the characteristics of texts; 2) each occurrence of
the target linguistic feature constitutes an observation in the research design;
and 3) the quantitative findings represent proportional preference for one lin-
guistic variant in comparison to other variants.
The text-linguistic approach to linguistic variation differs in all three re-
spects: 1) the research goal is to describe the characteristics of texts, rather than
the characteristics of a linguistic feature; 2) each text constitutes an observa-
tion; and 3) the quantitative findings represent the rates of occurrence of lin-
guistic features in texts rather than the proportional preference for a linguistic
variant in comparison to other variants. Thus, the text-linguistic approach de-
scribes language use from the perspective of a conversational participant or a
normal reader of a text: what features will they encounter most commonly in
spoken interactions or written texts?
The description of passive voice verbs (cf. Section 5.2) illustrates the prac
tical consequences of this distinction. From a variationist perspective, analyzing
proportional preference, verbs like BE + concerned, involved, cross-examined,
readmitted, adduced, extruded are especially associated with passive voice in
speech. That is, when one of these verbs is used in spoken discourse (or at least
in the particular corpus of speech analyzed for the S&G study), it is likely to
occur as a passive rather than active voice verb. (In this case, these findings are
in contrast to written discourse: these particular verbs are proportionally more
likely to occur as passives in spoken discourse than in written discourse.)
The text-linguistic approach provides a dramatically different perspective,
because most of these verbs are simply not common at all in spoken texts. Thus,
a conversational participant will rarely encounter these verbs in text, whether
they are in the active or passive voice. However, there are other verbs that do
frequently occur with passive voice in conversation, like BE + made, done,
called, put, told, paid. Proportionally, these verbs occur most of the time as
34 D. Biber
active voice verbs, so they are not associated with the passive from a variationist
perspective. But a conversational participant will encounter these passive forms
much more frequently than proportionally-preferred verbs like BE adduced or
BE extruded.
Similar contrasts between the types of information provided by the two
perspectives can be given for most linguistic features. The point here is not to
argue that one or the other perspective is correct. However, studies often fail
to distinguish between the two, slipping into descriptions that suggest a text-
linguistic perspective when the data are strictly proportional or variationist.
The secondary goal of the present paper has thus been to emphasize the differ-
ence between these two perspectives, and the need to explicitly characterize
findings as relating to one or the other.
The primary goal, though, has been to argue for the importance of register
differences – in both variationist and text-linguistic studies. That is, the pat-
terns of linguistic variation and use are dramatically different across registers.
The descriptions here have focused mostly on the spoken/written contrast
(especially face-to-face conversation versus academic writing), but systematic
differences exist across the full range of registers (see, e.g., Biber and Conrad
2009). These register differences exist across all linguistic levels, including
lexical patterns, grammatical patterns, and lexico-grammatical associations.
Traditionally, most general-purpose corpora were designed to include mul-
tiple registers, and thus many descriptive studies have adopted a text-linguistic
approach and include some information on register differences.6 In recent
years, some variationist studies have also begun to include analysis of register
differences interacting with other factors of the linguistic context (see, e.g.,
Riordan 2007; Szmrecsanyi and Hinrichs 2008). However, it is still the norm
in most studies of collocation and lexico-grammatical associations to disregard
the possible influence of register differences. The main point of the present
paper is that we should instead treat this possible influence as a likelihood: that
the patterns of linguistic variation and use are usually strikingly different in
spoken versus written registers. Thus, the practice advocated here is to begin a
research study with the hypothesis that such register differences exist, and to
include analysis of those differences unless they are empirically shown to be
unimportant.
Bionotes
Notes
1. In a few cases, corpus linguists have actively argued against the need for quantitative analysis.
One of the best known linguists to take this position is Sinclair, stating that: “some numbers
are more important than others. Certainly the distinction between 0 and 1 is fundamental, being
the occurrence or non-occurrence of a phenomenon. The distinction between 1 and more than
one is also of great importance . . .” [because even two unconnected tokens constitute] the
recurrence of a linguistic event . . . , [which] permits the reasonable assumption that the event
can be systematically related to a unit of meaning. In the study of meaning it is not usually
necessary to go much beyond the recognition of recurrence [i.e. two independent tokens]. . . .
(Sinclair 2001: 343–344).
2. It is also possible to include quantitative variables in a Type A study. For example, each occur-
rence of a relative clause could be coded for the number of words separating the relative pro-
noun from the gap position. However, studies of linguistic variation generally do not include
such variables.
3. If the three sub-corpora had been exactly the same size, and if the analysis had been based on
a complete sample of all that-clauses, then conclusions of this type would be appropriate.
However, most Type A studies do not meet these two requirements.
4. Type C studies are also designed for text-linguistic research questions, but they do not permit
the use of inferential statistics. The results reported in the LGSWE are actually based on Type
C designs rather than Type B designs.
5. The operational definition of ‘collocate’ can also be approached from both variationist versus
text-linguistic perspectives. The variationist perspective uses statistics like mutual informa-
tion and log-likelihood, which are based on the proportion of both words that co-occur. The
text-linguistic perspective uses simple rate of occurrence, measuring how often the combina-
tion of words is found in texts.
6. Most of these studies use a ‘Type C’ design, treating each sub-corpus as an observation rather
than analyzing each text as an observation. While this still results in a text-linguistic perspec-
tive, it does not permit analysis of dispersion or the extent to which register differences hold
for individual texts.
References
Biber, Douglas. 1988. Variation across speech and writing. Cambridge: Cambridge University
Press.
Biber, Douglas. 1995. Dimensions of register variation: A cross-linguistic comparison. Cam-
bridge: Cambridge University Press.
Biber, Douglas. 2006. University language: A corpus-based study of spoken and written registers.
Amsterdam: John Benjamins.
36 D. Biber
Biber, Douglas. 2009a. Are there linguistic consequences of literacy? Comparing the potentials of
language use in speech and writing. In David R. Olson & Nancy Torrance (eds.), Cambridge
Handbook of Literacy, 75–91. Cambridge: Cambridge University Press.
Biber, Douglas. 2009b. A corpus-driven approach to formulaic language: Multi-word patterns in
speech and writing. International Journal of Corpus Linguistics 14. 381– 417.
Biber, Douglas & Victoria Clark. 2002. Historical shifts in modification patterns with complex
noun phrase structures: How long can you go without a verb? In Teresa Fanego, Maria Jose
Lopez-Couso & Javier Perez-Guerra (eds.), English historical syntax and morphology, 43– 66.
Amsterdam: John Benjamins.
Biber, Douglas & Susan Conrad. 2009. Register, genre, and style. Cambridge University Press.
Biber, Douglas, Susan Conrad & Viviana Cortes. 2004. If you look at . . . : Lexical bundles in
university teaching and textbooks. Applied Linguistics 25. 371– 405.
Biber, Douglas, Susan Conrad & Randi Reppen. 1998. Corpus linguistics: Investigating language
structure and use. Cambridge: Cambridge University Press.
Biber, Douglas & Bethany Gray. 2010. Challenging stereotypes about academic writing: Com-
plexity, elaboration, explicitness. Journal of English for Academic Purposes 9. 2–20.
Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad & Edward Finegan. 1999. The
Longman grammar of spoken and written English. London: Longman.
Biber, Douglas & James K. Jones. 2009. Quantitative methods in corpus linguistics. In Anke
Lüdeling and Merja Kytö (eds.), Corpus linguistics: An international handbook, 1286 –1304.
Berlin: Walter de Gruyter.
Biber, Douglas, Bethany Gray, and Kornwipa Poonpon. 2011. Should we use characteristics of
conversation to measure grammatical complexity in L2 writing development? TESOL Quar
terly 45. 5–35.
Carter, Ronald & Michael McCarthy. 2006. Cambridge grammar of English. Cambridge: CUP.
Collins COBUILD English Grammar. 1990. London: Collins.
Conrad, Susan & Douglas Biber. 2009. Real grammar: A corpus-based approach to English. Pear-
son Longman.
Francis, Gill, Susan Hunston & Elizabeth Manning (eds.). 1996. Collins COBUILD grammar pat
terns 1: Verbs. London: HarperCollins.
Francis, Gill, Susan Hunston & Elizabeth Manning (eds.). 1998. Collins COBUILD grammar pat
terns 2: Nouns and adjectives. London: HarperCollins.
Gries, Stefan Th. 2006. Exploring variability within and between corpora: some methodological
considerations. Corpora 1. 109–151.
Gries, Stefan Th. & Anatol Stefanowitsch. 2004. Extending collostructional analysis: a corpus-
based perspective on ‘alternations’. International Journal of Corpus Linguistics 9. 97–129.
Hinrichs, Lars & Benedikt Szmrecsanyi. 2007. Recent changes in the function and frequency of
standard English genitive constructions: a multivariate analysis of tagged corpora. English Lan
guage and Linguistics 11. 335–378.
Hunston, Susan. 2002. Corpora in applied linguistics. Cambridge: Cambridge University Press.
Hunston, Susan & Gill Francis. 2000. Pattern grammar: A corpus-driven approach to the lexical
grammar of English. Amsterdam: John Benjamins.
Kennedy, Greame. 1991. Between and through: The company they keep and the functions they
serve. In Karin Aijmer & Bengt Altenberg (eds.), English Corpus Linguistics, 95–110. London:
Longman.
Kennedy, Graeme. 1998. An introduction to corpus linguistics. London: Longman.
Labov, William. 1972. Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press.
Lavandera, Beatriz R. 1978. Where Does the Sociolinguistic Variable Stop? Language in Society
7. 171–82.
Longman Dictionary of Contemporary English. 2009. London: Longman.
Register as a predictor of linguistic variation 37