0% found this document useful (0 votes)
6 views8 pages

Chap 2 Part 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views8 pages

Chap 2 Part 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

CHAPTER 2

Presented by: Eman Mohamed Yousef


2.5 What Is a Corpus?
The Traditional Notion of a Linguistic The Notion of the Web as Corpus.
Corpus
Finite Size Non-Finiteness

Balance Flexibility

Part-whole Relationship Decentering / Recentering

Permanence Provisionality

as the BNC, we know precisely what kinds No such certainty exists.


and types of English are being analyzed.
The Study of Biber, Egbert, and Davies (2015)
Function An empirical study to determine the kinds of texts that exist on the Web
based on a collection of texts taken from the Corpus of Global Web-
Based English (GloWbE), a corpus that is 1.9 billion words in length
and that contains samples of English from 20 different countries in
which English is used.

Steps 1. They developed a very carefully planned methodology for


extracting a representative body of texts in the corpus from the Web
2. They trained a group of evaluators to categorize the texts into
specific registers.

Findings ² They found that three registers predominated:


1. The Narrative Register
2. The Informational Description/Explanation Register
3. The Opinion Register
² They also discovered a number of texts that were hybrid in nature:
“documents that combine multiple communicative purposes in a
single text”.
2.6Corpus
2.6 CorpusSize
Size
First Generation Corpora Second Generation Corpora
As Brown and LOB were relatively As BNC: regularly 100 million words in
short (each of one million words in length or even longer.
length)

Keyed in by Hand a Optical Scanners made it easier to


tremendous amount of very tedious convert printed texts into digital
and time-consuming typing. formats
The Study of Davies:
Function His goal was to determine the extent to which the length of a corpus
could provide valid information on 10 different linguistic constructions,
including individual lexical items; frequently occurring grammatical
structures, such as modal verbs and passives; collocations, and an
assortment of other commonly studied grammatical items.
Steps 1. He provides a useful guide for determining how lengthy a corpus
needs to be to accurately describe particular linguistic structures.
2. He analyzed three different corpora of varying length: the Brown
Corpus (one million words), the BNC (100 million words), and COCA
(at that time 500 million+ words).
Findings ² Individual lexical items were better studied in larger corpora than in
shorter corpora.
§ adjectives such as fun or tender are among the group of adjectives
that are most common in COCA, in the Brown Corpus, they
occurred five times or less.
§ In contrast, certain types of syntactic structures, such as modal
verbs, have more even distributions across the three corpora, thus
being one of the few areas “where Brown provides sufficient data”
The Study of Biber:
Function Biber (1993) provides a different mechanism for estimating the
necessary size of a corpus for the study of particular linguistic
constructions.

Steps ² His approach employs statistical formulas that


1. take the frequency with which linguistic constructions are likely to
occur in a corpus
2. then calculate how large the corpus will have to be to validly study
the distribution of the constructions.
Findings ² Reliable information could be obtained on frequently occurring
linguistic items such as nouns in as few as 59.8 text samples.
² infrequently occurring grammatical constructions such as
conditional clauses required a much larger number of text samples
(1,190) for valid information to be obtained.
2.7 The Internal Structure of a Corpus
BNC ICE

While the two corpora contain the same range of genres, the genres are much more specifically
delineated in ICE Corpora than they are in the BNC.
For instance, in both corpora, 60 percent of the spoken texts are dialogues and 40 percent are
monologues.

Dialogues and monologues are interspersed In the category of speech, there are dialogues
among the various genres (e.g. business, and monologues. Dialogues can be either
leisure) making up the spoken part of the private (e.g. direct conversations) or public
corpus. (e.g. broadcast discussions)

In both corpora, there is a clear bias towards spontaneous dialogues

While the amount of writing in the BNC greatly exceeded the amount of speech, just the opposite
is true in the ICE Corpus

Although the BNC makes a distinction between the natural, applied, and social sciences and,
unlike the ICE, does not include equal numbers of texts in each of these categories.
2.7 The Internal Structure of a Corpus
BNC ICE
² Both the ICE and BNC are multi-purpose corpora;
They are intended to be used for a variety of different purposes, ranging from studies of
vocabulary, to studies of the differences between various national varieties of English, to
studies whose focus is grammatical analysis, to comparisons of the various genres of English.
For this reason, each of these corpora contains a broad range of genres.

COCA
It represents five major registers: spoken (transcripts of dialogical speech taken from various
television and radio shows), fiction, newspapers, magazines, and academic writing.

The overall size of the corpus (one billion words)

COCA has a diachronic component because it contains text collected over a span of 19 years.

You might also like