0% found this document useful (0 votes)
47 views6 pages

Project Proposal

Corpus linguistics

Uploaded by

Sadiya Ghafar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views6 pages

Project Proposal

Corpus linguistics

Uploaded by

Sadiya Ghafar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Corpus Linguistics

05.11.2024

University of The Education, Bank Road Campus, Lahore.


By:
Sadia Ghaffar,Samra Akram, Kiran Ilyas, Aliya Sarfraz.
1

Introduction to Corpus Linguistics

The word corpus is derived from a Latin word “Corps” meaning “body”.

It is a body or large collection of real-world text (books, novels, magazines, speeches, journals.)

What is Corpus?
“ A large collection of machine readable text is known as corpus (plural corpora). It can be
spoken or written.”
or
“ A large collection of language text in electronic form, selected to be representative of a
particular language
for linguistic research.” _Sinclair (2005)

What is Corpus Linguistics?

“ Systematic study of language through large collections of machine readable text (corpus) is
known as corpus linguistics.”
OR
“ Study of language that involves the analysis of a large collection of real world
language data is known as corpus linguistics.”
The term corpus linguistics first appeared only in the early 1980s.
Sinclair and J. R Firth are leading linguists in this field.

John McHardy Sinclair


John McHardy Sinclair (14 June 1933 – 13 March 2007) was a professor of Modern English
Language at Birmingham University from 1965 to 2000. He pioneered work in corpus
linguistics, discourse analysis, lexicography, and language teaching.

Examples of Corpus
2

● The Brown Corpus of Standard American English:The first modern, electronically


readable corpus was the Brown corpus prepared by Nelson Francis & Henry Kucera
in early 1960s. This corpus consists of one million words of American English text.
● The BNC ( British National Corpus): A 100 million word corpus of British English in
the early 1990s.
● Kolhapur corpus ( Indian English)
● The London corpus of Spoken British English.

Types of Corpus Linguistics


1. General Corpus: consists of general text, text that does not belong to a
single text type, or subject field.
2. Monolingual: A single language corpus often used to obtain lexical,
grammatical info about a specific language.
3. Bilingual: A corpus containing texts from two languages, enabling
comparative analysis.
4. Parallel: A corpus containing texts from several languages and their
translation useful for translational strategies.

Features of Corpus linguistics:


I. Using authentic language data:
II. Evidence based study
III. Investigating qualitative as well as
IV. quantitative aspect of language

Uses of Corpus Linguistics:


Lexicography: creating dictionaries
Discourse analysis
Language Acquisition
Analysis of Political Discourse

History of Corpus Linguistics


1.Early corpus linguistics (before 1950s)
2.Chomsky criticism (1957);
3

3.Modern corpus linguistics (1950s-1990s)


4.Quirk’s survey of English Language
5.Brown Corpus (1961)
6.London Lund corpus LLC
7.British National Corpus BNC (1995)
8.Work by Neo-Firthian
9.Sinclair contribution

Early corpus linguistics (before 1950s);


Although the term corpus linguistics first appeared in the 1980s, corpus based language
study has substantial history.
Corpus linguistics before the advent of Chomsky (1957) is known as early corpus linguistics.
In the pre-1950s era, corpus linguistics emerged as a methodological tool for linguistic
research, emphasizing the systematic analysis of large collections of text to uncover
language patterns. Pioneering efforts were notably made by scholars such as Zellig Harris,
Boas (1940) and Bloomfield who utilized early forms of corpus analysis to explore language
structures and patterns.

Chomsky criticism (1957)


However, corpus linguistics faced criticism from Noam Chomsky and others who favored a
more theoretical and rule-based approach, marginalizing corpus-based methods.
I. Chomsky as a rationalist invalidated the corpus as a source of evidence in linguistic
enquiry.
II. Linguists must study competence rather than performance and the corpus is based on
performance. Performance itself is a poor mirror of competence.
III. It is not possible to collect all infinite sequences of language in form of corpus that leads
to its skewness.

Modern Corpus linguistics (1960s)


• Quirk Survey of English (1960s-1970s):The development of modern corpus linguistics
began with the pioneering work of Randolph Quirk and his colleagues in the 1960s and
1970s.
4

• The "Survey of English Usage" at University College London played a pivotal role in
collecting and analyzing large amounts of linguistic data.
• This survey marked a significant shift towards empirical approaches, utilizing spoken and
written texts to investigate language patterns.

The Brown Corpus


• The first modern, electronically readable, corpus was The Brown Corpus of Standard
American English.
• It was published in 1963-1964 by Nelson Francis and Henry Kucera at Brown University
USA.
• The corpus consists of one million words of American English.
• To make the corpus a good standard reference, the texts were sampled in different
proportions from 15 different text
categories: Press (reportage, editorial, reviews), Skills and Hobbies, Religious,
Learned/scientific, Fiction (various subcategories), etc.

The London Lund corpus LLC


• The corpus was the first computer readable corpus of spoken language, and it consists of
100 spoken texts of appr. 5,000 words each.
• The texts are classified into different categories, such as spontaneous conversation,
spontaneous commentary, spontaneous and prepared oration, etc. The texts are
orthographically transcribed and have been provided with detailed prosodic marketing.

British National Corpus 1995


• In 1995 another large corpus was released; the British National Corpus (BNC).
• This corpus consists of some 100 million words. It contains both written and spoken
material.
• The texts have been encoded with mark-up providing information about the texts,
authors, speakers.

J. R Firth contribution
J.R. Firth, a key figure in the development of corpus linguistics, emphasized the importance
of studying language in its natural context. His concept of “..context of situation" influenced
the field, encouraging researchers to explore language within its broader communicative
setting.
5

Sinclair and the COBUILD Project


COBUILD, or Collins Birmingham University International Language Database, was a project
led by linguist J.R. Sinclair and Michael Halliday at the
University of Birmingham in the 1980s. its primary purpose was to create a comprehensive
corpus of English language texts, analyzing and cataloging words based on their actual
usage rather than prescriptive rules.
The goal was to develop more accurate and descriptive dictionaries and language
resources that reflected how words and expressions were used in real contexts.
COBUILD revolutionized lexicography by emphasizing the importance of language as it is
truly spoken and written.

You might also like