WHAT IS CORPUS LINGUISTICS?
The main field that makes use of Corpora for the analysis of
language is called Corpus
Linguistics.
Corpus Linguistics is both an approach, a method for linguistic
analysis, and it is also a theoretical framework.
More specifically, Corpus Linguistics is the analysis of
language based on examples of real-life language use.
Corpora are essentially bodies of electronically encoded
text (electronically encoded because corpora are digital
files) ,
Corpus linguistics is based on a quantitative methodology,
because:
- it consists of thousands or even millions of words, serving as
samples of naturally occurring language;
-because it has to do with quantities of data, where the
software programme does the counting, so it is not manual but
electronic;
-calculations are carried out electronically by software
programmes allowing for complex analysis of extensive text.
There are many software programmes for the analysis of
Corpora, ex: Sketch Engine
Corpora are large samples of a language, used as a standard
reference for analysing frequent patterns in that language.
(frequent pattern: the frequent use of specific words).
This methodology-Corpus Analysis or Corpus Linguistics-
is useful for spotting particular linguistic phenomenon.
So, it means that through the analysis of examples of real-life
language use= AND SO THROUGH the analysis of corpora, it is
possible to spot particular linguistic phenomena, to see
how people use language in real life.
ANNOTATION
So, corpora is a collection of texts, (a corpus of news stories
contains news articles; a corpus of political speeches contains
political speeches)
But, in addition to the text, Corpora can embed other information,
for example, additional linguistic information about who produced
the text for instance, or whether a word is a noun or a verb. This
process is called Annotationannotation means adding
additional linguistic information to the corpus. that can be
useful to the researcher when analysing the corpus, but also to the
software programme, in order to process the corpus more
efficiently.
Especially in spoken corpora (conversations, interviews transcribed
an then collected into a corpus) annotations are useful, because
they may include info about the gender of the person who is
speaking, their age, their socio-economic status, and so on. This is
relevant if we are conducting an analysis, for example, on the way
teenagers speak.. Additional linguistic information is included in
order to help the researcher in the analysis.
Corpus based methods are not new, they have been used since
the 19th century, of course, in a different way, in a manual way
since they were not electronic. Another interesting fact about
Corpus Linguistics is that Corpora have been employed in different
fields, for example, for the creation of dictionaries, or for forensic
linguistics, or foreign language description.
CORPUS BASED APPROACH
Now, the approach that we're interested in, is the Corpus-Based
Approach to Discourse Analysis, which depends on both
quantitative and qualitative techniques.
Corpus Linguistics can be helpful in discourse analysis, so the
analysis of discourse, because it helps the analysts in finding out
discourses in language use, revealing hidden ideologies, using a
wider sample of language from real contexts.
This combination of these two approaches is strategic, because:
Corpus Analysis on its own is a quantitative method, so it has to do
with numbers and statistics, but it is limited because it does not
connect linguistic patterns to ideological assumptions in discourse.
On the other hand, CDA on its own is a qualitative method, it means
that allows the unveiling of ideologies in language, but it is a
manual limited methodology, because the researcher can only
focus on one or two texts at a time.
The combination of the two approaches allows for a quantitative
and qualitative approach, the numbers and the data brought in by
Corpus linguistics and the analysis of discourse coming from CDA.
THEY COMPLETE EACH OTHERCorpus analysis provides
quantitative data supporting the analysis of language in use,
while discourse analysis offers the interpretation of these data,
provided by the researcher.
However, there are some advantages and disadvantages of
the Corpus-Based Approach to Discourse Analysis.
Using a corpus provides objective basis for analysing
language, because researchers can access large sets of data
and they can find results based on real language examples,
rather than just personal stories.
However, even if the presence of the researcher is important for the
interpretation of data, at the same time, interpretation can still
be biased, and this is one limit of Corpus based approaches,
because the researcher perspective may just not be objective, it is
an interpretation. Of course, there are data to be interpreted, there
are numbers, but the perspective of the researcher may be biased,
SO there may be more than one analysis.
Other adv:
-Corpus data are interesting to discourse analysts because they
can reveal counter hegemonic discourses.
-discourses are not static; corpus allows for the comparison of
different times, through the Diachronic Corpus.
-Triangulation- the use of multiple methods of analysis makes it
easier.
Tognini-Bonelli (2001) makes a useful distinction between:
Corpus-based investigations: this approach uses a corpus
as a source of examples, to check researcher
intuition/ideas.
Corpus-driven investigations: it’s a more inductive way -
the corpus itself is the data and the patterns in it are
noted as a way of expressing regularities (and exceptions)
in language.
Triangulation, the use of multiple methods of analysis, is now
accepted by 'most researchers' because it makes easier to verify
hypotheses, and it allows researchers to adapt to unexpected
problems during their research.
Different Types of Corpora
Researchers create different types of corpora (big collections of
texts), based on the goals they have in mind.
1) Specialized Corpora: This type of corpus looks at particular
variety or genre of language, like the language used in
newspapers or on a particular topic. For instance, researchers
like Johnson and others created a corpus of British newspaper
articles that included references to the concept of political
correctness. The criteria for inclusion in this corpus was that
each article had to contain terms like "politically correct," "PC,"
or "political incorrectness."
2) The difference between CORPORA and TEXT ARCHIVES OR
DATABASE should be pointed out:
-Archives or database simply collect texts-or they’re
depositary of text.
- Corpus is different because it is a balanced collection of
texts/data, built according to explicit design criteria, for a
specific purpose. explicit design criteria means the rules,
established in advance, that lead the researcher to the
collection of data.
Sampling: Another aspect of traditional corpus building is in
sampling.
When building a corpus, researchers often take samples from
different texts. For example, if they are creating a corpus of
English literature, they might select parts of Pride and Prejudice or
Wuthering Heights. This helps to avoid having too much text from
just one source, making the corpus more balanced/representative.
Frequency of Topics: when creating a specialized corpus, it’s
important to consider how often the topic appears. For example, if a
researcher is interested in articles about unmarried mothers, a
smaller corpus with frequent mentions about this topic can
provide useful insights. So, in this case, the quality of the texts is
more important than the quantity.
3) Diachronic Corpus: This type of corpus allows researchers to
study how language changes over time, this is because
discourses are not static. Since language evolves, using a
diachronic corpus helps researchers check these changes
across different periods. Moreover, the use of diachronic
corpus enables researchers to see how social changes
influence language changes.
4) Reference Corpus: consists of a large corpus which is
representative of a particular language variety. An
example of Reference Corpus is the British National Corpus
(BNC), which has about one hundred million words from
newspapers, academic books, letters, and spoken data.
In summary, researchers use different types of corpora based on
their specific needs, focusing on quality, topic frequency, and the
evolution of language over time.
HOW TO BUILD YOUR OWN CORPUS
First of all, we can decide to use an existing corpus or we can
decide design our own corpus.
Of course, this decision depends on the purpose of our analysis.
Popular existing corpora are usually very big, and some of these
are the British National Corpus the acronym is BNC: British National
Corpus;
the International Corpus of English;
the Brown Corpus,
or the Global Webpage.
If we want to look at something very specific, we cannot use any
existing corpus, but we must build our own corpus.
EX: “the representation of women in new stories of violence”.
In order to build a corpus, we need a starting point. So, a database
collecting news articles about women. So, we can use one of these
databases: LexisNexis, ProQuest, Nexus UK.
Otherwise, we can use Google News, which is free.
The first stages of corpus building: finding and selecting texts,
obtaining permissions, transferring to electronic format and
checking that will form the researcher initial hypotheses.
How big a corpus should be?
About the length, a standard size hasn’t been established so
corpora can have different sizes according to your research goal.
The size of a corpus should in fact be related to its function. If
we are interested in how Indian English has changed in the last 20
years, we will need a huge corpus because we need a lot of data.
While if we are interested in the analysis of discourse, then the
corpus can be big or small.
The timespan/timeframe it is our choice, decision, because we
have to decide which timeframe we’re interested in. Let's say that
we want to have a look at news stories of violence between the
years 2020-2023.
Permissions
Before you use texts, you have to get permission from the authors
or publishers, especially if you want to share your corpora with
others. This can be slow and complicated, especially for large
corpora. If it's for your own use and only from a few sources, it’s
usually easier to get permission. Letting them know you won’t sell
or share the work can help.
There are also ethical issues when analyzing texts that may show
negative sides, like racist or sexist content. In some cases, getting
permission can’t be possible.
TOOLS FOR THE ANALYSIS OF CORPORA
Corpus linguistics doesn’t provide a single way of analysing data,
but there are different techniques: collocation, keywords,
frequency lists, clusters, dispersion plots etc.
1) FREQUENCY
Frequency is one type of analysis, the analysis of frequency.
Frequency refers to words that tend to occur with other
words, with a certain degree of predictability.
Sometimes frequency or, in general language, follow certain
grammar rules.
However, some other times people make some choices about the
words they use, which can show their ideological standpoints or
intentions. For instance, there is an ideological purpose in
representing refugees in combination with numbers, because
numbers give the idea of a mass invasion, so it’s ideological.
The idea of frequency is significant because it shows the
tension between language as a system of rules and
language as free choice.
There is an interesting study conducted by Zwicky (1997), who
investigated the grammatical choice between the use of the word
gay as a noun or an adjective:
Saying "he is gay" seems less negative than "he is a gay." Here,
using "gay" as an adjective suggests that it's just one part of their
personality, it doesn’t reduce the person to their sexuality; while
using it as a noun reduces the person to their sexuality.
FREQUENCY COUNTS: WORD LIST
Frequencies can be analysed using different methods, one of which
is called Word List.
This Word/Frequency Lists are list of frequent words in a given
corpus and they provide:
-the number of occurrences
-the percentage of frequency.
By looking at these Lists, we have an idea of what the corpus is
about or whether there are any specific trends.
The representation of a word list made by Sketch Engine consists
of different columns
- The frequency column shows the numbers of times the word
occurs in our corpus (ex. The article the)
- The relative frequency shows the numbers of occurrences
per million tokens.
- The DOCF (document frequency) shows the number of texts
containing that word and also shows the same number converted
into a percentage.
Dal libro dei corpora:
The representation of a word list made by WordSmith consists of
three windows:
The (F) window shows the most frequent words first.
The (A) window re-orders the word list alphabetically
The (S) window gives statistical information about the
corpus, including the total number of words (tokens), the
number of original words (types) and the type/token ratio,
which is simply the number of types divided by the number of
tokens expressed as a percentage.
For example, the word you occurs several times in the
corpus, although it only consists of one type of word.
A corpus with a low type/token ratio will contain many
repeated words. Whereas, a high type/token ratio suggests
that a more diverse form of language is being employed. Low
ratios are common in large texts, since common words like
"the" and "to" will repeat often.
2) CLUSTER
Definizione presa da internet: According to Paul Baker, a
cluster is a group of words that are semantically related
and often occur together in a text. This creates a pattern or
"cluster" of related vocabulary , ex: GLOBAL WARMING.
Dal libro di Baker: The notion of clusters is important
because it begins to take into account the context that a
single word is placed in.
Another useful function of Wordsmith is to search frequency lists for
clusters of words. So, carrying on his analysis, Paul Baker
searched for clusters which contained the words BAR/BARS:
The research revealed that some of the most common patterns
were: bars and clubs, loads of bars,
the best bars, bar serving snacks, pool
and bar, well stocked bar, 24-hours open bar. So the clusters mainly
had the function to describe bars.
3) DISPERSION PLOT
IS A CONCEPT LINKED TO FREQUENCY BECAUSE IT IS
IMPORTANT TO KNOW HOW FREQUENTLY SOMETHING
OCCURS IN A TEXT, BUT IT IS EVEN MORE IMPORTANT
TO KNOW WHERE EXACTLY IN THE TEXT IT DOES
OCCUR.
Another way of looking at a word is to analyse where in the
text it does occur. For this reason, it can be useful to carry
on a dispersion plot because it gives a visual representation
of where in the text the search term occurs
4) CONCORDANCES
The analysis of concordances is another tool in Corpus analysis.
Concordances are all the occurrences of a search term
within the context in which they occur. Context is given by
showing a few words on the left and on the right of the search
term.
The search term is also known as a "keyword in context," which
means a word that is under examination.
The object of creating concordances is to conduct a close
examination of our search term.
In order to show us how we can use concordances in a discourse
analysis, Paul Baker made a case study with a corpus of
newspaper articles.
First of all, we need to take into consideration that newspaper
articles are a very interesting corpus to analyse, because journalist
have the power to influence their readers by sharing their own
opinions or those of powerful people.
Becker call this hierarchy of credibility: powerful people will
have their opinion accepted easily because they are believed to be
more informed about things.
Investigating discourses of refugees
The topic of Baker’s analysis is refugee: A topic particularly
interesting to analyse in terms of discourse because they are the
most powerless groups in society.
Van Dijk also points out that minority groups, such as refugees, are
frequent topics of political talk, but they have very little control over
the way they are represented.
A. BUILD THE CORPUS
In order to construct the corpus, Paul Baker used Newsbank,
which is a digital achieve where a wide range of British newspapers
can be found. This allowed to create a corpus with different
newspaper articles, containing a wide range of political and
ideological positions. Newsbank allows researchers to make a
research for articles containing a particular word or phrase.
Paul Baker built his corpus by searching for the word
refugee/refugees separately in newspaper articles from 2003
on. The results gave 140 examples to examine.
B. CREATE CONCORDANCE
Once we establish the corpus we are working on, it is necessary to
create a concordance. This can be done by loading the corpus
into a concordancing program. There are different software that
serve this purpose. Paul Baker chose WordSmith Tools .
The next step at this point is to look for patterns of
language use
C. FINDING PATTERNS
The first pattern identified is that of quantification: people
talk about refugees in terms of quantity, not only using exact
numbers, but also expressions like "thousands" or "more
and more." This emphasis on quantification suggests that
the amount of refugees is troublesome, is a source of concern.
The second pattern identified was the tendency to describe
refugees in terms of movement, by using verb phrases such
as: fleeing refugees', making refugees seem like victims. The
movement of refugees is constructed as a natural force, as
something out of control.
Sometimes refugees are even compared to water, like a
"flood of refugees" or "overflowing refugee camps." In
order to describe them as a natural disaster, unexpected,
unwanted and hard to control.
Another pattern is the occurrence “for refugees” with words
such as: help, assistance, shelter, rescue. This refers to all
the actions take in order to try help refugees.
Another pattern is the phrase the plight of X refugee(s), the
despair of its refugees, the tragedy of 22 million
refugees. The words plight, tragedy and despair all have
similar meanings, and they suggest that refugees are a
problem that needs to be solved
Semantic preference and discourse prosody
An important distinction in Corpus Based Approach is that between
semantic preference and discourse prosody.
A semantic preference is the relation between a word-form
and a set of semantically related words.
Semantic preference also occurs with multi-word units, for example:
glass of co-occurs with a lexical set of semantically related words
'drinks', ‘liquids’: e.g. sherry, lemonade, water, wine…
Semantic preference is not confined to words but to semantic
categories, moreover it denotes meanings that do not depend on
the speaker but are dictated by language.
This is the main difference: discourse prosody has to do with
speaker’s attitude towards linguistic choices.
Discourse prosody is when patterns can be found between a
word or a phrase and a set of related words that suggests a
discourse (the presence of an ideology, evaluation..)
Some examples of discourse prosody is the pattern quantifiers +
refugee, which is not dictated by grammar rules, but it depends on
the writer/speaker’s linguistic attitude, so it’s a decision. The intent
is to describe refugees as mass invasion and a source of concern.
5) COLLOCATES (the words that co-occur)
Collocation (the phenomenon) is when a word regularly
occurs with another word and their relationship is
statistically significant. Statistically significant means that this
co-occurrence of one word + another is very frequent in our corpus,
so it must mean something.
Blommaert points out that much of how we communicate isn't just
based on choices but is shaped by social rules related to inequality.
This is where language corpora, can help. When two words appear
together often in everyday language, it provides stronger evidence
for understanding communication than just one example.
For example, in the British magazine Outdoor Action, there's a
sentence that says: "Diana, herself a keen sailor despite being
confined to a wheelchair..." While this sentence shows a positive
view of people with disabilities, it also raises some questions about
language use.
When we look at a large corpus of general British English (maybe
BNC), we find that the words "confined" and "wheelchair" have
strong patterns of co-occurrence with each other.
The phrase confined to a wheelchair occurs 45 times in the corpus
case. This supports existing beliefs about people in wheelchairs:
people confined means people limited to the wheelchair,
not able to do things.
Collocation is a way of understanding meanings and
associations between words. In order to understand the
phenomena we now consider the subject of people who
have never been married.
PAUL BAKER COLLOCATION ANALYSIS: BACHELOR(S) AND
SPINSTER(S)
The topic of Paul Baker’s analysis is people who never got
married.
In British English, we can refer to never-married men as
BACHELORS and unmarried women as SPINSTERS.
In this case, Paul Baker didn’t build a corpus but used an existent
one: British National Corpus (BNC) because it is quite
representative of British English. The concordance programme
used was BNCweb, which allows to carry out concordances and
collocations.
A bachelor is often seen positively, associated with youth,
attractiveness, wealth, personal choice, and sexual activity. In
contrast, a spinster carries negative connotations, suggesting an
unattractive, older woman who is poor, unchosen, and sexually
inactive. This lexical inequality highlights gender bias where
being unmarried is seen as positive for men but undesirable for
women.
Paul Baker examines these associations and assumptions
through the analysis of collocations.
1. FIRST ANALYSIS
Baker carried out his analysis combining singular and plural forms
of the two search terms1; the results showed that
bachelor/bachelors occurred more frequently than
spinster/spinster.
The high frequency of "bachelor" can be attributed to its dual
meaning, because bachelor is a case of polysemy(it has more
than one meaning) and it refers to the bachelor degree, particularly
in scientific and arts contexts.
the BNC showed that they are more frequent in written text
than in spoken conversation, because bacherlor/s often occur
in science domain because the term also refers to a type of
academic degree.
2. CALCULATION OF COLLOCATES
To investigate collocates, Baker applied several statistical
methods. The basic approach involves counting the occurrences of
words within a specified window of n-words surrounding the search
terms. However, function words dominate initial frequency counts,
necessitating a shift in focus toward grammatical and lexical
collocates.
Baker highlights the use of Mutual Information (MI) as a statistical
test, which calculates the probability of two words co-occurring
compared to their general frequencies in the corpus. While MI
effectively identifies potential collocates, it can overestimate the
importance of words that are used less often.
As a solution, Baker recommends complementary statistical
methods, like the log-log technique, which emphasizes more
common lexical words.
3. IDENTIFING DISCOURSES FROM COLLOCATES
Using the log-log algorithm, we find a list of collocates for each
search term.
Backer found out that bachelor/bachelors collocates are more
than the ones for spinster/spinsters. This is due to the fact
that:
- bachelor it occurs more frequently in the corpus
- because it’s a case of polysemy, as matter of fact a lot of
occurrences of the word bachelor have nothing to do with
unmarried man, but they refer to academic degree.
The list of bachelor’s collocates includes:
ELIGIBLE, DEGREE, MALES, EDUCATION, ELDERLY, ARTS,
BROTHER, PARTY, DAYS, STATUS
ELIGIBLE reflects a positive view of single men
BROTHER and SON: occurrences relative to inheritance or family
trees;
DAYS and PARTY: have a positive discourse prosody connected to
nostalgia.
ELDERLY,: suggest a negative prosody as age increases.
MALES: science category to talk about animals and their sexual
behaviour;
STATUS: it describes the unmarried state of a man and it was found
with both positive and negative connotations;
So the collocates of bachelor(s) suggest a double picture. Young
bachelors have a positive discourse prosody because bachelors
days are described to be happy. However, it stars to have a
negative discourse prosody, as the collocate elderly suggests
Spinster(s) only has four collocates: ELDERLY, WIDOWS,
SISTERS and THREE.
The collocates of spinster suggest that there is a characterization
of spinster as victim and widow and suggest an image of an
older and unattractive woman being unmarried.
The term goes back to the 13 th century and it referred to women
who spun wool for a living. The historical reference explains better
the collocation of spinster with elderly and widow. The
collocation of three is particularly interesting; analysing the
collocations of this word in the BNC, we can see that it occurs with
a wide range of different terms which includes sisters. It could be
a reference to Shakespeare’s play Macbeth.
4. RESISTANT DISCOURSES
Collocates may also contain traces of resistant discourses. In
this case, we notice that the concept of a young spinster is totally
absent from the corpus. Males, on the other hand, have a double
discourse prosody according to the age. Young bachelors are happy
while older bachelors tend to be lonely. This suggests how the
view of being unmarried is constructed differently
according to the gender. Spinster is used in association with a
negative evaluation of being unmarried for women, while bachelor
has positively connoted collocates, so the condition of being
unmarried for men is more socially acceptable.
6) KEYNESS
Keyness is the degree of saliency that a word has within a
corpus A compared to the same word in corpus B, which is
the reference corpus.
Frequency gives you the number of times that word appears, while
Keyness is the degree of saliency. The words that are very important
in one corpus are called keywords. Keyness is automatically
calculated by the software programme.
Investigating the reasons why a particular word appears so
frequently in a corpus can help to reveal the presence of discourses,
especially those of a hegemonic nature. Creating a list of these
frequent words is a good starting point for research, but these lists
have their own shortcomings.
PAUL BAKER ANALYSIS
The chosen topic is political debates on fox-hunting in the
British House of Commons. The ban was finally approved in
2005 so the corpus contains the transcripts of three debates which
happened between 2002 and 2003 in the House of Commons. Most
of these debates were held at the House of Commons because
they were the political party who wanted the ban to be approved.
A first analysis shows that this is one of the cases in which a simple
frequency lists research is not enough. The list of the most
frequent words, in fact, doesn’t give any information about fox-
hunting debate. They don’t have a specific connotation related to
our topic.
To better understand important words related to the debate, we can
compare more than one frequency lists together: those who
supported the ban and those who opposed it. By analyzing
these two groups, we noticed that some words appeared in similar
frequencies on both lists, which can lead to interesting questions,
such as why "ban" is mentioned less frequently in the speech
of anti-ban people.
Keyness: A New Way to Analyze Words
In order to analyze the degree of saliency of words, we can use a
method called "Keyness."
Using WordSmith, you can compare the frequencies of words from
different lists to see which words from list A appear more often
compared to list B.
Every word that occurs more often than expected in list A, when
compared to list B, is added to a keyword list. This keyword list
provides information about saliency, while a basic frequency
list only shows information about frequency.
Analyzing Keywords
The analysis of keywords from the debate on hunting reveals
significant linguistic patterns. The term "criminal" emerges as a
dominant keyword used by pro-hunters, appearing 38 times
compared to its two occurrences among anti-hunters.
Without a context, such information doesn’t say a lot. For this
reason, it is necessary to examine individual keywords in more
detail, using concordance and collocation analysis.
CONCORDANCE ANALYSIS: when a
concordance analysis of criminal was
carried out, it was found that common phrases contained words like
criminal law, criminal sanctions or criminal act.
COLLOCATION ANALYSIS:
collocation analysis of criminal gave more
information;
The modal verbs would and should were present; also, various
forms of the verb make. The lemma
MAKE seems an important collocate, so a
concordance analysis was carried out, which showed that anti-ban’s
strategy was to frame the fox-hunting ban of criminalizing people
who were against it.
The concordance analysis of criminal
also showed the use of the verb
invoke. In the BNC, it collocates strongly with two sets of words:
legal terms and supernatural forces. When pro-hunters talk
about invocation of the criminal law,
they might be referring to the invocation of spirits,
which are something commonly considered dangerous.