Annotation = codes used within a corpus that add information about things such as, for example,
grammatical category. Also refers to the process of adding such information to a corpus.
Balance = a property of a corpus (or, more precisely, a sampling frame). A corpus is said to be
balanced if the relative sizes of each of its subsections have been chosen with the aim of adequately
representing the range of language that exists in the population of texts being sampled.
Colligation = more generally, colligation is co-occurrence between grammatical categories (e.g. verbs
colligate with adverbs) but can also mean a co-occurrence relationship between a word and a
grammatical category.
Collocation = a co-occurrence relationship between words or phrases. Words are said to collocate
with one another if one is more likely to occur in the presence of the other than elsewhere.
Comparability = two corpora or subcorpora are said to be comparable if their sampling frames are
similar or identical.
Concordance = a display of every instance of a specified word or other search term in a corpus,
together with a given amount of preceding and following context for each result or „hit“.
Concordancer = a computer program that can produce a concordance from a specified text or corpus.
Modern concordance software can also facilitate more advanced analyses.
Corpus = from the Latin for „body“ (plural corpora), a corpus is a body of language representative of a
particular variety of language or genre which is collected and stored in electronic form for analysis
using concordance software.
Corpus-based = where corpora are used to test performed hypotheses or exemplify existing linguistic
theories. Can mean either: any approach to language that uses corpus data and methods. X an
approach to linguistics that uses corpus methods but does not subscribe to corpus-driven principles.
Corpus-driven = an inductive process where corpora are investigated from the bottom up and
patterns found therein are used to explain linguistic regularities and exceptions of the language
variety/genre exemplified by those corpora.
Diachronic = diachronic corpora sample texts across a span of time or from different periods in time
in order to study the changes in the use of language over time. Compare: synchronic.
Frequency list = a list of all the items of a given type in a corpus (e.g. all words, all nouns, all four-
word sequences) together with a count of how often each occurs.
Key word in context (KWIC) = a way of displaying a node word or search term in relation to its context
within a text. This usually means the node is displayed centrally in a table with co-text displayed in
columns to its left and right. Here, „key word“ means „search term“ and is distinguished from
keyword.
Keyword = a word that is more frequent in a text or corpus under study than it is in some (larger)
reference corpus. Differences between corpora in how the word being studied occurs will be
statistically significant for it to be a keyword.
Lemma = a group of words related to the same base word differing only by inflection.
- Walked, walking, and walks are all part of the verb lemma walk
Lemmatisation = a form of annotation where every token is labelled to indicate its lemma.
Lexis = the words and other meaningful units (such as idioms) in a language. The lexis or vocabulary
of a language is usually viewed as being stored in a kind of mental dictionary, the lexicon.
Metadata = the texts that makeup a corpus are data. Metadata is data about that data – it gives
information about things such as the author, publication date, and title for a written text.
Monitor corpus = a corpus that grows continually, with new texts being added over time so that the
dataset continues to represent the most recent state of the language as well as earlier periods.
Node = in the study of collocation – and when looking at a key word in context (KWIC) – the node
words is the word whose co-occurence patterns are being studied.
Reference corpus = a corpus which, rather than being representative of a particular language variety,
attempts to represent the general nature of a language by using a sampling frame emphasizing
representativeness.
Representativeness = a representative corpus is one sampled in such a way that it contains all the
types of text, in the correct proportions, that are needed to make the contents of the corpus an
accurate reflection of the whole of the language variety of language that is samples.
Sample corpus = a corpus that aims for balance and representativeness within a specified sampling
frame.
Sampling frame = a definition, or set of instructions, for samples to be included in a corpus. A
sampling frame specifies how samples are to be chosen from the population of text, what types of
texts are to be chosen, the time they come from and other such features. The number and length of
the samples may also be specified.
Statistical significance = a quantitative result is considered statistically significant if there is a low
probability (usually lower than 5%) that the figures extracted from the data are simply the result of
chance. A variety of statistical procedures can be used to test statistical significance.
Synchronic = relating to the study of language or languages as they exist at a particular moment in
time, without reference to how they might change over time. A synchronic corpus contains texts
drawn from a single period – typically the present or very recent past.
Tagging = an informal term for annotation, especially forms of annotation that assign an analysis to
every word in a corpus (such as part-of-speech or semantic tagging).
Text = as a count noun it is any artefact containing language usage – typically a written document or a
recorded and/or transcribed spoken text. As a non-count noun it is collected discourse, on any scale.
Token = any single, particular instance of an individual word in a text or corpus.
Type = A single particular wordform. Any difference of form (e.g. spelling) makes a word a different
type. All tokens comprising the same characters are considered to be examples of the same type. X
Can also be used when discussing text types.
Type-token ratio = a measure of vocabulary diversity in a corpus, equal to the total number of types
divided by the total number of tokens. The closer the ratio is to 1 (or 100%), the more varied the
vocabulary is. This statistic is not comparable between corpora of different sizes.