Words: Collocations
Lecturer: Ana Ozaki
Semester 01/2019
Free University of Bozen-Bolzano
Collocations
Our topic today is a linguistic phenomenon called
co ocation.
‘Natural way’ of using words.
Applications: information retrieval, support text editors,
arti cial text generation, etc.
Note: this material is based on Chapter 5 of (Manning
and Schuetze, 1999).
ll
fi
Collocations
‘A co ocation is an expression consisting of two or more
words that correspond to some conventional way of
saying things.’ (Firth 1957)
E.g. we say ‘strong tea’
instead of ‘powerful tea’.
3
ll
Collocations
‘A co ocation is an expression consisting of two or more
words that correspond to some conventional way of
saying things.’ (Firth 1957)
E.g. we say ‘strong tea’
instead of ‘powerful tea’.
E.g. we say ‘broad daylight’
instead of ‘bright daylight’.
4
ll
Collocations: Characteristics
Collocations are characterised by limited compositionality.
A language expression is compositional if the meaning of the
expression can be predicted from the meaning of the parts.
Example of collocation: white wine
Collocations: Characteristics
Collocations are characterised by limited compositionality.
A language expression is compositional if the meaning of the
expression can be predicted from the meaning of the parts.
Not a collocation: expensive wine
Collocations: Characteristics
Limited substitutability: cannot substitute by other words, even if in the
context they have the same meaning.
E.g., cannot say yellow wine instead of white wine
Limited modi ability: cannot be freely modi ed with additional lexical
material or through grammatical transformations.
Di culty in translating the expression to another language.
(Italian) Fame da lupo, (German) Bärenhunger, (Portuguese) Fome de leão
ffi
fi
fi
Collocations
Other examples of collocations:
Idioms ‘Can't judge a book by its cover’
For more examples and details on this topic, please see
Chapter 5 of (Manning and Schuetze, 1999).
Finding Collocations
Frequency: if two (or more) words occur together a lot, that is
evidence that they form a collocation, by the limited
substitutability and the limited modi ability principles.
However, the selection of the most frequent bigrams is usually
not very e ective.
Lots of function words: ‘of the’, ‘in the’, ‘to the’, …
A part-of-speech lter can greatly improve the results.
Adjective Noun (e.g., strong tea)
ff
fi
fi
Finding Collocations
TODO add experiment
Collocational Window
Collocations may not be xed phrases.
Many collocations can stand in a more exible structure than a xed
sequence of words.
E.g., the verb ‘to knock’ forms a collocation with the noun ‘door’.
She knocked on the metal door.
A man knocked on the wooden front door.
They knocked at the door.
Other verbs such as ‘hit’, ‘beat’ or ‘rap’ do not form a collocation with ‘door’.
fi
fl
fi
Mean and Variance
One way of discovering the relationship between ‘knock’
and ‘door’ is by computing the mean and variance of the
o sets (i.e., the distance between the two words).
She knocked on the metal door.
Mean
1 A man knocked on the wooden
(4 + 5 + 3) = 4
<latexit sha1_base64="(null)">(null)</latexit>
3 front door.
They knocked at the door.
ff
Mean and Variance
The variance measures how much the o sets deviate
from the mean µ .
Pn
<latexit sha1_base64="(null)">(null)</latexit>
2
2 (d
i=1 i µ)
=
n 1
Mean
1
(4 + 5 + 3) = 4 She knocked on the metal door.
<latexit sha1_base64="(null)">(null)</latexit>
3
Standard Deviation A man knocked on the wooden
r
1 front door.
((4 4)2 + (5 4)2 + (3 4)2 )
<latexit sha1_base64="(null)">(null)</latexit>
2
=1
<latexit sha1_base64="(null)">(null)</latexit>
They knocked at the door.
ff
Mean and Variance
TODO Experiment
Mean and Variance
TODO
Hypothesis Testing
High frequency and low variance may be accidental.
To know whether words occur together (or at a close
distance) more than by chance, one can resort to a
technique from statistics called hypothesis testing.
The key point here is that we are not simply considering
the frequency but also the amount of data, so that we
can rule out events that occur by chance.
Hypothesis Testing
Null hypothesis: Assume that co-occurrences between
words are by chance.
Compute the probability p that the event would occur if
the null hypothesis H is true; reject H if p is too low.
Typically if p is lower than 0.05.
Hypothesis Testing
Assume that the null hypothesis holds if two words v, w
do not form a collocation.
This means that v, w are generated independently, and
P (vw) = P (v)(w)
In this model, the probability of co-occurrence is the
product of the probability of the occurrence of each word.
.
Hypothesis Testing
Hypothesis Testing
Mutual Information
Mutual Information
Mutual Information
Take Home Message
Collocations are language expressions consisting of two
of more words that correspond to the usual way of
expressing ideas.
It is linguistic phenomenon shared by practically all
human languages (it is easier to form sentences using
‘chunks’ of words than word by word).
Take Home Message
Collocations are characterised by:
limited compositionality,
limited modi ability,
limited substitutability.
Example: ‘Fame da lupo’.
fi
Take Home Message
Finding collocations:
Frequency plus part-of-speech lter.
Mean and variance.
Hypothesis testing.
Mutual information.
fi