Collocation
Collocation:
•Frequency
•Mean and Variance
•Hypothesis Testing
•The t test
•Pearson’s chi-square test
•Mutual Information
8/30/2023 2
Collocation (Contd)
• A collocation is an expression consisting of two or more words that
correspond to some conventional way of saying things
• Collocations include noun phrases like strong tea and weapons of mass
destruction, phrasal verbs like to make up, and other strong phrases like the
rich and powerful
• Collocations are characterized by limited compositionality
• Compositionality natural language expression compositional if the meaning of the
expression can be predicted from the meaning of the parts
• Collocations are not fully compositional in that there is usually an element
of meaning added to the combination
• In the case of strong tea, “strong” has acquired the meaning rich in some active
agent which is closely related, but slightly different from the basic sense having
“great physical strength”
8/30/2023 3
Collocation (Contd)
Definition of a Collocation:
Choueka, 1988
•[A collocation is defined as] “a sequence of two or more consecutive words, that
has characteristics of a syntactic and semantic unit, and whose exact and
unambiguous meaning or connotation cannot be derived directly from the
meaning or connotation of its components.“
Criteria:
– non-compositionality
– non-substitutability
– non-modifiability
– non-translatable word for word
8/30/2023 4
Collocation (Contd)
Non- Compositionality
• A phrase is compositional if its meaning can be predicted from the meaning of
its parts
• Collocations have limited compositionality, there is usually an element of
meaning added to the combination, Ex: strong tea
• Idioms are the most extreme examples of non-compositionality
• Ex: The ball is in your court
Non-Substitutability
• Cannot substitute near-synonyms for the components of a collocation.
• Strong is a near-synonym of powerful
• strong tea? powerful tea
• yellow is a good description of the color of white wines
• white wine ? yellow wine 5
Collocation (Contd)
Non-modifiability
• Many collocations cannot be freely modified with additional lexical material or
through grammatical transformations
• weapons of mass destruction --> ?weapons of massive destruction
• to be fed up to the back teeth --> ?to be fed up to the teeth in the back
Non-translatable (word for word)
• English:
make a decision take a decision
• French:
faire une décision prendre une décision
• to test whether a group of words is a collocation
• translate it into another language, if we cannot translate it word by word then it probably is
a collocation
Collocation (Contd)
Linguistic Subclasses of Collocations
Phrases with light verbs
Verbs with little semantic content in the collocation
make, take, do…
Verb particle/phrasal verb constructions
to go down, to check out,…
Proper nouns
John Smith
Terminological expressions
concepts and objects in technical domains
hydraulic oil filter
8/30/2023 7
Collocation (Contd)
• It is an expression of two or more words that tend to appear together
frequently and have a specific meaning or syntactic pattern
• Stiff breeze but not??a stiff wind (while either a strong breeze or a
strong wind is okay)
• broad daylight (Why not? bright daylight or ?narrow darkness)
• Big mistake but not ?large mistake
• overlap with the concepts of:
• terms, technical terms & terminological phrases
• Collocations extracted from technical domains
• Ex: hydraulic oil filter, file transfer protocol
8/30/2023 8
Collocation (Contd)
Examples of Collocation:
•strong tea
•weapons of mass destruction
•to make up
•to check in
•heard it through the grapevine
•he knocked at the door
•I made it all up
8/30/2023 9
Collocation (Contd)
Why collocation is needed?
In NLG (Natural Language Generation)
• The output should be natural
• make a decision ?take a decision
In lexicography
• Identify collocations to list them in a dictionary
• To distinguish the usage of synonyms or near-synonyms
In parsing
• To give preference to most natural attachments
• plastic (can opener) ? (plastic can) opener
In corpus linguistics and psycholinguists
• Ex: To study social attitudes towards different types of substances
• strong cigarettes/tea/coffee
• powerful drug
8/30/2023 10
Collocation (Contd)
A note on (near-)synon
To determine if two words are synonyms-- True synonyms are rare...
Principle of substitutability:
Depend on:
• Two words are synonym if they can be
substituted for one another in some?/any? shades of meaning
sentence without changing the meaning or • words may share central core meaning
acceptability of the sentence but have different sense accents
register/social factors
• How big/large is this plane? • speaking to a 4-yr old vs to graduate
• Would I be flying on a big/large or small students!
plane? Collocations
• Miss Nelson became a kind of big / ?? • conventional way of saying something /
large sister to Tom. fixed expression
• I think I made a big / ?? large mistake.
11
Collocation (Contd)
Approaches to finding collocations
• Frequency
• Mean and Variance
• Hypothesis Testing
• t-test
• χ2-test
12
Collocation (Contd)
Approaches to find • Except for “New York”, all
collocations bigrams are pairs of function
1. Frequency: words
• So, let’s pass the results through a
Justeson & Katz, 1995) part-of- speech filter
says,
Hypothesis: Part of speech tag patterns for
if two words occur together collocation filtering
very often, they must be
interesting candidates for a
collocation
Method:
Select the most frequently
occurring bigrams
(sequence of 2 adjacent
words) 13
Collocation (Contd)
Frequency: Frequency:
Finding Collocations: Justeson and The noun W occurring most often in the
Katz part-of-speech filter patterns “Strong W” vs “Powerful W”
14
Collocation (Contd)
Frequency - Conclusion
Advantages:
• works well for fixed phrases
• Simple method & accurate result
• Requires small linguistic knowledge
But: many collocations consist of two words in more flexible
relationships
• she knocked on his door
• they knocked at the door
• 100 women knocked on Donaldson’s door
• a man knocked on the metal front door 15
Collocation (Contd)
Mean and Variance • The mean is simply the average offset
• The words that appear between knocked • compute the mean offset between
and door vary and the distance between knocked and door as follows:
the two words is not constant so a fixed
phrase approach would not work here
• But there is enough regularity in the
patterns to allow us to determine that
knock is the right verb to use in English
for this situation, not hit, beat or rap
• This assumes a tokenization of
Donaldson’s as three words
Donaldson, apostrophe, and s, which
is what we actually did.
• One way of discovering the relationship between knocked and door is to compute the mean and
variance of the offsets (signed distances) between mean variance the two words in the corpus.
Collocation (Contd)
• Smadja et al., 1993) , Looks at the distribution
of distances between two words in a corpus,
looking for pairs of words with low variance
• A low variance means that the two
words usually occur at about the To capture 2-word collocations
same distance ✔this is this an
✔is an is example
• A low variance --> good candidate for
✔an example an of
collocation ✔example of example a
• Need a Collocational Window to capture ✔of a of three
✔a three a word
collocations of variable distances
✔three word three window
✔word window --
Collocation (Contd)
•
18
Collocation (Contd)
•
19
Collocation (Contd)
“strong…opposition”
variance is low
--> interesting collocation
“strong…support”
“strong…for”
“strong…for”
variance
variance
is high
is high
-->-->
notnot
interesting
interesting
collocation
collocation
Collocation (Contd)
std. dev. & mean offsets -->
would be found by frequency
method
std. dev. ~0 & high mean offset
--> very interesting
high deviation --> not
interesting
Collocation (Contd)
Hypothesis Testing
• If two words are frequent… they will frequently occur together…
• Frequent bigrams (two words can co-occur by chance)
• To determine whether the co-occurrence is random or whether it occurs
more often than chance
• This is a classical problem in statistics called Hypothesis Testing
• When two words co-occur, Hypothesis Testing measures how confident
we have that this was due to chance or not
22
Collocation (Contd)
•
23
Collocation (Contd)
Hypothesis Testing – the t-Statistic
24
Collocation (Contd)
Hypothesis Testing – the t-Statistic
Difference between the observed
mean and the expected mean
The higher the value of t, the greater the confidence that:
• There is a significant difference
• It’s not due to chance
• The two words are not independent 25
Collocation (Contd)
Hypothesis Testing – t-test : Example with collocations
26
Collocation (Contd)
•
27
Collocation (Contd)
Hypothesis Testing – t-test :
Example with collocations
✔ But we counted 8 occurrences of the
bigram new companies
✔ So the observed mean is
✔ By applying the t-test, we have:
28
Collocation (Contd)
•
29
Collocation (Contd)
•
30
Collocation (Contd)
Hypothesis Testing – Some Intuition
•t-test assigns a probability to describe the likelihood that the null
hypothesis is true
31
Collocation (Contd)
•
32
Collocation (Contd)
•
33
Collocation (Contd)
✔ t test applied to 10 bigrams that occur with frequency = 20
▪pass the t-test ▪fail the t-test
(t > 2.756) so: (t < 2.756) so:
▪we can reject ▪we cannot
the null reject the null
hypothesis hypothesis
▪so they form ▪so they do not
collocation form a
collocation
✔ Frequency-based method could not have seen the difference in these bigrams,
because they all have the same frequency
✔ the t test takes into account the frequency of a bigram relative to the frequencies of
its component words
• If a high proportion of the occurrences of both words occurs in the bigram, then
its t is high.
8/30/2023 34
✔ The t test is mostly used to rank collocation
Collocation (Contd)
Hypothesis testing: the χ2-test
✔problem with the t test is that it assumes that probabilities are
approximately normally distributed…
✔the χ2-test does not make this assumption
✔The essence of the χ2-test is the same as the t-test
✔ Compare observed frequencies and expected frequencies for
independence
✔ if the difference is large
✔ then we can reject the null hypothesis of independence
8/30/2023 35
Collocation (Contd)
χ2-test
✔In its simplest form, it is applied to a 2x2 table of observed
frequencies
✔The χ2 statistic:
✔ sums the differences between observed frequencies (in the table)
and expected values for independence
✔ scaled by the magnitude of the expected values
8/30/2023 36
Collocation (Contd)
χ2-test- Example
✔ Observed frequencies Obsij
8/30/2023 37
Collocation (Contd)
χ2-test-
• Example
38
Collocation (Contd)
χ2-test- Example
✔ But is the difference significant?
✔ df in an nxc table = (n-1)(c-1) = (2-1)(2-1) =1 (degrees of freedom)
✔ The probability level of α=0.05 the critical value is 3.84
✔ Since 1.55 < 3.84:
✔ So we cannot reject H0 (that new and companies occur independently of each other)
✔ So new companies is not a good candidate for a collocation
8/30/2023 39
Collocation (Contd)
χ2-test: Conclusion
✔Differences between the t statistic and χ2 statistic do not seem
to be large
✔But:
✔ the χ2 test is appropriate for large probabilities
• where t test fails because of the normality assumption
✔ the χ2 is not appropriate with sparse data (if numbers in the 2 by 2
tables are small)
✔χ2 test has been applied to a wider range of problems
✔ Machine translation
✔ Corpus similarity
8/30/2023 40
χ2-test for machine translation
Collocation (Contd)
✔ Church & Gale, 1991)
✔ To identify translation word pairs in Nb of aligned sentence pairs containing “cow”
aligned corpora in English and “vache” in French
✔ Ex:
Observed “cow” ~”cow” TOTAL
frequency
“vache” 59 6 65
~”vache” 8 570 934 570 942
TOTAL 67 570 940 571 007
✔ χ2 = 456 400 >> 3.84 (with α= 0.05)
✔ So “vache” and “cow” are not independent… and so are translations of each other
8/30/2023 41
Collocation (Contd)
χ2-test for corpus similarity
✔Kilgarriff & Rose, 1998)
✔Ex:
✔ Compute χ2 for the 2 populations (corpus1 and corpus2)
✔ Ho: the 2 corpora have the
8/30/2023 42
Collocation (Contd)
Collocations across corpora
✔ Ratios of relative frequencies between two or more different corpora
✔ can be used to discover collocations that are characteristic of a corpus when compared to other
corpus
✔ most useful for the discovery of subject-specific collocations
✔ Compare a general text with a subject-specific text
✔ words and phrases that (on a relative basis) occur most often in the subject-specific text are
likely to be part of the vocabulary that is specific to the domain
8/30/2023 43
Collocation (Contd)
Pointwise Mutual Information
✔Uses a measure from information-theory
✔Pointwise mutual information between 2 events x and y (in our
case the occurrence of 2 words) is roughly:
✔ a measure of how much one event (word) tells us about the other
✔ or a measure of the independence of 2 events (or 2 words)
• If 2 events x and y are independent, then I(x,y) = 0
8/30/2023 44
Collocation (Contd)
Example
✔ Assume:
✔ c(Ayatollah) = 42
✔ c(Ruhollah) = 20
✔ c(Ayatollah, Ruhollah) = 20
✔ N = 143 076 668
✔ Then:
✔ So? The occurrence of “Ayatollah” at position i increases by 18.38bits if “Ruhollah” occurs at
position i+1
✔ works particularly badly with sparse data
8/30/2023 45
Collocation (Contd)
Pointwise Mutual Information
✔ With pointwise mutual information:
✔ With t-test
✔ Same ranking as t-test
8/30/2023 46
Collocation (Contd)
Pointwise Mutual Information
•
8/30/2023 47