0% found this document useful (0 votes)

18 views47 pages

4

The document discusses different approaches to identifying collocations, including frequency-based methods, mean and variance analysis, and hypothesis testing techniques like the t-test and chi-square test. Collocations are multi-word expressions with some conventional meaning that is not fully predictable from the individual words. Examples of various types of collocations are also provided.

Uploaded by

225003012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views47 pages

4

Uploaded by

225003012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Collocation

Collocation:
•Frequency
•Mean and Variance
•Hypothesis Testing
•The t test
•Pearson’s chi-square test
•Mutual Information

8/30/2023 2
Collocation (Contd)
• A collocation is an expression consisting of two or more words that
correspond to some conventional way of saying things
• Collocations include noun phrases like strong tea and weapons of mass
destruction, phrasal verbs like to make up, and other strong phrases like the
rich and powerful
• Collocations are characterized by limited compositionality
• Compositionality natural language expression compositional if the meaning of the
expression can be predicted from the meaning of the parts
• Collocations are not fully compositional in that there is usually an element
of meaning added to the combination
• In the case of strong tea, “strong” has acquired the meaning rich in some active
agent which is closely related, but slightly different from the basic sense having
“great physical strength”
8/30/2023 3
Collocation (Contd)
Definition of a Collocation:
Choueka, 1988
•[A collocation is defined as] “a sequence of two or more consecutive words, that
has characteristics of a syntactic and semantic unit, and whose exact and
unambiguous meaning or connotation cannot be derived directly from the
meaning or connotation of its components.“
Criteria:
– non-compositionality
– non-substitutability
– non-modifiability
– non-translatable word for word

8/30/2023 4
Collocation (Contd)
Non- Compositionality
• A phrase is compositional if its meaning can be predicted from the meaning of
its parts
• Collocations have limited compositionality, there is usually an element of
meaning added to the combination, Ex: strong tea
• Idioms are the most extreme examples of non-compositionality
• Ex: The ball is in your court
Non-Substitutability
• Cannot substitute near-synonyms for the components of a collocation.
• Strong is a near-synonym of powerful
• strong tea? powerful tea
• yellow is a good description of the color of white wines
• white wine ? yellow wine 5
Collocation (Contd)
Non-modifiability
• Many collocations cannot be freely modified with additional lexical material or
through grammatical transformations
• weapons of mass destruction --> ?weapons of massive destruction
• to be fed up to the back teeth --> ?to be fed up to the teeth in the back
Non-translatable (word for word)
• English:
make a decision take a decision
• French:
faire une décision prendre une décision
• to test whether a group of words is a collocation
• translate it into another language, if we cannot translate it word by word then it probably is
a collocation
Collocation (Contd)
Linguistic Subclasses of Collocations
Phrases with light verbs
Verbs with little semantic content in the collocation
make, take, do…
Verb particle/phrasal verb constructions
to go down, to check out,…
Proper nouns
John Smith
Terminological expressions
concepts and objects in technical domains
hydraulic oil filter

8/30/2023 7
Collocation (Contd)
• It is an expression of two or more words that tend to appear together
frequently and have a specific meaning or syntactic pattern
• Stiff breeze but not??a stiff wind (while either a strong breeze or a
strong wind is okay)
• broad daylight (Why not? bright daylight or ?narrow darkness)
• Big mistake but not ?large mistake
• overlap with the concepts of:
• terms, technical terms & terminological phrases
• Collocations extracted from technical domains
• Ex: hydraulic oil filter, file transfer protocol

8/30/2023 8
Collocation (Contd)
Examples of Collocation:
•strong tea
•weapons of mass destruction
•to make up
•to check in
•heard it through the grapevine
•he knocked at the door
•I made it all up
8/30/2023 9
Collocation (Contd)
Why collocation is needed?
In NLG (Natural Language Generation)
• The output should be natural
• make a decision ?take a decision
In lexicography
• Identify collocations to list them in a dictionary
• To distinguish the usage of synonyms or near-synonyms
In parsing
• To give preference to most natural attachments
• plastic (can opener) ? (plastic can) opener
In corpus linguistics and psycholinguists
• Ex: To study social attitudes towards different types of substances
• strong cigarettes/tea/coffee
• powerful drug
8/30/2023 10
Collocation (Contd)
A note on (near-)synon
To determine if two words are synonyms-- True synonyms are rare...
Principle of substitutability:
Depend on:
• Two words are synonym if they can be
substituted for one another in some?/any? shades of meaning
sentence without changing the meaning or • words may share central core meaning
acceptability of the sentence but have different sense accents
register/social factors
• How big/large is this plane? • speaking to a 4-yr old vs to graduate
• Would I be flying on a big/large or small students!
plane? Collocations
• Miss Nelson became a kind of big / ?? • conventional way of saying something /
large sister to Tom. fixed expression
• I think I made a big / ?? large mistake.
11
Collocation (Contd)
Approaches to finding collocations
• Frequency
• Mean and Variance
• Hypothesis Testing
• t-test
• χ2-test

12
Collocation (Contd)
Approaches to find • Except for “New York”, all
collocations bigrams are pairs of function
1. Frequency: words
• So, let’s pass the results through a
Justeson & Katz, 1995) part-of- speech filter
says,
Hypothesis: Part of speech tag patterns for
if two words occur together collocation filtering
very often, they must be
interesting candidates for a
collocation
Method:
Select the most frequently
occurring bigrams
(sequence of 2 adjacent
words) 13
Collocation (Contd)
Frequency: Frequency:
Finding Collocations: Justeson and The noun W occurring most often in the
Katz part-of-speech filter patterns “Strong W” vs “Powerful W”

14
Collocation (Contd)
Frequency - Conclusion
Advantages:
• works well for fixed phrases
• Simple method & accurate result
• Requires small linguistic knowledge

But: many collocations consist of two words in more flexible

relationships
• she knocked on his door
• they knocked at the door
• 100 women knocked on Donaldson’s door
• a man knocked on the metal front door 15
Collocation (Contd)
Mean and Variance • The mean is simply the average offset
• The words that appear between knocked • compute the mean offset between
and door vary and the distance between knocked and door as follows:
the two words is not constant so a fixed
phrase approach would not work here
• But there is enough regularity in the
patterns to allow us to determine that
knock is the right verb to use in English
for this situation, not hit, beat or rap

• This assumes a tokenization of

Donaldson’s as three words
Donaldson, apostrophe, and s, which
is what we actually did.

• One way of discovering the relationship between knocked and door is to compute the mean and
variance of the offsets (signed distances) between mean variance the two words in the corpus.
Collocation (Contd)
• Smadja et al., 1993) , Looks at the distribution
of distances between two words in a corpus,
looking for pairs of words with low variance
• A low variance means that the two
words usually occur at about the To capture 2-word collocations
same distance ✔this is this an
✔is an is example
• A low variance --> good candidate for
✔an example an of
collocation ✔example of example a
• Need a Collocational Window to capture ✔of a of three
✔a three a word
collocations of variable distances
✔three word three window
✔word window --
Collocation (Contd)
•

18
Collocation (Contd)
•

19
Collocation (Contd)
“strong…opposition”
variance is low
--> interesting collocation

“strong…support”

“strong…for”
“strong…for”
variance
variance
is high
is high
-->-->
notnot
interesting
interesting
collocation
collocation
Collocation (Contd)
std. dev. & mean offsets -->
would be found by frequency
method

std. dev. ~0 & high mean offset

--> very interesting

high deviation --> not

interesting
Collocation (Contd)
Hypothesis Testing
• If two words are frequent… they will frequently occur together…
• Frequent bigrams (two words can co-occur by chance)
• To determine whether the co-occurrence is random or whether it occurs
more often than chance
• This is a classical problem in statistics called Hypothesis Testing
• When two words co-occur, Hypothesis Testing measures how confident
we have that this was due to chance or not

22
Collocation (Contd)
•

23
Collocation (Contd)
Hypothesis Testing – the t-Statistic

24
Collocation (Contd)
Hypothesis Testing – the t-Statistic
Difference between the observed
mean and the expected mean

The higher the value of t, the greater the confidence that:

• There is a significant difference
• It’s not due to chance
• The two words are not independent 25
Collocation (Contd)
Hypothesis Testing – t-test : Example with collocations

26
Collocation (Contd)
•

27
Collocation (Contd)
Hypothesis Testing – t-test :
Example with collocations
✔ But we counted 8 occurrences of the
bigram new companies
✔ So the observed mean is

✔ By applying the t-test, we have:

28
Collocation (Contd)
•

29
Collocation (Contd)
•

30
Collocation (Contd)
Hypothesis Testing – Some Intuition
•t-test assigns a probability to describe the likelihood that the null
hypothesis is true

31
Collocation (Contd)
•

32
Collocation (Contd)
•

33
Collocation (Contd)
✔ t test applied to 10 bigrams that occur with frequency = 20

▪pass the t-test ▪fail the t-test

(t > 2.756) so: (t < 2.756) so:
▪we can reject ▪we cannot
the null reject the null
hypothesis hypothesis
▪so they form ▪so they do not
collocation form a
collocation

✔ Frequency-based method could not have seen the difference in these bigrams,
because they all have the same frequency
✔ the t test takes into account the frequency of a bigram relative to the frequencies of
its component words
• If a high proportion of the occurrences of both words occurs in the bigram, then
its t is high.
8/30/2023 34
✔ The t test is mostly used to rank collocation
Collocation (Contd)
Hypothesis testing: the χ2-test
✔problem with the t test is that it assumes that probabilities are
approximately normally distributed…
✔the χ2-test does not make this assumption
✔The essence of the χ2-test is the same as the t-test
✔ Compare observed frequencies and expected frequencies for
independence
✔ if the difference is large
✔ then we can reject the null hypothesis of independence

8/30/2023 35
Collocation (Contd)
χ2-test
✔In its simplest form, it is applied to a 2x2 table of observed
frequencies
✔The χ2 statistic:
✔ sums the differences between observed frequencies (in the table)
and expected values for independence
✔ scaled by the magnitude of the expected values

8/30/2023 36
Collocation (Contd)
χ2-test- Example
✔ Observed frequencies Obsij

8/30/2023 37
Collocation (Contd)
χ2-test-
• Example

38
Collocation (Contd)
χ2-test- Example

✔ But is the difference significant?

✔ df in an nxc table = (n-1)(c-1) = (2-1)(2-1) =1 (degrees of freedom)

✔ The probability level of α=0.05 the critical value is 3.84

✔ Since 1.55 < 3.84:
✔ So we cannot reject H0 (that new and companies occur independently of each other)
✔ So new companies is not a good candidate for a collocation

8/30/2023 39
Collocation (Contd)
χ2-test: Conclusion
✔Differences between the t statistic and χ2 statistic do not seem
to be large
✔But:
✔ the χ2 test is appropriate for large probabilities
• where t test fails because of the normality assumption
✔ the χ2 is not appropriate with sparse data (if numbers in the 2 by 2
tables are small)
✔χ2 test has been applied to a wider range of problems
✔ Machine translation
✔ Corpus similarity
8/30/2023 40
χ2-test for machine translation
Collocation (Contd)
✔ Church & Gale, 1991)
✔ To identify translation word pairs in Nb of aligned sentence pairs containing “cow”
aligned corpora in English and “vache” in French
✔ Ex:
Observed “cow” ~”cow” TOTAL
frequency
“vache” 59 6 65
~”vache” 8 570 934 570 942

TOTAL 67 570 940 571 007

✔ χ2 = 456 400 >> 3.84 (with α= 0.05)

✔ So “vache” and “cow” are not independent… and so are translations of each other

8/30/2023 41
Collocation (Contd)
χ2-test for corpus similarity
✔Kilgarriff & Rose, 1998)
✔Ex:

✔ Compute χ2 for the 2 populations (corpus1 and corpus2)

✔ Ho: the 2 corpora have the
8/30/2023 42
Collocation (Contd)
Collocations across corpora
✔ Ratios of relative frequencies between two or more different corpora
✔ can be used to discover collocations that are characteristic of a corpus when compared to other
corpus

✔ most useful for the discovery of subject-specific collocations

✔ Compare a general text with a subject-specific text
✔ words and phrases that (on a relative basis) occur most often in the subject-specific text are
likely to be part of the vocabulary that is specific to the domain

8/30/2023 43
Collocation (Contd)
Pointwise Mutual Information
✔Uses a measure from information-theory
✔Pointwise mutual information between 2 events x and y (in our
case the occurrence of 2 words) is roughly:
✔ a measure of how much one event (word) tells us about the other
✔ or a measure of the independence of 2 events (or 2 words)
• If 2 events x and y are independent, then I(x,y) = 0

8/30/2023 44
Collocation (Contd)
Example
✔ Assume:
✔ c(Ayatollah) = 42
✔ c(Ruhollah) = 20
✔ c(Ayatollah, Ruhollah) = 20
✔ N = 143 076 668
✔ Then:

✔ So? The occurrence of “Ayatollah” at position i increases by 18.38bits if “Ruhollah” occurs at

position i+1
✔ works particularly badly with sparse data

8/30/2023 45
Collocation (Contd)
Pointwise Mutual Information
✔ With pointwise mutual information:

✔ With t-test

✔ Same ranking as t-test

8/30/2023 46
Collocation (Contd)
Pointwise Mutual Information
•

8/30/2023 47

Understanding Collocations in NLP
No ratings yet
Understanding Collocations in NLP
30 pages
Video v2
No ratings yet
Video v2
43 pages
Collocations
No ratings yet
Collocations
44 pages
English Collocation
No ratings yet
English Collocation
32 pages
Collocations
No ratings yet
Collocations
26 pages
Meaning and Discourse in English: Collocation
No ratings yet
Meaning and Discourse in English: Collocation
28 pages
Meaning and Discourse in English: Collocation
No ratings yet
Meaning and Discourse in English: Collocation
28 pages
Understanding English Collocations
No ratings yet
Understanding English Collocations
15 pages
4TH Group English V Research Work
No ratings yet
4TH Group English V Research Work
10 pages
Collocations
100% (4)
Collocations
28 pages
Teaching Collocation Leckman
100% (1)
Teaching Collocation Leckman
20 pages
Collocation S
No ratings yet
Collocation S
14 pages
Collocation and Colligation
No ratings yet
Collocation and Colligation
6 pages
Collocations 11
100% (7)
Collocations 11
20 pages
Morphology - Collocation
No ratings yet
Morphology - Collocation
19 pages
Cognitive Collocations PDF
No ratings yet
Cognitive Collocations PDF
27 pages
Collocation & Corpus Linguistics
No ratings yet
Collocation & Corpus Linguistics
14 pages
Equivalence Above Word Level in Translation
No ratings yet
Equivalence Above Word Level in Translation
25 pages
Collocations Communication Cce by Supriya 2.0
No ratings yet
Collocations Communication Cce by Supriya 2.0
20 pages
Collocation
No ratings yet
Collocation
28 pages
JUE 100 - Collocations
No ratings yet
JUE 100 - Collocations
16 pages
Ike Damaiyanti - CHAPTER II
No ratings yet
Ike Damaiyanti - CHAPTER II
10 pages
The Most Frequent Collocations in Spoken English
100% (1)
The Most Frequent Collocations in Spoken English
10 pages
Collocation: Make Tea - I Made A Cup of Tea For Lunch. Do Homework - I Did All of My Homework Yesterday
No ratings yet
Collocation: Make Tea - I Made A Cup of Tea For Lunch. Do Homework - I Did All of My Homework Yesterday
8 pages
Collocation 3. Colligation 4. Combining Collocation and Colligation Analyses 5. Conclusion References
No ratings yet
Collocation 3. Colligation 4. Combining Collocation and Colligation Analyses 5. Conclusion References
23 pages
Collocation and Corpus Linguistics
No ratings yet
Collocation and Corpus Linguistics
10 pages
Collocation
No ratings yet
Collocation
13 pages
DR7 Collocations
No ratings yet
DR7 Collocations
19 pages
Collocational: What Is Collocation?
100% (1)
Collocational: What Is Collocation?
7 pages
Understanding English Collocations
No ratings yet
Understanding English Collocations
14 pages
English Collocations for Learners
No ratings yet
English Collocations for Learners
5 pages
Beyond Single Word-The Most Frequent Collocations in Spoken English
100% (1)
Beyond Single Word-The Most Frequent Collocations in Spoken English
10 pages
What Are Collocations
No ratings yet
What Are Collocations
16 pages
Collocation
No ratings yet
Collocation
10 pages
Introduction To Collocations
No ratings yet
Introduction To Collocations
2 pages
Bahns J. Lexical Collocations: A Contrastive View
75% (4)
Bahns J. Lexical Collocations: A Contrastive View
8 pages
TESOL Quarterly - 2012 - WALKER - A Corpus Based Study of The Linguistic Features and Processes Which Influence The Way
No ratings yet
TESOL Quarterly - 2012 - WALKER - A Corpus Based Study of The Linguistic Features and Processes Which Influence The Way
22 pages
English Collocations in Use PDF
100% (1)
English Collocations in Use PDF
157 pages
Pub - Academic Vocabulary in Use With Answers PDF
No ratings yet
Pub - Academic Vocabulary in Use With Answers PDF
157 pages
Pub - Academic Vocabulary in Use With Answers PDF
No ratings yet
Pub - Academic Vocabulary in Use With Answers PDF
157 pages
Syntagmatic Sense Relations
No ratings yet
Syntagmatic Sense Relations
25 pages
Knowledge of Words: Ernida Hamid, S.Ag., M.PD
No ratings yet
Knowledge of Words: Ernida Hamid, S.Ag., M.PD
49 pages
English Adjective-Noun Collocations
No ratings yet
English Adjective-Noun Collocations
21 pages
Beyond Single Words The Most Frequent Co
No ratings yet
Beyond Single Words The Most Frequent Co
10 pages
50-Something Years of Work On Collocations
No ratings yet
50-Something Years of Work On Collocations
29 pages
Be 2
No ratings yet
Be 2
10 pages
COLLOCATION
No ratings yet
COLLOCATION
12 pages
Collocation S
No ratings yet
Collocation S
3 pages
Collocation S
No ratings yet
Collocation S
13 pages
Tra 107 Week 5
No ratings yet
Tra 107 Week 5
16 pages
English Collocations Guide
No ratings yet
English Collocations Guide
12 pages
Chapter 1.2
No ratings yet
Chapter 1.2
47 pages
The Investigation of Collocational Errors in University Students' Writing Majoring in English
No ratings yet
The Investigation of Collocational Errors in University Students' Writing Majoring in English
5 pages
Activate My Vocabulary - 2 - Collocations PDF
No ratings yet
Activate My Vocabulary - 2 - Collocations PDF
3 pages
Lexical Cohesion
No ratings yet
Lexical Cohesion
29 pages
Contoh XXI
No ratings yet
Contoh XXI
11 pages
Class For Collocation For Ss
No ratings yet
Class For Collocation For Ss
14 pages
Collocation S
No ratings yet
Collocation S
1 page
ML Supervised Learning
No ratings yet
ML Supervised Learning
44 pages
Reference To Above PDF
No ratings yet
Reference To Above PDF
2 pages
Markov Models
No ratings yet
Markov Models
54 pages
Cse402 May 2023
No ratings yet
Cse402 May 2023
4 pages
Parts of Speech and Morphology - Phrase Structure - Semantics and Pragmatics
No ratings yet
Parts of Speech and Morphology - Phrase Structure - Semantics and Pragmatics
39 pages
Dmbs U1
No ratings yet
Dmbs U1
12 pages
Ooad-July 2022
No ratings yet
Ooad-July 2022
4 pages
ACTIVITY Module 3 Noun Adjective 1
No ratings yet
ACTIVITY Module 3 Noun Adjective 1
2 pages
Result Ncumc 24 Ind 114251
No ratings yet
Result Ncumc 24 Ind 114251
15 pages
1 Research Methods Guided Notes
No ratings yet
1 Research Methods Guided Notes
73 pages
Timing Gear PDF
No ratings yet
Timing Gear PDF
12 pages
Elderly Customer Retention Script
No ratings yet
Elderly Customer Retention Script
3 pages
Homoeopathic First Aid Project
100% (9)
Homoeopathic First Aid Project
27 pages
Purslane Portulaca Oleracea Seed Consumption and A
No ratings yet
Purslane Portulaca Oleracea Seed Consumption and A
11 pages
The Tows Matrix: Strategic Options Using TOWS
No ratings yet
The Tows Matrix: Strategic Options Using TOWS
5 pages
PDE Types
No ratings yet
PDE Types
28 pages
Print Article - Quasi-Federal Nature of Indian Constitution
No ratings yet
Print Article - Quasi-Federal Nature of Indian Constitution
6 pages
Tuesdays With Morrie Book Review
No ratings yet
Tuesdays With Morrie Book Review
5 pages
Jesus
No ratings yet
Jesus
2 pages
Front Intr
100% (1)
Front Intr
18 pages
K5 Science Endorsement GPS Lesson Plan: Title Teacher(s) E-Mail School Lesson Title Grade Level Concepts(s) Targeted
No ratings yet
K5 Science Endorsement GPS Lesson Plan: Title Teacher(s) E-Mail School Lesson Title Grade Level Concepts(s) Targeted
4 pages
Plugin Directory: Ready To Get Started?
No ratings yet
Plugin Directory: Ready To Get Started?
3 pages
Template of Mathematics School Based Assessment
No ratings yet
Template of Mathematics School Based Assessment
13 pages
The Bard's Tale (Original) - Manual
No ratings yet
The Bard's Tale (Original) - Manual
18 pages
Bing Spatial Data Services
No ratings yet
Bing Spatial Data Services
155 pages
PDC Catalog
No ratings yet
PDC Catalog
32 pages
Current Account - 08870200001497 Loyal Furnitures
No ratings yet
Current Account - 08870200001497 Loyal Furnitures
4 pages
Discussion Text
No ratings yet
Discussion Text
3 pages
Bachelor Party Ki Velli Chesina Gumpu Dengulata
No ratings yet
Bachelor Party Ki Velli Chesina Gumpu Dengulata
11 pages
Social Media Family Personal Relations PDF
No ratings yet
Social Media Family Personal Relations PDF
12 pages
Literary Conventions UNIT 2 MODULE 7
No ratings yet
Literary Conventions UNIT 2 MODULE 7
7 pages
The Japanese Writing System
100% (1)
The Japanese Writing System
25 pages
Adil Nuromar H. STEM 12A PEH 3
No ratings yet
Adil Nuromar H. STEM 12A PEH 3
16 pages
Algebra Problem Solutions
No ratings yet
Algebra Problem Solutions
10 pages
Import Tariffs and Quotas Under Perfect Competition
No ratings yet
Import Tariffs and Quotas Under Perfect Competition
78 pages
Seismic Load Calculation NSCP 2010 & UBC 1997
No ratings yet
Seismic Load Calculation NSCP 2010 & UBC 1997
3 pages
Bcba Demo
No ratings yet
Bcba Demo
5 pages

4

Uploaded by

4

Uploaded by

Collocation

But: many collocations consist of two words in more flexible

• This assumes a tokenization of

std. dev. ~0 & high mean offset

high deviation --> not

The higher the value of t, the greater the confidence that:

✔ By applying the t-test, we have:

▪pass the t-test ▪fail the t-test

✔ But is the difference significant?

✔ df in an nxc table = (n-1)(c-1) = (2-1)(2-1) =1 (degrees of freedom)

✔ The probability level of α=0.05 the critical value is 3.84

TOTAL 67 570 940 571 007

✔ χ2 = 456 400 >> 3.84 (with α= 0.05)

✔ Compute χ2 for the 2 populations (corpus1 and corpus2)

✔ most useful for the discovery of subject-specific collocations

✔ So? The occurrence of “Ayatollah” at position i increases by 18.38bits if “Ruhollah” occurs at

✔ Same ranking as t-test

You might also like