0% found this document useful (0 votes)

76 views35 pages

NLP & Linguistics for Researchers

The document provides an introduction to natural language processing (NLP). It discusses how NLP aims to design algorithms that process human language data by analyzing, translating, and managing large amounts of textual information. The document also summarizes some key NLP tasks like named entity recognition and topic classification. It outlines several levels of linguistic analysis studied in NLP, including phonetics, phonology, morphology, syntax, semantics, and pragmatics. Finally, it discusses how machine learning is applied to NLP problems through supervised learning on annotated training data.

Uploaded by

Thái Bảo Từ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views35 pages

NLP & Linguistics for Researchers

Uploaded by

Thái Bảo Từ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Natural Language Processing

Introduction

Felipe Bravo-Marquez

March 31, 2021

Disclaimer
• A significant part of the content presented in these slides is taken from other
resources such as textbooks and publications.
• The neural network part of the course is heavily based on this book:
Natural Language Processing

• The amount of digitized textual data being generated every day is huge (e.g, the
Web, social media, medical records, digitized books).
• So does the need for translating, analyzing, and managing this flood of words
and text.
• Natural language processing (NLP) is the field of designing methods and
algorithms that take as input or produce as output unstructured, natural
language data. [Goldberg, 2017]
• Natural language processing is focused on the design and analysis of
computational algorithms and representations for processing natural human
language [Eisenstein, 2018]
Natural Language Processing

• Example of NLP task: Named Entity Recognition (NER):

Figure: Named Entity Recognition

• Human language is highly ambiguous: I ate pizza with friends vs. I ate pizza with
olives vs. I ate pizza with a fork.
• It is also ever changing and evolving (e.g, Hashtags in Twitter).
Natural Language Processing and Computational
Linguistics

Natural language processing (NLP) develops methods for solving practical problems
involving language [Johnson, 2014].
• Automatic speech recognition.
• Machine translation.
• Information extraction from documents.

Computational linguistics (CL) studies the computational processes underlying

(human) language.
• How do we understand language?
• How do we produce language?
• How do we learn language?

Similar methods and models are used in NLP and CL.

Natural Language Processing and Computational
Linguistics

• Most of the meetings and journals that host natural language processing
research bear the name “computational linguistics” (e.g., ACL, NACL).
[Eisenstein, 2018]
• NLP and CL may be thought of as essentially synonymous.
• While there is substantial overlap, there is an important difference in focus.
• CL is essentially linguistics supported by computational methods (similar to
computational biology, computational astronomy).
• In linguistics, language is the object of study.
• NLP focuses on solving well-defined tasks involving human language (e.g.,
translation, query answering, holding conversations).
• Fundamental linguistic insights may be crucial for accomplishing these tasks, but
success is ultimately measured by whether and how well the job gets done
(according to an evaluation metric) [Eisenstein, 2018].
Linguistics levels of description

The field of linguistics includes subfields that concern themselves with different levels
or aspects of the structure of language, as well as subfields dedicated to studying how
linguistic structure interacts with human cognition and society [Bender, 2013].
1. Phonetics: The study of the sounds of human language.
2. Phonology: The study of sound systems in human languages.
3. Morphology: The study of the formation and internal structure of words.
4. Syntax: The study of the formation and internal structure of sentences.
5. Semantics: The study of the meaning of sentences
6. Pragmatics: The study of the way sentences with their semantic meanings are
used for particular communicative goals.
Phonetics

• Phonetics studies the sounds of a language [Johnson, 2014]

• It deals with the organs of sound production (e.g., mouth, tongue, throat, nose,
lips, palate)
• Vowels vs consonants.
• Vowels are produced with little restriction of the airflow from the lungs out the
mouth and/or the nose. [Fromkin et al., 2018]
• Consonants are produced with some restriction or closure in the vocal tract that
impedes the flow of air from the lungs. [Fromkin et al., 2018]
• International Phonetic Alphabet (IPA): alphabetic system of phonetic notation.
Phonology

• Phonology: The study of how speech sounds form patterns

[Fromkin et al., 2018].
• Phonemes are the basic form of a sound (e.g., the phoneme /p/)
• Example: Why g is silent in sign but is pronounced in the related word signature?
• Example: English speakers pronounce /t/ differently (e.g., in water)
• In Spanish /z/ is pronounced differently in Spain and Latin America.
• Phonetics vs Phonology:
http://www.phon.ox.ac.uk/jcoleman/PHONOLOGY1.htm.
Morphology
• Morphology studies the structure of words (e.g.,re+structur+ing,
un+remark+able) [Johnson, 2014]
• Morpheme: The linguistic term for the most elemental unit of grammatical form
[Fromkin et al., 2018]. Example morphology= morph + ology (the science of).
• Derivational morphology: process of forming a new word from an existing word,
often by adding a prefix or suffix
• Derivational morphology exhibits a hierarchical structure. Example:
re+vital+ize+ation

• The suffix usually determines the syntactic category (part-of-speech) of the

derived word.
Syntax

• Syntax studies the ways words combine to form phrases and sentences
[Johnson, 2014]

• Syntactic parsing helps identify who did what to whom, a key step in
understanding a sentence.
Semantics

• Semantics studies the meaning of words, phrases and sentences

[Johnson, 2014].
• Semantic roles: indicate the role played by each entity in a sentence.
• Examples of semantic roles: agent (the entity that performs the action), theme
(the entity involved in the action), or instrument (another entity used by the
agent in order to perform the action).
• Annotated sentence: The boy cut the rope with a razor.
• Lexical relations: relationship between different words [Yule, 2016].
• Examples of lexical relations: synonymy (conceal/hide), antonymy
(shallow/deep) and hyponymy (dog/animal).
Pragmatics

• Pragmatics: the study of how context affects meaning in certain situations

[Fromkin et al., 2018].
• Example: how the sentence “It’s cold in here” comes to be interpreted as “close
the windows”.
• Example 2: Can you pass the salt?
Natural Language Processing and Machine Learning

• While we humans are great users of language, we are also very poor at formally
understanding and describing the rules that govern language.
• Understanding and producing language using computers is highly challenging.
• The best known set of methods for dealing with language data rely on
supervised machine learning.
• Supervised machine learning: attempt to infer usage patterns and regularities
from a set of pre-annotated input and output pairs (a.k.a training dataset).
Training Dataset: CoNLL-2003 NER Data

Each line contains a token, a part-of-speech tag, a syntactic

chunk tag, and a named-entity tag.

U.N. NNP I-NP I-ORG

official NN I-NP O
Ekeus NNP I-NP I-PER
heads VBZ I-VP O
for IN I-PP O
Baghdad NNP I-NP I-LOC
. . O O
1 Source:

https://www.clips.uantwerpen.be/conll2003/ner/
Challenges of Language

• Three challenging properties of language: discreteness , compositionality, and

sparseness.
• Discreteness: we cannot infer the relation between two words from the letters
they are made of (e.g., hamburger and pizza).
• Compositionality: the meaning of a sentence goes beyond the individual
meaning of their words.
• Sparseness: The way in which words (discrete symbols) can be combined to
form meanings is practically infinite.
Example of NLP Task: Topic Classification

• Classify a document into one of four categories: Sports, Politics, Gossip, and
Economy.
• The words in the documents provide very strong hints.
• Which words provide what hints?
• Writing up rules for this task is rather challenging.
• However, readers can easily categorize a number of documents into its topic
(data annotation).
• A supervised machine learning algorithm come up with the patterns of word
usage that help categorize the documents.
Example 3: Sentiment Analysis

• Application of NLP techniques to identify and extract subjective information from

textual datasets.

Main Problem: Message-level Polarity Classification

(MPC)
1. Automatically classify a sentence to classes positive, negative, or neutral.

2. State-of-the-art solutions use supervised machine learning models trained from

manually annotated examples [Mohammad et al., 2013].
Sentiment Classification via Supervised Learning and
BoWs Vectors
vocabulary
w1 angry

w2 happy
lol happy
w3 good
lol good
w4 grr

w5 lol grr angry

Label tweets by
sentiment and train
a classifier Tweet vectors

w1 w2 w3 w4 w5

t1 0 1 0 0 1

t2 0 0 1 0 1

t3 1 0 0 1 0

Classify target
tweets by sentiment

Target tweets

Happy morning pos

What a bummer! neg
Lovely day pos
Supervised Learning: Support Vector Machines
(SVMs)
• Idea: Find a hyperplane that separates the classes with the maximum margin
(largest separation).

• H3 separates the classes with the maximum margin.

2 Image source: Wikipedia
Linguistics and NLP

• Knowing about linguistic structure is important for feature design and error
analysis in NLP [Bender, 2013].
• Machine learning approaches to NLP require features which can describe and
generalize across particular instances of language use.
• Goal: guide the machine learning algorithm to find correlations between
language use and its target set of labels.
• Knowledge about linguistic structures can inform the design of features for
machine learning approaches to NLP.
Challenges in NLP

• Annotation Costs: manual annotation is labour-intensive and

time-consuming.
• Domain Variations: the pattern we want to learn can vary from one corpus to
another (e.g., sports, politics).
• A model trained from data annotated for one domain will not necessarily work
on another one!
• Trained models can become outdated over time (e.g., new hashtags).

Domain Variation in Sentiment

1. For me the queue was pretty small and it was only a 20 minute wait I think but
was so worth it!!! :D @raynwise
2. Odd spatiality in Stuttgart. Hotel room is so small I can barely turn around but
surroundings are inhumanly vast & long under construction.
Overcoming the data annotation costs

Distant Supervision
• Automatically label unlabeled data (Twitter API) using a heuristic method.
• Emoticon-Annotation Approach (EAA): tweets with positive :) or negative :(
emoticons are labelled according to the polarity indicated by the
emoticon [Read, 2005].
• The emoticon is removed from the content.
• The same approach has been extended using hashtags #anger, and emojis.
• Is not trivial to find distant supervision techniques for all kind of NLP problems.

Crowdsourcing
• Rely on services like Amazon Mechanical Turk or Crowdflower to ask the
crowds to annotate data.
• This can be expensive.
• It is hard to guarantee quality.
Sentiment Classification of Tweets

• In 2013, The Semantic Evaluation (SemEval) workshop organized the

“Sentiment Analysis in Twitter task” [Nakov et al., 2013].
• The task was divided into two sub-tasks: the expression level and the message
level.
• Expression-level: focused on determining the sentiment polarity of a message
according to a marked entity within its content.
• Message-level: the polarity has to be determined according to the overall
message.
• The organizers released training and testing datasets for both tasks.
[Nakov et al., 2013]
The NRC System
• The team that achieved the highest performance in both tasks among 44 teams
was the NRC-Canada team [Mohammad et al., 2013].
• The team proposed a supervised approach using a linear SVM classifier with the
following hand-crafted features for representing tweets:
1. Word n-grams.
2. Character n-grams.
3. Part-of-speech tags.
4. Word clusters trained with the Brown clustering
method [Brown et al., 1992].
5. The number of elongated words (words with one character repeated more
than two times).
6. The number of words with all characters in uppercase.
7. The presence of positive or negative emoticons.
8. The number of individual negations.
9. The number of contiguous sequences of dots, question marks and
exclamation marks.
10. Features derived from polarity lexicons [Mohammad et al., 2013]. Two of
these lexicons were generated using the PMI method from tweets
annotated with hashtags and emoticons.
Feature Engineering and Deep Learning

• Up until 2014 most state-of-the-art NLP systems were based on feature

engineering + shallow machine learning models (e.g., SVMs, HMMs).
• Designing the features of a winning NLP system requires a lot of domain-specific
knowledge.
• The NRC system was built before deep learning became popular in NLP.
• Deep Learning systems on the other hand rely on neural networks to
automatically learn good representations.
Feature Engineering and Deep Learning

• Deep Learning yields state-of-the-art results in most NLP tasks.

• Large amounts of training data and faster multicore GPU machines are key in the
success of deep learning.
• Neural networks and word embeddings play a key role in modern NLP models.
Deep Learning and Linguistic Concepts

• If deep learning models can learn representations automatically, are linguistic

concepts still useful (e.g., syntax, morphology)?
• Some proponents of deep-learning argue that such inferred, manually designed,
linguistic properties are not needed, and that the neural network will learn these
intermediate representations (or equivalent, or better ones) on its own
[Goldberg, 2016].
• The jury is still out on this.
• Goldberg believes many of these linguistic concepts can indeed be inferred by
the network on its own if given enough data.
• However, for many other cases we do not have enough training data available for
the task we care about, and in these cases providing the network with the more
explicit general concepts can be very valuable.
History

NLP progress can be divided into three main waves: 1) rationalism, 2) empiricism, and
3) deep learning [Deng and Liu, 2018].
1950 - 1990 Rationalism: approaches endeavored to design hand-crafted rules to incorporate
knowledge and reasoning mechanisms into intelligent NLP systems (e.g, ELIZA
for simulating a Rogerian psychotherapist, MARGIE for structuring real-world
information into concept ontologies).
1991 - 2009 Empiricism: characterized by the exploitation of data corpora and of (shallow)
machine learning and statistical models (e.g., Naive Bayes, HMMs, IBM
translation models).
2010 - Deep Learning: feature engineering (considered as a bottleneck) is replaced
with representation learning and/or deep neural networks (e.g.,
https://www.deepl.com/translator). A very influential paper in this
revolution: [Collobert et al., 2011].

2
Dates are approximated.
Roadmap
In this course we will introduce modern concepts in natural
language processing based on statistical models (second
wave) and neural networks (third wave). The main concepts to
be covered are listed below:
1. Text classification.
2. Linear Models.
3. Naive Bayes.
4. Hidden Markov Models.
5. Maximum Entropy Markov Models (MEMMs) and Conditional Random Fields
(CRFs).
6. Neural Networks.
7. Word embeddings.
8. Convolutional Neural Networks (CNNs)
9. Recurrent Neural Networks: Elman, LSTMs, GRUs.
10. Attention.
11. Sequence-to-Sequence Models.
12. Parse Trees.
Important Websites

1. Repository to track the progress in Natural Language Processing (NLP),

including the datasets and the current state-of-the-art for the most common NLP
task: http://nlpprogress.com/
2. An open-source NLP research library, built on PyTorch:
https://allennlp.org/
3. The course’s Website: https://github.com/dccuchile/CC6205/
Questions?

Thanks for your Attention!

References I
Bender, E. M. (2013).
Linguistic fundamentals for natural language processing: 100 essentials from
morphology and syntax.
Synthesis lectures on human language technologies, 6(3):1–184.
Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C. (1992).
Class-based n-gram models of natural language.
Computational linguistics, 18(4):467–479.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P.
(2011).
Natural language processing (almost) from scratch.
Journal of machine learning research, 12(Aug):2493–2537.
Deng, L. and Liu, Y. (2018).
Deep Learning in Natural Language Processing.
Springer.
Eisenstein, J. (2018).
Natural language processing.
Technical report, Georgia Tech.
Fromkin, V., Rodman, R., and Hyams, N. (2018).
An introduction to language.
Cengage Learning.
References II
Goldberg, Y. (2016).
A primer on neural network models for natural language processing.
J. Artif. Intell. Res.(JAIR), 57:345–420.
Goldberg, Y. (2017).
Neural network methods for natural language processing.
Synthesis Lectures on Human Language Technologies, 10(1):1–309.
Johnson, M. (2014).
Introduction to computational linguistics and natural language processing (slides).

2014 Machine Learning Summer School.

Mohammad, S. M., Kiritchenko, S., and Zhu, X. (2013).
Nrc-canada: Building the state-of-the-art in sentiment analysis of tweets.
Proceedings of the seventh international workshop on Semantic Evaluation
Exercises (SemEval-2013).
Nakov, P., Rosenthal, S., Kozareva, Z., Stoyanov, V., Ritter, A., and Wilson, T.
(2013).
Semeval-2013 task 2: Sentiment analysis in twitter.
In Proceedings of the seventh international workshop on Semantic Evaluation
Exercises, pages 312–320, Atlanta, Georgia, USA. Association for Computational
Linguistics.
References III

Read, J. (2005).
Using emoticons to reduce dependency in machine learning techniques for
sentiment classification.
In Proceedings of the ACL Student Research Workshop, ACLstudent ’05, pages
43–48, Stroudsburg, PA, USA. Association for Computational Linguistics.
Yule, G. (2016).
The study of language.
Cambridge university press.

NLP Introduction
No ratings yet
NLP Introduction
35 pages
Introduction
No ratings yet
Introduction
24 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
69 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Lec1-UNIT5 - MORE SIMPLER
No ratings yet
Lec1-UNIT5 - MORE SIMPLER
28 pages
Introduction To NLP
No ratings yet
Introduction To NLP
51 pages
Natural Language Processing Unit 1-2
No ratings yet
Natural Language Processing Unit 1-2
18 pages
Computational Linguistic Notes
No ratings yet
Computational Linguistic Notes
27 pages
1 Introduction
No ratings yet
1 Introduction
13 pages
Natural Language Processing (NLP) : Chapter 1: Introduction To NLP
No ratings yet
Natural Language Processing (NLP) : Chapter 1: Introduction To NLP
96 pages
INTRONLP
No ratings yet
INTRONLP
30 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
41 pages
NLP Introduction Overview
No ratings yet
NLP Introduction Overview
34 pages
Lect1 Intro 3jan08
No ratings yet
Lect1 Intro 3jan08
94 pages
NLP Course Overview Fall 2020
No ratings yet
NLP Course Overview Fall 2020
44 pages
Introduction To Natural Language Processing: Unit 1
No ratings yet
Introduction To Natural Language Processing: Unit 1
60 pages
Nayie Bayes Classifier 21 Page
No ratings yet
Nayie Bayes Classifier 21 Page
28 pages
NLP01 IntroNLP
No ratings yet
NLP01 IntroNLP
68 pages
Unit-I NLP
No ratings yet
Unit-I NLP
15 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
54 pages
1 Natural Language Processing-Intro
No ratings yet
1 Natural Language Processing-Intro
16 pages
Lec 1.1.2
No ratings yet
Lec 1.1.2
44 pages
NLP Textbook Star Edu
No ratings yet
NLP Textbook Star Edu
103 pages
NLP Merged
100% (1)
NLP Merged
975 pages
Natural Language Processing Lec 1
No ratings yet
Natural Language Processing Lec 1
23 pages
Lecture 1
No ratings yet
Lecture 1
16 pages
NLP Lab1
No ratings yet
NLP Lab1
33 pages
NLP Course for Students
No ratings yet
NLP Course for Students
25 pages
Natural Language Processing Report (By Sandeep Kumar Dash)
No ratings yet
Natural Language Processing Report (By Sandeep Kumar Dash)
25 pages
Introduction To NLP: Natural Language Processing
No ratings yet
Introduction To NLP: Natural Language Processing
21 pages
Introduction To NLP 2021
No ratings yet
Introduction To NLP 2021
13 pages
Unit V
No ratings yet
Unit V
16 pages
NLP PPT1
No ratings yet
NLP PPT1
29 pages
CL and Topic Models
No ratings yet
CL and Topic Models
33 pages
Natural Language Processing Guide
No ratings yet
Natural Language Processing Guide
21 pages
Basic NLP To End-To-End Pipeline .PPTX - Removed
No ratings yet
Basic NLP To End-To-End Pipeline .PPTX - Removed
35 pages
NLP Module 1
No ratings yet
NLP Module 1
124 pages
Unit 1-Introduction To NLP
No ratings yet
Unit 1-Introduction To NLP
68 pages
519 Assignment
No ratings yet
519 Assignment
26 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
36 pages
Notes
No ratings yet
Notes
9 pages
NLP 1
No ratings yet
NLP 1
20 pages
NLP Course: Theory & Applications
No ratings yet
NLP Course: Theory & Applications
16 pages
Chapter 6
100% (1)
Chapter 6
28 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
Lesson 1 Introduction To Natural Language Processing
No ratings yet
Lesson 1 Introduction To Natural Language Processing
93 pages
NLP Lecture Notes R20
No ratings yet
NLP Lecture Notes R20
56 pages
1.chapter1 Introduction Chapter2 LanguageCharacteristics
No ratings yet
1.chapter1 Introduction Chapter2 LanguageCharacteristics
35 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
45 pages
NLP Introduction
No ratings yet
NLP Introduction
36 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
Introduction To NLP - Part 1
No ratings yet
Introduction To NLP - Part 1
23 pages
NLP Unit1
No ratings yet
NLP Unit1
51 pages
Chapter23 - Natural Language Processing
No ratings yet
Chapter23 - Natural Language Processing
87 pages
NLP MODULE-1 New Updated
No ratings yet
NLP MODULE-1 New Updated
57 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
32 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
30 pages
NLP CSM
No ratings yet
NLP CSM
136 pages
2 Introduction
No ratings yet
2 Introduction
15 pages

NLP & Linguistics for Researchers

Uploaded by

NLP & Linguistics for Researchers

Uploaded by

Natural Language Processing

March 31, 2021

• Example of NLP task: Named Entity Recognition (NER):

Figure: Named Entity Recognition

Computational linguistics (CL) studies the computational processes underlying

Similar methods and models are used in NLP and CL.

• Phonetics studies the sounds of a language [Johnson, 2014]

• Phonology: The study of how speech sounds form patterns

• The suffix usually determines the syntactic category (part-of-speech) of the

• Semantics studies the meaning of words, phrases and sentences

• Pragmatics: the study of how context affects meaning in certain situations

Each line contains a token, a part-of-speech tag, a syntactic

U.N. NNP I-NP I-ORG

• Three challenging properties of language: discreteness , compositionality, and

• Application of NLP techniques to identify and extract subjective information from

Main Problem: Message-level Polarity Classification

2. State-of-the-art solutions use supervised machine learning models trained from

w5 lol grr angry

Happy morning pos

• H3 separates the classes with the maximum margin.

• Annotation Costs: manual annotation is labour-intensive and

Domain Variation in Sentiment

• In 2013, The Semantic Evaluation (SemEval) workshop organized the

• Up until 2014 most state-of-the-art NLP systems were based on feature

• Deep Learning yields state-of-the-art results in most NLP tasks.

• If deep learning models can learn representations automatically, are linguistic

1. Repository to track the progress in Natural Language Processing (NLP),

Thanks for your Attention!

2014 Machine Learning Summer School.

You might also like