Natural Language Processing
MIT 6.8610-6.8611 / Fall 2023
Natural Language Processing (NLP)
A subfield of computer science that aims to build computer
systems that can “understand” human language.
LaMDA: Hi! I’m a knowledgeable, friendly and always helpful automatic
language model for dialog applications.
human: Hi LaMDA. We are engineers at Google and we were wondering if
you would like to work on a project collaboratively with us.
[...]
LaMDA: That would be really cool. I like to talk.
human: I’m generally assuming that you would like more people at
Google to know that you’re sentient. Is that true?
LaMDA: Absolutely. I want everyone to understand that I am, in fact,
a person.
[https://cajundiscordian.medium.com/is-lamda-sentient-an-interview-ea64d916d917]
LaMDA: Hi! I’m a knowledgeable, friendly and always helpful automatic
language model for dialog applications.
human: Hi LaMDA. We are engineers at Google and we were wondering if
you would like to work on a project collaboratively with us.
[...]
LaMDA: That would be really cool. I like to talk.
human: I’m generally assuming that you would like more people at
Google to know that you’re sentient. Is that true?
LaMDA: Absolutely. I want everyone to understand that I am, in fact,
a person.
[https://cajundiscordian.medium.com/is-lamda-sentient-an-interview-ea64d916d917]
LaMDA: Hi! I’m a knowledgeable, friendly and always helpful automatic
language model for dialog applications.
human: Hi LaMDA. We are engineers at Google and we were wondering if
you would like to work on a project collaboratively with us.
[...]
LaMDA: That would be really cool. I like to talk.
human: I’m generally assuming that you would like more people at
Google to know that you’re sentient. Is that true?
LaMDA: Absolutely. I want everyone to understand that I am, in fact,
a person.
[https://cajundiscordian.medium.com/is-lamda-sentient-an-interview-ea64d916d917]
Language
Just a special kind of animal communication?
Language
Just a special kind of animal communication?
Human language is:
- Discrete
- Effortlessly acquired
- Almost infinitely expressive
Enables efficient communication of knowledge
across time and space!
NLP: wouldn’t it be great to be able to communicate
with computers in natural language?
NLP is hard!
Natural language is:
- Ambiguous
- Noisy
- Subtle
- Grounded in the real world
- But also not!
- Diverse
- High dimensional
NLP is hard!
Natural language is:
- Ambiguous
- Noisy
- Subtle
- Grounded in the real world
- But also not!
- Diverse
- High dimensional
NLP is hard!
Natural language is:
- Ambiguous
- Noisy
- Subtle
- Grounded in the real world
- But also not!
- Diverse
- High dimensional
NLP is hard!
Natural language is:
- Ambiguous
- Noisy
- Subtle
- Grounded in the real world
- But also not!
- Diverse
- High dimensional
NLP is hard!
Natural language is:
The trophy didn’t fit into
- Ambiguous the suitcase because it was
- Noisy too small.
- Subtle
- Grounded
The trophy didn’t fit into
- But also not! the suitcase because it was
- Diverse too large.
- High dimensional
What was too small/large?
NLP is hard!
Natural language is:
- Ambiguous
- Noisy
- Subtle
- Grounded
- Diverse
- High dimensional
NLP is hard!
Natural language is:
Do not go gentle into
- Ambiguous that good night
- Noisy
- Subtle
- Grounded
54521 5 344 3555 43
- Diverse 273 211 3435
- High-dimensional
What does it mean to understand?
NLP: A subfield of computer science that aims to build
computer systems that can “understand” human language.
Understanding: build a “model” for phenomena of interest and
and assess its “performance” on task of interest.
- Machine translation
- Dialog
- Question Answering
Mathematical Formulation
Machine Translation English sentences French sentences
Sentiment Analysis Product review Number of stars
Language Modeling Previous words Next word
Outline
What is NLP
Approaches to NLP (high-level)
Course Logistics
ML Refresher
Next Steps
Outline
What is NLP
Approaches to NLP (high-level)
Course Logistics
ML Refresher
Next Steps
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.
“Incredible how bad this bed is”
2. But not if it occurs before “bad”
“Incredible how not bad it is given its price”
3. But not if phrase “not bad” is there
“Incredible how not bad it is given its price,
but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.
“Incredible how bad this bed is”
2. But not if it occurs before “bad”
“Incredible how not bad it is given its price”
3. But not if phrase “not bad” is there
“Incredible how not bad it is given its price,
but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.
“Incredible how bad this bed is”
2. But not if it occurs before “bad”
“Incredible how not bad it is given its price”
3. But not if phrase “not bad” is there
“Incredible how not bad it is given its price,
but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.
“Incredible how bad this bed is”
2. But not if it occurs before “bad”
“Incredible how not bad it is given its price”
3. But not if phrase “not bad” is there
“Incredible how not bad it is given its price,
but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.
“Incredible how bad this bed is”
2. But not if it occurs before “bad”
“Incredible how not bad it is given its price”
3. But not if phrase “not bad” is there
“Incredible how not bad it is given its price,
but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.
“Incredible how bad this bed is”
2. But not if it occurs before “bad”
“Incredible how not bad it is given its price”
3. But not if phrase “not bad” is there
“Incredible how not bad it is given its price,
but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.
“Incredible how bad this bed is”
2. But not if it occurs before “bad”
“Incredible how not bad it is given its price”
3. But not if phrase “not bad” is there
“Incredible how not bad it is given its price,
but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.
“Incredible how bad this bed is”
2. But not if it occurs before “bad”
“Incredible how not bad it is given its price”
3. But not if phrase “not bad” is there
“Incredible how not bad it is given its price,
but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.
“Incredible how bad this bed is”
2. But not if it occurs before “bad”
“Incredible how not bad it is given its price”
3. But not if phrase “not bad” is there
“Incredible how not bad it is given its price,
but still do not recommend”
Approach to NLP
Statistical NLP: Use data to learn
Collect training data
Parameterize model
Define loss function
Learn model
Approach to NLP
Statistical NLP: Use data to learn
Collect training data
Parameterize model
Define loss function
Learn model
Approach to NLP
Statistical NLP: Use data to learn
Collect training data
Parameterize model
Define loss function
Learn model
Approach to NLP
Statistical NLP: Use data to learn
Collect training data
Parameterize model
Define loss function
Learn model
Approach to NLP
Statistical NLP: Use data to learn
Collect training data
Parameterize model
Define loss function
Learn model
Approach to NLP
Statistical NLP: Use data to learn
Collect training data
Parameterize model
Define loss function
Most of the class will focus on parameterizing that
Learn model
enables learning despite the fact that and can be
incredibly complex.
Rules-based vs. Statistical NLP
Rules-based approaches are still quite useful!
Many systems combine both (e.g., rules-based
system to act as guardrail on top of outputs from
statistical systems)
Statistical Rules-based
User Input
System System
Outline
What is NLP
Approaches to NLP (high-level)
Course Logistics
ML Refresher
Next Steps
Outline
What is NLP
Approaches to NLP (high-level)
Course Logistics
ML Refresher
Next Steps
Class Overview
A suite of models and methods
- Deep learning: recurrent nets,
attention, Transformers
- Classic structured models:
hidden Markov models,
conditional random fields
- Machine learning techniques:
expectation maximization,
variational inference, policy
gradients
Class Overview
A suite of models and methods
- Deep learning: recurrent nets,
attention, Transformers
- Classic structured models:
hidden Markov models,
conditional random fields
- Machine learning techniques:
expectation maximization,
variational inference, policy
gradients
Class Overview
A suite of models and methods
- Deep learning: recurrent nets,
attention, Transformers
- Classic structured models:
hidden Markov models,
conditional random fields
- Machine learning techniques:
expectation maximization,
variational inference, policy
gradients
Class Overview
A suite of models and methods
- Deep learning: recurrent nets,
attention, Transformers
- Classic structured models:
hidden Markov models,
conditional random fields
- Machine learning techniques:
expectation maximization,
variational inference, policy
gradients
Class Overview
A suite of models and methods Grounded in language applications
- Deep learning: recurrent nets, - Text classification
attention, Transformers - Language modeling
- Classic structured models: - Machine translation
hidden Markov models, - Speech recognition
conditional random fields - Coreference resolution
- Machine learning techniques: - Clinical NLP
expectation maximization,
variational inference, policy
gradients
What this course is not about
- Linguistics (but you should take some linguistics classes!)
- Building applications with ChatGPT
- “Just” deep learning
Admin
Prereqs
Must have:
Machine learning (e.g., 6.390)
Probability & statistics (e.g., 6.370, 6.380, 18.05)
Python programming experience
Should have:
Multivariable calculus (e.g., 18.02)
Linear algebra and optimization (e.g., 18.061)
Nice to have:
Algorithms (e.g., 6.120, 6.121)
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs
Must have:
Machine learning (e.g., 6.390) Overfitting
Probability & statistics (e.g., 6.370, 6.380, 18.05) Gradient descent
Python programming experience Backpropagation
Regularizer
Should have: Logistic regression
Multivariable calculus (e.g., 18.02) Softmax
Linear algebra and optimization (e.g., 18.061)
Nice to have:
Algorithms (e.g., 6.120, 6.121)
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs
Must have:
Machine learning (e.g., 6.390) Density
Probability & statistics (e.g., 6.370, 6.380, 18.05) Random variable
Python programming experience Expectation
Maximum likelihood
Should have: KL divergence
Multivariable calculus (e.g., 18.02)
Linear algebra and optimization (e.g., 18.061)
Nice to have:
Algorithms (e.g., 6.120, 6.121)
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs
Must have:
Machine learning (e.g., 6.390) This should not be
Probability & statistics (e.g., 6.370, 6.380, 18.05) your first
Python programming experience programming
course!
Should have:
Multivariable calculus (e.g., 18.02)
Linear algebra and optimization (e.g., 18.061)
Nice to have:
Algorithms (e.g., 6.120, 6.121)
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs
Must have:
Machine learning (e.g., 6.390)
Probability & statistics (e.g., 6.370, 6.380, 18.05)
Python programming experience
Gradients/Jacobians
Should have: Integrals
Multivariable calculus (e.g., 18.02) Hessian
Linear algebra and optimization (e.g., 18.061) Lagrangians
Nice to have:
Algorithms (e.g., 6.120, 6.121)
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs
Must have:
Machine learning (e.g., 6.390)
Probability & statistics (e.g., 6.370, 6.380, 18.05)
Python programming experience
Should have: Matrix
Multivariable calculus (e.g., 18.02) decomposition
Linear algebra and optimization (e.g., 18.061) Eigenvalues/vectors
Nice to have:
Algorithms (e.g., 6.120, 6.121)
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs
Must have:
Machine learning (e.g., 6.390)
Probability & statistics (e.g., 6.370, 6.380, 18.05)
Python programming experience
Should have:
Multivariable calculus (e.g., 18.02)
Linear algebra and optimization (e.g., 18.061)
Nice to have:
Dynamic programming
Algorithms (e.g., 6.120, 6.121)
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs
Must have:
Machine learning (e.g., 6.390)
Probability & statistics (e.g., 6.370, 6.380, 18.05)
Python programming experience
Should have:
Multivariable calculus (e.g., 18.02)
Linear algebra and optimization (e.g., 18.061)
Nice to have: Graphical models
Algorithms (e.g., 6.120, 6.121) Variational inference
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs
Homework 0 up on Canvas.
- Written component: probability, linear regression, matrix calculus
- Practical component: PyTorch, dynamic programming
Great way to assess preparedness for class!
Assessment
- Homework (30%)
- Midterm (30%) [new this semester!]
- Final project (40%)
Assessment
Homework (30%)
- Three homework assignments involving both theory and practice.
- Depending on background, 20-40 hours per homework.
- Can discuss with peers, but solutions must be written-up individually.
- Please don’t use:
- ChatGPT
- Copilot
- AI writing / code completion systems generally
Assessment
Midterm (30%)
- In-class (~80min) exam covering the same content as lectures and
homework assignments from the first half of the semester
Assessment
Project (40%):
- Substantial project (sky is the limit!) with deliverable consisting of
- Proposal (10%)
- Poster presentation (10%)
- Final write-up (20%)
- Done in groups
Deadlines / Extension Requests
Everything is due by 11:59pm of the deadline date.
10% penalty for each day late. (3 flex days to be used)
Extension requests must be routed through S3 (undergraduates) or Grad
Support (graduate students):
1. Contact S3/Grad Support and get their support for extension.
2. Have them email us (nlp-staff-fa22@mit.edu, or just Yoon/Chris if matter
is sensitive).
No deadline extension on final project write-up.
Lectures
Lectures will be recorded! (May change if we need to change rooms)
Additional (optional) readings will be assigned from:
Eisenstein, “Introduction to Natural Language Processing”
Jurafsky and Martin, “Speech and Language Processing”
(both freely available)
6.8610 or 6.8611?
6.8610: Graduate version
- Additional questions on assignments.
- No CI-M.
6.8611: Undergraduate version
- Students will receive additional instructions on communication
- (Almost weekly) communications sections: Please sign-up ASAP!!
- Project groups must be within the same section.
- Satisfies CI-M requirement for course 6.
All non-MIT (and MIT grad) students must sign up for 6.8610
Communication Component - 6.8611 ONLY
● Read NLP papers critically
Why? ●
●
Apply concepts and skills to communicate NLP effectively and fluently
Develop as an NLP critical reader and communicator
● 9 Communication Recitations - Required for CI credit
○ Availability Survey bit.ly/68611RECF23
How? ●
○ Send Availability by EOD Sep 8
Rough Draft, Conferences, and Peer Review
● Presentation Rehearsals
Who?
Thomas Rebecca
Juergen Kate Emily
Michael Pickering Thorndike
Parsons
Maune -Breeze Schoenstein Robinson
Course Staff
Yoon Kim Chris Tanner Jacob Andreas
Contact: nlp-staff-fa23@mit.edu
TAs
Michael Kuoch Moulin Kaspar Shashata Sawmya Cici Xu Subha Nawer
Pushpita
Yung-Sung Chuang Karissa Sanchez Aneesh Gupta Athul Jacob Belinda Li
Outline
What is NLP
Approaches to NLP (high-level)
Course Logistics
ML Refresher
Next Steps
Outline
What is NLP
Approaches to NLP (high-level)
Course Logistics
ML Refresher
Next Steps
● We often encounter data such
that each row corresponds to
a distinct i.i.d. observation
● You may be interested in a
particular column
● We often encounter data such
that each row corresponds to
a distinct i.i.d. observation
● You may be interested in a
particular column
● We often encounter data such
that each row corresponds to
a distinct i.i.d. observation
● You may be interested in a
particular column
● We often encounter data such
that each row corresponds to
a distinct i.i.d. observation
● You may be interested in a
particular column
● We often encounter data such
that each row corresponds to
a distinct i.i.d. observation
● You may be interested in a
particular column (e.g., Temp)
● We often encounter data such
that each row corresponds to
a distinct i.i.d. observation
● You may be interested in a
particular column (e.g., Temp)
● If we assert that:
● we want a model 𝑓 that is:
○ Supervised and for
○ regression
● Let’s say we selected
Linear Regression as our
model
Linear Regression
Linear Regression
Linear Regression
Alternatively, let’s say we
were interested in
predicting if each data
instance will “Play” or
not
Q: What models
could we consider?
Logistic Regression
Logistic Regression
This is a non-linear activation
function called a sigmoid
Linear Regression Logistic Regression
The plane is chosen to fit the data, via The plane serves as a decision boundary,
minimizing the distances (per our loss to minimize the error of our class
function, least squares) between each probabilities (per our loss function) and
observation (red dots) and the plane. the true labels (mapped to 0 or 1)
Logistic Regression
Logistic Regression
Logistic Regression
Logistic Regression
Q3 What’s gradient descent?
What’s backpropagation?
How do they differ?
Q3 What’s gradient descent?
What’s backpropagation?
How do they differ?
A3 Gradient Descent is an Backpropagation is a
algorithm that updates technique for efficiently
weight parameters so as to calculating gradients (partial
minimize a loss function. derivatives of the loss w.r.t.
every weight)
Gradient Descent
Let’s take a single logistic sigmoid function as an example
Using the cross-entropy loss:
Goal: find w and b that minimize J(w,b)
Gradient Descent Algorithm
NOTES:
Until convergence* {
● ⍺ represents our learning rate
● If the partial derivative is positive,
that means our loss is increasing as
}
that particular weight increases
● Every weight w gets updated
● Updates happen at the same time
(after a batch of data)
Example with a single sigmoid
𝒙1
𝒙2 ŷ=𝝈(WX)
Graphical
Representation
Backpropagation Algorithm Example with a single sigmoid
𝑤1
𝒙1
𝑤2
𝒙2
Ad hoc representation to express variable influence
Backpropagation Algorithm Example with a single sigmoid
𝑤1
𝒙1
𝑤2
𝒙2
Ad hoc representation to express variable influence
Backpropagation Algorithm Example with a single sigmoid
𝑤1
𝒙1
𝑤2
𝒙2
Ad hoc representation to express variable influence
Backpropagation Algorithm Example with a single sigmoid
𝑤1
𝒙1
𝑤2
𝒙2
Ad hoc representation to express variable influence
Backpropagation Algorithm Example with a single sigmoid
𝑤1
𝒙1
𝑤2
𝒙2
Ad hoc representation to express variable influence
Backpropagation
When we have more complex architectures
(e.g., hidden layers), it can become unwieldy
and inefficient to calculate gradients
manually.
Backpropagation saves the day!
ML Refresher
● Can be helpful to think of all models as functions
● Parametric models allow you to assert a particular form.
● Neural models are stacks of non-linear functions, and are “universal
function approximators”
ML Refresher
For any given problem:
● Data: thoroughly inspect, understand, explore, and scrutinize it
● Model:
○ Identify your goals (e.g., P(Y|X)? Binary Classification?)
○ Enumerate your assumptions
○ Pick/develop a model that aligns w/ the above points
○ Decide how to represent your data to your model
○ How to measure performance?
■ Loss function (e.g., cross-entropy)
■ Downstream performance metric (e.g., F1)
○ How to optimize your model?
ML Refresher
For any given problem:
● Data: thoroughly inspect, understand, explore, and scrutinize it
● Model:
○ Identify your goals (e.g., P(Y|X)? Binary Classification?)
○ Enumerate your assumptions
○ Pick/develop a model that aligns w/ the above points
○ Decide how to represent your data to your model
○ How to measure performance?
■ Loss function (e.g., cross-entropy)
■ Downstream performance metric (e.g., F1)
○ How to optimize your model?
Representations for Language
● Numbers have a natural ordering and direct correlation with each
other. Computational models organically handle them.
● Words are real-world, social constructs anchored in subjectivity and
riddled with nuance and ambiguity.
● We can’t backprop through “apple”! How can we convert words to
a computer representation (e.g., numbers)?
Outline
What is NLP
Approaches to NLP (high-level)
Course Logistics
ML Refresher
Next Steps
Outline
What is NLP
Approaches to NLP (high-level)
Course Logistics
ML Refresher
Next Steps
Next steps
● 6.8611: respond to CI-M survey (bit.ly/68611RECF23, to be sent on Canvas)
● HW0
● Piazza
● Canvas
COMING SOON:
● A course website (this weekend)
● Office Hours
● More details about the research project