0% found this document useful (0 votes)
507 views122 pages

Natural Language Processing: MIT 6.8610-6.8611 / Fall 2023

Anthropic Projects: Build NLP systems! Grading: Homeworks, projects, exams Office hours: Come with questions! Have fun learning NLP! Outline What is NLP Approaches to NLP (high-level) Course Logistics ML Refresher Next Steps Outline What is NLP Approaches to NLP (high-level) Course Logistics ML Refresher Next Steps ML Refresher Supervised Learning - Classification - Regression Unsupervised Learning - Clustering - Dimensionality Reduction Reinforcement Learning - Markov Decision Processes

Uploaded by

Jim Chung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
507 views122 pages

Natural Language Processing: MIT 6.8610-6.8611 / Fall 2023

Anthropic Projects: Build NLP systems! Grading: Homeworks, projects, exams Office hours: Come with questions! Have fun learning NLP! Outline What is NLP Approaches to NLP (high-level) Course Logistics ML Refresher Next Steps Outline What is NLP Approaches to NLP (high-level) Course Logistics ML Refresher Next Steps ML Refresher Supervised Learning - Classification - Regression Unsupervised Learning - Clustering - Dimensionality Reduction Reinforcement Learning - Markov Decision Processes

Uploaded by

Jim Chung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

Natural Language Processing

MIT 6.8610-6.8611 / Fall 2023


Natural Language Processing (NLP)

A subfield of computer science that aims to build computer


systems that can “understand” human language.
LaMDA: Hi! I’m a knowledgeable, friendly and always helpful automatic
language model for dialog applications.

human: Hi LaMDA. We are engineers at Google and we were wondering if


you would like to work on a project collaboratively with us.

[...]

LaMDA: That would be really cool. I like to talk.

human: I’m generally assuming that you would like more people at
Google to know that you’re sentient. Is that true?

LaMDA: Absolutely. I want everyone to understand that I am, in fact,


a person.

[https://cajundiscordian.medium.com/is-lamda-sentient-an-interview-ea64d916d917]
LaMDA: Hi! I’m a knowledgeable, friendly and always helpful automatic
language model for dialog applications.

human: Hi LaMDA. We are engineers at Google and we were wondering if


you would like to work on a project collaboratively with us.

[...]

LaMDA: That would be really cool. I like to talk.

human: I’m generally assuming that you would like more people at
Google to know that you’re sentient. Is that true?

LaMDA: Absolutely. I want everyone to understand that I am, in fact,


a person.

[https://cajundiscordian.medium.com/is-lamda-sentient-an-interview-ea64d916d917]
LaMDA: Hi! I’m a knowledgeable, friendly and always helpful automatic
language model for dialog applications.

human: Hi LaMDA. We are engineers at Google and we were wondering if


you would like to work on a project collaboratively with us.

[...]

LaMDA: That would be really cool. I like to talk.

human: I’m generally assuming that you would like more people at
Google to know that you’re sentient. Is that true?

LaMDA: Absolutely. I want everyone to understand that I am, in fact,


a person.

[https://cajundiscordian.medium.com/is-lamda-sentient-an-interview-ea64d916d917]
Language

Just a special kind of animal communication?


Language

Just a special kind of animal communication?

Human language is:


- Discrete
- Effortlessly acquired
- Almost infinitely expressive
Enables efficient communication of knowledge
across time and space!
NLP: wouldn’t it be great to be able to communicate
with computers in natural language?
NLP is hard!

Natural language is:

- Ambiguous
- Noisy
- Subtle
- Grounded in the real world
- But also not!
- Diverse
- High dimensional
NLP is hard!

Natural language is:

- Ambiguous
- Noisy
- Subtle
- Grounded in the real world
- But also not!
- Diverse
- High dimensional
NLP is hard!

Natural language is:

- Ambiguous
- Noisy
- Subtle
- Grounded in the real world
- But also not!
- Diverse
- High dimensional
NLP is hard!

Natural language is:

- Ambiguous
- Noisy
- Subtle
- Grounded in the real world
- But also not!
- Diverse
- High dimensional
NLP is hard!

Natural language is:


The trophy didn’t fit into
- Ambiguous the suitcase because it was
- Noisy too small.
- Subtle
- Grounded
The trophy didn’t fit into
- But also not! the suitcase because it was
- Diverse too large.
- High dimensional
What was too small/large?
NLP is hard!

Natural language is:

- Ambiguous
- Noisy
- Subtle
- Grounded
- Diverse
- High dimensional
NLP is hard!

Natural language is:


Do not go gentle into
- Ambiguous that good night
- Noisy
- Subtle
- Grounded
54521 5 344 3555 43
- Diverse 273 211 3435
- High-dimensional
What does it mean to understand?

NLP: A subfield of computer science that aims to build


computer systems that can “understand” human language.

Understanding: build a “model” for phenomena of interest and


and assess its “performance” on task of interest.

- Machine translation
- Dialog
- Question Answering
Mathematical Formulation

Machine Translation English sentences French sentences


Sentiment Analysis Product review Number of stars
Language Modeling Previous words Next word
Outline
What is NLP

Approaches to NLP (high-level)

Course Logistics

ML Refresher

Next Steps
Outline
What is NLP

Approaches to NLP (high-level)

Course Logistics

ML Refresher

Next Steps
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.

“Incredible how bad this bed is”

2. But not if it occurs before “bad”

“Incredible how not bad it is given its price”

3. But not if phrase “not bad” is there

“Incredible how not bad it is given its price,


but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.

“Incredible how bad this bed is”

2. But not if it occurs before “bad”

“Incredible how not bad it is given its price”

3. But not if phrase “not bad” is there

“Incredible how not bad it is given its price,


but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.

“Incredible how bad this bed is”

2. But not if it occurs before “bad”

“Incredible how not bad it is given its price”

3. But not if phrase “not bad” is there

“Incredible how not bad it is given its price,


but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.

“Incredible how bad this bed is”

2. But not if it occurs before “bad”

“Incredible how not bad it is given its price”

3. But not if phrase “not bad” is there

“Incredible how not bad it is given its price,


but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.

“Incredible how bad this bed is”

2. But not if it occurs before “bad”

“Incredible how not bad it is given its price”

3. But not if phrase “not bad” is there

“Incredible how not bad it is given its price,


but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.

“Incredible how bad this bed is”

2. But not if it occurs before “bad”

“Incredible how not bad it is given its price”

3. But not if phrase “not bad” is there

“Incredible how not bad it is given its price,


but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.

“Incredible how bad this bed is”

2. But not if it occurs before “bad”

“Incredible how not bad it is given its price”

3. But not if phrase “not bad” is there

“Incredible how not bad it is given its price,


but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.

“Incredible how bad this bed is”

2. But not if it occurs before “bad”

“Incredible how not bad it is given its price”

3. But not if phrase “not bad” is there

“Incredible how not bad it is given its price,


but still do not recommend”
Approach to NLP
Rules-based NLP: Use rules to craft
1. Output 5 stars if ‘incredible’ is mentioned.

“Incredible how bad this bed is”

2. But not if it occurs before “bad”

“Incredible how not bad it is given its price”

3. But not if phrase “not bad” is there

“Incredible how not bad it is given its price,


but still do not recommend”
Approach to NLP
Statistical NLP: Use data to learn

Collect training data

Parameterize model

Define loss function

Learn model
Approach to NLP
Statistical NLP: Use data to learn

Collect training data

Parameterize model

Define loss function

Learn model
Approach to NLP
Statistical NLP: Use data to learn

Collect training data

Parameterize model

Define loss function

Learn model
Approach to NLP
Statistical NLP: Use data to learn

Collect training data

Parameterize model

Define loss function

Learn model
Approach to NLP
Statistical NLP: Use data to learn

Collect training data

Parameterize model

Define loss function

Learn model
Approach to NLP
Statistical NLP: Use data to learn

Collect training data

Parameterize model

Define loss function


Most of the class will focus on parameterizing that
Learn model
enables learning despite the fact that and can be
incredibly complex.
Rules-based vs. Statistical NLP

Rules-based approaches are still quite useful!

Many systems combine both (e.g., rules-based


system to act as guardrail on top of outputs from
statistical systems)

Statistical Rules-based
User Input
System System
Outline
What is NLP

Approaches to NLP (high-level)

Course Logistics

ML Refresher

Next Steps
Outline
What is NLP

Approaches to NLP (high-level)

Course Logistics

ML Refresher

Next Steps
Class Overview
A suite of models and methods

- Deep learning: recurrent nets,


attention, Transformers
- Classic structured models:
hidden Markov models,
conditional random fields
- Machine learning techniques:
expectation maximization,
variational inference, policy
gradients
Class Overview
A suite of models and methods

- Deep learning: recurrent nets,


attention, Transformers
- Classic structured models:
hidden Markov models,
conditional random fields
- Machine learning techniques:
expectation maximization,
variational inference, policy
gradients
Class Overview
A suite of models and methods

- Deep learning: recurrent nets,


attention, Transformers
- Classic structured models:
hidden Markov models,
conditional random fields
- Machine learning techniques:
expectation maximization,
variational inference, policy
gradients
Class Overview
A suite of models and methods

- Deep learning: recurrent nets,


attention, Transformers
- Classic structured models:
hidden Markov models,
conditional random fields
- Machine learning techniques:
expectation maximization,
variational inference, policy
gradients
Class Overview
A suite of models and methods Grounded in language applications

- Deep learning: recurrent nets, - Text classification


attention, Transformers - Language modeling
- Classic structured models: - Machine translation
hidden Markov models, - Speech recognition
conditional random fields - Coreference resolution
- Machine learning techniques: - Clinical NLP
expectation maximization,
variational inference, policy
gradients
What this course is not about

- Linguistics (but you should take some linguistics classes!)

- Building applications with ChatGPT

- “Just” deep learning


Admin
Prereqs
Must have:
Machine learning (e.g., 6.390)
Probability & statistics (e.g., 6.370, 6.380, 18.05)
Python programming experience

Should have:
Multivariable calculus (e.g., 18.02)
Linear algebra and optimization (e.g., 18.061)

Nice to have:
Algorithms (e.g., 6.120, 6.121)
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs
Must have:
Machine learning (e.g., 6.390) Overfitting
Probability & statistics (e.g., 6.370, 6.380, 18.05) Gradient descent
Python programming experience Backpropagation
Regularizer
Should have: Logistic regression
Multivariable calculus (e.g., 18.02) Softmax
Linear algebra and optimization (e.g., 18.061)

Nice to have:
Algorithms (e.g., 6.120, 6.121)
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs
Must have:
Machine learning (e.g., 6.390) Density
Probability & statistics (e.g., 6.370, 6.380, 18.05) Random variable
Python programming experience Expectation
Maximum likelihood
Should have: KL divergence
Multivariable calculus (e.g., 18.02)
Linear algebra and optimization (e.g., 18.061)

Nice to have:
Algorithms (e.g., 6.120, 6.121)
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs
Must have:
Machine learning (e.g., 6.390) This should not be
Probability & statistics (e.g., 6.370, 6.380, 18.05) your first
Python programming experience programming
course!
Should have:
Multivariable calculus (e.g., 18.02)
Linear algebra and optimization (e.g., 18.061)

Nice to have:
Algorithms (e.g., 6.120, 6.121)
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs
Must have:
Machine learning (e.g., 6.390)
Probability & statistics (e.g., 6.370, 6.380, 18.05)
Python programming experience
Gradients/Jacobians
Should have: Integrals
Multivariable calculus (e.g., 18.02) Hessian
Linear algebra and optimization (e.g., 18.061) Lagrangians

Nice to have:
Algorithms (e.g., 6.120, 6.121)
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs
Must have:
Machine learning (e.g., 6.390)
Probability & statistics (e.g., 6.370, 6.380, 18.05)
Python programming experience

Should have: Matrix


Multivariable calculus (e.g., 18.02) decomposition
Linear algebra and optimization (e.g., 18.061) Eigenvalues/vectors

Nice to have:
Algorithms (e.g., 6.120, 6.121)
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs
Must have:
Machine learning (e.g., 6.390)
Probability & statistics (e.g., 6.370, 6.380, 18.05)
Python programming experience

Should have:
Multivariable calculus (e.g., 18.02)
Linear algebra and optimization (e.g., 18.061)

Nice to have:
Dynamic programming
Algorithms (e.g., 6.120, 6.121)
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs
Must have:
Machine learning (e.g., 6.390)
Probability & statistics (e.g., 6.370, 6.380, 18.05)
Python programming experience

Should have:
Multivariable calculus (e.g., 18.02)
Linear algebra and optimization (e.g., 18.061)

Nice to have: Graphical models


Algorithms (e.g., 6.120, 6.121) Variational inference
Grad-level machine learning (e.g., 6.790, 6.781)
Prereqs

Homework 0 up on Canvas.

- Written component: probability, linear regression, matrix calculus


- Practical component: PyTorch, dynamic programming

Great way to assess preparedness for class!


Assessment

- Homework (30%)
- Midterm (30%) [new this semester!]
- Final project (40%)
Assessment

Homework (30%)

- Three homework assignments involving both theory and practice.


- Depending on background, 20-40 hours per homework.
- Can discuss with peers, but solutions must be written-up individually.
- Please don’t use:
- ChatGPT
- Copilot
- AI writing / code completion systems generally
Assessment

Midterm (30%)

- In-class (~80min) exam covering the same content as lectures and


homework assignments from the first half of the semester
Assessment

Project (40%):

- Substantial project (sky is the limit!) with deliverable consisting of


- Proposal (10%)
- Poster presentation (10%)
- Final write-up (20%)

- Done in groups
Deadlines / Extension Requests
Everything is due by 11:59pm of the deadline date.

10% penalty for each day late. (3 flex days to be used)

Extension requests must be routed through S3 (undergraduates) or Grad


Support (graduate students):
1. Contact S3/Grad Support and get their support for extension.
2. Have them email us (nlp-staff-fa22@mit.edu, or just Yoon/Chris if matter
is sensitive).

No deadline extension on final project write-up.


Lectures
Lectures will be recorded! (May change if we need to change rooms)

Additional (optional) readings will be assigned from:


Eisenstein, “Introduction to Natural Language Processing”
Jurafsky and Martin, “Speech and Language Processing”

(both freely available)


6.8610 or 6.8611?

6.8610: Graduate version


- Additional questions on assignments.
- No CI-M.
6.8611: Undergraduate version
- Students will receive additional instructions on communication
- (Almost weekly) communications sections: Please sign-up ASAP!!
- Project groups must be within the same section.
- Satisfies CI-M requirement for course 6.

All non-MIT (and MIT grad) students must sign up for 6.8610
Communication Component - 6.8611 ONLY
● Read NLP papers critically
Why? ●

Apply concepts and skills to communicate NLP effectively and fluently
Develop as an NLP critical reader and communicator

● 9 Communication Recitations - Required for CI credit


○ Availability Survey bit.ly/68611RECF23
How? ●
○ Send Availability by EOD Sep 8
Rough Draft, Conferences, and Peer Review
● Presentation Rehearsals

Who?
Thomas Rebecca
Juergen Kate Emily
Michael Pickering Thorndike
Parsons
Maune -Breeze Schoenstein Robinson
Course Staff

Yoon Kim Chris Tanner Jacob Andreas

Contact: nlp-staff-fa23@mit.edu
TAs

Michael Kuoch Moulin Kaspar Shashata Sawmya Cici Xu Subha Nawer


Pushpita

Yung-Sung Chuang Karissa Sanchez Aneesh Gupta Athul Jacob Belinda Li


Outline
What is NLP

Approaches to NLP (high-level)

Course Logistics

ML Refresher

Next Steps
Outline
What is NLP

Approaches to NLP (high-level)

Course Logistics

ML Refresher

Next Steps
● We often encounter data such
that each row corresponds to
a distinct i.i.d. observation

● You may be interested in a


particular column
● We often encounter data such
that each row corresponds to
a distinct i.i.d. observation

● You may be interested in a


particular column
● We often encounter data such
that each row corresponds to
a distinct i.i.d. observation

● You may be interested in a


particular column
● We often encounter data such
that each row corresponds to
a distinct i.i.d. observation

● You may be interested in a


particular column
● We often encounter data such
that each row corresponds to
a distinct i.i.d. observation

● You may be interested in a


particular column (e.g., Temp)
● We often encounter data such
that each row corresponds to
a distinct i.i.d. observation

● You may be interested in a


particular column (e.g., Temp)

● If we assert that:

● we want a model 𝑓 that is:


○ Supervised and for
○ regression
● Let’s say we selected
Linear Regression as our
model
Linear Regression
Linear Regression
Linear Regression
Alternatively, let’s say we
were interested in
predicting if each data
instance will “Play” or
not

Q: What models
could we consider?
Logistic Regression
Logistic Regression

This is a non-linear activation


function called a sigmoid
Linear Regression Logistic Regression

The plane is chosen to fit the data, via The plane serves as a decision boundary,
minimizing the distances (per our loss to minimize the error of our class
function, least squares) between each probabilities (per our loss function) and
observation (red dots) and the plane. the true labels (mapped to 0 or 1)
Logistic Regression
Logistic Regression
Logistic Regression
Logistic Regression
Q3 What’s gradient descent?
What’s backpropagation?
How do they differ?
Q3 What’s gradient descent?
What’s backpropagation?
How do they differ?

A3 Gradient Descent is an Backpropagation is a


algorithm that updates technique for efficiently
weight parameters so as to calculating gradients (partial
minimize a loss function. derivatives of the loss w.r.t.
every weight)
Gradient Descent

Let’s take a single logistic sigmoid function as an example

Using the cross-entropy loss:

Goal: find w and b that minimize J(w,b)


Gradient Descent Algorithm
NOTES:
Until convergence* {
● ⍺ represents our learning rate

● If the partial derivative is positive,


that means our loss is increasing as
}
that particular weight increases

● Every weight w gets updated

● Updates happen at the same time


(after a batch of data)
Example with a single sigmoid

𝒙1

𝒙2 ŷ=𝝈(WX)

Graphical
Representation
Backpropagation Algorithm Example with a single sigmoid

𝑤1

𝒙1

𝑤2

𝒙2

Ad hoc representation to express variable influence


Backpropagation Algorithm Example with a single sigmoid

𝑤1

𝒙1

𝑤2

𝒙2

Ad hoc representation to express variable influence


Backpropagation Algorithm Example with a single sigmoid

𝑤1

𝒙1

𝑤2

𝒙2

Ad hoc representation to express variable influence


Backpropagation Algorithm Example with a single sigmoid

𝑤1

𝒙1

𝑤2

𝒙2

Ad hoc representation to express variable influence


Backpropagation Algorithm Example with a single sigmoid

𝑤1

𝒙1

𝑤2

𝒙2

Ad hoc representation to express variable influence


Backpropagation

When we have more complex architectures


(e.g., hidden layers), it can become unwieldy
and inefficient to calculate gradients
manually.

Backpropagation saves the day!


ML Refresher

● Can be helpful to think of all models as functions

● Parametric models allow you to assert a particular form.

● Neural models are stacks of non-linear functions, and are “universal


function approximators”
ML Refresher
For any given problem:
● Data: thoroughly inspect, understand, explore, and scrutinize it
● Model:
○ Identify your goals (e.g., P(Y|X)? Binary Classification?)
○ Enumerate your assumptions
○ Pick/develop a model that aligns w/ the above points
○ Decide how to represent your data to your model
○ How to measure performance?
■ Loss function (e.g., cross-entropy)
■ Downstream performance metric (e.g., F1)
○ How to optimize your model?
ML Refresher
For any given problem:
● Data: thoroughly inspect, understand, explore, and scrutinize it
● Model:
○ Identify your goals (e.g., P(Y|X)? Binary Classification?)
○ Enumerate your assumptions
○ Pick/develop a model that aligns w/ the above points
○ Decide how to represent your data to your model
○ How to measure performance?
■ Loss function (e.g., cross-entropy)
■ Downstream performance metric (e.g., F1)
○ How to optimize your model?
Representations for Language

● Numbers have a natural ordering and direct correlation with each


other. Computational models organically handle them.

● Words are real-world, social constructs anchored in subjectivity and


riddled with nuance and ambiguity.

● We can’t backprop through “apple”! How can we convert words to


a computer representation (e.g., numbers)?
Outline
What is NLP

Approaches to NLP (high-level)

Course Logistics

ML Refresher

Next Steps
Outline
What is NLP

Approaches to NLP (high-level)

Course Logistics

ML Refresher

Next Steps
Next steps
● 6.8611: respond to CI-M survey (bit.ly/68611RECF23, to be sent on Canvas)
● HW0
● Piazza
● Canvas

COMING SOON:
● A course website (this weekend)
● Office Hours
● More details about the research project

You might also like