0% found this document useful (0 votes)
14 views34 pages

Lecture 1 Intro To NLP

The course on Natural Language Processing (NLP) aims to familiarize students with common NLP techniques, focusing on methodology and application, particularly in finance. It includes four homeworks, three quizzes, and a final project, with a grading breakdown of 50% homework, 15% quizzes, 30% project, and 5% attendance. The course will cover various NLP tasks, challenges, and applications, emphasizing practical aspects and Python tool-kits for implementation.

Uploaded by

emilyzhan89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views34 pages

Lecture 1 Intro To NLP

The course on Natural Language Processing (NLP) aims to familiarize students with common NLP techniques, focusing on methodology and application, particularly in finance. It includes four homeworks, three quizzes, and a final project, with a grading breakdown of 50% homework, 15% quizzes, 30% project, and 5% attendance. The course will cover various NLP tasks, challenges, and applications, emphasizing practical aspects and Python tool-kits for implementation.

Uploaded by

emilyzhan89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Lecture 1: Introduction to Natural Language

Processing

Ron Yurko
NLP: 46-924

August 28, 2023

1
Goals of this course

▶ Become familiar with common NLP techniques with a focus


on methodology and application
▶ Understand and construct different representations of text
data that are relevant across a variety of problems
▶ Gain experience in implementing and evaluating NLP
techniques that are leveraged in machine learning and deep
learning methods
▶ Become comfortable with python tool-kits for implementing
NLP techniques in finance problems

2
This course: practical aspects

▶ Four homeworks - due on Wednesday evenings starting week 2


▶ Final homework is due Wednesday Sept 27th

3
This course: practical aspects

▶ Four homeworks - due on Wednesday evenings starting week 2


▶ Final homework is due Wednesday Sept 27th

▶ There will be three quizzes instead of a final exam


▶ Mix of true/false and multiple choice questions on Gradescope,
testing concepts

3
This course: practical aspects

▶ Four homeworks - due on Wednesday evenings starting week 2


▶ Final homework is due Wednesday Sept 27th

▶ There will be three quizzes instead of a final exam


▶ Mix of true/false and multiple choice questions on Gradescope,
testing concepts

▶ You will submit a final project due Friday Oct 13th


▶ Similar in style to ML2 project, you’ll start this with HW3
(because I’m so generous...)
▶ Project description is available on Canvas.
▶ Submit your groups by Friday Sept 8th on Gradescope

▶ Grades: 50% HW, 15% quizzes, 30% project, 5% attendance

3
Homeworks

▶ Homeworks will primarily be applications of methods to real


data examples. Help us understand benefits and limitations of
working with text.

4
Homeworks

▶ Homeworks will primarily be applications of methods to real


data examples. Help us understand benefits and limitations of
working with text.

▶ Collaboration: you can talk about homeworks, help each other


learn. But you must:
▶ Write your own code: making debugging suggestions is okay,
but writing one version of the code together is not.
▶ Regarding ChatGPT: You cannot simply ask ChatGPT to do
the homework for you.
▶ You should be able to explain anything that you submit.

4
Homeworks

▶ Homeworks will primarily be applications of methods to real


data examples. Help us understand benefits and limitations of
working with text.

▶ Collaboration: you can talk about homeworks, help each other


learn. But you must:
▶ Write your own code: making debugging suggestions is okay,
but writing one version of the code together is not.
▶ Regarding ChatGPT: You cannot simply ask ChatGPT to do
the homework for you.
▶ You should be able to explain anything that you submit.

▶ This is a brand new course that I just built from scratch...

4
Discussion

▶ There is a Piazza discussion board through Canvas. Use


Piazza instead of email for any questions about the
material or homework.
▶ This way there is a fiar distribution of information, and you
can also help each other learn.
▶ I also tend to be faster at Piazza than at email. However, I
make no promises with either very close to homework
deadlines...

5
Subject matter

▶ This is an NLP class first, but I will try to only use finance
data examples for all of the considered applications. I may
mention non-finance data if I think it helps illustrate an
interesting concept, but all demos and homework problems
will be with finance examples.

6
Subject matter

▶ This is an NLP class first, but I will try to only use finance
data examples for all of the considered applications. I may
mention non-finance data if I think it helps illustrate an
interesting concept, but all demos and homework problems
will be with finance examples.

▶ This is NOT a web-scraping course. This is an ML course


about NLP methods, what they’re used for, when they work,
when they break.

6
Subject matter

▶ This is an NLP class first, but I will try to only use finance
data examples for all of the considered applications. I may
mention non-finance data if I think it helps illustrate an
interesting concept, but all demos and homework problems
will be with finance examples.

▶ This is NOT a web-scraping course. This is an ML course


about NLP methods, what they’re used for, when they work,
when they break.

▶ All programming in this course will be in python. Majority of


lectures will be accompanied by demos that I will walk
through during class. Jupyter notebooks will be available on
Canvas before every lecture.

6
Announcements

▶ There will be an announcement slide (like this!) at the


beginning of each lecture. You are responsible for being
aware of this content, even if you miss class.

▶ Homework 1 will be posted Wednesday morning and is due


on Wednesday, September 6th

▶ My office hours are 8:30 to 9:30 AM ET over zoom

▶ I will be in NYC on Wednesday Aug 30th

7
Goals for today

▶ Introduce natural languague processing (NLP)

▶ Discuss examples of NLP in finance

▶ Discuss text as data

8
NLP in the real world

▶ Natural languague processing (NLP): methods to analyze,


model, and understand human language

▶ Wide range of NLP applications you’ve likely come across


▶ Email platforms
▶ Voice-based assistants
▶ Search engines
▶ Machine translation services
▶ Assess social media feeds
▶ Jeopardy!
▶ ChatGPT...

9
Common NLP Tasks

▶ Spell Checking
▶ Keyword-Based Information Retrieval
▶ Text Classification - bucketing text into categories based on
content
▶ Topic Modeling - uncover topical structure of documents
▶ Text Summarization - create short summaries of longer
documents while retaining core content and meaning of text
▶ Question Answering
▶ Machine Translation
▶ Language modeling - predict next word in a sentence based
on previous words for a conversational agent

10
NLP examples relevant in finance

▶ Financial sentiment analysis - how does news, social media


relate to market movements
▶ Risk assessments - leverage information in documents to add
more data

11
NLP examples relevant in finance

▶ Financial sentiment analysis - how does news, social media


relate to market movements
▶ Risk assessments - leverage information in documents to add
more data
▶ Monetary policy stance - using text to indicate hawkish or
dovish stance by FOMC (see paper)
▶ Post-Earnings-Announcement Drift based on earnings calls
(see paper)

11
NLP examples relevant in finance

▶ Financial sentiment analysis - how does news, social media


relate to market movements
▶ Risk assessments - leverage information in documents to add
more data
▶ Monetary policy stance - using text to indicate hawkish or
dovish stance by FOMC (see paper)
▶ Post-Earnings-Announcement Drift based on earnings calls
(see paper)
▶ Financial word embeddings based on EDGAR-CORPUS
(see paper)

11
Thinking about human language

▶ Four major building blocks of language:

▶ phonemes - speech and sounds

▶ morphemes and lexemes - words

▶ syntax - phrases and sentences

▶ context - meaning

12
Morphemes and lexemes

▶ morpheme - smallest unit of language that has meaning,


formed by combination of phonemes
▶ Not all morphemes are words, but all prefixes and suffixes are
morphemes, e.g., multi- in multimedia

13
Morphemes and lexemes

▶ morpheme - smallest unit of language that has meaning,


formed by combination of phonemes
▶ Not all morphemes are words, but all prefixes and suffixes are
morphemes, e.g., multi- in multimedia

▶ lexemes - structural variations of morphemes related by


meaning, e.g. run and running belong to the same lexeme
form
▶ Morphological analysis - analyzes structure of words by
studying morphemes and lexemes

13
Syntax
▶ Syntax - set of rules to construct grammatically correct
sentences out of words/phrases
▶ Represent sentences with a parse tree

14
Why is NLP challenging?

▶ Context - various parts of language come together to convey


a particular meaning
▶ Semantics - direct meaning without external context
▶ Pragmatics - adds domain knowledge and external context

15
Why is NLP challenging?

▶ Context - various parts of language come together to convey


a particular meaning
▶ Semantics - direct meaning without external context
▶ Pragmatics - adds domain knowledge and external context

▶ Ambiguity - uncertainty of meaning


▶ e.g., ”I made her duck”

15
Why is NLP challenging?

▶ Context - various parts of language come together to convey


a particular meaning
▶ Semantics - direct meaning without external context
▶ Pragmatics - adds domain knowledge and external context

▶ Ambiguity - uncertainty of meaning


▶ e.g., ”I made her duck”

▶ Common knowledge - we use it all the time, but how do we


encode this?

15
Why is NLP challenging?

▶ Context - various parts of language come together to convey


a particular meaning
▶ Semantics - direct meaning without external context
▶ Pragmatics - adds domain knowledge and external context

▶ Ambiguity - uncertainty of meaning


▶ e.g., ”I made her duck”

▶ Common knowledge - we use it all the time, but how do we


encode this?

▶ Creativity
▶ Diversity across languages

15
Heuristics-Based NLP
▶ Rules-based approaches for NLP systems
▶ Use pre-existing dictionaries and thesauruses
▶ Example: lexicon-based sentiment analysis, counts of
positive and negative words to infer sentiment of text

16
Heuristics-Based NLP
▶ Rules-based approaches for NLP systems
▶ Use pre-existing dictionaries and thesauruses
▶ Example: lexicon-based sentiment analysis, counts of
positive and negative words to infer sentiment of text

▶ CMU English Dept DocuScope Project

16
Heuristics-Based NLP
▶ Rules-based approaches for NLP systems
▶ Use pre-existing dictionaries and thesauruses
▶ Example: lexicon-based sentiment analysis, counts of
positive and negative words to infer sentiment of text

▶ CMU English Dept DocuScope Project

▶ Regular expressions (regex): sequences of characters that


specifies a match pattern in text (use a cheatsheet!)
▶ e.g., find emails ids:
▶ Powerful tools for pattern detection, location, extraction, and
replacement when working with strings as data
▶ Popular for rule-based systems, regexes are used for
deterministic matches
Rules/heuristics can be useful as features for machine learning
based NLP systems
16
How do we represent images as data?

17
What about sound?

18
How do we represent text?

19
Recap and next time

▶ There are many use cases for NLP, but challenging,


open-ended problems

▶ Course will proceed as follows:


▶ Text representation
▶ Classification
▶ Clustering and Topic Models
▶ Deep Learning

Next time: Bag of Words representation of text

20

You might also like