0% found this document useful (0 votes)

23 views15 pages

Unit-I NLP

The document provides an extensive overview of Natural Language Processing (NLP), covering its real-world applications, tasks, and the challenges it faces due to the complexity of human language. It details the NLP pipeline, including data acquisition, pre-processing, feature engineering, modeling, and evaluation, while also discussing various machine learning approaches such as heuristics-based, machine learning-based, and deep learning-based techniques. Additionally, it outlines the key building blocks of language and the different levels of NLP analysis, emphasizing the importance of context and the intricacies involved in language processing.

Uploaded by

aakashchokkam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views15 pages

Unit-I NLP

Uploaded by

aakashchokkam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

UNIT-I

A Primer: NLP in the Real World, NLP Tasks, NLP Levels, What Is Language, Building
Blocks of Language, Why Is NLP Challenging, Machine Learning and Overview Approaches
to NLP, Heuristics-Based, Machine Learning, Deep Learning for NLP
NLP Pipeline: Data Acquisition, Pre-Processing Preliminaries Frequent Steps, Advanced
Processing Feature Engineering Classical NLP/ML Pipeline DL Pipeline Modelling
Evaluation of Models, Post-Modelling Phases.

Natural Language Processing (NLP) It focuses on enabling computers to understand,

process, and analyze human language. NLP involves programming computers to
process large volumes of natural language data.

NLP in the Real World

● Email Platforms (Gmail, Outlook): Spam detection, auto-complete, priority inbox.

● Voice Assistants (Siri, Alexa, Google Assistant): Understanding and responding
to user commands.
● Search Engines (Google, Bing): Query understanding, ranking, question
answering.
● Machine Translation (Google Translate, Microsoft Translator): Real-time
language translation.
● Social Media & E-commerce: Sentiment analysis, product insights.
● Grammar & Spelling Tools (Grammarly, MS Word): Auto-correction.

NLP Tasks

● Language Modeling: Predicts the next word in a sentence based on previous

words. Used in speech recognition, handwriting recognition, and spell checking.
● Text Classification: Categorizes text into predefined labels, e.g., spam detection,
sentiment analysis.
● Information Extraction: Identifies key details from text, like names or events in
emails and social media.
● Information Retrieval: Finds relevant documents for user queries, e.g., Google
Search.
● Conversational Agents: Builds dialogue systems like Alexa and Siri for human-like
interactions.
● Text Summarization: Generates concise summaries while preserving meaning.
● Question Answering: Develops systems to answer natural language queries.
● Machine Translation: Converts text between languages, e.g., Google Translate.
● Topic Modeling: Identifies underlying themes in large text datasets, useful in text
mining.

NLP levels

In Natural Language Processing (NLP), language understanding and generation are typically
analyzed across five primary levels or layers. Each level represents a different aspect of
language processing:
1. Phonological/Phonetic Level (Speech-based NLP)

● Concerned with: Sounds of speech.

● Tasks include: Speech recognition, phoneme identification.
● Example: Converting spoken words to text (ASR – Automatic Speech Recognition).

2. Morphological Level

● Concerned with: The structure of words and their meaningful components

(morphemes).
● Tasks include: Lemmatization, stemming, part-of-speech tagging.
● Example: "Running" → "run" (root word) + "ing" (suffix).

3. Syntactic Level

● Concerned with: Grammar and sentence structure.

● Tasks include: Parsing, POS tagging, grammar checking.
● Example: "The cat sits on the mat." → Subject-Verb-Object relationships.

4. Semantic Level

● Concerned with: Meaning of words and sentences.

● Tasks include: Word sense disambiguation, named entity recognition (NER),
semantic role labeling.
● Example: Understanding that "bank" can mean a financial institution or the side of a
river depending on context.

5. Pragmatic Level

● Concerned with: Contextual meaning and real-world knowledge.

● Tasks include: Discourse analysis, coreference resolution, intention detection.
● Example: Understanding that "Can you pass the salt?" is a request, not a question
about ability.

What is Language?

Language is a structured system of communication that enables humans to convey

thoughts, emotions, and information. It consists of various components that work together to
form meaningful expressions.

Key Building Blocks of Language

1. Phonemes

● The smallest units of sound in a language.

● Essential for speech recognition and text-to-speech (TTS) systems in NLP.
● Example: The sounds /p/, /t/, and /k/ in "pat" and "cat".

2. Morphemes & Lexemes

Morphemes

● The smallest meaningful units of language.

● Can be roots, prefixes, or suffixes that alter meaning.
● Example:
○ Un- in "undo" (negation)
○ -ed in "walked" (past tense)

Lexemes

● The base words or vocabulary elements.

● A lexeme represents a set of related word forms.
● Example:
○ Run, running, and ran belong to the same lexeme (run).

3. Syntax

● The set of rules that determine sentence structure.

● Defines how words are arranged for proper meaning.

✅
● Example:

❌
○ "She eats apples." (Correct syntax)
○ "Eats she apples." (Incorrect syntax)

4. Context

Context determines the meaning of words and sentences based on usage and
surrounding text.

Example "Bank" can mean:

○ A financial institution ("He deposited money in the bank.").

○ The side of a river ("He sat on the river bank.").

Key Aspects of Context:

● Words and phrases can have multiple meanings depending on context.

● Two Main Components:
1. Semantics: The direct, literal meaning of words and sentences.
2. Pragmatics: The implied meaning based on context and external knowledge
Why Is NLP Challenging

Natural Language Processing (NLP) faces several challenges due to the complexity of
human language.

Key Challenges:

1. Ambiguity:
○ Words and sentences can have multiple meanings based on context.

🔭👨
○ Example:"I saw a man with a telescope."

👨🔭
○ Could mean: I used a telescope to see a man.
○ Or: I saw a man who had a telescope.
○ Understanding the correct meaning requires contextual awareness.
2. Common Knowledge:
○ Humans rely on implicit knowledge that is not explicitly stated in
conversations.
○ Example: "Man bit dog" vs. "Dog bit man"
■ The second is more likely based on common sense.
○ Encoding such knowledge in a computational model is difficult.
3. Creativity in Language:
○ Humans use metaphors, idioms, sarcasm, and humor, which are hard for
machines to interpret.
○ Example: "It's raining cats and dogs" doesn’t mean actual animals are falling
from the sky it actually means it's raining very heavily
4. Diversity Across Languages:
○ Direct word-to-word translation is not always possible.
○ Syntax, grammar, and idioms vary, making multilingual NLP complex.
○ A model trained for one language may not work for another without
adaptation.

NOTE : WordNet

● A structured database of words and their meanings.

● Key Relationships:
○ Synonyms: Words with similar meanings (happy ↔ joyful).
○ Hyponyms: “Is-a” relationships (tennis is a hyponym of sports).
○ Meronyms: “Part-of” relationships (hand is a meronym of body).

Overview of Machine Learning Approaches in NLP

Machine Learning plays a crucial role in Natural Language Processing (NLP) by enabling
computers to understand, classify, and generate human language. Different approaches
include Heuristics-Based, Machine Learning-Based, and Deep Learning-Based
techniques.
1. Heuristics-Based Approach

● Relies on predefined rules and patterns for language processing.

● Useful for simple tasks where rules can be explicitly defined.

Advantages:

✅ Easy to implement
✅ Works well for structured problems
Examples:

● Spam detection → Using keyword-based filtering (e.g., detecting words like "free
money" or "win a prize" in emails).
● Named Entity Recognition (NER) → Using dictionaries to identify names of places,
people, or organizations.
● Regular Expressions (Regex) → Finding patterns in text (e.g., extracting email
addresses or phone numbers).

Limitation: Struggles with complex language structures and large-scale datasets.

2. Machine Learning-Based Approach

Machine Learning models learn from data and patterns rather than relying on predefined
rules.

2.1 Naïve Bayes Classifier

● A probabilistic algorithm used for text classification.

● Based on Bayes' Theorem, assuming features (words) are independent.
● Often used for sentiment analysis, spam filtering, and document classification.

Example: A spam filter classifies an email as spam or non-spam based on the probability
of certain words appearing (e.g., "free offer," "lottery winner", etc.).

2.2 Support Vector Machine (SVM)

● A supervised learning algorithm that finds a decision boundary to separate

different text categories.
● Can model both linear and nonlinear relationships.

Example: Classifying news articles as sports, politics, or entertainment based on word

distributions.

2.3 Hidden Markov Model (HMM )is a statistical model used to predict hidden states
based on observable clues.

Example: Weather Prediction

● Imagine you’re in a room with no windows, and you want to know the weather
outside. We can’t see the weather directly (hidden state) but can guess it by
observing what people wear (observable state).
● If someone has an umbrella, it’s likely rainy.
● If they wear sunglasses, it’s probably sunny

3. Deep Learning-Based Approach

● Uses neural networks to capture complex relationships in text.

● Excels at learning context, meaning, and semantics without extensive manual
feature engineering.

Advantages:

✅ Minimal feature engineering required (learns features automatically).

✅ Effective for complex NLP tasks like machine translation and conversational AI.
Examples of Deep Learning in NLP:

● ChatGPT → Conversational AI for answering queries.

● Google Translate → Uses sequence-to-sequence models for language translation.
● BERT, GPT, and Transformers → Advanced deep learning models that improve text
understanding and generation.

An NLP (Natural Language Processing) pipeline is a sequence of steps used to process

raw text data and transform it into a structured format for machine learning models. It
ensures efficient text analysis, classification, translation, sentiment analysis, and other NLP
tasks.

1. Data Acquisition

This is the first step where raw text data is collected. Sources can include:

● Web scraping
● Social media posts
● News articles
● Research papers
● Speech transcripts
● Open-source datasets (Kaggle, Hugging Face, etc.)

Challenges:

● Data availability
● Licensing and ethical concerns
● Imbalanced datasets

2. Text Cleaning

Before processing, text needs cleaning to remove noise. Common steps include:

● Lowercasing

Converts text to lowercase for consistency.

Example: "The Quick Brown FOX." → "the quick brown fox."

● Removing Special Characters & Punctuation

😊
Eliminates unnecessary symbols like @, #, !.
Example: "Hello!!! How's it going? " → "Hello Hows it going"

● Tokenization

Splits text into words or sentences.

Example: "NLP is fun!" → ["NLP", "is", "fun"]

● Stopword Removal

Removes common words (e.g., "the," "is," "and").

Example: "The sun is shining." → "sun shining."

● Spelling Correction

Fixes misspelled words.

Example: "Langage Processng is amzing!" → "Language Processing is amazing!"

● Lemmatization & Stemming

Both techniques reduce words to their base forms, but they differ in approach.

(a) Stemming:

● Reduces words to their root form by removing suffixes.

● Uses rules (may not always return valid words).

Example:

● "running" → "run"
● "flies" → "fli"
● "better" → "bet"

(b) Lemmatization:

● Uses a dictionary-based approach to find the correct root form.

● More accurate than stemming.

Example:

● "running" → "run"
● "flies" → "fly"
● "better" → "good" (uses context-based meaning)

3. Pre-Processing

Pre-processing is a crucial step in NLP to clean and structure text data before feeding it into
machine learning or deep learning models. It ensures consistency, removes noise, and
enhances the efficiency of NLP algorithms.

Key Pre-Processing Steps

1. Handling Missing Data

○ Imputation: Replacing missing values using statistical methods (e.g., mean,
mode, or predicting missing words using NLP techniques).
○ Deletion: Removing incomplete records if they are too noisy or sparse.
2. Text Normalization
○ Expanding contractions: ("I'm" → "I am")
○ Correcting abbreviations: ("u" → "you", "thx" → "thanks")
○ Lowercasing: ("HELLO" → "hello") to ensure consistency.
○ Lemmatization/Stemming: Reducing words to their base form ("running" →
"run").
3. Handling Outliers
○ Removing irrelevant or excessively noisy data such as special characters,
HTML tags, or random symbols.
4. POS (Part-of-Speech) Tagging
○ Assigning grammatical categories to words (e.g., noun, verb, adjective).
○ Helps in syntactic analysis and understanding sentence structure.
○ Example: "The cat sat on the mat." →
cat (NOUN), sat (VERB), mat (NOUN)
5. Named Entity Recognition (NER)
○ Identifies proper nouns such as names, places, and organizations.
○ Example: "Elon Musk founded Tesla in California."
■ Elon Musk (PERSON), Tesla (ORG), California (LOCATION)

Preliminaries of Pre-Processing -To perform these tasks, different approaches are used:

1. Rule-Based Methods

○ Use regular expressions (regex) and heuristics to clean text.
○ Example: Regex for removing special characters → [^a-zA-Z0-9]
○ Simple and interpretable but lacks adaptability.
2. Statistical Methods
○ Use TF-IDF (Term Frequency-Inverse Document Frequency) to weight
important words.
○ N-grams capture phrase-level patterns (bigrams: "New York", trigrams: "San
Francisco Bay").
○ Useful for keyword extraction and feature engineering.
3. Deep Learning-Based Methods
○ Word Embeddings (Word2Vec, GloVe, FastText) map words into dense
vector spaces.
○ Capture semantic meanings (e.g., "king" - "man" + "woman" ≈ "queen").
○ Used in advanced NLP tasks like sentiment analysis, translation, and
chatbots.

4. Feature Engineering - Feature engineering extracts useful information from text. This
can be done using:

Classical NLP/ML Methods

● Bag of Words (BoW) – Count-based representation

● TF-IDF (Term Frequency-Inverse Document Frequency) – Weighted word
importance
● n-grams – Consecutive word sequences (bigrams, trigrams)

Deep Learning-Based Methods

● Word embeddings (Word2Vec, GloVe, FastText)

● Contextual embeddings (BERT, GPT, ELMo)
● Transformer-based representations (T5, XLNet)

5. Modeling - Once features are extracted, models are trained for text classification,
sentiment analysis, etc.

Classical ML Approaches

● Naïve Bayes
● Logistic Regression
● Support Vector Machines (SVM)
● Random Forest

Deep Learning Pipeline

● Recurrent Neural Networks (RNNs)

● Long Short-Term Memory (LSTM)
● Gated Recurrent Units (GRU)
● Transformer-based models (BERT, GPT)
6. Model Evaluation

Evaluating model performance using:

● Accuracy, Precision, Recall, F1-score (for classification tasks)

● BLEU, ROUGE (for text generation tasks)
● Perplexity (for language models)

7. Deployment

Once the model is trained and evaluated, it is deployed using:

● APIs (Flask, FastAPI)

● Cloud services (AWS, GCP, Azure)

8. Monitoring and Updating

● Detecting model drift (performance degradation over time)

● Updating the model with new data and Continual learning and fine-tuning

Evaluation Metrics help measure the performance of models in terms of accuracy,

efficiency, and real-world effectiveness.

1. Evaluation Metrics for Classification

In text classification tasks (such as spam detection, sentiment analysis, and topic
classification), we primarily use the following metrics:

1.1 Accuracy- The proportion of correctly classified instances over the total instances.

● TP (True Positive): Correctly predicted positive samples

● TN (True Negative): Correctly predicted negative samples
● FP (False Positive): Incorrectly predicted positive samples
● FN (False Negative): Incorrectly predicted negative samples

1.2 Precision (Positive Predictive Value) The proportion of correctly predicted positive
samples out of all predicted positive samples.

Application-Useful in applications like fake news detection, where false positives are costly.

1.3 Recall (Sensitivity): The proportion of correctly predicted positive samples out of all
actual positive samples.
Application-Used in medical text classification (e.g., disease detection from clinical notes),
where missing a positive case is critical.

1.4 F1-Score- The harmonic mean of precision and recall.

1.5 ROC-AUC (Receiver Operating Characteristic - Area Under Curve):

● ROC Curve → Plots True Positive Rate (TPR) vs. False Positive Rate (FPR).
● AUC (Area Under Curve) → Measures how well a model separates classes.
● Higher AUC (close to 1.0) → Better model performance.
● AUC = 0.5 → Random guessing (bad model)

2. Evaluation Metrics for Regression

In NLP, regression models are used in tasks such as sentiment intensity prediction and
readability score prediction.

2.1 Mean Squared Error (MSE)

● yi = Actual value
● y^i= Predicted value
● n= Total number of samples

2.2 Mean Absolute Error (MAE)

2.3 R-squared (R2)

where yˉ is the mean of actual values.

3. Evaluation Metrics for Language Models and Embeddings

3.1 Perplexity measures how uncertain a language model is when predicting the next
word.

Example: If a model predicts "The sky is blue" with high confidence → Low PPL (better
model).If it gives many possible words (blue, happy, running) → High PPL (worse model).

3.2 BLEU Score (Bilingual Evaluation Understudy)

where BP is (Brevity Penalty)

Measures how similar a machine-generated text is to a human reference translation.

● Compares matching n-grams (word sequences).

● Score range: 0 to 1 (higher = better).

Example:

● Reference: "The cat is on the mat."

● Generated: "The cat is sitting on the mat."
● More matches → Higher BLEU Score → Better translation!

3.3 Word Embedding Evaluation (Cosine Similarity)

3.4 ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Used for: Evaluating text summarization by comparing generated text with human-written
references.

● ROUGE-N → Measures n-gram overlap (e.g., ROUGE-1 for unigrams, ROUGE-2 for
bigrams).
● ROUGE-L → Measures Longest Common Subsequence (LCS) overlap.

Post-Modelling Phases in NLP

After model training and evaluation, several steps ensure the model remains effective in
real-world applications. These include:

1. Hyperparameter Tuning – Optimizing parameters (learning rate, batch size) to

improve performance.
2. Error Analysis – Identifying misclassified samples to refine the model.
3. Model Deployment – Hosting the model using APIs, cloud services (AWS, GCP), or
edge devices.
4. Monitoring & Maintenance – re-training with fresh data, and handling bias.
5. User Feedback & Iteration – Collecting real-world feedback to refine and improve
the model.

NLP Modelling: Classical ML Pipeline vs Deep Learning Pipeline

NLP preprocessing concepts with clear examples:

1. Tokenization

Tokenization is the process of splitting text into smaller units called tokens, such as words,
subwords, or sentences.

Example - "Natural Language Processing is fun!"

Word Tokenization Output: - ['Natural', 'Language', 'Processing', 'is', 'fun', '!']

2. Stemming

Stemming reduces a word to its base/root form, which may not be a valid word. It uses
heuristic rules (often just chopping off suffixes).

Example:
Words: "playing", "played", "plays"
Stemmed: "play", "play", "play"

Note: "relational" → "relat", which is not a real word.

3. Lemmatization

Lemmatization reduces a word to its lemma (dictionary form), considering the part of
speech (POS).

Example:
Words: "am", "are", "is", "was"
Lemma: "be"

Words: "better"
Lemma: "good"

✅ Lemmatization is more accurate than stemming but computationally heavier.

4. POS Tagging (Part of Speech Tagging)

POS tagging assigns each word a part of speech (e.g., noun, verb, adjective) based on
context.

Example:
Sentence: "The dog barks loudly."
POS Tags: [('The', 'DT'), ('dog', 'NN'), ('barks', 'VBZ'), ('loudly', 'RB')]

Common POS Tags:

● NN: Noun
● VBZ: Verb (3rd person singular)
● JJ: Adjective
● RB: Adverb
● DT: Determiner

5. NER (Named Entity Recognition) -NER identifies and classifies named entities in text
into predefined categories such as person, organization, location, etc.
Sentence: "Barack Obama was born in Hawaii and served as the president of the USA."
NER Output:
[
('Barack Obama', PERSON),
('Hawaii', LOCATION),
('USA', LOCATION)
]

EXAMPLE : Let's analyze the sentence: "Children love to play in a garden."

for both POS tagging and NER.

POS Tagging (Part-of-Speech Tagging)

POS Interpretation: "Children" is the subject (plural noun), "love" is the main verb, "to play"
is an infinitive verb phrase, "in a garden" is a prepositional phrase indicating location.

2. NER (Named Entity Recognition)

NER attempts to find named entities like:

● PERSON, LOCATION ,ORGANIZATION ,DATE ,TIME, GPE (Geo-political Entity),

etc.

NER Output: No named entities detected.

Explanation:

● "Children", "garden" are common nouns, not proper nouns or named entities.
● There are no people, places (like "London"), organizations, or dates here.

Basic NLP To End-To-End Pipeline .PPTX - Removed
No ratings yet
Basic NLP To End-To-End Pipeline .PPTX - Removed
35 pages
ML QBF
No ratings yet
ML QBF
13 pages
Natural Language Processing
No ratings yet
Natural Language Processing
57 pages
Unit 1-Introduction To NLP
No ratings yet
Unit 1-Introduction To NLP
68 pages
NLP Introduction Overview
No ratings yet
NLP Introduction Overview
34 pages
Untitled Document
No ratings yet
Untitled Document
5 pages
NLP Ia1
No ratings yet
NLP Ia1
7 pages
Unit-I NLP
No ratings yet
Unit-I NLP
37 pages
Unit I
No ratings yet
Unit I
36 pages
NLP Introduction
No ratings yet
NLP Introduction
36 pages
NLP Merged
100% (1)
NLP Merged
975 pages
Unit1 A
No ratings yet
Unit1 A
8 pages
NLP Important Question and Answers Module Wise
No ratings yet
NLP Important Question and Answers Module Wise
101 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
CAT King Study Material 2
No ratings yet
CAT King Study Material 2
20 pages
NLP CH 1
No ratings yet
NLP CH 1
8 pages
Unit I - Natural Language Processing
No ratings yet
Unit I - Natural Language Processing
34 pages
Natural Language Processing Unit 1-2
No ratings yet
Natural Language Processing Unit 1-2
18 pages
Chapter23 - Natural Language Processing
No ratings yet
Chapter23 - Natural Language Processing
87 pages
1 - Introducntion To NLP
No ratings yet
1 - Introducntion To NLP
43 pages
NLP
No ratings yet
NLP
21 pages
NLP Presentation1
No ratings yet
NLP Presentation1
25 pages
Natural Language Processing Language Models? - Term...
No ratings yet
Natural Language Processing Language Models? - Term...
4 pages
0 Unit-1 Introducntion To NLP
No ratings yet
0 Unit-1 Introducntion To NLP
41 pages
Unit 4
No ratings yet
Unit 4
39 pages
Introduction To Natural Language Processing (NLP)
No ratings yet
Introduction To Natural Language Processing (NLP)
87 pages
Foundation For NLP
No ratings yet
Foundation For NLP
14 pages
NLP Module 1
No ratings yet
NLP Module 1
31 pages
NLP Basics for Computer Science Students
No ratings yet
NLP Basics for Computer Science Students
87 pages
NLP Unit1
No ratings yet
NLP Unit1
51 pages
NLP PPT1
No ratings yet
NLP PPT1
29 pages
Unit V
No ratings yet
Unit V
16 pages
Chapter 6
No ratings yet
Chapter 6
21 pages
Introduction To NLP
No ratings yet
Introduction To NLP
51 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
69 pages
Brief History of NLP
No ratings yet
Brief History of NLP
7 pages
NLP PPT
No ratings yet
NLP PPT
41 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
36 pages
Brocode OP
No ratings yet
Brocode OP
133 pages
NLP QB
No ratings yet
NLP QB
14 pages
NLP Lab1
No ratings yet
NLP Lab1
33 pages
Module1 Chapter1
No ratings yet
Module1 Chapter1
23 pages
Lec1-UNIT5 - MORE SIMPLER
No ratings yet
Lec1-UNIT5 - MORE SIMPLER
28 pages
Introduction
No ratings yet
Introduction
24 pages
1 Natural Language Processing-Intro
No ratings yet
1 Natural Language Processing-Intro
16 pages
Natural Language Processin1
No ratings yet
Natural Language Processin1
86 pages
NLP Unit-1 Merged
No ratings yet
NLP Unit-1 Merged
41 pages
CH 5 NLP
No ratings yet
CH 5 NLP
12 pages
1 NLP
No ratings yet
1 NLP
26 pages
1.introduction To Natural Language Processing (NLP)
100% (1)
1.introduction To Natural Language Processing (NLP)
37 pages
01 - Intro NLP
No ratings yet
01 - Intro NLP
13 pages
Natural Language Processing - Bridging The Gap Between Humans and Machines
No ratings yet
Natural Language Processing - Bridging The Gap Between Humans and Machines
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
43 pages
NLP Unit 1
No ratings yet
NLP Unit 1
18 pages
Shivangi Tyagi (NLP Assignments)
No ratings yet
Shivangi Tyagi (NLP Assignments)
60 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
88 pages
NLP Unit 1 and 2
No ratings yet
NLP Unit 1 and 2
106 pages
NLP MODULE 1 Chapter1 &2
100% (1)
NLP MODULE 1 Chapter1 &2
83 pages
The Problem and Its Setting
No ratings yet
The Problem and Its Setting
28 pages
Types of Reading
No ratings yet
Types of Reading
15 pages
Individual Learning and Development Plan 2024-2025
No ratings yet
Individual Learning and Development Plan 2024-2025
4 pages
FormStructureSense Easy
No ratings yet
FormStructureSense Easy
55 pages
Packet Tracer Reference For Network Administrators and Academy Instructors
No ratings yet
Packet Tracer Reference For Network Administrators and Academy Instructors
13 pages
Poor Charlie Ô Ç Ûs Almanack - The Wit and Wisdom of Charles T. Munger (PDFDrive) - Trang-2-Trang-2
No ratings yet
Poor Charlie Ô Ç Ûs Almanack - The Wit and Wisdom of Charles T. Munger (PDFDrive) - Trang-2-Trang-2
20 pages
MGT 351 Finalll
No ratings yet
MGT 351 Finalll
15 pages
ECCE Caregiver Contract
No ratings yet
ECCE Caregiver Contract
1 page
Assignment 2 For INC4804, 2023
No ratings yet
Assignment 2 For INC4804, 2023
8 pages
STARGAZER
No ratings yet
STARGAZER
17 pages
Form 4 Biology Paper 1
No ratings yet
Form 4 Biology Paper 1
6 pages
Bds MQ 2024-25 - College-Wise Allotments Under Special Stray Vacancy Round
No ratings yet
Bds MQ 2024-25 - College-Wise Allotments Under Special Stray Vacancy Round
5 pages
Navy Blue and Black Professional Resume PDF
No ratings yet
Navy Blue and Black Professional Resume PDF
15 pages
Blood Grouping Techniques Guide
100% (1)
Blood Grouping Techniques Guide
30 pages
Math6 q4 w9
No ratings yet
Math6 q4 w9
21 pages
Arrange The Words To Make Meaningful Sentences: - Re - 1: Competition Activities (1) Grades 3,4
No ratings yet
Arrange The Words To Make Meaningful Sentences: - Re - 1: Competition Activities (1) Grades 3,4
10 pages
Enhancing Scholarly Inquiry Skills
No ratings yet
Enhancing Scholarly Inquiry Skills
4 pages
1.2 Lewis and Clark and Me
No ratings yet
1.2 Lewis and Clark and Me
85 pages
Grade 8 Math: Triangle Congruence
No ratings yet
Grade 8 Math: Triangle Congruence
12 pages
Lecture On CEP: Complex Engineering Problem (CEP)
No ratings yet
Lecture On CEP: Complex Engineering Problem (CEP)
5 pages
2010 National Survey Young Australians Part1
No ratings yet
2010 National Survey Young Australians Part1
41 pages
MC I Vigilance Avoidance
No ratings yet
MC I Vigilance Avoidance
15 pages
Agricultural Science Primary 5 Scheme of Work - syllabusNG
No ratings yet
Agricultural Science Primary 5 Scheme of Work - syllabusNG
14 pages
My Study Plan For Pusan National University
No ratings yet
My Study Plan For Pusan National University
2 pages
High School Life Sciences: Isopod Behavior Lab
100% (1)
High School Life Sciences: Isopod Behavior Lab
6 pages
Kolli Deepthi Resume
No ratings yet
Kolli Deepthi Resume
1 page
AWL Sublist 1 Words 11-20 Worksheet
No ratings yet
AWL Sublist 1 Words 11-20 Worksheet
4 pages
Lesson - Plan - DAA - (IIYr)
No ratings yet
Lesson - Plan - DAA - (IIYr)
5 pages
The Negative Effects of Video Game: Counter-Argument Essay Topic
No ratings yet
The Negative Effects of Video Game: Counter-Argument Essay Topic
8 pages
BALMES Performance Task 6
100% (3)
BALMES Performance Task 6
2 pages

Unit-I NLP

Uploaded by

Unit-I NLP

Uploaded by

UNIT-I

Natural Language Processing (NLP) It focuses on enabling computers to understand,

NLP in the Real World

●​ Email Platforms (Gmail, Outlook): Spam detection, auto-complete, priority inbox.

●​ Language Modeling: Predicts the next word in a sentence based on previous

●​ Concerned with: Sounds of speech.

●​ Concerned with: The structure of words and their meaningful components

●​ Concerned with: Grammar and sentence structure.

●​ Concerned with: Meaning of words and sentences.

●​ Concerned with: Contextual meaning and real-world knowledge.

Language is a structured system of communication that enables humans to convey

Key Building Blocks of Language

●​ The smallest units of sound in a language.

2. Morphemes & Lexemes

●​ The smallest meaningful units of language.

●​ The base words or vocabulary elements.

●​ The set of rules that determine sentence structure.

Example "Bank" can mean:

○​ A financial institution ("He deposited money in the bank.").

Key Aspects of Context:

●​ Words and phrases can have multiple meanings depending on context.

●​ A structured database of words and their meanings.

Overview of Machine Learning Approaches in NLP

●​ Relies on predefined rules and patterns for language processing.

Limitation: Struggles with complex language structures and large-scale datasets.

2. Machine Learning-Based Approach

2.1 Naïve Bayes Classifier

●​ A probabilistic algorithm used for text classification.

2.2 Support Vector Machine (SVM)

●​ A supervised learning algorithm that finds a decision boundary to separate

Example: Classifying news articles as sports, politics, or entertainment based on word

Example: Weather Prediction

3. Deep Learning-Based Approach

●​ Uses neural networks to capture complex relationships in text.

✅ Minimal feature engineering required (learns features automatically).​

●​ ChatGPT → Conversational AI for answering queries.

An NLP (Natural Language Processing) pipeline is a sequence of steps used to process

Converts text to lowercase for consistency.​

●​ Removing Special Characters & Punctuation

Splits text into words or sentences.​

Removes common words (e.g., "the," "is," "and").​

Fixes misspelled words.​

●​ Lemmatization & Stemming

●​ Reduces words to their root form by removing suffixes.

●​ Uses a dictionary-based approach to find the correct root form.

Key Pre-Processing Steps

1.​ Handling Missing Data

1.​ Rule-Based Methods

Classical NLP/ML Methods

●​ Bag of Words (BoW) – Count-based representation

Deep Learning-Based Methods

●​ Word embeddings (Word2Vec, GloVe, FastText)

Deep Learning Pipeline

●​ Recurrent Neural Networks (RNNs)

Evaluating model performance using:

●​ Accuracy, Precision, Recall, F1-score (for classification tasks)

Once the model is trained and evaluated, it is deployed using:

●​ APIs (Flask, FastAPI)

8. Monitoring and Updating

●​ Detecting model drift (performance degradation over time)

Evaluation Metrics help measure the performance of models in terms of accuracy,

1. Evaluation Metrics for Classification

●​ TP (True Positive): Correctly predicted positive samples

1.4 F1-Score- The harmonic mean of precision and recall.

1.5 ROC-AUC (Receiver Operating Characteristic - Area Under Curve):

2. Evaluation Metrics for Regression

2.1 Mean Squared Error (MSE)

2.2 Mean Absolute Error (MAE)

2.3 R-squared (R2)

where yˉ is the mean of actual values.

3.2 BLEU Score (Bilingual Evaluation Understudy)

where BP is (Brevity Penalty)

Measures how similar a machine-generated text is to a human reference translation.

●​ Compares matching n-grams (word sequences).

●​ Reference: "The cat is on the mat."

3.3 Word Embedding Evaluation (Cosine Similarity)

● Email Platforms (Gmail, Outlook): Spam detection, auto-complete, priority inbox.

● Language Modeling: Predicts the next word in a sentence based on previous

● Concerned with: Sounds of speech.

● Concerned with: The structure of words and their meaningful components

● Concerned with: Grammar and sentence structure.

● Concerned with: Meaning of words and sentences.

● Concerned with: Contextual meaning and real-world knowledge.

● The smallest units of sound in a language.

● The smallest meaningful units of language.

● The base words or vocabulary elements.

● The set of rules that determine sentence structure.

○ A financial institution ("He deposited money in the bank.").

● Words and phrases can have multiple meanings depending on context.

● A structured database of words and their meanings.

● Relies on predefined rules and patterns for language processing.

● A probabilistic algorithm used for text classification.

● A supervised learning algorithm that finds a decision boundary to separate

● Uses neural networks to capture complex relationships in text.

✅ Minimal feature engineering required (learns features automatically).

● ChatGPT → Conversational AI for answering queries.

Converts text to lowercase for consistency.

● Removing Special Characters & Punctuation

Splits text into words or sentences.

Removes common words (e.g., "the," "is," "and").

Fixes misspelled words.

● Lemmatization & Stemming

● Reduces words to their root form by removing suffixes.

● Uses a dictionary-based approach to find the correct root form.

1. Handling Missing Data

1. Rule-Based Methods

● Bag of Words (BoW) – Count-based representation

● Word embeddings (Word2Vec, GloVe, FastText)

● Recurrent Neural Networks (RNNs)

● Accuracy, Precision, Recall, F1-score (for classification tasks)

● APIs (Flask, FastAPI)

● Detecting model drift (performance degradation over time)

● TP (True Positive): Correctly predicted positive samples

● Compares matching n-grams (word sequences).

● Reference: "The cat is on the mat."

1. Hyperparameter Tuning – Optimizing parameters (learning rate, batch size) to

EXAMPLE : Let's analyze the sentence: "Children love to play in a garden."

● PERSON, LOCATION ,ORGANIZATION ,DATE ,TIME, GPE (Geo-political Entity),