UNIT-I
A Primer: NLP in the Real World, NLP Tasks, NLP Levels, What Is Language, Building
Blocks of Language, Why Is NLP Challenging, Machine Learning and Overview Approaches
to NLP, Heuristics-Based, Machine Learning, Deep Learning for NLP
NLP Pipeline: Data Acquisition, Pre-Processing Preliminaries Frequent Steps, Advanced
Processing Feature Engineering Classical NLP/ML Pipeline DL Pipeline Modelling
Evaluation of Models, Post-Modelling Phases.
Natural Language Processing (NLP) It focuses on enabling computers to understand,
process, and analyze human language. NLP involves programming computers to
process large volumes of natural language data.
NLP in the Real World
● Email Platforms (Gmail, Outlook): Spam detection, auto-complete, priority inbox.
● Voice Assistants (Siri, Alexa, Google Assistant): Understanding and responding
to user commands.
● Search Engines (Google, Bing): Query understanding, ranking, question
answering.
● Machine Translation (Google Translate, Microsoft Translator): Real-time
language translation.
● Social Media & E-commerce: Sentiment analysis, product insights.
● Grammar & Spelling Tools (Grammarly, MS Word): Auto-correction.
NLP Tasks
● Language Modeling: Predicts the next word in a sentence based on previous
words. Used in speech recognition, handwriting recognition, and spell checking.
● Text Classification: Categorizes text into predefined labels, e.g., spam detection,
sentiment analysis.
● Information Extraction: Identifies key details from text, like names or events in
emails and social media.
● Information Retrieval: Finds relevant documents for user queries, e.g., Google
Search.
● Conversational Agents: Builds dialogue systems like Alexa and Siri for human-like
interactions.
● Text Summarization: Generates concise summaries while preserving meaning.
● Question Answering: Develops systems to answer natural language queries.
● Machine Translation: Converts text between languages, e.g., Google Translate.
● Topic Modeling: Identifies underlying themes in large text datasets, useful in text
mining.
NLP levels
In Natural Language Processing (NLP), language understanding and generation are typically
analyzed across five primary levels or layers. Each level represents a different aspect of
language processing:
1. Phonological/Phonetic Level (Speech-based NLP)
● Concerned with: Sounds of speech.
● Tasks include: Speech recognition, phoneme identification.
● Example: Converting spoken words to text (ASR – Automatic Speech Recognition).
2. Morphological Level
● Concerned with: The structure of words and their meaningful components
(morphemes).
● Tasks include: Lemmatization, stemming, part-of-speech tagging.
● Example: "Running" → "run" (root word) + "ing" (suffix).
3. Syntactic Level
● Concerned with: Grammar and sentence structure.
● Tasks include: Parsing, POS tagging, grammar checking.
● Example: "The cat sits on the mat." → Subject-Verb-Object relationships.
4. Semantic Level
● Concerned with: Meaning of words and sentences.
● Tasks include: Word sense disambiguation, named entity recognition (NER),
semantic role labeling.
● Example: Understanding that "bank" can mean a financial institution or the side of a
river depending on context.
5. Pragmatic Level
● Concerned with: Contextual meaning and real-world knowledge.
● Tasks include: Discourse analysis, coreference resolution, intention detection.
● Example: Understanding that "Can you pass the salt?" is a request, not a question
about ability.
What is Language?
Language is a structured system of communication that enables humans to convey
thoughts, emotions, and information. It consists of various components that work together to
form meaningful expressions.
Key Building Blocks of Language
1. Phonemes
● The smallest units of sound in a language.
● Essential for speech recognition and text-to-speech (TTS) systems in NLP.
● Example: The sounds /p/, /t/, and /k/ in "pat" and "cat".
2. Morphemes & Lexemes
Morphemes
● The smallest meaningful units of language.
● Can be roots, prefixes, or suffixes that alter meaning.
● Example:
○ Un- in "undo" (negation)
○ -ed in "walked" (past tense)
Lexemes
● The base words or vocabulary elements.
● A lexeme represents a set of related word forms.
● Example:
○ Run, running, and ran belong to the same lexeme (run).
3. Syntax
● The set of rules that determine sentence structure.
● Defines how words are arranged for proper meaning.
✅
● Example:
❌
○ "She eats apples." (Correct syntax)
○ "Eats she apples." (Incorrect syntax)
4. Context
Context determines the meaning of words and sentences based on usage and
surrounding text.
Example "Bank" can mean:
○ A financial institution ("He deposited money in the bank.").
○ The side of a river ("He sat on the river bank.").
Key Aspects of Context:
● Words and phrases can have multiple meanings depending on context.
● Two Main Components:
1. Semantics: The direct, literal meaning of words and sentences.
2. Pragmatics: The implied meaning based on context and external knowledge
Why Is NLP Challenging
Natural Language Processing (NLP) faces several challenges due to the complexity of
human language.
Key Challenges:
1. Ambiguity:
○ Words and sentences can have multiple meanings based on context.
🔭👨
○ Example:"I saw a man with a telescope."
👨🔭
○ Could mean: I used a telescope to see a man.
○ Or: I saw a man who had a telescope.
○ Understanding the correct meaning requires contextual awareness.
2. Common Knowledge:
○ Humans rely on implicit knowledge that is not explicitly stated in
conversations.
○ Example: "Man bit dog" vs. "Dog bit man"
■ The second is more likely based on common sense.
○ Encoding such knowledge in a computational model is difficult.
3. Creativity in Language:
○ Humans use metaphors, idioms, sarcasm, and humor, which are hard for
machines to interpret.
○ Example: "It's raining cats and dogs" doesn’t mean actual animals are falling
from the sky it actually means it's raining very heavily
4. Diversity Across Languages:
○ Direct word-to-word translation is not always possible.
○ Syntax, grammar, and idioms vary, making multilingual NLP complex.
○ A model trained for one language may not work for another without
adaptation.
NOTE : WordNet
● A structured database of words and their meanings.
● Key Relationships:
○ Synonyms: Words with similar meanings (happy ↔ joyful).
○ Hyponyms: “Is-a” relationships (tennis is a hyponym of sports).
○ Meronyms: “Part-of” relationships (hand is a meronym of body).
Overview of Machine Learning Approaches in NLP
Machine Learning plays a crucial role in Natural Language Processing (NLP) by enabling
computers to understand, classify, and generate human language. Different approaches
include Heuristics-Based, Machine Learning-Based, and Deep Learning-Based
techniques.
1. Heuristics-Based Approach
● Relies on predefined rules and patterns for language processing.
● Useful for simple tasks where rules can be explicitly defined.
Advantages:
✅ Easy to implement
✅ Works well for structured problems
Examples:
● Spam detection → Using keyword-based filtering (e.g., detecting words like "free
money" or "win a prize" in emails).
● Named Entity Recognition (NER) → Using dictionaries to identify names of places,
people, or organizations.
● Regular Expressions (Regex) → Finding patterns in text (e.g., extracting email
addresses or phone numbers).
Limitation: Struggles with complex language structures and large-scale datasets.
2. Machine Learning-Based Approach
Machine Learning models learn from data and patterns rather than relying on predefined
rules.
2.1 Naïve Bayes Classifier
● A probabilistic algorithm used for text classification.
● Based on Bayes' Theorem, assuming features (words) are independent.
● Often used for sentiment analysis, spam filtering, and document classification.
Example: A spam filter classifies an email as spam or non-spam based on the probability
of certain words appearing (e.g., "free offer," "lottery winner", etc.).
2.2 Support Vector Machine (SVM)
● A supervised learning algorithm that finds a decision boundary to separate
different text categories.
● Can model both linear and nonlinear relationships.
Example: Classifying news articles as sports, politics, or entertainment based on word
distributions.
2.3 Hidden Markov Model (HMM )is a statistical model used to predict hidden states
based on observable clues.
Example: Weather Prediction
● Imagine you’re in a room with no windows, and you want to know the weather
outside. We can’t see the weather directly (hidden state) but can guess it by
observing what people wear (observable state).
● If someone has an umbrella, it’s likely rainy.
● If they wear sunglasses, it’s probably sunny
3. Deep Learning-Based Approach
● Uses neural networks to capture complex relationships in text.
● Excels at learning context, meaning, and semantics without extensive manual
feature engineering.
Advantages:
✅ Minimal feature engineering required (learns features automatically).
✅ Effective for complex NLP tasks like machine translation and conversational AI.
Examples of Deep Learning in NLP:
● ChatGPT → Conversational AI for answering queries.
● Google Translate → Uses sequence-to-sequence models for language translation.
● BERT, GPT, and Transformers → Advanced deep learning models that improve text
understanding and generation.
An NLP (Natural Language Processing) pipeline is a sequence of steps used to process
raw text data and transform it into a structured format for machine learning models. It
ensures efficient text analysis, classification, translation, sentiment analysis, and other NLP
tasks.
1. Data Acquisition
This is the first step where raw text data is collected. Sources can include:
● Web scraping
● Social media posts
● News articles
● Research papers
● Speech transcripts
● Open-source datasets (Kaggle, Hugging Face, etc.)
Challenges:
● Data availability
● Licensing and ethical concerns
● Imbalanced datasets
2. Text Cleaning
Before processing, text needs cleaning to remove noise. Common steps include:
● Lowercasing
Converts text to lowercase for consistency.
Example: "The Quick Brown FOX." → "the quick brown fox."
● Removing Special Characters & Punctuation
😊
Eliminates unnecessary symbols like @, #, !.
Example: "Hello!!! How's it going? " → "Hello Hows it going"
● Tokenization
Splits text into words or sentences.
Example: "NLP is fun!" → ["NLP", "is", "fun"]
● Stopword Removal
Removes common words (e.g., "the," "is," "and").
Example: "The sun is shining." → "sun shining."
● Spelling Correction
Fixes misspelled words.
Example: "Langage Processng is amzing!" → "Language Processing is amazing!"
● Lemmatization & Stemming
Both techniques reduce words to their base forms, but they differ in approach.
(a) Stemming:
● Reduces words to their root form by removing suffixes.
● Uses rules (may not always return valid words).
Example:
● "running" → "run"
● "flies" → "fli"
● "better" → "bet"
(b) Lemmatization:
● Uses a dictionary-based approach to find the correct root form.
● More accurate than stemming.
Example:
● "running" → "run"
● "flies" → "fly"
● "better" → "good" (uses context-based meaning)
3. Pre-Processing
Pre-processing is a crucial step in NLP to clean and structure text data before feeding it into
machine learning or deep learning models. It ensures consistency, removes noise, and
enhances the efficiency of NLP algorithms.
Key Pre-Processing Steps
1. Handling Missing Data
○ Imputation: Replacing missing values using statistical methods (e.g., mean,
mode, or predicting missing words using NLP techniques).
○ Deletion: Removing incomplete records if they are too noisy or sparse.
2. Text Normalization
○ Expanding contractions: ("I'm" → "I am")
○ Correcting abbreviations: ("u" → "you", "thx" → "thanks")
○ Lowercasing: ("HELLO" → "hello") to ensure consistency.
○ Lemmatization/Stemming: Reducing words to their base form ("running" →
"run").
3. Handling Outliers
○ Removing irrelevant or excessively noisy data such as special characters,
HTML tags, or random symbols.
4. POS (Part-of-Speech) Tagging
○ Assigning grammatical categories to words (e.g., noun, verb, adjective).
○ Helps in syntactic analysis and understanding sentence structure.
○ Example: "The cat sat on the mat." →
cat (NOUN), sat (VERB), mat (NOUN)
5. Named Entity Recognition (NER)
○ Identifies proper nouns such as names, places, and organizations.
○ Example: "Elon Musk founded Tesla in California."
■ Elon Musk (PERSON), Tesla (ORG), California (LOCATION)
Preliminaries of Pre-Processing -To perform these tasks, different approaches are used:
1. Rule-Based Methods
○ Use regular expressions (regex) and heuristics to clean text.
○ Example: Regex for removing special characters → [^a-zA-Z0-9]
○ Simple and interpretable but lacks adaptability.
2. Statistical Methods
○ Use TF-IDF (Term Frequency-Inverse Document Frequency) to weight
important words.
○ N-grams capture phrase-level patterns (bigrams: "New York", trigrams: "San
Francisco Bay").
○ Useful for keyword extraction and feature engineering.
3. Deep Learning-Based Methods
○ Word Embeddings (Word2Vec, GloVe, FastText) map words into dense
vector spaces.
○ Capture semantic meanings (e.g., "king" - "man" + "woman" ≈ "queen").
○ Used in advanced NLP tasks like sentiment analysis, translation, and
chatbots.
4. Feature Engineering - Feature engineering extracts useful information from text. This
can be done using:
Classical NLP/ML Methods
● Bag of Words (BoW) – Count-based representation
● TF-IDF (Term Frequency-Inverse Document Frequency) – Weighted word
importance
● n-grams – Consecutive word sequences (bigrams, trigrams)
Deep Learning-Based Methods
● Word embeddings (Word2Vec, GloVe, FastText)
● Contextual embeddings (BERT, GPT, ELMo)
● Transformer-based representations (T5, XLNet)
5. Modeling - Once features are extracted, models are trained for text classification,
sentiment analysis, etc.
Classical ML Approaches
● Naïve Bayes
● Logistic Regression
● Support Vector Machines (SVM)
● Random Forest
Deep Learning Pipeline
● Recurrent Neural Networks (RNNs)
● Long Short-Term Memory (LSTM)
● Gated Recurrent Units (GRU)
● Transformer-based models (BERT, GPT)
6. Model Evaluation
Evaluating model performance using:
● Accuracy, Precision, Recall, F1-score (for classification tasks)
● BLEU, ROUGE (for text generation tasks)
● Perplexity (for language models)
7. Deployment
Once the model is trained and evaluated, it is deployed using:
● APIs (Flask, FastAPI)
● Cloud services (AWS, GCP, Azure)
8. Monitoring and Updating
● Detecting model drift (performance degradation over time)
● Updating the model with new data and Continual learning and fine-tuning
Evaluation Metrics help measure the performance of models in terms of accuracy,
efficiency, and real-world effectiveness.
1. Evaluation Metrics for Classification
In text classification tasks (such as spam detection, sentiment analysis, and topic
classification), we primarily use the following metrics:
1.1 Accuracy- The proportion of correctly classified instances over the total instances.
● TP (True Positive): Correctly predicted positive samples
● TN (True Negative): Correctly predicted negative samples
● FP (False Positive): Incorrectly predicted positive samples
● FN (False Negative): Incorrectly predicted negative samples
1.2 Precision (Positive Predictive Value) The proportion of correctly predicted positive
samples out of all predicted positive samples.
Application-Useful in applications like fake news detection, where false positives are costly.
1.3 Recall (Sensitivity): The proportion of correctly predicted positive samples out of all
actual positive samples.
Application-Used in medical text classification (e.g., disease detection from clinical notes),
where missing a positive case is critical.
1.4 F1-Score- The harmonic mean of precision and recall.
1.5 ROC-AUC (Receiver Operating Characteristic - Area Under Curve):
● ROC Curve → Plots True Positive Rate (TPR) vs. False Positive Rate (FPR).
● AUC (Area Under Curve) → Measures how well a model separates classes.
● Higher AUC (close to 1.0) → Better model performance.
● AUC = 0.5 → Random guessing (bad model)
2. Evaluation Metrics for Regression
In NLP, regression models are used in tasks such as sentiment intensity prediction and
readability score prediction.
2.1 Mean Squared Error (MSE)
● yi = Actual value
● y^i= Predicted value
● n= Total number of samples
2.2 Mean Absolute Error (MAE)
2.3 R-squared (R2)
where yˉ is the mean of actual values.
3. Evaluation Metrics for Language Models and Embeddings
3.1 Perplexity measures how uncertain a language model is when predicting the next
word.
Example: If a model predicts "The sky is blue" with high confidence → Low PPL (better
model).If it gives many possible words (blue, happy, running) → High PPL (worse model).
3.2 BLEU Score (Bilingual Evaluation Understudy)
where BP is (Brevity Penalty)
Measures how similar a machine-generated text is to a human reference translation.
● Compares matching n-grams (word sequences).
● Score range: 0 to 1 (higher = better).
Example:
● Reference: "The cat is on the mat."
● Generated: "The cat is sitting on the mat."
● More matches → Higher BLEU Score → Better translation!
3.3 Word Embedding Evaluation (Cosine Similarity)
3.4 ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
Used for: Evaluating text summarization by comparing generated text with human-written
references.
● ROUGE-N → Measures n-gram overlap (e.g., ROUGE-1 for unigrams, ROUGE-2 for
bigrams).
● ROUGE-L → Measures Longest Common Subsequence (LCS) overlap.
Post-Modelling Phases in NLP
After model training and evaluation, several steps ensure the model remains effective in
real-world applications. These include:
1. Hyperparameter Tuning – Optimizing parameters (learning rate, batch size) to
improve performance.
2. Error Analysis – Identifying misclassified samples to refine the model.
3. Model Deployment – Hosting the model using APIs, cloud services (AWS, GCP), or
edge devices.
4. Monitoring & Maintenance – re-training with fresh data, and handling bias.
5. User Feedback & Iteration – Collecting real-world feedback to refine and improve
the model.
NLP Modelling: Classical ML Pipeline vs Deep Learning Pipeline
NLP preprocessing concepts with clear examples:
1. Tokenization
Tokenization is the process of splitting text into smaller units called tokens, such as words,
subwords, or sentences.
Example - "Natural Language Processing is fun!"
Word Tokenization Output: - ['Natural', 'Language', 'Processing', 'is', 'fun', '!']
2. Stemming
Stemming reduces a word to its base/root form, which may not be a valid word. It uses
heuristic rules (often just chopping off suffixes).
Example:
Words: "playing", "played", "plays"
Stemmed: "play", "play", "play"
Note: "relational" → "relat", which is not a real word.
3. Lemmatization
Lemmatization reduces a word to its lemma (dictionary form), considering the part of
speech (POS).
Example:
Words: "am", "are", "is", "was"
Lemma: "be"
Words: "better"
Lemma: "good"
✅ Lemmatization is more accurate than stemming but computationally heavier.
4. POS Tagging (Part of Speech Tagging)
POS tagging assigns each word a part of speech (e.g., noun, verb, adjective) based on
context.
Example:
Sentence: "The dog barks loudly."
POS Tags: [('The', 'DT'), ('dog', 'NN'), ('barks', 'VBZ'), ('loudly', 'RB')]
Common POS Tags:
● NN: Noun
● VBZ: Verb (3rd person singular)
● JJ: Adjective
● RB: Adverb
● DT: Determiner
5. NER (Named Entity Recognition) -NER identifies and classifies named entities in text
into predefined categories such as person, organization, location, etc.
Sentence: "Barack Obama was born in Hawaii and served as the president of the USA."
NER Output:
[
('Barack Obama', PERSON),
('Hawaii', LOCATION),
('USA', LOCATION)
]
EXAMPLE : Let's analyze the sentence: "Children love to play in a garden."
for both POS tagging and NER.
POS Tagging (Part-of-Speech Tagging)
POS Interpretation: "Children" is the subject (plural noun), "love" is the main verb, "to play"
is an infinitive verb phrase, "in a garden" is a prepositional phrase indicating location.
2. NER (Named Entity Recognition)
NER attempts to find named entities like:
● PERSON, LOCATION ,ORGANIZATION ,DATE ,TIME, GPE (Geo-political Entity),
etc.
NER Output: No named entities detected.
Explanation:
● "Children", "garden" are common nouns, not proper nouns or named entities.
● There are no people, places (like "London"), organizations, or dates here.