NLP Exp 2

The document outlines the aim of Experiment No. 2, which is to study and implement text preprocessing techniques including tokenization, stop word removal, stemming, and lemmatization. It explains key concepts such as tokens, the process of tokenization, the definition and removal of stopwords, and the differences between stemming and lemmatization. The conclusion emphasizes the successful study and implementation of these preprocessing methods.

Uploaded by

Madhura Kanse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views4 pages

NLP Exp 2

Uploaded by

Madhura Kanse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Experiment No.

2
Aim: To study and implement preprocessing using tokenization, stop word removal,
stemming, and lemmatization.
Theory:
1. Token:
A token is a single, meaningful unit derived from text. It could be a word, a punctuation mark,
or even a number. Tokenization is the process of breaking down a text into these tokens. For
example, the sentence "I love natural language processing!" would be tokenized into: ["I",
"love", "natural", "language", "processing", "!"].
2. Tokenization:
Tokenization is the task of splitting a text into tokens. This is a crucial initial step in most NLP
tasks because it transforms raw text into manageable units for further analysis. Tokenization
can be done at different levels: wordlevel, subword level (useful for languages with complex
word formations), or character level.
3. Stopwords:
Stopwords are common words that are often filtered out during text preprocessing because
they occur frequently in a language and do not carry much semantic meaning on their own.
Examples in English include articles ("the", "a"), prepositions ("at", "on"), conjunctions
("and", "but"), and some common verbs ("is", "are").
4. List of Stopwords:
A list of stopwords is a predefined set of words recognized as stopwords for a particular
language. These lists are used during text preprocessing to efficiently filter out these common
words from the text data. Different libraries and NLP tools provide their own lists of stopwords
tailored for specific tasks or applications.
A list of common stopwords in English Language include:
[i, me, my, myself, we, our, ours, ourselves, you, your, yours, yourself, yourselves, he, him,
his, himself, she, her, hers, herself, it, its, itself, they, them, their, theirs, themselves, what,
which, who, whom, this, that, these, those, am, is, are, was, were, be, been, being, have, has,
had, having, do, does, did, doing, a, an, the, etc.]
5. How to remove stopwords:
To remove stopwords from text data:
o Tokenize the text to break it down into individual words or tokens.
o Compare each token against a predefined list of stopwords.
o Exclude tokens that match stopwords from the tokenized list.
The result is a cleaned text with stopwords removed, containing only meaningful words that
carry more semantic value.
6. Stemming and Lemmatization:
Stemming: Stemming is the process of reducing words to their root or base form, even if the
result is not a real word. It involves chopping off prefixes or suffixes to achieve this reduction.
Stemming algorithms apply heuristic rules to perform this task, such as Porter's algorithm or
Snowball algorithm.
Lemmatization: Lemmatization aims to reduce words to their canonical form (lemma), which
is a valid word present in the language. This involves using vocabulary and morphological
analysis of words, considering their meaning and context. Lemmatization typically uses
dictionaries and word structure rules to achieve accurate transformation.
7. Key differences between Stemming and Lemmatization:
Goal: Stemming aims for word reduction to a base form, while lemmatization aims for
reduction to a valid word form.
Output: Stemming can sometimes generate nonreal words (like "goes" to "go"), while
lemmatization always produces real words (like "going" to "go").
Accuracy: Lemmatization is generally more accurate because it considers the context of the
word and its meaning. Stemming, being simpler and rulebased, may not always produce
accurate results but is faster.

Conclusion: Hence we studied and implemented preprocessing using tokenization, stop word
removal, stemming, and lemmatization.

Viva Questions
No ratings yet
Viva Questions
6 pages
NLPEXP3
No ratings yet
NLPEXP3
3 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
16 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
Module 2 Complete
No ratings yet
Module 2 Complete
134 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
NLTK
No ratings yet
NLTK
3 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
29 pages
NLP Preprocessing Techniques Guide
No ratings yet
NLP Preprocessing Techniques Guide
24 pages
NLP 3-6
No ratings yet
NLP 3-6
20 pages
Lecture 2 IR System Components
No ratings yet
Lecture 2 IR System Components
10 pages
Irs Week 2
No ratings yet
Irs Week 2
2 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
NLP Experiments No-1
No ratings yet
NLP Experiments No-1
7 pages
Experiment 3 Manual
No ratings yet
Experiment 3 Manual
7 pages
3 A Morphology
No ratings yet
3 A Morphology
4 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
NLP Exp2
No ratings yet
NLP Exp2
2 pages
ANLP semVI Labmanual
No ratings yet
ANLP semVI Labmanual
33 pages
Token Ization
No ratings yet
Token Ization
5 pages
Lab 2
No ratings yet
Lab 2
49 pages
Final LP-VI NLP Manual 2023-24
No ratings yet
Final LP-VI NLP Manual 2023-24
29 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
Introduction To Natural Language Processing (Nlp) : Ths. Đặng Nhân Cách Email: Cach.Dang@Ut.Edu.Vn
No ratings yet
Introduction To Natural Language Processing (Nlp) : Ths. Đặng Nhân Cách Email: Cach.Dang@Ut.Edu.Vn
25 pages
Lab - Manual - IR - BE AI&DS CL II
No ratings yet
Lab - Manual - IR - BE AI&DS CL II
38 pages
Chap 2
No ratings yet
Chap 2
70 pages
NLP Unit-2
No ratings yet
NLP Unit-2
12 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Word Level Analysis (NLP)
No ratings yet
Word Level Analysis (NLP)
28 pages
NLP Concepts
No ratings yet
NLP Concepts
27 pages
NLP Exp 3
No ratings yet
NLP Exp 3
4 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
Natural Langauage Processing (NLP) : Tokenization of Words
No ratings yet
Natural Langauage Processing (NLP) : Tokenization of Words
8 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
Lab 2
No ratings yet
Lab 2
4 pages
NLP Exp 3
No ratings yet
NLP Exp 3
24 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
Stemming, Lemmatization & NLP Basics
No ratings yet
Stemming, Lemmatization & NLP Basics
6 pages
NLP Notes
No ratings yet
NLP Notes
12 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
ML Ch-6 Text Mining and Time Series
No ratings yet
ML Ch-6 Text Mining and Time Series
11 pages
NLP Test Questions
No ratings yet
NLP Test Questions
1 page
Lemmatization Stemming Presentation
No ratings yet
Lemmatization Stemming Presentation
11 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
4 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
Adnan Amin
No ratings yet
Adnan Amin
19 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
Module4 (A) - Private Blockchain
No ratings yet
Module4 (A) - Private Blockchain
24 pages
CNN Module-4 2
No ratings yet
CNN Module-4 2
304 pages
Module 5 - Cryptocurrencies and Digital Tokens
No ratings yet
Module 5 - Cryptocurrencies and Digital Tokens
87 pages
Mod 6 - BT Applications
No ratings yet
Mod 6 - BT Applications
50 pages
Module 1 Notes
No ratings yet
Module 1 Notes
23 pages
Data Knowledge Management
No ratings yet
Data Knowledge Management
9 pages
CSS Exp 1
No ratings yet
CSS Exp 1
10 pages
DL Exp 1
No ratings yet
DL Exp 1
4 pages
Merkle Tree
No ratings yet
Merkle Tree
10 pages
DL Module 2
No ratings yet
DL Module 2
148 pages
Module 2 - Consensus Protocol
No ratings yet
Module 2 - Consensus Protocol
49 pages
Sepm Exp1
No ratings yet
Sepm Exp1
5 pages
Autoencoders
No ratings yet
Autoencoders
103 pages
SE-Mini Project Report Format-1
No ratings yet
SE-Mini Project Report Format-1
18 pages
Be Computer Engineering Semester 3 2017 December Digital Logic Design and Analysis Cbcgs
No ratings yet
Be Computer Engineering Semester 3 2017 December Digital Logic Design and Analysis Cbcgs
25 pages
CCL Exp 2
No ratings yet
CCL Exp 2
10 pages
Be Computer-Engineering Semester-4 2019 December Applied-Mathematics-Iv-Cbcgs
No ratings yet
Be Computer-Engineering Semester-4 2019 December Applied-Mathematics-Iv-Cbcgs
20 pages
Be Computer Engineering Semester 3 2018 December Applied Mathematics III Cbcgs
No ratings yet
Be Computer Engineering Semester 3 2018 December Applied Mathematics III Cbcgs
20 pages
DWM Exp1
No ratings yet
DWM Exp1
12 pages
Exp 10
No ratings yet
Exp 10
4 pages
SE-Comps SEM4 CG MAY16
No ratings yet
SE-Comps SEM4 CG MAY16
1 page
Be Computer Engineering Semester 3 2018 May Applied Mathematics III Cbcgs
No ratings yet
Be Computer Engineering Semester 3 2018 May Applied Mathematics III Cbcgs
19 pages
Be First Year Engineering Semester 1 2022 December Engineering Chemistry I Chem1rev 2019c Scheme
No ratings yet
Be First Year Engineering Semester 1 2022 December Engineering Chemistry I Chem1rev 2019c Scheme
28 pages
Be - Computer Engineering - Semester 4 - 2019 - December - Computer Organization and Architecture Cbcgs
No ratings yet
Be - Computer Engineering - Semester 4 - 2019 - December - Computer Organization and Architecture Cbcgs
23 pages
Lecture 3 Constructing Sentences and Paragraphs
No ratings yet
Lecture 3 Constructing Sentences and Paragraphs
24 pages
English 8 Summative Test Quarter 2 Visual-Verbal Relationships
100% (6)
English 8 Summative Test Quarter 2 Visual-Verbal Relationships
4 pages
CrackVerbal GMAT Study Plan Guide
No ratings yet
CrackVerbal GMAT Study Plan Guide
125 pages
Visible Signs (An Introduction To Semiotics in The Visual Arts) (4th Edition) Crow
No ratings yet
Visible Signs (An Introduction To Semiotics in The Visual Arts) (4th Edition) Crow
10 pages
Per g01 Pub 1258 Touchstone AssessmentQPHTMLMode1 1258O23670 1258O2367
No ratings yet
Per g01 Pub 1258 Touchstone AssessmentQPHTMLMode1 1258O23670 1258O2367
44 pages
German Class 2
No ratings yet
German Class 2
21 pages
Universidad Nacional de Educación Enrique Guzman Y Valle Colegio Experimental de Aplicación I.E. Por Convenio UNE - MED, Según R.M. N°045 - 2001 - ED
No ratings yet
Universidad Nacional de Educación Enrique Guzman Y Valle Colegio Experimental de Aplicación I.E. Por Convenio UNE - MED, Según R.M. N°045 - 2001 - ED
5 pages
List of English Phrasal Verbs
No ratings yet
List of English Phrasal Verbs
7 pages
Pre-Advanced Etiquette Lesson
No ratings yet
Pre-Advanced Etiquette Lesson
21 pages
Sign Language: Bridging Communication Gaps
No ratings yet
Sign Language: Bridging Communication Gaps
3 pages
Section 2 Sturcture and Written Exspression
No ratings yet
Section 2 Sturcture and Written Exspression
16 pages
MATH - Fraction 4th
No ratings yet
MATH - Fraction 4th
7 pages
1665003846166362205301-Secondary Fly Higher A1Plus Teacher's Book-SU
No ratings yet
1665003846166362205301-Secondary Fly Higher A1Plus Teacher's Book-SU
26 pages
English 8 - Learning Packet - Lesson 1
No ratings yet
English 8 - Learning Packet - Lesson 1
5 pages
Dzexams 1am Anglais 430660
No ratings yet
Dzexams 1am Anglais 430660
5 pages
Prueba Saber Ingles 9 Primer Periodo 2025
No ratings yet
Prueba Saber Ingles 9 Primer Periodo 2025
2 pages
Meeting 12
No ratings yet
Meeting 12
2 pages
Modals
No ratings yet
Modals
9 pages
Purposive Communication Transes
No ratings yet
Purposive Communication Transes
15 pages
UNIT 5 and 6 Worksheet
No ratings yet
UNIT 5 and 6 Worksheet
4 pages
Intonation System. Tench
No ratings yet
Intonation System. Tench
11 pages
Possessive Pronouns, Demonstratives
No ratings yet
Possessive Pronouns, Demonstratives
2 pages
Syllable Structure Algorithm (SSA) - Simplified
No ratings yet
Syllable Structure Algorithm (SSA) - Simplified
3 pages
F9 - Delivering ManuscriptMemorized Speech
No ratings yet
F9 - Delivering ManuscriptMemorized Speech
24 pages
Handwriting Guide for Young Kids
No ratings yet
Handwriting Guide for Young Kids
30 pages
Advice On Dying and Living A Better Life 1St Edition Edition Dalai Lama - Experience The Full Ebook by Downloading It Now
100% (3)
Advice On Dying and Living A Better Life 1St Edition Edition Dalai Lama - Experience The Full Ebook by Downloading It Now
55 pages
Writing Clarity for Students
No ratings yet
Writing Clarity for Students
16 pages
CAE Use of English Part 2, Test 5 - The Light Bulb
No ratings yet
CAE Use of English Part 2, Test 5 - The Light Bulb
2 pages
Syntax: Dependent/Subordinate Clauses
No ratings yet
Syntax: Dependent/Subordinate Clauses
4 pages
The Past of Be - Statements - Yes - No
No ratings yet
The Past of Be - Statements - Yes - No
14 pages

NLP Exp 2

Uploaded by

NLP Exp 2

Uploaded by

Experiment No.

You might also like