0% found this document useful (0 votes)

7 views4 pages

Lab 2

The document provides a step-by-step guide on text preprocessing using the Natural Language Toolkit (nltk) in Python, focusing on tokenization, stop word removal, and stemming. It explains the importance of libraries like stopwords and PorterStemmer, and demonstrates the process with a sample text. The output includes filtered and stemmed words, highlighting the transformation of text data for improved analysis.

Uploaded by

Srihari Rao Panuganti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views4 pages

Lab 2

Uploaded by

Srihari Rao Panuganti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

import nltk

from nltk.corpus import stopwords

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
#nltk.download('punkt_tab')
# Sample text
text = "This is an example of text preprocessing using stop word removal and stemming."

# Tokenize
words = word_tokenize(text.lower())

# Remove stop words

filtered_words = [word for word in words if word.isalnum() and word not in stopwords.words('english')]

# Apply stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]

print("Filtered Words:", filtered_words)

print("Stemmed Words:", stemmed_words)

output:

Filtered Words: ['example', 'text', 'preprocessing', 'using', 'stop', 'word',

'removal', 'stemming']

Stemmed Words: ['exampl', 'text', 'preprocess', 'use', 'stop', 'word',

'remov', 'stem']

Step-by-Step Breakdown
🔹 Step 1: Import Required Libraries
python
CopyEdit
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

 nltk: The Natural Language Toolkit, a Python library for text processing.
 stopwords: List of common English words to remove.
 PorterStemmer: A stemming algorithm that reduces words to their root form.
 word_tokenize: Splits text into individual words (tokens).

🔹 Step 2: (Optional) Download Necessary Resources

python
CopyEdit
nltk.download('punkt')
nltk.download('stopwords')

 punkt: Pre-trained tokenizer model for splitting text into words/sentences.

 stopwords: Provides the list of stop words in various languages.

✅ You only need to run this once on a new system.

🔹 Step 3: Provide Sample Text

python
CopyEdit
text = "This is an example of text preprocessing using stop word removal and
stemming."

This is the raw input sentence to be cleaned.

🔹 Step 4: Tokenization
python
CopyEdit
words = word_tokenize(text.lower())

 Converts the text to lowercase.

 Splits it into words:

Output:

python
CopyEdit
['this', 'is', 'an', 'example', 'of', 'text', 'preprocessing', 'using',
'stop', 'word', 'removal', 'and', 'stemming', '.']

🔹 Step 5: Stop Word Removal

python
CopyEdit
filtered_words = [word for word in words if word.isalnum() and word not in
stopwords.words('english')]

 word.isalnum(): Keeps only alphabetic and numeric words (removes punctuation like
.).
 Removes common stop words such as "this", "is", "an", "of", "and", etc.

Output:

python
CopyEdit
['example', 'text', 'preprocessing', 'using', 'stop', 'word', 'removal',
'stemming']

🔹 Step 6: Stemming
python
CopyEdit
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]

 Initializes the Porter Stemmer.

 Applies stemming to each word in filtered_words.

Output:

python
CopyEdit
['exampl', 'text', 'preprocess', 'use', 'stop', 'word', 'remov', 'stem']

Note: Stemming may generate non-standard English words (e.g., exampl, remov).

🔹 Step 7: Output Results

python
CopyEdit
print("Filtered Words:", filtered_words)
print("Stemmed Words:", stemmed_words)

What are Stop Words?

Stop words are commonly used words in a language (such as "is", "the", "and", "in", etc.) that
carry little meaningful information. These are usually removed to reduce the dimensionality of
the data and improve performance.

Stemmed words are the base or root forms of words, obtained through a process called stemming.
Stemming reduces a word to its stem—a form that may not be a real word in the language, but captures
its core meaning.

Original Word Stemmed Word

running run
runs run
studies studi
studying studi
removal remov
stemming stem

NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
Token Ization
No ratings yet
Token Ization
5 pages
NLP Lab1
No ratings yet
NLP Lab1
2 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
4 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Natural Langauage Processing (NLP) : Tokenization of Words
No ratings yet
Natural Langauage Processing (NLP) : Tokenization of Words
8 pages
NLP Exp2
No ratings yet
NLP Exp2
2 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
NLPPractical
No ratings yet
NLPPractical
12 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
3 pages
7 TextAnalysis
No ratings yet
7 TextAnalysis
3 pages
NLP Lab Manual 3-2 Aiml R22 Update
100% (2)
NLP Lab Manual 3-2 Aiml R22 Update
20 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
7 pages
Ir 1 Stop Word Removed
No ratings yet
Ir 1 Stop Word Removed
1 page
NLPEXP3
No ratings yet
NLPEXP3
3 pages
Write A Python Program For The Following Preprocessing of Text in NLP: Tokenization Filtration Script Validation Stop Word Removal Stemming
No ratings yet
Write A Python Program For The Following Preprocessing of Text in NLP: Tokenization Filtration Script Validation Stop Word Removal Stemming
2 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
NLP Exp 2
No ratings yet
NLP Exp 2
4 pages
Python NLP Tasks with NLTK
No ratings yet
Python NLP Tasks with NLTK
17 pages
AM604PC Natural Language Processing LAB R22 AI&ML 3rd Yr 2nd Sem AM604PC Natural Language Processing LAB R22 AI&ML 3rd Yr 2nd Sem
No ratings yet
AM604PC Natural Language Processing LAB R22 AI&ML 3rd Yr 2nd Sem AM604PC Natural Language Processing LAB R22 AI&ML 3rd Yr 2nd Sem
20 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
EXP1
No ratings yet
EXP1
4 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
Date: Practical No.4:: Foundation of AI and ML (4351601)
No ratings yet
Date: Practical No.4:: Foundation of AI and ML (4351601)
10 pages
Irs Week 2
No ratings yet
Irs Week 2
2 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
7 Idf
No ratings yet
7 Idf
5 pages
Prog 1
No ratings yet
Prog 1
2 pages
Lab 2
No ratings yet
Lab 2
49 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
Aiml P4
No ratings yet
Aiml P4
12 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
Experiment 3 Manual
No ratings yet
Experiment 3 Manual
7 pages
Lab - Manual - IR - BE AI&DS CL II
No ratings yet
Lab - Manual - IR - BE AI&DS CL II
38 pages
AI Lab Manual Aktu
No ratings yet
AI Lab Manual Aktu
11 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
32 pages
Removing Stopwords in NLP
No ratings yet
Removing Stopwords in NLP
32 pages
Ir Lab 2 Ir Learning Outcomes: Pyterrier
No ratings yet
Ir Lab 2 Ir Learning Outcomes: Pyterrier
7 pages
NLP PRGRM-1
No ratings yet
NLP PRGRM-1
7 pages
Text Preprocessing
No ratings yet
Text Preprocessing
3 pages
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
No ratings yet
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
13 pages
NLP Pratical
No ratings yet
NLP Pratical
14 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
Detailed Explanation of The Code
No ratings yet
Detailed Explanation of The Code
4 pages
Chapter 6
No ratings yet
Chapter 6
6 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
NLTK Tutorial
No ratings yet
NLTK Tutorial
33 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
11 pages
Viva Questions
No ratings yet
Viva Questions
6 pages
Darshnlp 2
No ratings yet
Darshnlp 2
1 page
NLP Record
No ratings yet
NLP Record
15 pages
Stemming and Removal of Stop Words - Becoc316
No ratings yet
Stemming and Removal of Stop Words - Becoc316
3 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
NLP Tasks for MCA Students
No ratings yet
NLP Tasks for MCA Students
16 pages
R22 NLP Python Programs
No ratings yet
R22 NLP Python Programs
15 pages
C-23 Java Lab Programs
No ratings yet
C-23 Java Lab Programs
2 pages
Ayyappa
No ratings yet
Ayyappa
1 page
3.classes Inheritance Class Group
No ratings yet
3.classes Inheritance Class Group
9 pages
III Dcme Java Lab
No ratings yet
III Dcme Java Lab
19 pages
Complete Bundle Mecanique Pour Ingenieur Dynamique Volume 9th Edition HQ File
100% (1)
Complete Bundle Mecanique Pour Ingenieur Dynamique Volume 9th Edition HQ File
406 pages
Oil-Grit Separator Guide for Stormwater
100% (1)
Oil-Grit Separator Guide for Stormwater
18 pages
Selfservice Student Self Registration and Payment Instruction Guide
No ratings yet
Selfservice Student Self Registration and Payment Instruction Guide
5 pages
FILE Handling
No ratings yet
FILE Handling
18 pages
Internship Report On Laborate Pharmaceuticals India Ltd.
71% (7)
Internship Report On Laborate Pharmaceuticals India Ltd.
21 pages
Silent Knight 6820
No ratings yet
Silent Knight 6820
4 pages
Comparative Toxicological Evaluation of Natural and Artificial Sweeteners: Focus On Liver and Kidney Damage (WWW - Kiu.ac - Ug)
No ratings yet
Comparative Toxicological Evaluation of Natural and Artificial Sweeteners: Focus On Liver and Kidney Damage (WWW - Kiu.ac - Ug)
6 pages
Class X Science Exam Paper
No ratings yet
Class X Science Exam Paper
8 pages
Reports Sample
No ratings yet
Reports Sample
20 pages
Construction Quality Checklist
No ratings yet
Construction Quality Checklist
4 pages
Bp2101 A Mighty Fortress Complete Corrected 3digitalsale-1
No ratings yet
Bp2101 A Mighty Fortress Complete Corrected 3digitalsale-1
15 pages
Materializing The Digital: Architecture As Interface: Materia Arquitectura #13
No ratings yet
Materializing The Digital: Architecture As Interface: Materia Arquitectura #13
5 pages
Flakiness & Elongation Index
100% (1)
Flakiness & Elongation Index
1 page
Optimal Swine Housing Benefits
No ratings yet
Optimal Swine Housing Benefits
13 pages
The Zohar The Pritzker Edition Vol 1 Genesis Daniel C. Matt - Get The Ebook in PDF Format For A Complete Experience
100% (3)
The Zohar The Pritzker Edition Vol 1 Genesis Daniel C. Matt - Get The Ebook in PDF Format For A Complete Experience
78 pages
Chapter-3 Rework
No ratings yet
Chapter-3 Rework
12 pages
Flow Measurement Technology Guide
No ratings yet
Flow Measurement Technology Guide
24 pages
Property Detail
No ratings yet
Property Detail
1 page
TQM and Japanese Management Techniques
No ratings yet
TQM and Japanese Management Techniques
15 pages
Hair & Self-Esteem in Grade 11 Students
No ratings yet
Hair & Self-Esteem in Grade 11 Students
48 pages
Standard Practice For Operating Xenon Arc Lamp Apparatus For Exposure of Materials
No ratings yet
Standard Practice For Operating Xenon Arc Lamp Apparatus For Exposure of Materials
12 pages
Unique Wedding by Dr. Franklin
No ratings yet
Unique Wedding by Dr. Franklin
6 pages
Understanding Cognitive Biases
100% (1)
Understanding Cognitive Biases
16 pages
Yaeger - Apotheosis of Trash
No ratings yet
Yaeger - Apotheosis of Trash
20 pages
Teach Yourself Unix System Administration in 24 Hours
100% (7)
Teach Yourself Unix System Administration in 24 Hours
525 pages
Bach & The Bible - Christian History Magazine
No ratings yet
Bach & The Bible - Christian History Magazine
3 pages
Business Strategy Execution Guide
No ratings yet
Business Strategy Execution Guide
9 pages
Summer Traning Program Mohit Singh
No ratings yet
Summer Traning Program Mohit Singh
98 pages
Guppy Farming
No ratings yet
Guppy Farming
7 pages
New Microsoft Office Word Document
No ratings yet
New Microsoft Office Word Document
23 pages

Lab 2

Uploaded by

Lab 2

Uploaded by

import nltk

from nltk.corpus import stopwords

# Remove stop words

print("Filtered Words:", filtered_words)

Filtered Words: ['example', 'text', 'preprocessing', 'using', 'stop', 'word',

Stemmed Words: ['exampl', 'text', 'preprocess', 'use', 'stop', 'word',

🔹 Step 2: (Optional) Download Necessary Resources

 punkt: Pre-trained tokenizer model for splitting text into words/sentences.

✅ You only need to run this once on a new system.

🔹 Step 3: Provide Sample Text

This is the raw input sentence to be cleaned.

 Converts the text to lowercase.

🔹 Step 5: Stop Word Removal

 Initializes the Porter Stemmer.

🔹 Step 7: Output Results

What are Stop Words?

Original Word Stemmed Word

You might also like