0% found this document useful (0 votes)
7 views4 pages

Lab 2

The document provides a step-by-step guide on text preprocessing using the Natural Language Toolkit (nltk) in Python, focusing on tokenization, stop word removal, and stemming. It explains the importance of libraries like stopwords and PorterStemmer, and demonstrates the process with a sample text. The output includes filtered and stemmed words, highlighting the transformation of text data for improved analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views4 pages

Lab 2

The document provides a step-by-step guide on text preprocessing using the Natural Language Toolkit (nltk) in Python, focusing on tokenization, stop word removal, and stemming. It explains the importance of libraries like stopwords and PorterStemmer, and demonstrates the process with a sample text. The output includes filtered and stemmed words, highlighting the transformation of text data for improved analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

import nltk

from nltk.corpus import stopwords


from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
#nltk.download('punkt_tab')
# Sample text
text = "This is an example of text preprocessing using stop word removal and stemming."

# Tokenize
words = word_tokenize(text.lower())

# Remove stop words


filtered_words = [word for word in words if word.isalnum() and word not in stopwords.words('english')]

# Apply stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]

print("Filtered Words:", filtered_words)


print("Stemmed Words:", stemmed_words)

output:

Filtered Words: ['example', 'text', 'preprocessing', 'using', 'stop', 'word',


'removal', 'stemming']

Stemmed Words: ['exampl', 'text', 'preprocess', 'use', 'stop', 'word',


'remov', 'stem']

Step-by-Step Breakdown
🔹 Step 1: Import Required Libraries
python
CopyEdit
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

 nltk: The Natural Language Toolkit, a Python library for text processing.
 stopwords: List of common English words to remove.
 PorterStemmer: A stemming algorithm that reduces words to their root form.
 word_tokenize: Splits text into individual words (tokens).

🔹 Step 2: (Optional) Download Necessary Resources


python
CopyEdit
nltk.download('punkt')
nltk.download('stopwords')

 punkt: Pre-trained tokenizer model for splitting text into words/sentences.


 stopwords: Provides the list of stop words in various languages.

✅ You only need to run this once on a new system.

🔹 Step 3: Provide Sample Text


python
CopyEdit
text = "This is an example of text preprocessing using stop word removal and
stemming."

This is the raw input sentence to be cleaned.

🔹 Step 4: Tokenization
python
CopyEdit
words = word_tokenize(text.lower())

 Converts the text to lowercase.


 Splits it into words:

Output:

python
CopyEdit
['this', 'is', 'an', 'example', 'of', 'text', 'preprocessing', 'using',
'stop', 'word', 'removal', 'and', 'stemming', '.']

🔹 Step 5: Stop Word Removal


python
CopyEdit
filtered_words = [word for word in words if word.isalnum() and word not in
stopwords.words('english')]

 word.isalnum(): Keeps only alphabetic and numeric words (removes punctuation like
.).
 Removes common stop words such as "this", "is", "an", "of", "and", etc.

Output:

python
CopyEdit
['example', 'text', 'preprocessing', 'using', 'stop', 'word', 'removal',
'stemming']

🔹 Step 6: Stemming
python
CopyEdit
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]

 Initializes the Porter Stemmer.


 Applies stemming to each word in filtered_words.

Output:

python
CopyEdit
['exampl', 'text', 'preprocess', 'use', 'stop', 'word', 'remov', 'stem']

Note: Stemming may generate non-standard English words (e.g., exampl, remov).

🔹 Step 7: Output Results


python
CopyEdit
print("Filtered Words:", filtered_words)
print("Stemmed Words:", stemmed_words)

What are Stop Words?


Stop words are commonly used words in a language (such as "is", "the", "and", "in", etc.) that
carry little meaningful information. These are usually removed to reduce the dimensionality of
the data and improve performance.

Stemmed words are the base or root forms of words, obtained through a process called stemming.
Stemming reduces a word to its stem—a form that may not be a real word in the language, but captures
its core meaning.

Original Word Stemmed Word


running run
runs run
studies studi
studying studi
removal remov
stemming stem

You might also like