import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
#nltk.download('punkt_tab')
# Sample text
text = "This is an example of text preprocessing using stop word removal and stemming."
# Tokenize
words = word_tokenize(text.lower())
# Remove stop words
filtered_words = [word for word in words if word.isalnum() and word not in stopwords.words('english')]
# Apply stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]
print("Filtered Words:", filtered_words)
print("Stemmed Words:", stemmed_words)
output:
Filtered Words: ['example', 'text', 'preprocessing', 'using', 'stop', 'word',
'removal', 'stemming']
Stemmed Words: ['exampl', 'text', 'preprocess', 'use', 'stop', 'word',
'remov', 'stem']
Step-by-Step Breakdown
🔹 Step 1: Import Required Libraries
python
CopyEdit
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
nltk: The Natural Language Toolkit, a Python library for text processing.
stopwords: List of common English words to remove.
PorterStemmer: A stemming algorithm that reduces words to their root form.
word_tokenize: Splits text into individual words (tokens).
🔹 Step 2: (Optional) Download Necessary Resources
python
CopyEdit
nltk.download('punkt')
nltk.download('stopwords')
punkt: Pre-trained tokenizer model for splitting text into words/sentences.
stopwords: Provides the list of stop words in various languages.
✅ You only need to run this once on a new system.
🔹 Step 3: Provide Sample Text
python
CopyEdit
text = "This is an example of text preprocessing using stop word removal and
stemming."
This is the raw input sentence to be cleaned.
🔹 Step 4: Tokenization
python
CopyEdit
words = word_tokenize(text.lower())
Converts the text to lowercase.
Splits it into words:
Output:
python
CopyEdit
['this', 'is', 'an', 'example', 'of', 'text', 'preprocessing', 'using',
'stop', 'word', 'removal', 'and', 'stemming', '.']
🔹 Step 5: Stop Word Removal
python
CopyEdit
filtered_words = [word for word in words if word.isalnum() and word not in
stopwords.words('english')]
word.isalnum(): Keeps only alphabetic and numeric words (removes punctuation like
.).
Removes common stop words such as "this", "is", "an", "of", "and", etc.
Output:
python
CopyEdit
['example', 'text', 'preprocessing', 'using', 'stop', 'word', 'removal',
'stemming']
🔹 Step 6: Stemming
python
CopyEdit
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]
Initializes the Porter Stemmer.
Applies stemming to each word in filtered_words.
Output:
python
CopyEdit
['exampl', 'text', 'preprocess', 'use', 'stop', 'word', 'remov', 'stem']
Note: Stemming may generate non-standard English words (e.g., exampl, remov).
🔹 Step 7: Output Results
python
CopyEdit
print("Filtered Words:", filtered_words)
print("Stemmed Words:", stemmed_words)
What are Stop Words?
Stop words are commonly used words in a language (such as "is", "the", "and", "in", etc.) that
carry little meaningful information. These are usually removed to reduce the dimensionality of
the data and improve performance.
Stemmed words are the base or root forms of words, obtained through a process called stemming.
Stemming reduces a word to its stem—a form that may not be a real word in the language, but captures
its core meaning.
Original Word Stemmed Word
running run
runs run
studies studi
studying studi
removal remov
stemming stem