0% found this document useful (0 votes)
11 views4 pages

Tokenization (Breaking Text Into Words) : Import From Import From Import From Import

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views4 pages

Tokenization (Breaking Text Into Words) : Import From Import From Import From Import

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

# 🔹 Lab 1: Introduction to NLP

# 🎯 Objective: Learn basic text processing

# Step 1: Import libraries


import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required resources (only first time)


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Step 2: Example text


text = "I am learning Natural Language Processing in class."

print("Original Text:")
print(text)

[nltk_data] Downloading package punkt to C:\Users\Sujay


[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...

Original Text:
I am learning Natural Language Processing in class.

# Step 2: Example text


text = "I am learning Natural Language Processing in class."

print("Original Text:")
print(text)

Original Text:
I am learning Natural Language Processing in class.

Tokenization (Breaking text into words)


In simple words:

word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).

tokens = ... → saves those words into a variable called tokens.


print("After Tokenization:") → shows a label so students know what’s coming.

print(tokens) → shows the actual list of words.

👉 Example: Input → "I am learning NLP in class." Output → ['I', 'am', 'learning', 'NLP', 'in', 'class',
'.']

# Step 3: Tokenization (Breaking text into words)


tokens = word_tokenize(text) # This line takes the sentence in
'text' and splits it into individual words (tokens)
print("After Tokenization:") # This line just prints a heading
so output looks clear
print(tokens) # This line prints the list of
tokens (words) after splitting

After Tokenization:
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', 'in',
'class', '.']

Removing Stopwords (like "am", "in", "the")


In simple words: We are cleaning the sentence by throwing away unnecessary words (like “is”,
“the”, “in”) that don’t add much meaning.

👉 Example: Input Tokens → ['I', 'am', 'learning', 'NLP', 'in', 'class'] Output → ['I', 'learning', 'NLP',
'class']

# Step 4: Removing Stopwords (like "am", "in", "the")

stop_words = set(stopwords.words('english'))
# This line loads a list of common English words (like "am", "is",
"the") from NLTK
# and stores them inside a set called stop_words.
# A "set" is used because it's faster to check membership.

filtered_tokens = [word for word in tokens if word.lower() not in


stop_words]
# This line creates a new list.
# It goes through each word in tokens one by one,
# converts the word into lowercase (word.lower()),
# and keeps it ONLY if it is NOT in stop_words.

print("After Removing Stopwords:")


# Prints a heading so the output looks clear.

print(filtered_tokens)
# Prints the list of words after removing stopwords.
After Removing Stopwords:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']

Stemming (cutting words to their root form)


In simple words: Stemming is like chopping words down to their base. It may not always be a
real word but gives the core root.

👉 Example: Input → ['learning', 'NLP', 'classes'] Output → ['learn', 'nlp', 'class']

# Step 5: Stemming (cutting words to their root form)


from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
# Creates a "stemmer" object using the Porter algorithm.

stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]


# For each word in filtered_tokens,
# apply stemmer.stem(word) which cuts the word down to its root form.
# Example: "learning" → "learn", "classes" → "class"

print("After Stemming:")
print(stemmed_tokens)
# Prints the list of words after stemming.

After Stemming:
['learn', 'natur', 'languag', 'process', 'class', '.']

# Lemmatization (finding correct base word using grammar rules)

In simple words: Lemmatization is smarter than stemming because it uses grammar + dictionary
to find the correct base word.

👉 Example: Input → ['better', 'running', 'classes'] Output → ['good', 'run', 'class']

# Step 6: Lemmatization (finding correct base word using grammar


rules)
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Creates a "lemmatizer" object that uses the WordNet dictionary.

lemmatized_tokens = [lemmatizer.lemmatize(word) for word in


filtered_tokens]
# For each word in filtered_tokens,
# apply lemmatizer.lemmatize(word) which returns the meaningful base
form.
# Example: "better" → "good", "running" → "run"

print("After Lemmatization:")
print(lemmatized_tokens)
# Prints the list of words after lemmatization.

After Lemmatization:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']

In simple words:
word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).

tokens = ... → saves those words into a variable called tokens.

print("After Tokenization:") → shows a label so students know what’s coming.

print(tokens) → shows the actual list of words.

You might also like