0% found this document useful (0 votes)

11 views4 pages

Tokenization (Breaking Text Into Words) : Import From Import From Import From Import

Uploaded by

Sujay Kumar (2100622)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views4 pages

Tokenization (Breaking Text Into Words) : Import From Import From Import From Import

Uploaded by

Sujay Kumar (2100622)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

# 🔹 Lab 1: Introduction to NLP

# 🎯 Objective: Learn basic text processing

# Step 1: Import libraries

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required resources (only first time)

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Step 2: Example text

text = "I am learning Natural Language Processing in class."

print("Original Text:")
print(text)

[nltk_data] Downloading package punkt to C:\Users\Sujay

[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...

Original Text:
I am learning Natural Language Processing in class.

# Step 2: Example text

text = "I am learning Natural Language Processing in class."

print("Original Text:")
print(text)

Original Text:
I am learning Natural Language Processing in class.

Tokenization (Breaking text into words)

In simple words:

word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).

tokens = ... → saves those words into a variable called tokens.

print("After Tokenization:") → shows a label so students know what’s coming.

print(tokens) → shows the actual list of words.

👉 Example: Input → "I am learning NLP in class." Output → ['I', 'am', 'learning', 'NLP', 'in', 'class',
'.']

# Step 3: Tokenization (Breaking text into words)

tokens = word_tokenize(text) # This line takes the sentence in
'text' and splits it into individual words (tokens)
print("After Tokenization:") # This line just prints a heading
so output looks clear
print(tokens) # This line prints the list of
tokens (words) after splitting

After Tokenization:
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', 'in',
'class', '.']

Removing Stopwords (like "am", "in", "the")

In simple words: We are cleaning the sentence by throwing away unnecessary words (like “is”,
“the”, “in”) that don’t add much meaning.

👉 Example: Input Tokens → ['I', 'am', 'learning', 'NLP', 'in', 'class'] Output → ['I', 'learning', 'NLP',
'class']

# Step 4: Removing Stopwords (like "am", "in", "the")

stop_words = set(stopwords.words('english'))
# This line loads a list of common English words (like "am", "is",
"the") from NLTK
# and stores them inside a set called stop_words.
# A "set" is used because it's faster to check membership.

filtered_tokens = [word for word in tokens if word.lower() not in

stop_words]
# This line creates a new list.
# It goes through each word in tokens one by one,
# converts the word into lowercase (word.lower()),
# and keeps it ONLY if it is NOT in stop_words.

print("After Removing Stopwords:")

# Prints a heading so the output looks clear.

print(filtered_tokens)
# Prints the list of words after removing stopwords.
After Removing Stopwords:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']

Stemming (cutting words to their root form)

In simple words: Stemming is like chopping words down to their base. It may not always be a
real word but gives the core root.

👉 Example: Input → ['learning', 'NLP', 'classes'] Output → ['learn', 'nlp', 'class']

# Step 5: Stemming (cutting words to their root form)

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
# Creates a "stemmer" object using the Porter algorithm.

stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

# For each word in filtered_tokens,
# apply stemmer.stem(word) which cuts the word down to its root form.
# Example: "learning" → "learn", "classes" → "class"

print("After Stemming:")
print(stemmed_tokens)
# Prints the list of words after stemming.

After Stemming:
['learn', 'natur', 'languag', 'process', 'class', '.']

# Lemmatization (finding correct base word using grammar rules)

In simple words: Lemmatization is smarter than stemming because it uses grammar + dictionary
to find the correct base word.

👉 Example: Input → ['better', 'running', 'classes'] Output → ['good', 'run', 'class']

# Step 6: Lemmatization (finding correct base word using grammar

rules)
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Creates a "lemmatizer" object that uses the WordNet dictionary.

lemmatized_tokens = [lemmatizer.lemmatize(word) for word in

filtered_tokens]
# For each word in filtered_tokens,
# apply lemmatizer.lemmatize(word) which returns the meaningful base
form.
# Example: "better" → "good", "running" → "run"

print("After Lemmatization:")
print(lemmatized_tokens)
# Prints the list of words after lemmatization.

After Lemmatization:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']

In simple words:
word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).

tokens = ... → saves those words into a variable called tokens.

print("After Tokenization:") → shows a label so students know what’s coming.

print(tokens) → shows the actual list of words.

Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
7 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
11 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
NLPEXP3
No ratings yet
NLPEXP3
3 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
A7 Dsbda Sana
No ratings yet
A7 Dsbda Sana
15 pages
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
No ratings yet
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
81 pages
NLP Programs
No ratings yet
NLP Programs
5 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Natural Langauage Processing (NLP) : Tokenization of Words
No ratings yet
Natural Langauage Processing (NLP) : Tokenization of Words
8 pages
7 Idf
No ratings yet
7 Idf
5 pages
Cotb46 7
No ratings yet
Cotb46 7
3 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
7 TextAnalysis
No ratings yet
7 TextAnalysis
3 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
Lab 2
No ratings yet
Lab 2
49 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
Lab 2
No ratings yet
Lab 2
4 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
3 pages
NLP Pratical
No ratings yet
NLP Pratical
14 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
NLPPractical
No ratings yet
NLPPractical
12 pages
Text Preprocessing & NLTK Guide
No ratings yet
Text Preprocessing & NLTK Guide
8 pages
AMLTA
No ratings yet
AMLTA
17 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
Removing Stopwords in NLP
No ratings yet
Removing Stopwords in NLP
32 pages
Token Ization
No ratings yet
Token Ization
5 pages
NLTK Tutorial
No ratings yet
NLTK Tutorial
33 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
NLP Preprocessing Steps 1740444240
No ratings yet
NLP Preprocessing Steps 1740444240
20 pages
NLP Exp1
No ratings yet
NLP Exp1
4 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
NLP Lab Manual 3-2 Aiml R22 Update
100% (2)
NLP Lab Manual 3-2 Aiml R22 Update
20 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
NLP Lab1
No ratings yet
NLP Lab1
2 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
27 pages
EXP1
No ratings yet
EXP1
4 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
Tokenizer
No ratings yet
Tokenizer
4 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
NLP PRGRM-1
No ratings yet
NLP PRGRM-1
7 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
NLP Test Questions With Answers
No ratings yet
NLP Test Questions With Answers
3 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
Write A Python Program For The Following Preprocessing of Text in NLP: Tokenization Filtration Script Validation Stop Word Removal Stemming
No ratings yet
Write A Python Program For The Following Preprocessing of Text in NLP: Tokenization Filtration Script Validation Stop Word Removal Stemming
2 pages
Resume 1234
No ratings yet
Resume 1234
1 page
Jahid 1
No ratings yet
Jahid 1
5 pages
Sujay Final Report
No ratings yet
Sujay Final Report
25 pages
DataScience Institute Brochure
No ratings yet
DataScience Institute Brochure
1 page
Week Test Answers
No ratings yet
Week Test Answers
1 page
My Resume
No ratings yet
My Resume
1 page
My Resume
No ratings yet
My Resume
2 pages
Happy
No ratings yet
Happy
1 page
MD Jahid CV
No ratings yet
MD Jahid CV
2 pages
MD Jahid Cv.
No ratings yet
MD Jahid Cv.
2 pages
Fill Data by Backward and Forward Filling
No ratings yet
Fill Data by Backward and Forward Filling
9 pages
I.K.Gujral Punjab Technical University, Jalandhar: Grade Cum Marks Sheet
No ratings yet
I.K.Gujral Punjab Technical University, Jalandhar: Grade Cum Marks Sheet
1 page
Job D of AI-ML Instructor
No ratings yet
Job D of AI-ML Instructor
1 page
Resume Formatted
No ratings yet
Resume Formatted
2 pages
Python Basics For AI
No ratings yet
Python Basics For AI
1 page
Sujay Resume Updated
No ratings yet
Sujay Resume Updated
1 page
Recreated Resume
No ratings yet
Recreated Resume
1 page
Foundations of Data Analysis
No ratings yet
Foundations of Data Analysis
5 pages
Python Basics For AI With Examples
No ratings yet
Python Basics For AI With Examples
2 pages
20 Python Basic Practice Questions
No ratings yet
20 Python Basic Practice Questions
1 page
MD Jahid
No ratings yet
MD Jahid
5 pages
NLTK Package Training
No ratings yet
NLTK Package Training
17 pages
NLP Test Questions
No ratings yet
NLP Test Questions
1 page
Python
No ratings yet
Python
13 pages
Qualifications 28-08-2025 09 - 50 - 37
No ratings yet
Qualifications 28-08-2025 09 - 50 - 37
8 pages
AI Agents Masterclass Presentation
No ratings yet
AI Agents Masterclass Presentation
12 pages
Resume Template
No ratings yet
Resume Template
1 page
Objective:: Rajiv Gandhi University of Knowledge Technologies, Nuzvid
No ratings yet
Objective:: Rajiv Gandhi University of Knowledge Technologies, Nuzvid
4 pages
English Irregular Verbs Guide
No ratings yet
English Irregular Verbs Guide
3 pages
Simple Present Practice for Students
No ratings yet
Simple Present Practice for Students
5 pages
Forming Contractions: Contractions Ending in - N'T
No ratings yet
Forming Contractions: Contractions Ending in - N'T
17 pages
UTS Intermediate Grammar-Reg A
No ratings yet
UTS Intermediate Grammar-Reg A
3 pages
Jkset 2018
No ratings yet
Jkset 2018
19 pages
Dharmi
No ratings yet
Dharmi
10 pages
Eastern Kayah Li Grammar Texts Glossary David B Solnit PDF Download
100% (1)
Eastern Kayah Li Grammar Texts Glossary David B Solnit PDF Download
85 pages
Seat 130525 e
No ratings yet
Seat 130525 e
12 pages
Toka Primary School Prospectus 1
No ratings yet
Toka Primary School Prospectus 1
5 pages
Armenian Rock Art Symbolism Study
No ratings yet
Armenian Rock Art Symbolism Study
6 pages
Filipino Pledge of Allegiance
No ratings yet
Filipino Pledge of Allegiance
3 pages
Copyreading and Headline Writing 1
95% (22)
Copyreading and Headline Writing 1
53 pages
Pronunciation 10d1
No ratings yet
Pronunciation 10d1
4 pages
Skimming & Scanning
100% (2)
Skimming & Scanning
2 pages
Motivation Letter
No ratings yet
Motivation Letter
1 page
Show My Homework Forest Academy
100% (1)
Show My Homework Forest Academy
5 pages
English Mock Exam 2018
No ratings yet
English Mock Exam 2018
14 pages
English 5 DLL w1
No ratings yet
English 5 DLL w1
5 pages
SSC Canteen Attendant 2018 Syllabus - General Intelligence
No ratings yet
SSC Canteen Attendant 2018 Syllabus - General Intelligence
3 pages
ELDON LESSON PLAN COT 2 Concrete Abstract Lesson Plan COT 3RD GRADING
100% (1)
ELDON LESSON PLAN COT 2 Concrete Abstract Lesson Plan COT 3RD GRADING
2 pages
Success Stories: Behviour Change Intervention Handwashing With Soap in Municipal Schools by CACR-UNICEF
No ratings yet
Success Stories: Behviour Change Intervention Handwashing With Soap in Municipal Schools by CACR-UNICEF
11 pages
German Feelings for Kids
No ratings yet
German Feelings for Kids
3 pages
Might
No ratings yet
Might
2 pages
Mach4 Installation Manual
No ratings yet
Mach4 Installation Manual
26 pages
Assessment of Central Auditory Processing-1
No ratings yet
Assessment of Central Auditory Processing-1
96 pages
Supervisor Observation 1 Lesson Plan
No ratings yet
Supervisor Observation 1 Lesson Plan
3 pages
Contractor Resumes for KURA Project
No ratings yet
Contractor Resumes for KURA Project
5 pages
Arab Open University Intensive English Unit Fall 2021/2022-Part 1 EL97 Quiz 1 (MTA)
No ratings yet
Arab Open University Intensive English Unit Fall 2021/2022-Part 1 EL97 Quiz 1 (MTA)
5 pages
Negotiating International Business
No ratings yet
Negotiating International Business
13 pages

Tokenization (Breaking Text Into Words) : Import From Import From Import From Import

Uploaded by

Tokenization (Breaking Text Into Words) : Import From Import From Import From Import

Uploaded by

# 🔹 Lab 1: Introduction to NLP

# 🎯 Objective: Learn basic text processing

# Step 1: Import libraries

# Download required resources (only first time)

# Step 2: Example text

[nltk_data] Downloading package punkt to C:\Users\Sujay

# Step 2: Example text

Tokenization (Breaking text into words)

word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).

tokens = ... → saves those words into a variable called tokens.

print(tokens) → shows the actual list of words.

# Step 3: Tokenization (Breaking text into words)

Removing Stopwords (like "am", "in", "the")

# Step 4: Removing Stopwords (like "am", "in", "the")

filtered_tokens = [word for word in tokens if word.lower() not in

print("After Removing Stopwords:")

Stemming (cutting words to their root form)

👉 Example: Input → ['learning', 'NLP', 'classes'] Output → ['learn', 'nlp', 'class']

# Step 5: Stemming (cutting words to their root form)

stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

# Lemmatization (finding correct base word using grammar rules)

👉 Example: Input → ['better', 'running', 'classes'] Output → ['good', 'run', 'class']

# Step 6: Lemmatization (finding correct base word using grammar

lemmatized_tokens = [lemmatizer.lemmatize(word) for word in

tokens = ... → saves those words into a variable called tokens.

print("After Tokenization:") → shows a label so students know what’s coming.

print(tokens) → shows the actual list of words.

You might also like