School of Computer Science Engineering and Technology
Assignment-4
Course-B. Tech. Type- Specialization Elective
Course Code- CSET246 Course Name-Natural Language Processing
Year- 2025 Semester- Even
Date- Batch-All
Text Normalization and Preprocessing
Objective:
This assignment focuses on performing essential text preprocessing steps including lowercasing,
punctuation removal, tokenization, stopword elimination, and basic text analysis. Through
hands-on tasks, students will learn how to clean and prepare raw English text for downstream
NLP tasks like classification or sentiment analysis.
Q1. Perform basic text normalization on English text.
Input sentence: "The Examination's RESULT was Declared!!"
Normalize:
o Lowercase
o Remove punctuation
o Remove stopwords
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
Download NLTK data (only once)
nltk.download('punkt')
nltk.download('stopwords')
Input sentence
sentence = "The Examination's RESULT was Declared!!"
Step 1: Lowercase
sentence = sentence.lower()
Step 2: Remove punctuation
sentence = sentence.translate(str.maketrans('', '', string.punctuation))
Step 3: Tokenize the sentence
tokens = word_tokenize(sentence)
Step 4: Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
Final Output
print("Normalized Tokens:", filtered_tokens)
Q2. Tokenize sentences and words using NLTK.
Input: 2–3 sentences
Use nltk.sent_tokenize() and nltk.word_tokenize()
Show sentence and word tokens
Q3. Identify if characters in the text are alphabets, digits, or special characters.
Input: "Student123 scored 95%! Great job!!"
Output:
o Alphabets: Student, scored, Great, job
o Digits: 123, 95
o Special Characters: %, !, !!