School of Computer Science Engineering and Technology
Assignment-2
Course-B. Tech. Type- Specialization Elective
Course Code- CSET246 Course Name-Natural Language Processing
Year- 2025 Semester- Even
Date- Batch-All
Objective: The objective of this assignment is to familiarize with the essential data
cleaning steps in Natural Language Processing (NLP). Students will work with a
challenging uncleaned paragraph, applying various data cleaning techniques to
preprocess the text and make it suitable for further NLP tasks.
Task: You are provided with an uncleaned paragraph. Your task is to perform a
series of data cleaning steps to preprocess the text and make it ready for NLP tasks.
Data Cleaning Steps:
1. Lowercasing: Convert the entire paragraph to lowercase to ensure consistent
capitalization.
2. Removing Extra Whitespace: Remove extra spaces and ensure that only a single
space separates words.
3. Handling Contractions: Correct contractions to their full forms (e.g., "I <3 nlp"
to "I love NLP").
4. Removing Special Characters: Remove special characters such as @, #, $, %, &,
*, etc.
5. Reducing Duplicate Letters: Normalize repeated letters (e.g., "soooo" to "soo",
"loooong" to "long").
6. Fixing Punctuation: Correct the excessive use of punctuation marks and
normalize them.
7. Removing URL Artifacts: Clean up any remaining artifacts from URLs (e.g.,
"www.example.com////").
Paragraph:
1. OMG!! I can't believe I found this aWesoMe article about AI &
machine leanring!! It was soooo gooood lol. I <3 nlp butttt i hate spelng
errors in textttt. This is gonna be a looong paragraph with looooots of
spacesss and weirddddd symbols @#$%. The website's link is
www.example.com//// Check it out ASAP!!! #excited
2. Oh my gosh!!! Like, I can't even believe what I just stumbled upon on the
world wide web. This article, "The Marvels of Artificial General Intelligence
& the Future of Humanity," totally blew my mindddd!! It was, like,
sooo mind-blowingly awesome, lolz. I mean, I <3 NLP butttt those annoying
typos in texts drive me nuts. Brace yourselves, this is gonna be one seriously
long paragraph with tons and tons of extraterrestrial spaces and some
seriously weird symbols like @#$%. And guess what? The link to the
website is www.incrediblenews.com//// So, um, you better check it out like
ASAP!!! #excitedmuch
Q2. Visualize Word Frequency After Tokenization
Task: Read a paragraph from a file, perform tokenization using NLTK, and then
visualize the frequency of words using a bar graph.
Q3. Compare Root Words from Stemming vs Lemmatization Visually
Task: Take a sentence, apply both Porter Stemming and WordNet Lemmatization,
then plot a comparison showing how the two methods reduce words differently.