0% found this document useful (0 votes)
12 views2 pages

Lab 2 NLP

The assignment for the B. Tech course in Natural Language Processing focuses on data cleaning techniques to preprocess an uncleaned paragraph. Students are required to apply various steps such as lowercasing, removing extra whitespace, handling contractions, and fixing punctuation. Additionally, tasks include visualizing word frequency after tokenization and comparing root words from stemming versus lemmatization.

Uploaded by

vikrammadhad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views2 pages

Lab 2 NLP

The assignment for the B. Tech course in Natural Language Processing focuses on data cleaning techniques to preprocess an uncleaned paragraph. Students are required to apply various steps such as lowercasing, removing extra whitespace, handling contractions, and fixing punctuation. Additionally, tasks include visualizing word frequency after tokenization and comparing root words from stemming versus lemmatization.

Uploaded by

vikrammadhad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

School of Computer Science Engineering and Technology

Assignment-2
Course-B. Tech. Type- Specialization Elective
Course Code- CSET246 Course Name-Natural Language Processing

Year- 2025 Semester- Even


Date- Batch-All

Objective: The objective of this assignment is to familiarize with the essential data
cleaning steps in Natural Language Processing (NLP). Students will work with a
challenging uncleaned paragraph, applying various data cleaning techniques to
preprocess the text and make it suitable for further NLP tasks.

Task: You are provided with an uncleaned paragraph. Your task is to perform a
series of data cleaning steps to preprocess the text and make it ready for NLP tasks.

Data Cleaning Steps:

1. Lowercasing: Convert the entire paragraph to lowercase to ensure consistent


capitalization.
2. Removing Extra Whitespace: Remove extra spaces and ensure that only a single
space separates words.
3. Handling Contractions: Correct contractions to their full forms (e.g., "I <3 nlp"
to "I love NLP").
4. Removing Special Characters: Remove special characters such as @, #, $, %, &,
*, etc.
5. Reducing Duplicate Letters: Normalize repeated letters (e.g., "soooo" to "soo",
"loooong" to "long").
6. Fixing Punctuation: Correct the excessive use of punctuation marks and
normalize them.
7. Removing URL Artifacts: Clean up any remaining artifacts from URLs (e.g.,
"www.example.com////").

Paragraph:

1. OMG!! I can't believe I found this aWesoMe article about AI &amp;


machine leanring!! It was soooo gooood lol. I <3 nlp butttt i hate spelng
errors in textttt. This is gonna be a looong paragraph with looooots of
spacesss and weirddddd symbols @#$%. The website's link is
www.example.com//// Check it out ASAP!!! #excited
2. Oh my gosh!!! Like, I can't even believe what I just stumbled upon on the
world wide web. This article, "The Marvels of Artificial General Intelligence
&amp; the Future of Humanity," totally blew my mindddd!! It was, like,
sooo mind-blowingly awesome, lolz. I mean, I <3 NLP butttt those annoying
typos in texts drive me nuts. Brace yourselves, this is gonna be one seriously
long paragraph with tons and tons of extraterrestrial spaces and some
seriously weird symbols like @#$%. And guess what? The link to the
website is www.incrediblenews.com//// So, um, you better check it out like
ASAP!!! #excitedmuch

Q2. Visualize Word Frequency After Tokenization


Task: Read a paragraph from a file, perform tokenization using NLTK, and then
visualize the frequency of words using a bar graph.

Q3. Compare Root Words from Stemming vs Lemmatization Visually

Task: Take a sentence, apply both Porter Stemming and WordNet Lemmatization,
then plot a comparison showing how the two methods reduce words differently.

You might also like