0% found this document useful (0 votes)
12 views4 pages

NLP Exp 2

The document outlines the aim of Experiment No. 2, which is to study and implement text preprocessing techniques including tokenization, stop word removal, stemming, and lemmatization. It explains key concepts such as tokens, the process of tokenization, the definition and removal of stopwords, and the differences between stemming and lemmatization. The conclusion emphasizes the successful study and implementation of these preprocessing methods.

Uploaded by

Madhura Kanse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views4 pages

NLP Exp 2

The document outlines the aim of Experiment No. 2, which is to study and implement text preprocessing techniques including tokenization, stop word removal, stemming, and lemmatization. It explains key concepts such as tokens, the process of tokenization, the definition and removal of stopwords, and the differences between stemming and lemmatization. The conclusion emphasizes the successful study and implementation of these preprocessing methods.

Uploaded by

Madhura Kanse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Experiment No.

2
Aim: To study and implement preprocessing using tokenization, stop word removal,
stemming, and lemmatization.
Theory:
1. Token:
A token is a single, meaningful unit derived from text. It could be a word, a punctuation mark,
or even a number. Tokenization is the process of breaking down a text into these tokens. For
example, the sentence "I love natural language processing!" would be tokenized into: ["I",
"love", "natural", "language", "processing", "!"].
2. Tokenization:
Tokenization is the task of splitting a text into tokens. This is a crucial initial step in most NLP
tasks because it transforms raw text into manageable units for further analysis. Tokenization
can be done at different levels: wordlevel, subword level (useful for languages with complex
word formations), or character level.
3. Stopwords:
Stopwords are common words that are often filtered out during text preprocessing because
they occur frequently in a language and do not carry much semantic meaning on their own.
Examples in English include articles ("the", "a"), prepositions ("at", "on"), conjunctions
("and", "but"), and some common verbs ("is", "are").
4. List of Stopwords:
A list of stopwords is a predefined set of words recognized as stopwords for a particular
language. These lists are used during text preprocessing to efficiently filter out these common
words from the text data. Different libraries and NLP tools provide their own lists of stopwords
tailored for specific tasks or applications.
A list of common stopwords in English Language include:
[i, me, my, myself, we, our, ours, ourselves, you, your, yours, yourself, yourselves, he, him,
his, himself, she, her, hers, herself, it, its, itself, they, them, their, theirs, themselves, what,
which, who, whom, this, that, these, those, am, is, are, was, were, be, been, being, have, has,
had, having, do, does, did, doing, a, an, the, etc.]
5. How to remove stopwords:
To remove stopwords from text data:
o Tokenize the text to break it down into individual words or tokens.
o Compare each token against a predefined list of stopwords.
o Exclude tokens that match stopwords from the tokenized list.
The result is a cleaned text with stopwords removed, containing only meaningful words that
carry more semantic value.
6. Stemming and Lemmatization:
Stemming: Stemming is the process of reducing words to their root or base form, even if the
result is not a real word. It involves chopping off prefixes or suffixes to achieve this reduction.
Stemming algorithms apply heuristic rules to perform this task, such as Porter's algorithm or
Snowball algorithm.
Lemmatization: Lemmatization aims to reduce words to their canonical form (lemma), which
is a valid word present in the language. This involves using vocabulary and morphological
analysis of words, considering their meaning and context. Lemmatization typically uses
dictionaries and word structure rules to achieve accurate transformation.
7. Key differences between Stemming and Lemmatization:
Goal: Stemming aims for word reduction to a base form, while lemmatization aims for
reduction to a valid word form.
Output: Stemming can sometimes generate nonreal words (like "goes" to "go"), while
lemmatization always produces real words (like "going" to "go").
Accuracy: Lemmatization is generally more accurate because it considers the context of the
word and its meaning. Stemming, being simpler and rulebased, may not always produce
accurate results but is faster.

Conclusion: Hence we studied and implemented preprocessing using tokenization, stop word
removal, stemming, and lemmatization.

You might also like