0% found this document useful (0 votes)

5 views8 pages

Natural Langauage Processing (NLP) : Tokenization of Words

Uploaded by

Yashodhan Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views8 pages

Natural Langauage Processing (NLP) : Tokenization of Words

Uploaded by

Yashodhan Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Natural Langauage Processing (NLP)

NLTK installation
pip3 install nltk

#import nltk

#nltk.download()

Tokenization of words:
We use the method word_tokenize() to split a sentence into words.
word tokenization becomes a crucial part of the text (string) to numeric data
conversion

In [1]: import nltk

#nltk.download()

In [3]: import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Unzipping tokenizers/punkt.zip.
Out[3]: True

Word Tokenizer
In [5]: import nltk
from nltk.tokenize import word_tokenize
text = "Welcome to the Python Programming at Indeed Insprining Infotech"
print(word_tokenize(text))

['Welcome', 'to', 'the', 'Python', 'Programming', 'at', 'Indeed', 'Insprinin

g', 'Infotech']

Sentence Tokenizer
In [6]: from nltk.tokenize import sent_tokenize
text = "Hello Everyone. Welcome to the Python Programming"
print(sent_tokenize(text))

['Hello Everyone.', 'Welcome to the Python Programming']

Stemming
Loading [MathJax]/extensions/Safe.js
When we have many variations of the same word for example...the word is
dance and the variations are "dancing", "dances","danced".
Stemming algorithm works by cutting the suffix from the word.

In [7]: from nltk.stem import PorterStemmer

# words = ['Wait','Waiting','Waited','Waits']
words = ['clean','cleaning','cleans','cleaned']
ps = PorterStemmer()
for w in words:
words=ps.stem(w)
print(words)

clean
clean
clean
clean

lemmatization
Why is Lemmatization better than Stemming?
Stemming algorithm woks by cutting the suffix from the word and
Lemmatization is a more powerful operation because it perform
morphological analysis of the words.

Stemming Code:
In [8]: import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
text = "studies studying floors cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print('Stemming for ', w,'is',porter_stemmer.stem(w))

Stemming for studies is studi

Stemming for studying is studi
Stemming for floors is floor
Stemming for cry is cri

lemmatization Code:
In [9]: import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
Loading [MathJax]/extensions/Safe.js
Out[9]: True

In [10]: import nltk

from nltk.stem import WordNetLemmatizer
Wordnet_lemmatizer = WordNetLemmatizer()
text = "studies study floors cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print('Lemma for ', w,'is',Wordnet_lemmatizer.lemmatize(w))

Lemma for studies is study

Lemma for study is study
Lemma for floors is floor
Lemma for cry is cry

NLTK stop words

Text may contain stop words like 'the','is','are','a'. Stop words can be
filterd from the text to be processed.

In [11]: nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...

[nltk_data] Unzipping corpora/stopwords.zip.
Out[11]: True

In [12]: from nltk.tokenize import sent_tokenize, word_tokenize

from nltk.corpus import stopwords

data = 'AI was introduced in the year 1956 but it gained popularity recently
stopwords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []

for w in words:
if w not in stopwords:
wordsFiltered.append(w)

print(wordsFiltered)

['AI', 'introduced', 'year', '1956', 'gained', 'popularity', 'recently',

'.']

In [13]: print(len(stopwords))
print(stopwords)

Loading [MathJax]/extensions/Safe.js
179
{'were', 'are', "isn't", 'this', 'herself', 'until', 'under', 'that', 'her
e', 'over', 'by', 'while', 'was', 'not', 'most', 'had', 'be', 'can', 'becaus
e', "mustn't", 'just', 'hadn', 'yours', 'haven', 'is', 'do', 'a', 'about',
'should', 'above', 'm', 'if', 'how', 'hers', 'yourselves', 'her', 'themselve
s', 'as', 'of', 'at', 'ourselves', 'has', 'won', 've', "wasn't", 'all', 'i
t', 'itself', 't', "you'd", 'which', 'what', "doesn't", 'there', 'y', "you'v
e", "needn't", 'their', 'been', 'does', 'myself', 'out', 'when', "hasn't",
"wouldn't", 'ain', 'each', 'then', 'ours', 'we', 'its', 'up', 'such', 'ma',
"aren't", 'his', "she's", "you'll", "shouldn't", 'whom', 'on', 'before', 'so
me', 'they', 'down', 'an', 'again', 'him', 'he', 'am', 'wasn', 'into', 'no
r', 'you', 'after', 'our', 'other', 'them', 'no', 'so', 'don', "that'll", 'f
rom', 'between', 'in', 'during', 'have', 'mustn', 'both', 'to', 'isn', 'your
self', 'mightn', 'own', 'further', 'through', 'didn', 'but', "weren't", 'd',
'will', "mightn't", 'or', 'shouldn', 'your', 'did', 'me', "you're", 'the',
'aren', 'these', "it's", "couldn't", 'hasn', "didn't", 'my', 'few', 'very',
'why', 'below', 'than', 'doesn', 'she', 'doing', "should've", 'same', 'mor
e', 'i', 'couldn', 'and', 'those', 'being', 're', "haven't", "don't", 'sha
n', 'only', 'for', 'once', "shan't", 'any', 'weren', 'theirs', 'now', "wo
n't", 'who', 'with', 'needn', 'wouldn', 'o', 'against', 'himself', 's', 'of
f', 'too', 'll', 'having', "hadn't", 'where'}

Text Analytics
1. Extract Sample document and apply following document preprocessing
methods: Tokenization, POS Tagging, stop words removal, Stemming and
Lemmatization.
2. Create representation of document by calculating Term Frequency and
Inverse Document Frequency.

In [14]: import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag

In [15]: # Example document

document = "This is an example document that we will use to demonstrate docu

In [16]: # Tokenization
tokens = word_tokenize(document)

In [17]: tokens

Loading [MathJax]/extensions/Safe.js
Out[17]: ['This',
'is',
'an',
'example',
'document',
'that',
'we',
'will',
'use',
'to',
'demonstrate',
'document',
'preprocessing',
'.']

In [18]: import nltk

nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to

[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
Out[18]: True

In [19]: # POS tagging

#These tags can indicate whether a word is a noun, verb, adjective, adverb,
pos_tags = pos_tag(tokens)

In [20]: pos_tags
# DT for determiner , NN for noun , VBD for past tense word , IN for prepros
#VBZ: verb, 3rd person singular present tense (e.g. "runs")
# VB: verb, base form (e.g. "run")
# PRP: personal pronoun (e.g. "he", "she", "it", "they")
# MD: modal verb (e.g. "can", "should", "will")
# TO: to (e.g. "to run", "to go")
# Here are some additional tags that you may find useful:

# NN: noun, singular or mass (e.g. "cat", "dog", "water")

# NNS: noun, plural (e.g. "cats", "dogs")
# JJ: adjective (e.g. "happy", "blue")
# RB: adverb (e.g. "quickly", "very")
# IN: preposition or subordinating conjunction (e.g. "in", "on", "because")

Loading [MathJax]/extensions/Safe.js
Out[20]: [('This', 'DT'),
('is', 'VBZ'),
('an', 'DT'),
('example', 'NN'),
('document', 'NN'),
('that', 'IN'),
('we', 'PRP'),
('will', 'MD'),
('use', 'VB'),
('to', 'TO'),
('demonstrate', 'VB'),
('document', 'NN'),
('preprocessing', 'NN'),
('.', '.')]

In [21]: # Stopwords removal

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if not word.lower() in stop_words

In [22]: # Stemming
ps = PorterStemmer()
stemmed_tokens = [ps.stem(word) for word in filtered_tokens]

In [23]: # Lemmatization
wnl = WordNetLemmatizer()
lemmatized_tokens = [wnl.lemmatize(word) for word in filtered_tokens]

In [24]: # Print the results

print("Tokens: ", tokens)
# print("POS tags: ", pos_tags)
print("Filtered tokens: ", filtered_tokens)
print("Stemmed tokens: ", stemmed_tokens)
print("Lemmatized tokens: ", lemmatized_tokens)
#NLTK is capable of performing all the document preprocessing methods that y

Tokens: ['This', 'is', 'an', 'example', 'document', 'that', 'we', 'will',

'use', 'to', 'demonstrate', 'document', 'preprocessing', '.']
Filtered tokens: ['example', 'document', 'use', 'demonstrate', 'document',
'preprocessing', '.']
Stemmed tokens: ['exampl', 'document', 'use', 'demonstr', 'document', 'prep
rocess', '.']
Lemmatized tokens: ['example', 'document', 'use', 'demonstrate', 'documen
t', 'preprocessing', '.']

Text Analytics
Create representation of document by calculating Term Frequency and Inverse
Document Frequency.

In [ ]: import math
from collections import Counter

Loading [MathJax]/extensions/Safe.js
In [ ]: # Sample corpus of documents
corpus = [
'The quick brown fox jumps over the lazy dog',
'The brown fox is quick',
'The lazy dog is sleeping'
]

In [ ]: # Tokenize the documents

tokenized_docs = [doc.lower().split() for doc in corpus]

In [ ]: # Count the term frequency for each document

tf_docs = [Counter(tokens) for tokens in tokenized_docs]

In [ ]: # Calculate the inverse document frequency for each term

n_docs = len(corpus)
idf = {}
for tokens in tokenized_docs:
for token in set(tokens):
idf[token] = idf.get(token, 0) + 1
for token in idf:
idf[token] = math.log(n_docs / idf[token])

In [ ]: # Calculate the TF-IDF weights for each document

tfidf_docs = []
for tf_doc in tf_docs:
tfidf_doc = {}
for token, freq in tf_doc.items():
tfidf_doc[token] = freq * idf[token]
tfidf_docs.append(tfidf_doc)

In [ ]: # Print the resulting TF-IDF representation for each document

for i, tfidf_doc in enumerate(tfidf_docs):
print(f"Document {i+1}: {tfidf_doc}")

Document 1: {'the': 0.0, 'quick': 0.4054651081081644, 'brown': 0.40546510810

81644, 'fox': 0.4054651081081644, 'jumps': 1.0986122886681098, 'over': 1.098
6122886681098, 'lazy': 0.4054651081081644, 'dog': 0.4054651081081644}
Document 2: {'the': 0.0, 'brown': 0.4054651081081644, 'fox': 0.4054651081081
644, 'is': 0.4054651081081644, 'quick': 0.4054651081081644}
Document 3: {'the': 0.0, 'lazy': 0.4054651081081644, 'dog': 0.40546510810816
44, 'is': 0.4054651081081644, 'sleeping': 1.0986122886681098}

This code uses the Counter class from the collections module to count the term
frequency for each document. It then calculates the inverse document frequency
by iterating over the tokenized documents and keeping track of the number of
documents that each term appears in. Finally, it multiplies the term frequency of
each term in each document by its corresponding inverse document frequency
to get the TF-IDF weight for each term in each document. The resulting TF-IDF
representation for each document is printed to the console.

In [ ]:
Loading [MathJax]/extensions/Safe.js

NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
Cotb46 7
No ratings yet
Cotb46 7
3 pages
7 Idf
No ratings yet
7 Idf
5 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
7 TextAnalysis
No ratings yet
7 TextAnalysis
3 pages
Python NLP Tasks with NLTK
No ratings yet
Python NLP Tasks with NLTK
17 pages
A7 Dsbda Sana
No ratings yet
A7 Dsbda Sana
15 pages
Assignment No - 7
No ratings yet
Assignment No - 7
4 pages
NLP Pratical
No ratings yet
NLP Pratical
14 pages
Lab 2
No ratings yet
Lab 2
49 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
32 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
4 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
11 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
R22 NLP Python Programs
No ratings yet
R22 NLP Python Programs
15 pages
Removing Stopwords in NLP
No ratings yet
Removing Stopwords in NLP
32 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
NLP PRGRM-1
No ratings yet
NLP PRGRM-1
7 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
No ratings yet
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
13 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
NLP Techniques for Developers
No ratings yet
NLP Techniques for Developers
3 pages
Text Mining Basics
No ratings yet
Text Mining Basics
16 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
NLTK Tutorial
No ratings yet
NLTK Tutorial
33 pages
TSA Lab Manual New
No ratings yet
TSA Lab Manual New
14 pages
Lab 2
No ratings yet
Lab 2
4 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
7 pages
DSBDA7
No ratings yet
DSBDA7
2 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
NLP Record
No ratings yet
NLP Record
15 pages
7 Exp
No ratings yet
7 Exp
6 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
Irs Week 2
No ratings yet
Irs Week 2
2 pages
DSBDA Practical 7 Tutorial
No ratings yet
DSBDA Practical 7 Tutorial
11 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
Token Ization
No ratings yet
Token Ization
5 pages
NLP Programs
No ratings yet
NLP Programs
5 pages
3.Nlp Lab Manual
No ratings yet
3.Nlp Lab Manual
18 pages
NLP - Practical List
No ratings yet
NLP - Practical List
14 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
22 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Text Processing
No ratings yet
Text Processing
16 pages
Stemming, Lemmatization & NLP Basics
No ratings yet
Stemming, Lemmatization & NLP Basics
6 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
NLP Exp 2
No ratings yet
NLP Exp 2
4 pages
NLP Record
No ratings yet
NLP Record
23 pages
NLP1 Prasen
No ratings yet
NLP1 Prasen
5 pages
NLPPractical
No ratings yet
NLPPractical
12 pages
BCT Ocw
No ratings yet
BCT Ocw
2 pages
Stqa Activity1
No ratings yet
Stqa Activity1
2 pages
Bubble Sort
No ratings yet
Bubble Sort
1 page
Dsbda GRP B Print
No ratings yet
Dsbda GRP B Print
17 pages
Nqueen
No ratings yet
Nqueen
1 page
Rail Fence
No ratings yet
Rail Fence
1 page
Transposition
No ratings yet
Transposition
2 pages
Sign Language Paper
No ratings yet
Sign Language Paper
7 pages
Text Preprocessing & NLTK Guide
No ratings yet
Text Preprocessing & NLTK Guide
8 pages
Telugu 3
No ratings yet
Telugu 3
6 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Saraiki Language Hybrid Stemmer Using Rule-Based and LSTM-Based Sequence-To-Sequence Model Approach
No ratings yet
Saraiki Language Hybrid Stemmer Using Rule-Based and LSTM-Based Sequence-To-Sequence Model Approach
24 pages
Social Media & Text Mining Guide
No ratings yet
Social Media & Text Mining Guide
27 pages
Natural Language Processing Master Thesis
100% (3)
Natural Language Processing Master Thesis
8 pages
Lab Viva Questions
No ratings yet
Lab Viva Questions
9 pages
IRS Unit-2
No ratings yet
IRS Unit-2
63 pages
Unit 3 Answer Semester
No ratings yet
Unit 3 Answer Semester
11 pages
Morphology in NLP: A Comprehensive Guide
No ratings yet
Morphology in NLP: A Comprehensive Guide
52 pages
? DSML U4
No ratings yet
? DSML U4
27 pages
Module 2 NLP 2024-25 6 Sem Even P2-1
No ratings yet
Module 2 NLP 2024-25 6 Sem Even P2-1
141 pages
NLP Question Bank Answers (Raghav) - This Is Better
No ratings yet
NLP Question Bank Answers (Raghav) - This Is Better
25 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
Unit 1
No ratings yet
Unit 1
181 pages
Bai601 NLP Module 4 Lecture Notes
No ratings yet
Bai601 NLP Module 4 Lecture Notes
24 pages
Term Frequency and Inverse Document Frequency
No ratings yet
Term Frequency and Inverse Document Frequency
26 pages
Natural Language Processing Important Questions Answers
100% (1)
Natural Language Processing Important Questions Answers
31 pages
Module 2
No ratings yet
Module 2
78 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
Automated Resume Screening with NLP
No ratings yet
Automated Resume Screening with NLP
5 pages
NLP Course
No ratings yet
NLP Course
26 pages
Text As Data The Promise and Pitfalls of Automatic Content Analysis Methods For Political Texts
No ratings yet
Text As Data The Promise and Pitfalls of Automatic Content Analysis Methods For Political Texts
31 pages
Viva Questions
No ratings yet
Viva Questions
6 pages
A Survey of Stemming Algorithms in Information Retrieval: Author Index Subject Index Search Home
No ratings yet
A Survey of Stemming Algorithms in Information Retrieval: Author Index Subject Index Search Home
22 pages
The Porter Stemming Algorithm
No ratings yet
The Porter Stemming Algorithm
12 pages
Text Preprocessing Techniques Guide
No ratings yet
Text Preprocessing Techniques Guide
6 pages
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
33 pages
Solution To NLP Viva Questions
No ratings yet
Solution To NLP Viva Questions
21 pages

Natural Langauage Processing (NLP) : Tokenization of Words

Uploaded by

Natural Langauage Processing (NLP) : Tokenization of Words

Uploaded by

Natural Langauage Processing (NLP)

In [1]: import nltk

In [3]: import nltk

[nltk_data] Downloading package punkt to /root/nltk_data...

['Welcome', 'to', 'the', 'Python', 'Programming', 'at', 'Indeed', 'Insprinin

['Hello Everyone.', 'Welcome to the Python Programming']

In [7]: from nltk.stem import PorterStemmer

Stemming for studies is studi

[nltk_data] Downloading package wordnet to /root/nltk_data...

In [10]: import nltk

Lemma for studies is study

NLTK stop words

[nltk_data] Downloading package stopwords to /root/nltk_data...

In [12]: from nltk.tokenize import sent_tokenize, word_tokenize

['AI', 'introduced', 'year', '1956', 'gained', 'popularity', 'recently',

In [14]: import nltk

In [15]: # Example document

In [18]: import nltk

[nltk_data] Downloading package averaged_perceptron_tagger to

In [19]: # POS tagging

# NN: noun, singular or mass (e.g. "cat", "dog", "water")

In [21]: # Stopwords removal

In [24]: # Print the results

Tokens: ['This', 'is', 'an', 'example', 'document', 'that', 'we', 'will',

In [ ]: # Tokenize the documents

In [ ]: # Count the term frequency for each document

In [ ]: # Calculate the inverse document frequency for each term

In [ ]: # Calculate the TF-IDF weights for each document

In [ ]: # Print the resulting TF-IDF representation for each document

Document 1: {'the': 0.0, 'quick': 0.4054651081081644, 'brown': 0.40546510810

You might also like