Natural Langauage Processing (NLP)
NLTK installation
pip3 install nltk
#import nltk
#nltk.download()
Tokenization of words:
We use the method word_tokenize() to split a sentence into words.
word tokenization becomes a crucial part of the text (string) to numeric data
conversion
In [1]: import nltk
#nltk.download()
In [3]: import nltk
nltk.download('punkt')
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
Out[3]: True
Word Tokenizer
In [5]: import nltk
from nltk.tokenize import word_tokenize
text = "Welcome to the Python Programming at Indeed Insprining Infotech"
print(word_tokenize(text))
['Welcome', 'to', 'the', 'Python', 'Programming', 'at', 'Indeed', 'Insprinin
g', 'Infotech']
Sentence Tokenizer
In [6]: from nltk.tokenize import sent_tokenize
text = "Hello Everyone. Welcome to the Python Programming"
print(sent_tokenize(text))
['Hello Everyone.', 'Welcome to the Python Programming']
Stemming
Loading [MathJax]/extensions/Safe.js
When we have many variations of the same word for example...the word is
dance and the variations are "dancing", "dances","danced".
Stemming algorithm works by cutting the suffix from the word.
In [7]: from nltk.stem import PorterStemmer
# words = ['Wait','Waiting','Waited','Waits']
words = ['clean','cleaning','cleans','cleaned']
ps = PorterStemmer()
for w in words:
words=ps.stem(w)
print(words)
clean
clean
clean
clean
lemmatization
Why is Lemmatization better than Stemming?
Stemming algorithm woks by cutting the suffix from the word and
Lemmatization is a more powerful operation because it perform
morphological analysis of the words.
Stemming Code:
In [8]: import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
text = "studies studying floors cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print('Stemming for ', w,'is',porter_stemmer.stem(w))
Stemming for studies is studi
Stemming for studying is studi
Stemming for floors is floor
Stemming for cry is cri
lemmatization Code:
In [9]: import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
Loading [MathJax]/extensions/Safe.js
Out[9]: True
In [10]: import nltk
from nltk.stem import WordNetLemmatizer
Wordnet_lemmatizer = WordNetLemmatizer()
text = "studies study floors cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print('Lemma for ', w,'is',Wordnet_lemmatizer.lemmatize(w))
Lemma for studies is study
Lemma for study is study
Lemma for floors is floor
Lemma for cry is cry
NLTK stop words
Text may contain stop words like 'the','is','are','a'. Stop words can be
filterd from the text to be processed.
In [11]: nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
Out[11]: True
In [12]: from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
data = 'AI was introduced in the year 1956 but it gained popularity recently
stopwords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []
for w in words:
if w not in stopwords:
wordsFiltered.append(w)
print(wordsFiltered)
['AI', 'introduced', 'year', '1956', 'gained', 'popularity', 'recently',
'.']
In [13]: print(len(stopwords))
print(stopwords)
Loading [MathJax]/extensions/Safe.js
179
{'were', 'are', "isn't", 'this', 'herself', 'until', 'under', 'that', 'her
e', 'over', 'by', 'while', 'was', 'not', 'most', 'had', 'be', 'can', 'becaus
e', "mustn't", 'just', 'hadn', 'yours', 'haven', 'is', 'do', 'a', 'about',
'should', 'above', 'm', 'if', 'how', 'hers', 'yourselves', 'her', 'themselve
s', 'as', 'of', 'at', 'ourselves', 'has', 'won', 've', "wasn't", 'all', 'i
t', 'itself', 't', "you'd", 'which', 'what', "doesn't", 'there', 'y', "you'v
e", "needn't", 'their', 'been', 'does', 'myself', 'out', 'when', "hasn't",
"wouldn't", 'ain', 'each', 'then', 'ours', 'we', 'its', 'up', 'such', 'ma',
"aren't", 'his', "she's", "you'll", "shouldn't", 'whom', 'on', 'before', 'so
me', 'they', 'down', 'an', 'again', 'him', 'he', 'am', 'wasn', 'into', 'no
r', 'you', 'after', 'our', 'other', 'them', 'no', 'so', 'don', "that'll", 'f
rom', 'between', 'in', 'during', 'have', 'mustn', 'both', 'to', 'isn', 'your
self', 'mightn', 'own', 'further', 'through', 'didn', 'but', "weren't", 'd',
'will', "mightn't", 'or', 'shouldn', 'your', 'did', 'me', "you're", 'the',
'aren', 'these', "it's", "couldn't", 'hasn', "didn't", 'my', 'few', 'very',
'why', 'below', 'than', 'doesn', 'she', 'doing', "should've", 'same', 'mor
e', 'i', 'couldn', 'and', 'those', 'being', 're', "haven't", "don't", 'sha
n', 'only', 'for', 'once', "shan't", 'any', 'weren', 'theirs', 'now', "wo
n't", 'who', 'with', 'needn', 'wouldn', 'o', 'against', 'himself', 's', 'of
f', 'too', 'll', 'having', "hadn't", 'where'}
Text Analytics
1. Extract Sample document and apply following document preprocessing
methods: Tokenization, POS Tagging, stop words removal, Stemming and
Lemmatization.
2. Create representation of document by calculating Term Frequency and
Inverse Document Frequency.
In [14]: import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
In [15]: # Example document
document = "This is an example document that we will use to demonstrate docu
In [16]: # Tokenization
tokens = word_tokenize(document)
In [17]: tokens
Loading [MathJax]/extensions/Safe.js
Out[17]: ['This',
'is',
'an',
'example',
'document',
'that',
'we',
'will',
'use',
'to',
'demonstrate',
'document',
'preprocessing',
'.']
In [18]: import nltk
nltk.download('averaged_perceptron_tagger')
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
Out[18]: True
In [19]: # POS tagging
#These tags can indicate whether a word is a noun, verb, adjective, adverb,
pos_tags = pos_tag(tokens)
In [20]: pos_tags
# DT for determiner , NN for noun , VBD for past tense word , IN for prepros
#VBZ: verb, 3rd person singular present tense (e.g. "runs")
# VB: verb, base form (e.g. "run")
# PRP: personal pronoun (e.g. "he", "she", "it", "they")
# MD: modal verb (e.g. "can", "should", "will")
# TO: to (e.g. "to run", "to go")
# Here are some additional tags that you may find useful:
# NN: noun, singular or mass (e.g. "cat", "dog", "water")
# NNS: noun, plural (e.g. "cats", "dogs")
# JJ: adjective (e.g. "happy", "blue")
# RB: adverb (e.g. "quickly", "very")
# IN: preposition or subordinating conjunction (e.g. "in", "on", "because")
Loading [MathJax]/extensions/Safe.js
Out[20]: [('This', 'DT'),
('is', 'VBZ'),
('an', 'DT'),
('example', 'NN'),
('document', 'NN'),
('that', 'IN'),
('we', 'PRP'),
('will', 'MD'),
('use', 'VB'),
('to', 'TO'),
('demonstrate', 'VB'),
('document', 'NN'),
('preprocessing', 'NN'),
('.', '.')]
In [21]: # Stopwords removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if not word.lower() in stop_words
In [22]: # Stemming
ps = PorterStemmer()
stemmed_tokens = [ps.stem(word) for word in filtered_tokens]
In [23]: # Lemmatization
wnl = WordNetLemmatizer()
lemmatized_tokens = [wnl.lemmatize(word) for word in filtered_tokens]
In [24]: # Print the results
print("Tokens: ", tokens)
# print("POS tags: ", pos_tags)
print("Filtered tokens: ", filtered_tokens)
print("Stemmed tokens: ", stemmed_tokens)
print("Lemmatized tokens: ", lemmatized_tokens)
#NLTK is capable of performing all the document preprocessing methods that y
Tokens: ['This', 'is', 'an', 'example', 'document', 'that', 'we', 'will',
'use', 'to', 'demonstrate', 'document', 'preprocessing', '.']
Filtered tokens: ['example', 'document', 'use', 'demonstrate', 'document',
'preprocessing', '.']
Stemmed tokens: ['exampl', 'document', 'use', 'demonstr', 'document', 'prep
rocess', '.']
Lemmatized tokens: ['example', 'document', 'use', 'demonstrate', 'documen
t', 'preprocessing', '.']
Text Analytics
Create representation of document by calculating Term Frequency and Inverse
Document Frequency.
In [ ]: import math
from collections import Counter
Loading [MathJax]/extensions/Safe.js
In [ ]: # Sample corpus of documents
corpus = [
'The quick brown fox jumps over the lazy dog',
'The brown fox is quick',
'The lazy dog is sleeping'
]
In [ ]: # Tokenize the documents
tokenized_docs = [doc.lower().split() for doc in corpus]
In [ ]: # Count the term frequency for each document
tf_docs = [Counter(tokens) for tokens in tokenized_docs]
In [ ]: # Calculate the inverse document frequency for each term
n_docs = len(corpus)
idf = {}
for tokens in tokenized_docs:
for token in set(tokens):
idf[token] = idf.get(token, 0) + 1
for token in idf:
idf[token] = math.log(n_docs / idf[token])
In [ ]: # Calculate the TF-IDF weights for each document
tfidf_docs = []
for tf_doc in tf_docs:
tfidf_doc = {}
for token, freq in tf_doc.items():
tfidf_doc[token] = freq * idf[token]
tfidf_docs.append(tfidf_doc)
In [ ]: # Print the resulting TF-IDF representation for each document
for i, tfidf_doc in enumerate(tfidf_docs):
print(f"Document {i+1}: {tfidf_doc}")
Document 1: {'the': 0.0, 'quick': 0.4054651081081644, 'brown': 0.40546510810
81644, 'fox': 0.4054651081081644, 'jumps': 1.0986122886681098, 'over': 1.098
6122886681098, 'lazy': 0.4054651081081644, 'dog': 0.4054651081081644}
Document 2: {'the': 0.0, 'brown': 0.4054651081081644, 'fox': 0.4054651081081
644, 'is': 0.4054651081081644, 'quick': 0.4054651081081644}
Document 3: {'the': 0.0, 'lazy': 0.4054651081081644, 'dog': 0.40546510810816
44, 'is': 0.4054651081081644, 'sleeping': 1.0986122886681098}
This code uses the Counter class from the collections module to count the term
frequency for each document. It then calculates the inverse document frequency
by iterating over the tokenized documents and keeping track of the number of
documents that each term appears in. Finally, it multiplies the term frequency of
each term in each document by its corresponding inverse document frequency
to get the TF-IDF weight for each term in each document. The resulting TF-IDF
representation for each document is printed to the console.
In [ ]:
Loading [MathJax]/extensions/Safe.js