0% found this document useful (0 votes)

66 views20 pages

14 NLP

Uploaded by

manasvii2006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views20 pages

14 NLP

Uploaded by

manasvii2006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Natural Language Processing

Introduction
Till now, we have explored two domains of AI: Data Science and Computer Vision. Both these domains
differ from each other in terms of the data on which they work. Data Science works around numbers
and tabular data while Computer Vision is all about visual data like images and videos. The third
domain, Natural Language Processing (commonly called NLP) takes in the data of Natural Languages
which humans use in their daily lives and operates on this.

Natural Language Processing, or NLP, is the sub-field of AI that is focused on enabling computers to
understand and process human languages. AI is a subfield of Linguistics, Computer Science,
Information Engineering, and Artificial Intelligence concerned with the interactions between
computers and human (natural) languages, in particular how to program computers to process and
analyse large amounts of natural language data.

But how do computers do that? How do they understand what we say in our language? This chapter
is all about demystifying the Natural Language Processing domain and understanding how it works.

Before we get deeper into NLP, let us experience it with the help of this AI Game:

Identify the mystery animal: http://bit.ly/iai4yma

Go to this link on Google Chrome, launch the experiment and try to identify the Mystery Animal by
asking the machine 20 Yes or No questions.

Were you able to guess the animal?

__________________________________________________________________________________
__________________________________________________________________________________

If yes, in how many questions were you able to guess it?

__________________________________________________________________________________
__________________________________________________________________________________

If no, how many times did you try playing this game?

__________________________________________________________________________________
__________________________________________________________________________________

What according to you was the task of the machine?

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Were there any challenges that you faced while playing this game? If yes, list them down.

What approach must one follow to win this game?

Applications of Natural Language Processing

Since Artificial Intelligence nowadays is becoming an integral part of our lives, its applications are very
commonly used by the majority of people in their daily lives. Here are some of the applications of
Natural Language Processing which are used in the real-life scenario:

Automatic Summarization: Information overload is a real

problem when we need to access a specific, important piece
of information from a huge knowledge base. Automatic
summarization is relevant not only for summarizing the
meaning of documents and information, but also to
understand the emotional meanings within the information,
such as in collecting data from social media. Automatic
summarization is especially relevant when used to provide an
overview of a news item or blog post, while avoiding
redundancy from multiple sources and maximizing the
diversity of content obtained.

Sentiment Analysis: The goal of sentiment

analysis is to identify sentiment among several
posts or even in the same post where emotion is
not always explicitly expressed. Companies use
Natural Language Processing applications, such as
sentiment analysis, to identify opinions and
sentiment online to help them understand what
customers think about their products and services
(i.e., “I love the new iPhone” and, a few lines later
“But sometimes it doesn’t work well” where the
person is still talking about the iPhone) and overall

* Images shown here are the property of individual organisations and are used here for reference purpose only.
indicators of their reputation. Beyond determining simple polarity, sentiment analysis understands
sentiment in context to help better understand what’s behind an expressed opinion, which can be
extremely relevant in understanding and driving purchasing decisions.

Text classification: Text classification makes it possible to assign

predefined categories to a document and organize it to help you
find the information you need or simplify some activities. For
example, an application of text categorization is spam filtering in
email.

Virtual Assistants: Nowadays Google Assistant, Cortana,

Siri, Alexa, etc have become an integral part of our lives. Not
only can we talk to them but they also have the abilities to
make our lives easier. By accessing our data, they can help
us in keeping notes of our tasks, make calls for us, send
messages and a lot more. With the help of speech
recognition, these assistants can not only detect our speech
but can also make sense out of it. According to recent
researches, a lot more advancements are expected in this
field in the near future.

Natural Language Processing: Getting Started

Natural Language Processing is all about how machines try to understand and interpret human
language and operate accordingly. But how can Natural Language Processing be used to solve the
problems around us? Let us take a look.

Revisiting the AI Project Cycle

Let us try to understand how we can develop a project in Natural Language processing with the help
of an example.

The Scenario
The world is competitive nowadays. People face
competition in even the tiniest tasks and are expected to
give their best at every point in time. When people are
unable to meet these expectations, they get stressed and
could even go into depression. We get to hear a lot of cases
where people are depressed due to reasons like peer
pressure, studies, family issues, relationships, etc. and they
eventually get into something that is bad for them as well
as for others. So, to overcome this, cognitive behavioural
therapy (CBT) is considered to be one of the best methods
to address stress as it is easy to implement on people and
also gives good results. This therapy includes

* Images shown here are the property of individual organisations and are used here for reference purpose only.
understanding the behaviour and mindset of a person in their normal life. With the help of CBT,
therapists help people overcome their stress and live a happy life.
To understand more about the concept of this therapy, visit this link:
https://en.wikipedia.org/wiki/Cognitive_behavioral_therapy

Problem Scoping
CBT is a technique used by most therapists to cure patients out of stress and depression. But it has
been observed that people do not wish to seek the help of a psychiatrist willingly. They try to avoid
such interactions as much as possible. Thus, there is a need to bridge the gap between a person who
needs help and the psychiatrist. Let us look at various factors around this problem through the 4Ws
problem canvas.

Who Canvas – Who has the problem?

Who are the

o People who suffer from stress and are at the onset of depression.
stakeholders?

What do we know
o People who are going through stress are reluctant to consult a psychiatrist.
about them?

What Canvas – What is the nature of the problem?

What is the o People who need help are reluctant to consult a psychiatrist and hence live
problem? miserably.

How do you know o Studies around mental stress and depression available on various authentic
it is a problem? sources.

Where Canvas – Where does the problem arise?

What is the context/situation

o When they are going through a stressful period of time
in which the stakeholders
o Due to some unpleasant experiences
experience this problem?

Why Canvas – Why do you think it is a problem worth solving?

o People get a platform where they can talk and vent out their
What would be of key feelings anonymously
value to the stakeholders? o People get a medium that can interact with them and applies
primitive CBT on them and can suggest help whenever needed

How would it improve their o People would be able to vent out their stress
situation? o They would consider going to a psychiatrist whenever required
Now that we have gone through all the factors around the problem, the problem statement templates
go as follows:

Our People undergoing stress Who?

Have a problem of Not being able to share their feelings What?
While They need help in venting out their emotions Where?
Provide them a platform to share their thoughts
An ideal solution would Why
anonymously and suggest help whenever required

This leads us to the goal of our project which is:

“To create a chatbot which can interact with people, help them
to vent out their feelings and take them through primitive CBT.”

Data Acquisition
To understand the sentiments of people, we need to collect their conversational data so the machine
can interpret the words that they use and understand their meaning. Such data can be collected from
various means:

1. Surveys 2. Observing the therapist’s sessions

3. Databases available on the internet 4. Interviews, etc.
Data Exploration
Once the textual data has been collected, it needs to be processed and cleaned so that an easier
version can be sent to the machine. Thus, the text is normalised through various steps and is lowered
to minimum vocabulary since the machine does not require grammatically correct statements but the
essence of it.

Modelling
Once the text has been normalised, it is then fed to an NLP based AI model. Note that in NLP, modelling
requires data pre-processing only after which the data is fed to the machine. Depending upon the type
of chatbot we try to make, there are a lot of AI models available which help us build the foundation of
our project.

Evaluation
The model trained is then evaluated and the accuracy for the same is generated on the basis of the
relevance of the answers which the machine gives to the user’s responses. To understand the
efficiency of the model, the suggested answers by the chatbot are compared to the actual answers.
As you can see in the above diagram, the blue line talks about the model’s output while the green one
is the actual output along with the data samples.

The model’s output does not match the true function at all. Hence the model is said
Figure 1 to be underfitting and its accuracy is lower.

In the second one, the model’s performance matches well with the true function
Figure 2 which states that the model has optimum accuracy and the model is called a
perfect fit.

In the third case, model performance is trying to cover all the data samples even if
Figure 3 they are out of alignment to the true function. This model is said to be overfitting
and this too has a lower accuracy.

Once the model is evaluated thoroughly, it is then deployed in the form of an app which people can
use easily.

Chatbots
As we have seen earlier, one of the most common applications of Natural Language Processing is a
chatbot. There are a lot of chatbots available and many of them use the same approach as we used in
the scenario above.. Let us try some of the chatbots and see how they work.

• Mitsuku Bot*
https://www.pandorabots.com/mitsuku/

• CleverBot*
https://www.cleverbot.com/

• Jabberwacky*
http://www.jabberwacky.com/

• Haptik*
https://haptik.ai/contact-us

* Images shown here are the property of individual organisations and are used here for reference purpose only.
• Rose*
http://ec2-54-215-197-164.us-west-1.compute.amazonaws.com/speech.php

• Ochatbot*
https://www.ometrics.com/blog/list-of-fun-chatbots/

Let us discuss!
• Which chatbot did you try? Name any one.
• What is the purpose of this chatbot?
• How was the interaction with the chatbot?
• Did the chat feel like talking to a human or a robot? Why do you think so?
• Do you feel that the chatbot has a certain personality?
As you interact with more and more chatbots, you would realise that some of them are scripted or in
other words are traditional chatbots while others were AI-powered and had more knowledge. With
the help of this experience, we can understand that there are 2 types of chatbots around us: Script-
bot and Smart-bot. Let us understand what each of them mean in detail:

Script-bot Smart-bot
Script bots are easy to make Smart-bots are flexible and powerful
Script bots work around a script which is Smart bots work on bigger databases and other
programmed in them resources directly
Mostly they are free and are easy to integrate Smart bots learn with more data
to a messaging platform
No or little language processing skills Coding is required to take this up on board
Limited functionality Wide functionality

The story speaker activity which was done in class 9 can be considered as a script-bot as in that activity
we used to create a script around which the interactive story revolved. As soon as the machine got
triggered by the person, it used to follow the script and answer accordingly. Other examples of script
bot may include the bots which are deployed in the customer care section of various companies. Their
job is to answer some basic queries that they are coded for and connect them to human executives
once they are unable to handle the conversation.

On the other hand, all the assistants like Google Assistant, Alexa, Cortana, Siri, etc. can be taken as
smart bots as not only can they handle the conversations but can also manage to do other tasks which
makes them smarter.

Human Language VS Computer Language

Humans communicate through language which we process all the time. Our brain keeps on processing
the sounds that it hears around itself and tries to make sense out of them all the time. Even in the
classroom, as the teacher delivers the session, our brain is continuously processing everything and
storing it in some place. Also, while this is happening, when your friend whispers something, the focus
of your brain automatically shifts from the teacher’s speech to your friend’s conversation. So now, the
brain is processing both the sounds but is prioritising the one on which our interest lies.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
The sound reaches the brain through a long channel. As a person speaks, the sound travels from his
mouth and goes to the listener’s eardrum. The sound striking the eardrum is converted into neuron
impulse, gets transported to the brain and then gets processed. After processing the signal, the brain
gains understanding around the meaning of it. If it is clear, the signal gets stored. Otherwise, the
listener asks for clarity to the speaker. This is how human languages are processed by humans.

On the other hand, the computer understands the language of numbers. Everything that is sent to the
machine has to be converted to numbers. And while typing, if a single mistake is made, the computer
throws an error and does not process that part. The communications made by the machines are very
basic and simple.

Now, if we want the machine to understand our language, how should this happen? What are the
possible difficulties a machine would face in processing natural language? Let us take a look at some
of them here:

Arrangement of the words and meaning

There are rules in human language. There are nouns, verbs, adverbs, adjectives. A word can be a noun
at one time and an adjective some other time. There are rules to provide structure to a language.

This is the issue related to the syntax of the language. Syntax refers to the grammatical structure of a
sentence. When the structure is present, we can start interpreting the message. Now we also want to
have the computer do this. One way to do this is to use the part-of-speech tagging. This allows the
computer to identify the different parts of a speech.

Besides the matter of arrangement, there’s also meaning behind the language we use. Human
communication is complex. There are multiple characteristics of the human language that might be
easy for a human to understand but extremely difficult for a computer to understand.

Analogy with programming language:

Different syntax, same semantics: 2+3 = 3+2

Here the way these statements are written is different, but their meanings are the same that is 5.

Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3)

Here the statements written have the same syntax but their meanings are different. In Python 2.7,
this statement would result in 1 while in Python 3, it would give an output of 1.5.

Think of some other examples of different syntax and same semantics and vice-versa.

__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Multiple Meanings of a word
Let’s consider these three sentences:

His face turned red after he found out that he took the wrong bag
What does this mean? Is he feeling ashamed because he took another person’s bag instead of his? Is
he feeling angry because he did not manage to steal the bag that he has been targeting?

The red car zoomed past his nose

Probably talking about the color of the car

His face turns red after consuming the medicine

Is he having an allergic reaction? Or is he not able to bear the taste of that medicine?

Here we can see that context is important. We understand a sentence almost intuitively, depending
on our history of using the language, and the memories that have been built within. In all three
sentences, the word red has been used in three different ways which according to the context of the
statement changes its meaning completely. Thus, in natural language, it is important to understand
that a word can have multiple meanings and the meanings fit into the statement according to the
context of it.

Think of some other words which can have multiple meanings and use them in sentences.

__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________

Perfect Syntax, no Meaning

Sometimes, a statement can have a perfectly correct syntax but it does not mean anything. For
example, take a look at this statement:

Chickens feed extravagantly while the moon drinks tea.

This statement is correct grammatically but does this make any sense? In Human language, a perfect
balance of syntax and semantics is important for better understanding.
Think of some other sentences having correct syntax and incorrect semantics.

__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________

These are some of the challenges we might have to face if we try to teach computers how to
understand and interact in human language. So how does Natural Language Processing do this magic?

Data Processing
Humans interact with each other very easily. For us, the natural languages that we use are so
convenient that we speak them easily and understand them well too. But for computers, our
languages are very complex. As you have already gone through some of the complications in human
languages above, now it is time to see how Natural Language Processing makes it possible for the
machines to understand and speak in the Natural Languages just like humans.

Since we all know that the language of computers is Numerical, the very first step that comes to our
mind is to convert our language to numbers. This conversion takes a few steps to happen. The first
step to it is Text Normalisation. Since human languages are complex, we need to first of all simplify
them in order to make sure that the understanding becomes possible. Text Normalisation helps in
cleaning up the textual data in such a way that it comes down to a level where its complexity is lower
than the actual data. Let us go through Text Normalisation in detail.

Text Normalisation
In Text Normalisation, we undergo several steps to normalise the text to a lower level. Before we
begin, we need to understand that in this section, we will be working on a collection of written text.
That is, we will be working on text from multiple documents and the term used for the whole textual
data from all the documents altogether is known as corpus. Not only would we go through all the
steps of Text Normalisation, we would also work them out on a corpus. Let us take a look at the steps:

Sentence Segmentation
Under sentence segmentation, the whole corpus is divided into sentences. Each sentence is taken as
a different data so now the whole corpus gets reduced to sentences.
Tokenisation
After segmenting the sentences, each sentence is then further divided into tokens. Tokens is a term
used for any word or number or special character occurring in a sentence. Under tokenisation, every
word, number and special character is considered separately and each of them is now a separate
token.

Removing Stopwords, Special Characters and Numbers

In this step, the tokens which are not necessary are removed from the token list. What can be the
possible words which we might not require?

Stopwords are the words which occur very frequently in the corpus but do not add any value to it.
Humans use grammar to make their sentences meaningful for the other person to understand. But
grammatical words do not add any essence to the information which is to be transmitted through the
statement hence they come under stopwords. Some examples of stopwords are:

* Images shown here are the property of individual organisations and are used here for reference purpose only.
These words occur the most in any given corpus but talk very little or nothing about the context or the
meaning of it. Hence, to make it easier for the computer to focus on meaningful terms, these words
are removed.

Along with these words, a lot of times our corpus might have special characters and/or numbers. Now
it depends on the type of corpus that we are working on whether we should keep them in it or not.
For example, if you are working on a document containing email IDs, then you might not want to
remove the special characters and numbers whereas in some other textual data if these characters do
not make sense, then you can remove them along with the stopwords.

Converting text to a common case

After the stopwords removal, we convert the whole text into a similar case, preferably lower case.
This ensures that the case-sensitivity of the machine does not consider same words as different just
because of different cases.

Here in this example, the all the 6 forms of hello would be converted to lower case and hence would
be treated as the same word by the machine.

Stemming
In this step, the remaining words are reduced to their root words. In other words, stemming is the
process in which the affixes of words are removed and the words are converted to their base form.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Note that in stemming, the stemmed words (words which are we get after removing the affixes) might
not be meaningful. Here in this example as you can see: healed, healing and healer all were reduced
to heal but studies was reduced to studi after the affix removal which is not a meaningful word.
Stemming does not take into account if the stemmed word is meaningful or not. It just removes the
affixes hence it is faster.

Lemmatization
Stemming and lemmatization both are alternative processes to each other as the role of both the
processes is same – removal of affixes. But the difference between both of them is that in
lemmatization, the word we get after affix removal (also known as lemma) is a meaningful one.
Lemmatization makes sure that lemma is a word with meaning and hence it takes a longer time to
execute than stemming.

As you can see in the same example, the output for studies after affix removal has become study
instead of studi.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Difference between stemming and lemmatization can be summarized by this example:

With this we have normalised our text to tokens which are the simplest form of words present in the
corpus. Now it is time to convert the tokens into numbers. For this, we would use the Bag of Words
algorithm

Bag of Words
Bag of Words is a Natural Language Processing model which helps in extracting features out of the
text which can be helpful in machine learning algorithms. In bag of words, we get the occurrences of
each word and construct the vocabulary for the corpus.

This image gives us a brief overview about how bag of words works. Let us assume that the text on
the left in this image is the normalised corpus which we have got after going through all the steps of
text processing. Now, as we put this text into the bag of words algorithm, the algorithm returns to us
the unique words out of the corpus and their occurrences in it. As you can see at the right, it shows us
a list of words appearing in the corpus and the numbers corresponding to it shows how many times
the word has occurred in the text body. Thus, we can say that the bag of words gives us two things:

1. A vocabulary of words for the corpus

2. The frequency of these words (number of times it has occurred in the whole corpus).

Here calling this algorithm “bag” of words symbolises that the sequence of sentences or tokens does
not matter in this case as all we need are the unique words and their frequency in it.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Here is the step-by-step approach to implement bag of words algorithm:

1. Text Normalisation: Collect data and pre-process it

2. Create Dictionary: Make a list of all the unique words occurring in the corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many times the
word from the unique list of words has occurred.
4. Create document vectors for all the documents.
Let us go through all the steps with an example:

Step 1: Collecting data and pre-processing it.

Document 1: Aman and Anil are stressed

Document 2: Aman went to a therapist

Document 3: Anil went to download a health chatbot

Here are three documents having one sentence each. After text normalisation, the text becomes:

Document 1: [aman, and, anil, are, stressed]

Document 2: [aman, went, to, a, therapist]

Document 3: [anil, went, to, download, a, health, chatbot]

Note that no tokens have been removed in the stopwords removal step. It is because we have very
little data and since the frequency of all the words is almost the same, no word can be said to have
lesser value than the other.

Step 2: Create Dictionary

Go through all the steps and create a dictionary i.e., list down all the words which occur in all three
documents:

Dictionary:

aman and anil are stressed went

download health chatbot therapist a to

Note that even though some words are repeated in different documents, they are all written just once
as while creating the dictionary, we create the list of unique words.

Step 3: Create document vector

In this step, the vocabulary is written in the top row. Now, for each word in the document, if it matches
with the vocabulary, put a 1 under it. If the same word appears again, increment the previous value
by 1. And if the word does not occur in that document, put a 0 under it.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Since in the first document, we have words: aman, and, anil, are, stressed. So, all these words get a
value of 1 and rest of the words get a 0 value.

Step 4: Repeat for all documents

Same exercise has to be done for all the documents. Hence, the table becomes:

In this table, the header row contains the vocabulary of the corpus and three rows correspond to three
different documents. Take a look at this table and analyse the positioning of 0s and 1s in it.

Finally, this gives us the document vector table for our corpus. But the tokens have still not converted
to numbers. This leads us to the final steps of our algorithm: TFIDF.

TFIDF: Term Frequency & Inverse Document Frequency

Suppose you have a book. Which characters or words do you think would occur the most in it?

__________________________________________________________________________________
__________________________________________________________________________________

Bag of words algorithm gives us the frequency of words in each document we have in our corpus. It
gives us an idea that if the word is occurring more in a document, its value is more for that document.
For example, if I have a document on air pollution, air and pollution would be the words which occur
many times in it. And these words are valuable too as they give us some context around the document.
But let us suppose we have 10 documents and all of them talk about different issues. One is on women
empowerment, the other is on unemployment and so on. Do you think air and pollution would still be
one of the most occurring words in the whole corpus? If not, then which words do you think would
have the highest frequency in all of them?

And, this, is, the, etc. are the words which occur the most in almost all the documents. But these words
do not talk about the corpus at all. Though they are important for humans as they make the
statements understandable to us, for the machine they are a complete waste as they do not provide
us with any information regarding the corpus. Hence, these are termed as stopwords and are mostly
removed at the pre-processing stage only.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Take a look at this graph. It is a plot of occurrence of words versus their value. As you can see, if the
words have highest occurrence in all the documents of the corpus, they are said to have negligible
value hence they are termed as stop words. These words are mostly removed at the pre-processing
stage only. Now as we move ahead from the stopwords, the occurrence level drops drastically and the
words which have adequate occurrence in the corpus are said to have some amount of value and are
termed as frequent words. These words mostly talk about the document’s subject and their
occurrence is adequate in the corpus. Then as the occurrence of words drops further, the value of
such words rises. These words are termed as rare or valuable words. These words occur the least but
add the most value to the corpus. Hence, when we look at the text, we take frequent and rare words
into consideration.

Let us now demystify TFIDF. TFIDF stands for Term Frequency and Inverse Document Frequency. TFIDF
helps un in identifying the value for each word. Let us understand each term one by one.

Term Frequency
Term frequency is the frequency of a word in one document. Term frequency can easily be found from
the document vector table as in that table we mention the frequency of each word of the vocabulary
in each document.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Here, you can see that the frequency of each word for each document has been recorded in the table.
These numbers are nothing but the Term Frequencies!

Inverse Document Frequency

Now, let us look at the other half of TFIDF which is Inverse Document Frequency. For this, let us first
understand what does document frequency mean. Document Frequency is the number of documents
in which the word occurs irrespective of how many times it has occurred in those documents. The
document frequency for the exemplar vocabulary would be:

Here, you can see that the document frequency of ‘aman’, ‘anil’, ‘went’, ‘to’ and ‘a’ is 2 as they have
occurred in two documents. Rest of them occurred in just one document hence the document
frequency for them is one.

Talking about inverse document frequency, we need to put the document frequency in the
denominator while the total number of documents is the numerator. Here, the total number of
documents are 3, hence inverse document frequency becomes:

Finally, the formula of TFIDF for any word W becomes:

TFIDF(W) = TF(W) * log( IDF(W) )

Here, log is to the base of 10. Don’t worry! You don’t need to calculate the log values by yourself.
Simply use the log function in the calculator and find out!

Now, let’s multiply the IDF values to the TF values. Note that the TF values are for each document
while the IDF values are for the whole corpus. Hence, we need to multiply the IDF values to each row
of the document vector table.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Here, you can see that the IDF values for Aman in each row is the same and similar pattern is followed
for all the words of the vocabulary. After calculating all the values, we get:

Finally, the words have been converted to numbers. These numbers are the values of each for each
document. Here, you can see that since we have less amount of data, words like ‘are’ and ‘and’ also
have a high value. But as the IDF value increases, the value of that word decreases. That is, for
example:

Total Number of documents: 10

Number of documents in which ‘and’ occurs: 10

Therefore, IDF(and) = 10/10 = 1

Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.

On the other hand, number of documents in which ‘pollution’ occurs: 3

IDF(pollution) = 10/3 = 3.3333…

Which means: log(3.3333) = 0.522; which shows that the word ‘pollution’ has considerable value in
the corpus.

Summarising the concept, we can say that:

1. Words that occur in all the documents with high term frequencies have the least values and
are considered to be the stopwords.
2. For a word to have high TFIDF value, the word needs to have a high term frequency but less
document frequency which shows that the word is important for one document but is not a
common word for all documents.
3. These values help the computer understand which words are to be considered while
processing the natural language. The higher the value, the more important the word is for a
given corpus.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Applications of TFIDF
TFIDF is commonly used in the Natural Language Processing domain. Some of its applications are:

Document Information
Topic Modelling Stop word filtering
Classification Retrieval System

Helps in classifying the To extract the Helps in removing the

It helps in predicting
type and genre of a important information unnecessary words
the topic for a corpus.
document. out of a corpus. out of a text body.

DIY – Do It Yourself!
Here is a corpus for you to challenge yourself with the given tasks. Use the knowledge you have
gained in the above sections and try completing the whole exercise by yourself.

The Corpus
Document 1: We can use health chatbots for treating stress.

Document 2: We can use NLP to create chatbots and we will be making health chatbots now!

Document 3: Health Chatbots cannot replace human counsellors now. Yay >< !! @1nteLA!4Y

Accomplish the following challenges on the basis of the corpus given above. You can use the tools
available online for these challenges. Link for each tool is given below:

1. Sentence Segmentation: https://tinyurl.com/y36hd92n

2. Tokenisation: https://text-processing.com/demo/tokenize/

3. Stopwords removal: https://demos.datasciencedojo.com/demo/stopwords/

4. Lowercase conversion: https://caseconverter.com/

5. Stemming: http://textanalysisonline.com/nltk-porter-stemmer

6. Lemmatisation: http://textanalysisonline.com/spacy-word-lemmatize

7. Bag of Words: Create a document vector table for all documents.

8. Generate TFIDF values for all the words.

9. Find the words having highest value.

10. Find the words having the least value.

Class X Unit VI Natural Language Processing
No ratings yet
Class X Unit VI Natural Language Processing
42 pages
Natural Language Processing
No ratings yet
Natural Language Processing
70 pages
Harambe University
No ratings yet
Harambe University
8 pages
NLP Grade 10 2023-2024
No ratings yet
NLP Grade 10 2023-2024
72 pages
Unit 6 Natural Language Processing
No ratings yet
Unit 6 Natural Language Processing
48 pages
Natural Language Processing GRADE 10 - 2021
No ratings yet
Natural Language Processing GRADE 10 - 2021
82 pages
Screenshot 2023-10-23 at 6.50.44 AM
No ratings yet
Screenshot 2023-10-23 at 6.50.44 AM
48 pages
Natural Language Processing
100% (1)
Natural Language Processing
12 pages
Asm1 Artificial Intelligence 314384
No ratings yet
Asm1 Artificial Intelligence 314384
9 pages
Unit 1
No ratings yet
Unit 1
26 pages
AI-Natural Language Processing
No ratings yet
AI-Natural Language Processing
49 pages
Grade 10 Unit 6 - Natural Language Processing
No ratings yet
Grade 10 Unit 6 - Natural Language Processing
33 pages
Artificial Intelligence - NLP
No ratings yet
Artificial Intelligence - NLP
32 pages
Natural Language Processing
No ratings yet
Natural Language Processing
73 pages
Class10 Facilitator Handbook Removed
No ratings yet
Class10 Facilitator Handbook Removed
31 pages
AI-Natural Language Processing
No ratings yet
AI-Natural Language Processing
51 pages
NLP Presentation
No ratings yet
NLP Presentation
20 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
27 pages
NLP Unit-1
No ratings yet
NLP Unit-1
20 pages
Applications of Natural Language Processing
No ratings yet
Applications of Natural Language Processing
61 pages
What Is Natural Language Processing?
No ratings yet
What Is Natural Language Processing?
5 pages
Natural Language Processing Guide
No ratings yet
Natural Language Processing Guide
16 pages
Ai NLP
No ratings yet
Ai NLP
21 pages
NLP Applications and Chatbots Guide
No ratings yet
NLP Applications and Chatbots Guide
71 pages
Lect 01
No ratings yet
Lect 01
28 pages
Natural Languag-Wps Office
No ratings yet
Natural Languag-Wps Office
24 pages
Unit 3&4
No ratings yet
Unit 3&4
10 pages
NLP 01
No ratings yet
NLP 01
7 pages
PDF Document 4
No ratings yet
PDF Document 4
5 pages
A.I Assignment
No ratings yet
A.I Assignment
4 pages
Text Mining 2
No ratings yet
Text Mining 2
4 pages
NLP for Tech Enthusiasts
No ratings yet
NLP for Tech Enthusiasts
35 pages
Natural Language Processing Report (By Sandeep Kumar Dash)
No ratings yet
Natural Language Processing Report (By Sandeep Kumar Dash)
25 pages
NLP Basics for Computer Science Students
No ratings yet
NLP Basics for Computer Science Students
87 pages
01 - Intro NLP
No ratings yet
01 - Intro NLP
13 pages
A Beginner's Introduction To Natural Language Processing (NLP)
100% (1)
A Beginner's Introduction To Natural Language Processing (NLP)
15 pages
Table of Content
No ratings yet
Table of Content
13 pages
Basic NLP To End-To-End Pipeline .PPTX - Removed
No ratings yet
Basic NLP To End-To-End Pipeline .PPTX - Removed
35 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
14 pages
Application of NLP
No ratings yet
Application of NLP
10 pages
Application of NLP in Big Data
No ratings yet
Application of NLP in Big Data
10 pages
What Is NLP?
No ratings yet
What Is NLP?
5 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
11 pages
NLP for Tech Enthusiasts
No ratings yet
NLP for Tech Enthusiasts
40 pages
NLP Notes
No ratings yet
NLP Notes
9 pages
Nlput-Unit1 Notes
No ratings yet
Nlput-Unit1 Notes
29 pages
NLP Notes
No ratings yet
NLP Notes
90 pages
NLP in Everyday Tech
No ratings yet
NLP in Everyday Tech
11 pages
1.2 Chap NLP Intro-2
No ratings yet
1.2 Chap NLP Intro-2
46 pages
Chapter 5 Ai
No ratings yet
Chapter 5 Ai
16 pages
AI & NLP Trends in Industry 2024
No ratings yet
AI & NLP Trends in Industry 2024
16 pages
Intro to NLP: Concepts & Applications
No ratings yet
Intro to NLP: Concepts & Applications
80 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
9 pages
Getting Started With Artificial Intelligence - Preview - Final 1 - KUO12425USEN PDF
No ratings yet
Getting Started With Artificial Intelligence - Preview - Final 1 - KUO12425USEN PDF
18 pages
NLP Lecture 1
No ratings yet
NLP Lecture 1
3 pages
Natural Language Processing: Bachelor of Technology Computer Science and Engineering
No ratings yet
Natural Language Processing: Bachelor of Technology Computer Science and Engineering
7 pages
AI Unit-5
No ratings yet
AI Unit-5
10 pages
Spesifikasi CAT 385C/385C L
100% (1)
Spesifikasi CAT 385C/385C L
28 pages
Siemens Building Technologies: HVAC Products
No ratings yet
Siemens Building Technologies: HVAC Products
299 pages
Sherwood M410
No ratings yet
Sherwood M410
29 pages
3M Novec 1230 Fire Protection Fluid: Frequently Asked Questions (Faqs)
No ratings yet
3M Novec 1230 Fire Protection Fluid: Frequently Asked Questions (Faqs)
6 pages
Deep Learning - 2
No ratings yet
Deep Learning - 2
21 pages
Assignment 3 CBNST
No ratings yet
Assignment 3 CBNST
23 pages
Barco Alchemy ICMP Datasheet
No ratings yet
Barco Alchemy ICMP Datasheet
2 pages
Operate Electrical, Electronic and Control Systems
100% (3)
Operate Electrical, Electronic and Control Systems
129 pages
Sex XXX 18 New Videos Sex XXX Sex XNXX Porn Sexy BF Videos Original Oficial
No ratings yet
Sex XXX 18 New Videos Sex XXX Sex XNXX Porn Sexy BF Videos Original Oficial
3 pages
Aphex 828M 2009
No ratings yet
Aphex 828M 2009
52 pages
Apcn-S Iom
No ratings yet
Apcn-S Iom
88 pages
Jenkins Automation for Developers
No ratings yet
Jenkins Automation for Developers
52 pages
Series 701 Pin Diode Fundamentals: Micronote
No ratings yet
Series 701 Pin Diode Fundamentals: Micronote
5 pages
Geotechnical and Structural Instrumentation
No ratings yet
Geotechnical and Structural Instrumentation
3 pages
Rta For Vis2515 - bcd112 (FTTH Overlay)
No ratings yet
Rta For Vis2515 - bcd112 (FTTH Overlay)
35 pages
Source Code For Insurance Project
No ratings yet
Source Code For Insurance Project
32 pages
Diesel Generator Set: Prime Model:LP90P Standby Model:LP100S
No ratings yet
Diesel Generator Set: Prime Model:LP90P Standby Model:LP100S
4 pages
Simscape Lang
No ratings yet
Simscape Lang
362 pages
AI Insights for Tech Enthusiasts
No ratings yet
AI Insights for Tech Enthusiasts
2 pages
Solar-Mobile-Charger Project Work 1
No ratings yet
Solar-Mobile-Charger Project Work 1
34 pages
Geophysical Log Interpretation
No ratings yet
Geophysical Log Interpretation
4 pages
History of Operating Systems
100% (1)
History of Operating Systems
5 pages
Principal Component Analysis PCA in Machine Learning
No ratings yet
Principal Component Analysis PCA in Machine Learning
20 pages
Advanced Integrity Assessment of Pipeline Dents Using ILI Data
No ratings yet
Advanced Integrity Assessment of Pipeline Dents Using ILI Data
27 pages
Stereo Optical Product Catalog 09 2024 Reduced
No ratings yet
Stereo Optical Product Catalog 09 2024 Reduced
15 pages
OS Practical Notes
No ratings yet
OS Practical Notes
21 pages
Inverted Ladder DAC
No ratings yet
Inverted Ladder DAC
11 pages
India in The Persian World of Letters Hni RZ Among The EighteenthCentury Philologists Arthur Dudney Instructor Test Bank
No ratings yet
India in The Persian World of Letters Hni RZ Among The EighteenthCentury Philologists Arthur Dudney Instructor Test Bank
329 pages
Assignment 9-KS
No ratings yet
Assignment 9-KS
3 pages
Linkedin Checklist
No ratings yet
Linkedin Checklist
7 pages

14 NLP

Uploaded by

14 NLP

Uploaded by

Natural Language Processing

Identify the mystery animal: http://bit.ly/iai4yma

Were you able to guess the animal?

If yes, in how many questions were you able to guess it?

What according to you was the task of the machine?

What approach must one follow to win this game?

Applications of Natural Language Processing

Automatic Summarization: Information overload is a real

Sentiment Analysis: The goal of sentiment

Text classification: Text classification makes it possible to assign

Virtual Assistants: Nowadays Google Assistant, Cortana,

Natural Language Processing: Getting Started

Revisiting the AI Project Cycle

Who Canvas – Who has the problem?

Who are the

What Canvas – What is the nature of the problem?

Where Canvas – Where does the problem arise?

What is the context/situation

Why Canvas – Why do you think it is a problem worth solving?

Our People undergoing stress Who?

This leads us to the goal of our project which is:

1. Surveys 2. Observing the therapist’s sessions

Human Language VS Computer Language

Arrangement of the words and meaning

Analogy with programming language:

Different syntax, same semantics: 2+3 = 3+2

Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3)

The red car zoomed past his nose

His face turns red after consuming the medicine

Perfect Syntax, no Meaning

Chickens feed extravagantly while the moon drinks tea.

Removing Stopwords, Special Characters and Numbers

Converting text to a common case

1. A vocabulary of words for the corpus

1. Text Normalisation: Collect data and pre-process it

Step 1: Collecting data and pre-processing it.

Document 1: Aman and Anil are stressed

Document 2: Aman went to a therapist

Document 3: Anil went to download a health chatbot

Document 1: [aman, and, anil, are, stressed]

Document 2: [aman, went, to, a, therapist]

Document 3: [anil, went, to, download, a, health, chatbot]

Step 2: Create Dictionary

aman and anil are stressed went

download health chatbot therapist a to

Step 3: Create document vector

Step 4: Repeat for all documents

TFIDF: Term Frequency & Inverse Document Frequency

Inverse Document Frequency

Finally, the formula of TFIDF for any word W becomes:

TFIDF(W) = TF(W) * log( IDF(W) )

Total Number of documents: 10

Number of documents in which ‘and’ occurs: 10

Therefore, IDF(and) = 10/10 = 1

Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.

On the other hand, number of documents in which ‘pollution’ occurs: 3

IDF(pollution) = 10/3 = 3.3333…

Summarising the concept, we can say that:

Helps in classifying the To extract the Helps in removing the

1. Sentence Segmentation: https://tinyurl.com/y36hd92n

3. Stopwords removal: https://demos.datasciencedojo.com/demo/stopwords/

4. Lowercase conversion: https://caseconverter.com/

7. Bag of Words: Create a document vector table for all documents.

8. Generate TFIDF values for all the words.

9. Find the words having highest value.

10. Find the words having the least value.

You might also like