NLP Unit 1 PDF
NLP Unit 1 PDF
i. Sentiment Analysis: This involves analyzing text to determine the sentiment behind it—
whether it is positive, negative, or neutral. It is widely used in social media
monitoring, product reviews, and customer feedback analysis.
ii. Named Entity Recognition (NER): NER identifies and categorizes entities in text such as
names of people, organizations, locations, dates, and other proper nouns. It is used in
information extraction, news categorization, and search engines. The input to such a
model is generally text, and the output is the various named entities along with their
start and end positions. Named entity recognition is useful in applications such as
summarizing news articles and combating disinformation.
iii. Machine Translation: NLP powers automatic translation tools like Google Translate by
converting text from one language to another while preserving meaning and context.
iv. Speech Recognition: Converts spoken language into text. This is used in voice assistants
like Siri, Alexa, and Google Assistant, as well as in transcription services.
vi. Text Classification: Involves categorizing text into predefined labels, such as spam
detection in emails or topic classification in articles.
8. What is Chunking?
9. What is Chinking?
i. While chunking includes a sequence of words into a chunk, chinking removes specific
words or POS patterns from an already chunked phrase.
ii. It’s useful when the chunk includes unwanted words that should be excluded based on
their POS tags.
iii. For example, suppose a chunk includes all words between a determiner and a noun, but
you want to exclude verbs from that group. You would use chinking to "cut out" those
verbs.
i. Breaking sentences into tokens: Splitting text into smaller units like words or phrases for
easier analysis.
ii. Tagging parts of speech (POS): Assigning grammatical roles (noun, verb, adjective, etc.)
to each token based on context.
iii. Building an appropriate vocabulary: Creating a set of unique words or tokens present in
the text corpus.
iv. Linking the components of a created vocabulary: Mapping tokens to indices or
embeddings for computational processing.
v. Understanding the context: Analyzing surrounding words and structure to grasp the
meaning of tokens accurately.
vi. Extracting semantic meaning: Identifying the deeper meaning or intent behind words and
phrases.
vii. Named Entity Recognition (NER): Detecting and classifying proper nouns like names,
places, dates, etc., in text.
viii. Transforming unstructured data into structured data: Converting raw text into organized
formats suitable for machine learning.
ix. Ambiguity in speech: Addressing words or phrases that have multiple meanings
depending on the context.
vi. During training, Word2Vec uses optimization techniques such as Negative Sampling or
Hierarchical Softmax to improve efficiency when working with large vocabularies.
vii. One of the most powerful features of Word2Vec is its ability to capture linguistic
regularities and vector arithmetic. For example:
viii. Vector("King") - Vector("Man") + Vector("Woman") ≈ Vector("Queen")
ix. This means that the relationship between "king" and "man" is similar to the relationship
between "queen" and "woman".
x. These vector operations reflect real-world relationships and make Word2Vec highly useful
in tasks requiring semantic understanding.
xi. Word2Vec has several applications in NLP, including sentiment analysis, document
classification, machine translation, text clustering, question answering, and
recommendation systems.
xii. By converting unstructured text into structured vector representations, it enables
machines to understand word similarity, context, and meaning in a human-like way.
i. Continuous Bag of Words (CBOW) is one of the two main architectures used in the
Word2Vec model for learning word embeddings — the other being Skip-gram.
ii. The CBOW model aims to predict a target word based on its surrounding context words
within a given window size.
iii. It is called "Bag of Words" because the order of context words is ignored, and only their
presence matters.
iv. For example, consider the sentence: "The cat sits on the mat."
If we choose the context window size = 2, and the target word is "sits", then the
context words are ["The", "cat", "on", "the"]. The CBOW model will try to predict the
word "sits" using these four surrounding words.
v. How it Works:
a) Input Layer: The model takes the context words as input, which are first converted into
one-hot encoded vectors. These vectors are then mapped to dense representations
using a shared weight matrix (also known as the embedding matrix).
b) Hidden Layer: The embeddings of the context words are averaged to produce a single
vector representation. This step captures the general meaning of the context.
c) Output Layer: This averaged vector is passed through another weight matrix followed by a
softmax function to produce a probability distribution over the entire vocabulary.
d) Prediction: The model selects the word with the highest probability as the predicted target
word. During training, the model adjusts its weights to minimize the error between the
predicted and actual target word.
vi. Key Features:
a) CBOW is faster to train than Skip-gram because it works well with frequent words and
uses fewer parameters per training example.
b) It is best suited for large corpora where most words appear often.
c) It captures the overall meaning of surrounding words rather than focusing on individual
pairwise relationships.
vii. Advantages:
a) Efficient and quick for large datasets.
viii. Limitations:
a) Less effective for learning representations of rare words.
i. Noise Removal -
Involves removing irrelevant information such as HTML tags, special characters,
emojis, URLs, or metadata that do not contribute to the meaning of the text.
ii. Tokenization -
Splits the text into smaller units called tokens (such as words, sentences, or
characters), which form the basis for further processing.
iii. Lowercasing -
Converts all characters in the text to lowercase to avoid treating the same words in
different cases (e.g., "Apple" and "apple") as separate tokens.
iv. Normalization (Stemming and Lemmatization) -
Stemming reduces words to their root form by chopping off suffixes (e.g., "playing"
→ "play").
Lemmatization returns the base or dictionary form of a word using linguistic
knowledge (e.g., "better" → "good").
v. Stop Word Removal -
Removes commonly used words (like is, the, and, a) that do not carry significant
meaning and are often considered irrelevant for analysis.
vi. Object Standardization -
Converts different forms of the same concept into a standard format (e.g., converting
"₹", "Rs.", and "INR" all to "rupees").
vii. Removing Punctuation -
Eliminates punctuation marks (like ., !, ?, ", etc.) which are generally not useful in
most text processing tasks.
viii. Removing Extra Whitespaces -
Trims unnecessary spaces, tabs, or newlines to ensure uniformity in the text and avoid
misleading tokenization.
17. Explain what is tokenization and what are the types of tokenization.
a) Word Tokenization -
Splits a sentence into individual words.
Example:
"I love NLP." → ["I", "love", "NLP", "."]
b) Sentence Tokenization -
Splits a paragraph or document into individual sentences.
Example:
"NLP is interesting. It has many applications."
→ ["NLP is interesting.", "It has many applications."]
c) Character Tokenization -
Splits text into individual characters.
Example:
"Chat" → ['C', 'h', 'a', 't']
d) Subword Tokenization -
Splits words into meaningful subword units (like prefixes, suffixes, or roots).
Used in advanced NLP models like BERT and GPT.
Example:
"unhappiness" → ["un", "happi", "ness"]
e) Whitespace Tokenization -
Splits tokens wherever there is a space or tab character.
Simple but can break on punctuation or special symbols.
i. A bag of words is a representation of text, that describes the occurrence of words within a
document.
ii. It keeps track of words counts, and disregard the grammatical details and the words order
hence it is called “Bag” of words.
iii. It is concerned with whether known words occur in the document or not.
iv. Bag of words is at text modelling, that describes a process of generating a sentence.
v. How it works –
i. GloVe, which stands for Global Vectors for Word Representation, is an unsupervised
learning algorithm used to generate word embeddings—dense vector representations
of words that capture their semantic relationships.
ii. It is similar in purpose to Word2Vec but differs significantly in the way it learns these
representations.
iii. Unlike Word2Vec, which relies on local context windows (i.e., predicting a word from its
neighbors or vice versa), GloVe is based on global word co-occurrence statistics.
iv. It constructs a large matrix from a corpus that counts how often pairs of words occur
together in different contexts.
v. For example, if two words frequently appear in the same context across the entire corpus
(like "doctor" and "hospital"), they are likely to have similar vector representations.
vi. The key idea behind GloVe is that the ratio of co-occurrence probabilities between words
can reveal meaningful relationships.
vii. For instance, the ratio of how often "ice" co-occurs with "solid" compared to how often
"steam" co-occurs with "solid" helps the model learn that "ice" is more closely related
to cold or solidity, while "steam" is not. GloVe uses these relationships to train word
vectors so that the dot product of two word vectors approximates their co-occurrence
probability.
viii. GloVe uses a log-bilinear regression model to minimize a cost function that captures the
difference between the actual co-occurrence of words and the dot product of their
corresponding vectors.
ix. This allows it to produce word vectors that capture both semantic similarity (words used
in similar contexts) and linear relationships (e.g., vector("king") - vector("man") +
vector("woman") ≈ vector("queen")), just like Word2Vec.
x. One of the strengths of GloVe is that it produces a single embedding per word, trained on
the entire corpus, and performs well even with rare words if enough global co-
occurrence data is available.
In a bigram model:
Then,
i. A neural network is a computational model inspired by the human brain's structure and
function.
ii. It is made up of layers of interconnected units known as neurons or nodes. These neurons
are organized into three main types of layers: the input layer, one or more hidden
layers, and the output layer.
iii. The design and flow of data through these layers is what defines the architecture of the
neural network.
iv. The input layer is the entry point for data into the network.
v. Each neuron in this layer represents a feature of the input data. For example, if we are
working with images of 28×28 pixels, the input layer will have 784 neurons, one for
each pixel.
vi. The input layer does not perform any computation; it simply passes the raw feature values
to the next layer.
vii. The hidden layers are where the majority of computation occurs. Each neuron in a hidden
layer takes inputs from all neurons in the previous layer, multiplies them by weights,
adds a bias, and passes the result through an activation function.
viii. These activation functions introduce non-linearity into the model, enabling it to learn
complex and non-linear patterns in data.
ix. Common activation functions include ReLU (Rectified Linear Unit), sigmoid, and tanh.
ReLU is widely used because it is simple and helps prevent the vanishing gradient
problem during training.
x. The output layer is the final layer of the network. It produces the network’s prediction.
xi. The number of neurons in the output layer depends on the task at hand. For binary
classification, a single output neuron with a sigmoid activation function is typically
used, producing values between 0 and 1, interpreted as probabilities.
xii. For multi-class classification, a softmax activation function is applied. The softmax
function converts raw output scores (logits) into probabilities that sum up to 1,
ensuring that no output exceeds 1.
xiii. This makes it easier to interpret the output as a probability distribution over different
classes, which is crucial for decision-making in classification problems.
xiv. The network learns using a process called forward propagation and backpropagation. In
forward propagation, the input is passed through the network layer by layer, and
predictions are made.
xv. These predictions are compared to the actual labels using a loss function (such as cross-
entropy for classification or mean squared error for regression), which calculates the
error in prediction.
xvi. This error is then used in backpropagation to update the weights and biases using
gradients computed via the chain rule of calculus.
xvii. To optimize the network parameters (weights and biases), algorithms such as Stochastic
Gradient Descent (SGD), Adam, or RMSprop are used.
xviii. These algorithms iteratively update the parameters to minimize the loss function,
improving the network’s accuracy over time.
xix. Training is usually done over multiple epochs, where the entire dataset is repeatedly fed
into the network, and may be divided into smaller batches to make the process more
efficient and stable.
OR
i. The feedforward process is the fundamental mechanism by which a neural network makes
predictions or computes outputs based on input data.
ii. It refers to the unidirectional flow of information from the input layer, through the hidden
layers, to the output layer—without any cycles or loops.
iii. This is why such networks are also called Feedforward Neural Networks (FNNs).
iv. In feedforward, each neuron in a layer receives input only from the previous layer,
performs a calculation, and passes its output to the next layer.
v. The process starts when raw input data, such as an image, text, or numerical values, is fed
into the input layer.
vi. Each input neuron simply passes its data to the first hidden layer without applying any
transformation.
vii. Within the hidden layers, each neuron takes a weighted sum of its inputs, adds a bias, and
then passes the result through an activation function.
viii. This activation function introduces non-linearity, which allows the network to learn
complex patterns.
ix. For example, the ReLU (Rectified Linear Unit) function outputs 0 for negative inputs and
the input itself for positive values, helping the network to avoid problems like
vanishing gradients.
x. This process continues layer by layer until the output layer is reached.
xi. The output neurons also compute weighted sums and apply activation functions like
sigmoid (for binary classification) or softmax (for multi-class classification). The final
output represents the prediction made by the network.
i. The set diagram illustrates the relationship between Artificial Intelligence (AI),
Machine Learning (ML), Deep Learning (DL), and Natural Language Processing
(NLP).
ii. At the broadest level, AI encompasses all technologies that aim to simulate human
intelligence in machines. This includes everything from rule-based systems to
learning-based models. Within this broad AI domain lies Machine Learning,
which refers to the subset of AI that enables systems to learn from data and
improve their performance over time without being explicitly programmed for
every specific task.
iii. Within Machine Learning lies a further specialization known as Deep Learning, which
involves the use of artificial neural networks with multiple layers (deep
architectures). Deep learning has shown exceptional performance in handling
large-scale data and complex tasks like image recognition, speech processing, and
most notably, sophisticated NLP tasks. This part of the diagram highlights that
while all deep learning is machine learning, not all machine learning is deep
learning.
iv. The NLP (Natural Language Processing) section overlaps both ML and DL areas of
the diagram, indicating that NLP makes use of techniques from both subfields.
NLP is the discipline concerned with enabling machines to understand, interpret,
generate, and interact using human languages.
v. This overlapping region emphasizes how NLP benefits from both traditional ML
approaches and cutting-edge DL models.
vi. NLP is a crucial application area of both ML and DL, sitting at the intersection of
language and computation.
26. What are the advantages and disadvantages of NLP.