NLP Crecord Mid2
NLP Crecord Mid2
CountVectorizer
Description:
Tokenization
Before representing text numerically, it must be broken down into smaller units
called tokens.
In the code, tokenizer=lambda x: x.split() splits each sentence by whitespace.
Example:
"apple and banana" → ["apple", "and", "banana"]
The vectorizer builds a vocabulary of all unique tokens in the corpus.
One-Hot Encoding (Binary Representation)
This method represents whether each token from the vocabulary is present
(1) or absent (0) in a sentence.
• binary=True in CountVectorizer ensures that we use a binary vector, not raw
counts.
• Each vector’s length = size of the vocabulary.
• Each sentence becomes a vector indicating the presence of each word.
Example Corpus:
1. "apple and banana"
2. "banana and orange"
3. "grape apple banana"
Vocabulary: ['and', 'apple', 'banana', 'grape', 'orange'] Binary
Matrix Output:
• [1, 1, 1, 0, 0] # "apple and banana"
• [1, 0, 1, 0, 1] # "banana and orange"
• [0, 1, 1, 1, 0] # "grape apple banana"
This matrix numerically represents the text and is suitable for downstream ML tasks.
Code:
22261A6634 29
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"apple and banana",
"banana and orange",
"grape apple banana"
]
print("Vocabulary:", vectorizer.get_feature_names_out())
print("One-Hot Encoded Matrix:\n", X)
Output:
Vocabulary: ['and' 'apple' 'banana' 'grape' 'orange']
One-Hot Encoded Matrix:
[[1 1 1 0 0]
[1 0 1 0 1]
[0 1 1 1 0]]
Program 14: Write a python code for demonstrating Count Vectorization, also
known as the Bag-of-Words (BoW) model — a foundational text representation
technique in NLP.
Description:
This code demonstrates Count Vectorization, also known as the Bag-of-Words
(BoW) model — a foundational text representation technique in NLP.
22261A6634 30
• It breaks each sentence into tokens (typically words), builds a vocabulary of all
unique tokens, and creates vectors indicating the frequency of each word in a
sentence.
How it works:
1. Tokenization:
The text is split into words (tokens) using default rules (like splitting by spaces
and removing punctuation).
2. Vocabulary Creation:
A set of all unique words across the corpus is created.
3. Vectorization:
Each sentence is converted into a numerical vector where:
o Each dimension corresponds to a word in the vocabulary.
o The value represents how many times that word appears in the sentence.
Example Corpus:
1. "Natural language processing is a field of artificial intelligence and language
processing."
2. "Machine learning and deep learning are parts of AI."
3. "Natural language techniques are used in chatbots and translation."
4. "The future of AI depends on advances in NLP."
22261A6634 31
• A vocabulary is extracted: All unique words from all sentences.
• The Count Vector Matrix shows how often each vocabulary word appears in each
sentence.
This Bag-of-Words model captures word frequency information but ignores grammar,
word order, and semantics. It is commonly used for:
• Text classification
• Spam detection
• Sentiment analysis
• Document similarity tasks Code:
from sklearn.feature_extraction.text import CountVectorizer
# Sample corpus (can be sentences, paragraphs, documents)
corpus = [
"Natural language processing is a field of artificial
intelligence and language processing.",
"Machine learning and deep learning are parts of AI.",
"Natural language techniques are used in chatbots and
translation.",
"The future of AI depends on advances in NLP."
]
22261A6634 32
Output:
📚 Vocabulary:
['advances' 'ai' 'and' 'are' 'artificial' 'chatbots' 'deep' 'depends'
'field' 'future' 'in' 'intelligence' 'is' 'language' 'learning' 'machine'
'natural' 'nlp' 'of' 'on' 'parts' 'processing' 'techniques' 'the'
'translation' 'used']
22261A6634 33
Program 14 A: Write a python code for TF-IDF (Term Frequency-Inverse Document
Frequency) to understand which words are important, not just frequent.
Description:
1. Count Vectorizer (Bag of Words)
• Converts each document into a vector based on word frequency.
• Ignores grammar and word order; only counts occurrences of words.
• Commonly used for basic text classification and NLP tasks.
Limitation:
Frequent but less meaningful words (like "data" or "systems") may dominate the
representation, even if they don’t carry much information.
In this example:
• Corpus: A small collection of sentences related to AI.
• CountVectorizer: Creates a matrix with raw word counts (excluding stop words).
• TfidfVectorizer: Creates a matrix of weighted scores based on word importance.
This helps in tasks like:
• Document classification
• Keyword extraction
• Search relevance ranking
22261A6634 34
Code:
from sklearn.feature_extraction.text import CountVectorizer,
TfidfVectorizer import pandas as pd
corpus = [
"Artificial Intelligence is transforming industries and daily
life through automation and smart systems.",
"Machine Learning, as a subset of AI, enables systems to
learn from data without being explicitly programmed.",
"Deep Learning techniques use neural networks with many
layers to model complex patterns in data such as images and
speech.",
"Applications of AI include self-driving cars, medical
diagnosis, financial forecasting, and personalized recommendations.",
"Natural Language Processing helps computers understand,
interpret, and generate human language using linguistic and
statistical techniques.",
"With rapid advancements in computing power and data
availability, the future of AI continues to grow exponentially."
]
# Count Vectorizer
count_vectorizer = CountVectorizer(stop_words='english')
count_matrix = count_vectorizer.fit_transform(corpus) count_df
= pd.DataFrame(count_matrix.toarray(),
columns=count_vectorizer.get_feature_names_out())
# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus) tfidf_df
= pd.DataFrame(tfidf_matrix.toarray(),
columns=tfidf_vectorizer.get_feature_names_out())
# Display
print("📊 Count Vector (Bag of Words):") print(count_df)
22261A6634 35
Output:
📊 Count Vector (Bag of Words):
advancements ai applications artificial automation availability cars \ 0
0 0 0 1 1 0 0
1 0 1 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 1 1 0 0 0 1
4 0 0 0 0 0 0 0
5 1 1 0 0 0 1 0
[6 rows x 61 columns]
🌟 TF-IDF Vector:
advancements ai applications artificial automation availability \ 0
0.000000 0.000000 0.00000 0.33957 0.33957 0.000000
1 0.000000 0.240255 0.00000 0.00000 0.00000 0.000000
2 0.000000 0.000000 0.00000 0.00000 0.00000 0.000000
3 0.000000 0.204336 0.29515 0.00000 0.00000 0.000000
4 0.000000 0.000000 0.00000 0.00000 0.00000 0.000000
5 0.316885 0.219383 0.00000 0.00000 0.00000 0.316885
use using 0
0.000000 0.000000
1 0.000000 0.000000
2 0.290814 0.000000
3 0.000000 0.000000
4 0.000000 0.252599
5 0.000000 0.000000
[6 rows x 61 columns]
22261A6634 36
Program 15: Write a python code to implement word2vec word-embedding
technique
Code:
import gensim
import pandas as pd
df = pd.read_json("Sports_and_Outdoors_5.json", lines=True)
review_text =
df.reviewText.apply(gensim.utils.simple_preprocess)
df.reviewText.loc[0] model =
gensim.models.Word2Vec( window=10, min_count=2,
workers=4,
)
model.build_vocab(review_text, progress_per=1000) model.train(review_text,
total_examples=model.corpus_count, epochs=model.epochs)
model.save("./word2vec-outdoor-reviews-short.model")
print(model.wv.most_similar("awful"))
Output:
[('terrible', 0.7352169156074524),
('horrible', 0.6891771554946899),
('overwhelming', 0.6227911710739136),
('impossibility', 0.5835400819778442),
22261A6634 37
('horrendous', 0.5827057957649231),
('enormous', 0.5721909999847412),
('ugly', 0.567825436592102),
('unusual', 0.566750705242157),
('isolated', 0.5588798522949219),
('unfortunate', 0.5560564994812012)]
• model.wv.similarity(w1="good", w2="great") ð
output: 0.7870506
• model.wv.similarity(w1="slow", w2="steady")
ð output: 0.3472042
Program 16 A: Write a python program to create a sample list for at least 5 words with
A word sense is a specific meaning of a word, especially when the word has multiple
meanings depending on the context. This concept is central to understanding and
processing natural language correctly.
22261A6634 38
Word Sense in Lexical Databases
• A definition (gloss)
• Examples
• Synonyms (called synsets)
• Relations (hypernyms, hyponyms, etc.)
Code:
wordnet as wn
# A simple approach to disambiguate: Check if any context words match with the word's
definition for i, sense in enumerate(senses): for example in sense.examples():
22261A6634 39
if any(context_word in example for context_word in context):
print(f"\nContext matched with sense {i+1}: {sense.name()}")
print(f"Example: {example}") return sense.name() # Return
the sense name based on context
Output:
WSD for the word: bank
Senses of the word 'bank':
1. bank.n.01: sloping land (especially the slope beside a body of water) 2.
depository_financial_institution.n.01: a financial institution that accepts
deposits and channels the money into lending activities
3. bank.n.03: a long ridge or pile
4. bank.n.04: an arrangement of similar objects in a row or in tiers
22261A6634 40
5. bank.n.05: a supply or stock held in reserve for future use (especially in
emergencies)
6. bank.n.06: the funds held by a gambling house or the dealer in some
gambling games
7. bank.n.07: a slope in the turn of a road or track; the outside is higher
than the inside in order to reduce the effects of centrifugal force 8.
savings_bank.n.02: a container (usually with a slot in the top) for
keeping money at home
9. bank.n.09: a building in which the business of banking transacted 10.
bank.n.10: a flight maneuver; aircraft tips laterally about its
longitudinal axis (especially in turning)
11. bank.v.01: tip laterally
12. bank.v.02: enclose with a bank
13. bank.v.03: do business with a bank or keep an account at a bank
14. bank.v.04: act as the banker in a game or in gambling
15. bank.v.05: be in the banking business
16. deposit.v.02: put into a bank account
17. bank.v.07: cover with ashes so to control the rate of burning
18. trust.v.01: have confidence or faith in
22261A6634 41
9. bat.v.04: use a bat
10. cream.v.02: beat thoroughly and conclusively in a competition or fight
22261A6634 42
14. lead.n.14: thin strip of metal used to separate lines of type in printing
15. lead.n.15: mixture of graphite with clay in different degrees of
hardness; the marking substance in a pencil
16. jumper_cable.n.01: a jumper that consists of a short piece of wire
17. lead.n.17: the playing of a card to start a trick in bridge
18. lead.v.01: take somebody somewhere
19. leave.v.07: have as a result or residue
20. lead.v.03: tend to or result in
21. lead.v.04: travel in front of; go in advance of others
22. lead.v.05: cause to undertake a certain action
23. run.v.03: stretch out over a distance, space, time, or scope; run or
extend between two points or beyond a certain point
24. head.v.02: be in charge of
25. lead.v.08: be ahead of others; be the first
26. contribute.v.03: be conducive to
27. conduct.v.02: lead, as in the performance of a composition
28. go.v.25: lead, extend, or afford access
29. precede.v.04: move ahead (of others) in time or space
30. run.v.23: cause something to pass or lead somewhere
31. moderate.v.01: preside over
Disambiguated sense of 'lead': No match found
[nltk_data] Downloading package wordnet to /root/nltk_data... [nltk_data]
Package wordnet is already up-to-date!
Program 16 B: Write a python program to implement Lesk’s algorithm for word sense
disambiguity.
Description:
Word Sense Disambiguation (WSD) is the process of identifying the correct meaning
(sense) of a word based on its context, especially when the word has multiple
meanings.
The correct sense of an ambiguous word is the one whose definition overlaps the
most with the definitions of the surrounding words in the sentence.
Working of algorithm:
• Identify all possible senses of the ambiguous word using a dictionary like
WordNet.
22261A6634 43
• For each sense, take its definition (gloss).
• Compare it with the glosses or words in the context (surrounding words).
• Count overlapping words between glosses and context.
• Choose the sense with the maximum overlap.
Code:
import nltk
from nltk.corpus import wordnet as wn from
nltk.tokenize import word_tokenize def
lesk_algorithm(word, context):
if not senses:
return None
max_overlap = 0
best_sense = None
# Compute the overlap between the context and the sense definition + examples
overlap = len(context_tokens.intersection(definition.union(examples)))
22261A6634 44
# Keep track of the sense with the maximum overlap
if overlap > max_overlap: max_overlap = overlap
best_sense = sense
return best_sense
# Example usage:
context = ["The bark of the tree is rough and textured."] word
= "bark"
if best_sense:
print(f"Disambiguated sense for '{word}': {best_sense.name()}")
print(f"Definition: {best_sense.definition()}") else: print(f"No
sense found for '{word}'")
if best_sense:
print(f"Disambiguated sense for '{word}': {best_sense.name()}")
print(f"Definition: {best_sense.definition()}") else: print(f"No
sense found for '{word}'")
Output:
Disambiguated sense for 'bark': bark.v.03
22261A6634 45
Definition: remove the bark of a tree
Disambiguated sense for 'bark': bark.n.02
Definition: a noise resembling the bark of a dog
Program 17: Write a python program using NLTK package to convert audio file to text
NLTK it does not handle audio files directly.We will use speech_recognition and gTTS
libraries for audio, and use NLTK to process the text in between.
1. Speech_recognition:
speech_recognition is a Python library that helps you convert spoken audio
into written text using speech recognition engines.
Code:
22261A6634 46
import speech_recognition as sr from
gtts import gTTS def
audio_to_text(audio_file_path):
recognizer = sr.Recognizer() with
sr.AudioFile(audio_file_path) as source:
audio_data = recognizer.record(source)
try:
text = recognizer.recognize_google(audio_data)
print("Recognized Text:\n", text)
return text except sr.UnknownValueError:
print("Speech Recognition could not understand the audio")
except sr.RequestError:
print("Could not request results from Google Speech Recognition service")
audio_path = "Sports.wav"
output_audio_path = "output.mp3"
if text:
tokens = word_tokenize(text)
print("Tokenized Text:\n", tokens)
Output:
22261A6634 47
Tokenized Text:
['good', 'evening', 'ladies', 'and', 'gentlemen', 'we', 'like', 'to',
'welcome', 'you', 'to', 'play', 'the', 'new', 'videos', 'Broadcast']
Program 18: Write a python program using NLTK package to convert audio file to text
and text to audio files
Description:
FrameNet is a linguistic database that organizes words based on the situations (called
frames) they describe, showing how words are connected to roles and events in
realworld experiences.
1. Helps computers understand not just words, but meanings and relationships.
The key elements of FrameNet are Frames, Frame Elements (FEs), and Lexical
Units (LUs).
Frames:
These are scripts or conceptual structures that describe specific types of situations,
events, or objects. For example, a "Cooking" frame would describe the situation of
preparing food, including the roles of a cook, the food, and the heating instrument.
22261A6634 48
A lexical unit is a word (or phrase) in a specific sense, linked to a particular frame. For
example, "bake" in its sense of preparing food would be an LU associated with the
"Cooking" frame.
Code:
Output:
[nltk_data] Downloading package framenet_v17 to /root/nltk_data...
[nltk_data] Unzipping corpora/framenet_v17.zip.
Total number of frames: 1221
1. Abandonment
2. Abounding_with
3. Absorb_heat
4. Abundance
5. Abusing
6. Access_scenario
7. Accompaniment
8. Accomplishment
9. Accoutrements
10. Accuracy
11. Achieving_first
12. Active_substance
13. Activity
14. Activity_abandoned_state
15. Activity_done_state
16. Activity_finish
17. Activity_ongoing
22261A6634 49
18. Activity_pause
19. Activity_paused_state
20. Activity_prepare
Output:
[nltk_data] Downloading package framenet_v17 to /root/nltk_data...
📌 Frame Name: Awareness
📝 Definition: A Cognizer has a piece of Content in their model of the world.
The Content is not necessarily present due to immediate perception, but
usually, rather, due to deduction from perceivables. In some cases, the
deduction of the Content is implicitly based on confidence in sources of
information (believe), in some cases based on logic (think), and in other
cases the source of the deduction is deprofiled (know). 'Your boss is aware
of your commitment.' '' Note that this frame is undergoing some degree of
reconsideration. Many of the targets will be moved to the Opinion frame.
That frame indicates that the Cognizer considers something as true, but the
Opinion (compare to Content) is not presupposed to be true; rather it is
something that is considered a potential point of difference, as in the
following: 'I think that you are awesome.' In the uses that will remain
in the Awareness frame, however, the Content is presupposed. '' This frame
is also distinct from the Certainty frame, in that it does not profile the
relationship of the Cognizer to the Content, but rather presupposes it. In
22261A6634 50
Certainty, the Degree of confidence or certainty is expressible as a separate
frame element, as in the following: 'She absolutely knew that he would be
there .'
🔤 Lexical Units (LUs):
- aware.a
- awareness.n
- believe.v
- comprehend.v
- comprehension.n
- conceive.v
- conception.n
- conscious.a
- hunch.n
- imagine.v
- know.v
- knowledge.n
- knowledgeable.a
- presume.v
- presumption.n
- reckon.v
- supposition.n
- suspect.v
- suspicion.n - think.v
- thought.n
- understand.v
- understanding.n
- ignorance.n
- consciousness.n
- cognizant.a
- unknown.a
- idea.n
22261A6634 51
- Degree: Peripheral — This FE identifies the Degree to which an event
occurs.
- Manner: Peripheral — This FE identifies the Manner in which the Cognizer
knows or thinks something.
- Expressor: Core — Expressor is the body part that reveals the Cognizer's
state to the observer. 'Bob's eyes were overly aware'
- Role: Peripheral — Role is the category within which an element of the
Content is considered. 'He understood her remark as an insult.'
- Paradigm: Extra-Thematic — This frame element identifies the Paradigm
which serves as the basis for the Cognizer's awareness. 'The formation of
black holes should be understood in astrophysic terms.'
- Time: Peripheral — The time interval during which the Cognizer is aware of
the Content. 'Yet there is no evidence that Mr. Parrish was cognizant at
the time of the signing of the notes that the clauses in issue were
present.'
- Explanation: Extra-Thematic — The reason why or how it came to be that the
Cognizer has awareness of the Topic or Content.
[nltk_data] Package framenet_v17 is already up-to-date!
Program to invoke a particular Frame based on Lexical unit in the given sentence
return frames_found
# Example sentence
sentence = "We believe it is a fair and generous price." frames_invoked =
find_frames_for_sentence(sentence)
# Display result
22261A6634 52
print(f"Sentence: {sentence}") print("\
nInvoked FrameNet Frames:") for word,
frames in frames_invoked.items(): print(f"-
{word}: {', '.join(frames)}")
Output:
22261A6634 53
Program 19: Write a python program using NLTK package to convert audio file to text
and text to audio files
Code:
1. Finding Synonyms of a given word:
Output:
Synset: car.n.01
22261A6634 54
Definition: a motor vehicle with four wheels; usually
propelled by an internal combustion engine Examples:
['he needs a car to get to work']
Synset: car.n.02
Definition: a wheeled vehicle adapted to the rails of
railroad
Examples: ['three cars had jumped the rails']
Synset: car.n.03
Definition: the compartment that is suspended from an
airship and that carries personnel and the cargo and the
power plant Examples: []
Synset: car.n.04
Definition: where passengers ride up and down
Examples: ['the car was on the top floor']
Synset: cable_car.n.01
Definition: a conveyance for passengers or freight on a
cable railway
Examples: ['they took a cable car to the top of the
mountain']
22261A6634 55
print(f"Hypernym: {hypernym.name()}")
Write a Python code to generate n-grams using NLTK n-gram Library Description:
22261A6634 56
(NLP) and is used in many tasks, including language modeling, text analysis, speech
recognition, and machine translation. n represents the number of items (usually
words) in the sequence.
Types of N-Grams:
1. Unigram (1-gram):
o A unigram is simply a single word (or character) from the text.
o Example: "I love NLP"
§ Unigrams: ['I', 'love', 'NLP']
2. Bigram (2-gram):
o A bigram is a sequence of two consecutive words.
o Example: "I love NLP"
§ Bigrams: [('I', 'love'), ('love', 'NLP')]
3. Trigram (3-gram):
o A trigram is a sequence of three consecutive words.
o Example: "I love NLP"
§ Trigrams: [('I', 'love', 'NLP')]
4. Tetragram (4-gram):
o A tetragram is a sequence of four consecutive words.
o Example: "I love NLP very much"
§ Tetragrams: [('I', 'love', 'NLP', 'very'), ('love', 'NLP', 'very', 'much')]
5. Higher-order n-grams (n > 3):
o You can generate n-grams for any n. For instance, you can create
5grams, 6-grams, etc., depending on how much context you want to
capture.
22261A6634 57
o Example:
Code:
import collections
import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
# Example text text1 = "this is a sample text with several words this is another sample
text with some different words"
text = "Sample list of words"
# Generate Uni-grams
print("List of Unigram")
ngrams_list = get_ngrams(text,
1) for ngram in ngrams_list:
print(ngram)
# Generate bi-grams
print("List of Bigrams")
ngrams_list = get_ngrams(text, 2) for
ngram in ngrams_list:
print(ngram)
22261A6634 58
# Generate tri-grams print("List of
Trigrams") ngrams_list =
get_ngrams(text, 3) for ngram in
ngrams_list:
print(ngram)
Output:
List of Unigram
('sample',)
('list',)
('of',)
('words',)
List of Bigrams
('sample', 'list')
('list', 'of')
('of', 'words')
List of Trigrams
('sample', 'list', 'of')
('list', 'of', 'words')
Program 21: Write a python program to train bi-gram model for a given corpus of
text to predict the next probable word given the previous two words of a
sentence.
Code:
22261A6634 59
"My passion is developing real world problem solving applications"
]
# Step 4: Predict the next word given a context (bigram prediction) def
predict_next_word(model, context): if context[-1] in model:
next_word_probs = model[context[-1]] next_word =
max(next_word_probs, key=next_word_probs.get) return
next_word else:
return None
22261A6634 60
# Predict the next word based on the user input predicted_word =
predict_next_word(bigram_model, context)
Output:
[nltk_data] Downloading package punkt_tab to /root/nltk_data... [nltk_data]
Package punkt_tab is already up-to-date!
Enter a sentence or context (e.g., 'The bank'): I love
Program 22: Write a python program to train bi-gram model for a given corpus of
text to predict the next probable word given the previous two words of a
22261A6634 61
Steps are
Code:
import collections import
nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
def count_ngrams(ngrams):
"""Count the occurrences of each n-gram."""
ngram_counts = collections.Counter(ngrams) return
ngram_counts
22261A6634 62
context = ngram[:-1] context_count = unigram_counts[context] if n > 1 else
sum(unigram_counts.values()) smoothed_probs[ngram] = (ngram_counts[ngram] + 1)
/ (context_count + vocab_size)
return smoothed_probs
def build_vocabulary(text):
"""Build a vocabulary from the given text."""
tokens = word_tokenize(text.lower()) vocab
= set(tokens)
return vocab
# Example text
text = "this is a sample text with several words this is another sample text with some different
words"
Output:
22261A6634 63
('is', 'a'): 0.153846
('a', 'sample'): 0.166667
('sample', 'text'): 0.230769
('text', 'with'): 0.230769
('with', 'several'): 0.153846
('several', 'words'): 0.166667
('words', 'this'): 0.153846
('is', 'another'): 0.153846
('another', 'sample'): 0.166667
('with', 'some'): 0.153846
('some', 'different'): 0.166667
('different', 'words'): 0.166667
22261A6634 64