NLP Crecord Mid2
NLP Crecord Mid2
CountVectorizer
       Description:
       Tokenization
       Before representing text numerically, it must be broken down into smaller units
       called tokens.
       In the code, tokenizer=lambda x: x.split() splits each sentence by whitespace.
       Example:
       "apple and banana" → ["apple", "and", "banana"]
       The vectorizer builds a vocabulary of all unique tokens in the corpus.
       One-Hot Encoding (Binary Representation)
       This method represents whether each token from the vocabulary is present
       (1) or absent (0) in a sentence.
             •   binary=True in CountVectorizer ensures that we use a binary vector, not raw
                 counts.
             •   Each vector’s length = size of the vocabulary.
             •   Each sentence becomes a vector indicating the presence of each word.
       Example Corpus:
             1. "apple and banana"
             2. "banana and orange"
             3. "grape apple banana"
       Vocabulary: ['and', 'apple', 'banana', 'grape', 'orange'] Binary
       Matrix Output:
             •   [1, 1, 1, 0, 0] # "apple and banana"
             •   [1, 0, 1, 0, 1] # "banana and orange"
             •   [0, 1, 1, 1, 0] # "grape apple banana"
This matrix numerically represents the text and is suitable for downstream ML tasks.
Code:
22261A6634                                              29
       from sklearn.preprocessing import OneHotEncoder
       from sklearn.feature_extraction.text import CountVectorizer
       corpus = [
           "apple and banana",
           "banana and orange",
           "grape apple banana"
       ]
       print("Vocabulary:", vectorizer.get_feature_names_out())
       print("One-Hot Encoded Matrix:\n", X)
       Output:
       Vocabulary: ['and' 'apple' 'banana' 'grape' 'orange']
       One-Hot Encoded Matrix:
        [[1 1 1 0 0]
        [1 0 1 0 1]
        [0 1 1 1 0]]
       Program 14: Write a python code for demonstrating Count Vectorization, also
       known as the Bag-of-Words (BoW) model — a foundational text representation
       technique in NLP.
       Description:
       This code demonstrates Count Vectorization, also known as the Bag-of-Words
       (BoW) model — a foundational text representation technique in NLP.
22261A6634                                          30
             •       It breaks each sentence into tokens (typically words), builds a vocabulary of all
                     unique tokens, and creates vectors indicating the frequency of each word in a
                     sentence.
       How it works:
             1. Tokenization:
                     The text is split into words (tokens) using default rules (like splitting by spaces
                     and removing punctuation).
             2. Vocabulary Creation:
                     A set of all unique words across the corpus is created.
             3. Vectorization:
                     Each sentence is converted into a numerical vector where:
                 o    Each dimension corresponds to a word in the vocabulary.
                 o    The value represents how many times that word appears in the sentence.
       Example Corpus:
             1. "Natural language processing is a field of artificial intelligence and language
                     processing."
             2. "Machine learning and deep learning are parts of AI."
             3. "Natural language techniques are used in chatbots and translation."
             4. "The future of AI depends on advances in NLP."
22261A6634                                                 31
             •   A vocabulary is extracted: All unique words from all sentences.
             •   The Count Vector Matrix shows how often each vocabulary word appears in each
                 sentence.
       This Bag-of-Words model captures word frequency information but ignores grammar,
       word order, and semantics. It is commonly used for:
             •   Text classification
             •   Spam detection
             •   Sentiment analysis
             •   Document similarity tasks Code:
       from sklearn.feature_extraction.text import CountVectorizer
       # Sample corpus (can be sentences, paragraphs, documents)
       corpus = [
           "Natural language processing is a field of artificial
       intelligence and language processing.",
           "Machine learning and deep learning are parts of AI.",
       "Natural language techniques are used in chatbots and
       translation.",
           "The future of AI depends on advances in NLP."
       ]
22261A6634                                          32
     Output:
     📚 Vocabulary:
     ['advances' 'ai' 'and' 'are' 'artificial' 'chatbots' 'deep' 'depends'
      'field' 'future' 'in' 'intelligence' 'is' 'language' 'learning' 'machine'
      'natural' 'nlp' 'of' 'on' 'parts' 'processing' 'techniques' 'the'
      'translation' 'used']
22261A6634                                                  33
       Program 14 A: Write a python code for TF-IDF (Term Frequency-Inverse Document
       Frequency) to understand which words are important, not just frequent.
       Description:
       1. Count Vectorizer (Bag of Words)
             •   Converts each document into a vector based on word frequency.
             •   Ignores grammar and word order; only counts occurrences of words.
             •   Commonly used for basic text classification and NLP tasks.
       Limitation:
       Frequent but less meaningful words (like "data" or "systems") may dominate the
       representation, even if they don’t carry much information.
       In this example:
             •   Corpus: A small collection of sentences related to AI.
             •   CountVectorizer: Creates a matrix with raw word counts (excluding stop words).
             •   TfidfVectorizer: Creates a matrix of weighted scores based on word importance.
       This helps in tasks like:
             •   Document classification
             •   Keyword extraction
             •   Search relevance ranking
22261A6634                                           34
       Code:
       from sklearn.feature_extraction.text import CountVectorizer,
       TfidfVectorizer import pandas as pd
       corpus = [
           "Artificial Intelligence is transforming industries and daily
       life through automation and smart systems.",
           "Machine Learning, as a subset of AI, enables systems to
       learn from data without being explicitly programmed.",
       "Deep Learning techniques use neural networks with many
       layers to model complex patterns in data such as images and
       speech.",
           "Applications of AI include self-driving cars, medical
       diagnosis, financial forecasting, and personalized recommendations.",
           "Natural Language Processing helps computers understand,
       interpret, and generate human language using linguistic and
       statistical techniques.",
           "With rapid advancements in computing power and data
       availability, the future of AI continues to grow exponentially."
       ]
       # Count Vectorizer
       count_vectorizer = CountVectorizer(stop_words='english')
       count_matrix = count_vectorizer.fit_transform(corpus) count_df
       = pd.DataFrame(count_matrix.toarray(),
       columns=count_vectorizer.get_feature_names_out())
       # TF-IDF Vectorizer
       tfidf_vectorizer = TfidfVectorizer(stop_words='english')
       tfidf_matrix = tfidf_vectorizer.fit_transform(corpus) tfidf_df
       = pd.DataFrame(tfidf_matrix.toarray(),
       columns=tfidf_vectorizer.get_feature_names_out())
       # Display
       print("📊 Count Vector (Bag of Words):") print(count_df)
22261A6634                            35
       Output:
       📊 Count Vector (Bag of Words):
          advancements ai applications      artificial      automation    availability     cars       \ 0
       0   0             0            1            1                 0       0
       1             0   1              0            0               0               0           0
       2             0   0              0            0               0               0           0
       3             0   1              1            0               0               0           1
       4             0   0              0            0               0               0           0
       5             1   1              0            0               0               1           0
[6 rows x 61 columns]
       🌟 TF-IDF Vector:
          advancements       ai applications artificial automation availability                       \ 0
       0.000000 0.000000       0.00000      0.33957      0.33957       0.000000
       1      0.000000 0.240255        0.00000      0.00000      0.00000      0.000000
       2      0.000000 0.000000        0.00000      0.00000      0.00000      0.000000
       3      0.000000 0.204336        0.29515      0.00000      0.00000      0.000000
       4      0.000000 0.000000        0.00000      0.00000      0.00000      0.000000
       5      0.316885 0.219383        0.00000      0.00000      0.00000      0.316885
               use     using      0
       0.000000 0.000000
       1 0.000000 0.000000
       2 0.290814 0.000000
       3 0.000000 0.000000
       4 0.000000 0.252599
       5 0.000000 0.000000
       [6 rows x 61 columns]
22261A6634                                            36
       Program 15: Write a python code to implement word2vec word-embedding
       technique
       Code:
       import gensim
       import pandas as pd
       df = pd.read_json("Sports_and_Outdoors_5.json", lines=True)
       review_text =
       df.reviewText.apply(gensim.utils.simple_preprocess)
       df.reviewText.loc[0] model =
       gensim.models.Word2Vec(        window=10,        min_count=2,
       workers=4,
       )
       model.build_vocab(review_text, progress_per=1000) model.train(review_text,
       total_examples=model.corpus_count, epochs=model.epochs)
       model.save("./word2vec-outdoor-reviews-short.model")
       print(model.wv.most_similar("awful"))
       Output:
       [('terrible', 0.7352169156074524),
       ('horrible', 0.6891771554946899),
       ('overwhelming', 0.6227911710739136),
       ('impossibility', 0.5835400819778442),
22261A6634                                         37
       ('horrendous', 0.5827057957649231),
       ('enormous', 0.5721909999847412),
       ('ugly', 0.567825436592102),
       ('unusual', 0.566750705242157),
       ('isolated', 0.5588798522949219),
       ('unfortunate', 0.5560564994812012)]
        •     model.wv.similarity(w1="good", w2="great") ð
              output: 0.7870506
• model.wv.similarity(w1="slow", w2="steady")
ð output: 0.3472042
Program 16 A: Write a python program to create a sample list for at least 5 words with
       A word sense is a specific meaning of a word, especially when the word has multiple
       meanings depending on the context. This concept is central to understanding and
       processing natural language correctly.
22261A6634                                           38
       Word Sense in Lexical Databases
             •     A definition (gloss)
             •     Examples
             •     Synonyms (called synsets)
             •     Relations (hypernyms, hyponyms, etc.)
       Code:
wordnet as wn
       # A simple approach to disambiguate: Check if any context words match with the word's
       definition          for i, sense in enumerate(senses):        for example in sense.examples():
22261A6634                                                      39
                if any(context_word in example for context_word in context):
       print(f"\nContext matched with sense {i+1}: {sense.name()}")
       print(f"Example: {example}")                   return sense.name() # Return
       the sense name based on context
       Output:
       WSD for the word: bank
       Senses of the word 'bank':
       1. bank.n.01: sloping land (especially the slope beside a body of water) 2.
       depository_financial_institution.n.01: a financial institution that accepts
       deposits and channels the money into lending activities
       3. bank.n.03: a long ridge or pile
       4. bank.n.04: an arrangement of similar objects in a row or in tiers
22261A6634                                                  40
       5. bank.n.05: a supply or stock held in reserve for future use (especially in
          emergencies)
       6. bank.n.06: the funds held by a gambling house or the dealer in some
          gambling games
       7. bank.n.07: a slope in the turn of a road or track; the outside is higher
          than the inside in order to reduce the effects of centrifugal force 8.
          savings_bank.n.02: a container (usually with a slot in the top) for
          keeping money at home
       9. bank.n.09: a building in which the business of banking transacted 10.
       bank.n.10: a flight maneuver; aircraft tips laterally about its
       longitudinal axis (especially in turning)
       11. bank.v.01: tip laterally
       12. bank.v.02: enclose with a bank
       13. bank.v.03: do business with a bank or keep an account at a bank
       14. bank.v.04: act as the banker in a game or in gambling
       15. bank.v.05: be in the banking business
       16. deposit.v.02: put into a bank account
       17. bank.v.07: cover with ashes so to control the rate of burning
       18. trust.v.01: have confidence or faith in
22261A6634                                  41
       9. bat.v.04: use a bat
       10. cream.v.02: beat thoroughly and conclusively in a competition or fight
22261A6634                                  42
       14. lead.n.14: thin strip of metal used to separate lines of type in printing
           15. lead.n.15: mixture of graphite with clay in different degrees of
           hardness; the marking substance in a pencil
       16. jumper_cable.n.01: a jumper that consists of a short piece of wire
       17. lead.n.17: the playing of a card to start a trick in bridge
       18. lead.v.01: take somebody somewhere
       19. leave.v.07: have as a result or residue
       20. lead.v.03: tend to or result in
       21. lead.v.04: travel in front of; go in advance of others
       22. lead.v.05: cause to undertake a certain action
       23. run.v.03: stretch out over a distance, space, time, or scope; run or
           extend between two points or beyond a certain point
       24. head.v.02: be in charge of
       25. lead.v.08: be ahead of others; be the first
       26. contribute.v.03: be conducive to
       27. conduct.v.02: lead, as in the performance of a composition
       28. go.v.25: lead, extend, or afford access
       29. precede.v.04: move ahead (of others) in time or space
       30. run.v.23: cause something to pass or lead somewhere
       31. moderate.v.01: preside over
       Disambiguated sense of 'lead': No match found
       [nltk_data] Downloading package wordnet to /root/nltk_data... [nltk_data]
       Package wordnet is already up-to-date!
       Program 16 B: Write a python program to implement Lesk’s algorithm for word sense
       disambiguity.
Description:
       Word Sense Disambiguation (WSD) is the process of identifying the correct meaning
       (sense) of a word based on its context, especially when the word has multiple
       meanings.
       The correct sense of an ambiguous word is the one whose definition overlaps the
       most with the definitions of the surrounding words in the sentence.
Working of algorithm:
             •   Identify all possible senses of the ambiguous word using a dictionary like
                 WordNet.
22261A6634                                          43
             •     For each sense, take its definition (gloss).
             •     Compare it with the glosses or words in the context (surrounding words).
             •     Count overlapping words between glosses and context.
             •     Choose the sense with the maximum overlap.
       Code:
       import nltk
       from nltk.corpus import wordnet as wn from
       nltk.tokenize import word_tokenize def
       lesk_algorithm(word, context):
         if not senses:
       return None
       max_overlap = 0
       best_sense = None
                 # Compute the overlap between the context and the sense definition + examples
       overlap = len(context_tokens.intersection(definition.union(examples)))
22261A6634                                                 44
             # Keep track of the sense with the maximum overlap
       if overlap > max_overlap:          max_overlap = overlap
       best_sense = sense
return best_sense
       # Example usage:
       context = ["The bark of the tree is rough and textured."] word
       = "bark"
       if best_sense:
         print(f"Disambiguated sense for '{word}': {best_sense.name()}")
       print(f"Definition: {best_sense.definition()}") else:   print(f"No
       sense found for '{word}'")
       if best_sense:
         print(f"Disambiguated sense for '{word}': {best_sense.name()}")
       print(f"Definition: {best_sense.definition()}") else:   print(f"No
       sense found for '{word}'")
       Output:
       Disambiguated sense for 'bark': bark.v.03
22261A6634                                              45
       Definition: remove the bark of a tree
       Disambiguated sense for 'bark': bark.n.02
       Definition: a noise resembling the bark of a dog
Program 17: Write a python program using NLTK package to convert audio file to text
       NLTK it does not handle audio files directly.We will use speech_recognition and gTTS
       libraries for audio, and use NLTK to process the text in between.
             1. Speech_recognition:
               speech_recognition is a Python library that helps you convert spoken audio
                into written text using speech recognition engines.
Code:
22261A6634                                           46
       import speech_recognition as sr from
       gtts import gTTS def
       audio_to_text(audio_file_path):
          recognizer = sr.Recognizer() with
       sr.AudioFile(audio_file_path) as source:
       audio_data = recognizer.record(source)
       try:
               text = recognizer.recognize_google(audio_data)
               print("Recognized Text:\n", text)
               return text     except sr.UnknownValueError:
       print("Speech Recognition could not understand the audio")
       except sr.RequestError:
               print("Could not request results from Google Speech Recognition service")
       audio_path = "Sports.wav"
       output_audio_path = "output.mp3"
       if text:
           tokens = word_tokenize(text)
           print("Tokenized Text:\n", tokens)
Output:
22261A6634                                           47
       Tokenized Text:
        ['good', 'evening', 'ladies', 'and', 'gentlemen', 'we', 'like', 'to',
       'welcome', 'you', 'to', 'play', 'the', 'new', 'videos', 'Broadcast']
       Program 18: Write a python program using NLTK package to convert audio file to text
       and text to audio files
Description:
       FrameNet is a linguistic database that organizes words based on the situations (called
       frames) they describe, showing how words are connected to roles and events in
       realworld experiences.
1. Helps computers understand not just words, but meanings and relationships.
             The key elements of FrameNet are Frames, Frame Elements (FEs), and Lexical
             Units (LUs).
       Frames:
       These are scripts or conceptual structures that describe specific types of situations,
       events, or objects. For example, a "Cooking" frame would describe the situation of
       preparing food, including the roles of a cook, the food, and the heating instrument.
22261A6634                                        48
       A lexical unit is a word (or phrase) in a specific sense, linked to a particular frame. For
       example, "bake" in its sense of preparing food would be an LU associated with the
       "Cooking" frame.
Code:
       Output:
               [nltk_data] Downloading package framenet_v17 to /root/nltk_data...
               [nltk_data]   Unzipping corpora/framenet_v17.zip.
               Total number of frames: 1221
               1.    Abandonment
               2.    Abounding_with
               3.    Absorb_heat
               4.    Abundance
               5.    Abusing
               6.    Access_scenario
               7.    Accompaniment
               8.    Accomplishment
               9.    Accoutrements
               10.   Accuracy
               11.   Achieving_first
               12.   Active_substance
               13.   Activity
               14.   Activity_abandoned_state
               15.   Activity_done_state
               16.   Activity_finish
               17.   Activity_ongoing
22261A6634                                                 49
              18. Activity_pause
              19. Activity_paused_state
              20. Activity_prepare
       Output:
       [nltk_data] Downloading package framenet_v17 to /root/nltk_data...
       📌 Frame Name: Awareness
       📝 Definition: A Cognizer has a piece of Content in their model of the world.
       The Content is not necessarily present due to immediate perception, but
       usually, rather, due to deduction from perceivables. In some cases, the
       deduction of the Content is implicitly based on confidence in sources of
       information (believe), in some cases based on logic (think), and in other
       cases the source of the deduction is deprofiled (know). 'Your boss is aware
       of your commitment.' '' Note that this frame is undergoing some degree of
       reconsideration. Many of the targets will be moved to the Opinion frame.
       That frame indicates that the Cognizer considers something as true, but the
       Opinion (compare to Content) is not presupposed to be true; rather it is
       something that is considered a potential point of difference, as in the
       following:    'I think that you are awesome.' In the uses that will remain
       in the Awareness frame, however, the Content is presupposed. '' This frame
       is also distinct from the Certainty frame, in that it does not profile the
       relationship of the Cognizer to the Content, but rather presupposes it. In
22261A6634                                                50
       Certainty, the Degree of confidence or certainty is expressible as a separate
       frame element, as in the following: 'She absolutely knew that he would be
       there .'
       🔤   Lexical Units (LUs):
       -   aware.a
       -   awareness.n
       -   believe.v
       -   comprehend.v
       -   comprehension.n
       -   conceive.v
       -   conception.n
       -   conscious.a
       -   hunch.n
       -   imagine.v
       -   know.v
       -   knowledge.n
       -   knowledgeable.a
       -   presume.v
       -   presumption.n
       -   reckon.v
       -   supposition.n
       -   suspect.v
       -   suspicion.n - think.v
       -   thought.n
       -   understand.v
       -   understanding.n
       -   ignorance.n
       -   consciousness.n
       -   cognizant.a
       -   unknown.a
       -   idea.n
22261A6634                                  51
       - Degree: Peripheral — This FE identifies the Degree to which an event
         occurs.
       - Manner: Peripheral — This FE identifies the Manner in which the Cognizer
         knows or thinks something.
       - Expressor: Core — Expressor is the body part that reveals the Cognizer's
         state to the observer. 'Bob's eyes were overly aware'
       - Role: Peripheral — Role is the category within which an element of the
       Content is considered. 'He understood her remark as an insult.'
       - Paradigm: Extra-Thematic — This frame element identifies the Paradigm
         which serves as the basis for the Cognizer's awareness. 'The formation of
         black holes should be understood in astrophysic terms.'
       - Time: Peripheral — The time interval during which the Cognizer is aware of
         the Content. 'Yet there is no evidence that Mr. Parrish was cognizant at
         the time of the signing of the notes that the clauses in issue were
         present.'
       - Explanation: Extra-Thematic — The reason why or how it came to be that the
       Cognizer has awareness of the Topic or Content.
       [nltk_data]   Package framenet_v17 is already up-to-date!
Program to invoke a particular Frame based on Lexical unit in the given sentence
return frames_found
              # Example sentence
              sentence = "We believe it is a fair and generous price." frames_invoked =
              find_frames_for_sentence(sentence)
# Display result
22261A6634                                             52
              print(f"Sentence: {sentence}") print("\
              nInvoked FrameNet Frames:") for word,
              frames in frames_invoked.items(): print(f"-
              {word}: {', '.join(frames)}")
Output:
22261A6634                                             53
       Program 19: Write a python program using NLTK package to convert audio file to text
       and text to audio files
       Code:
       1. Finding Synonyms of a given word:
                 Output:
                   Synset: car.n.01
22261A6634                                                 54
              Definition: a motor vehicle with four wheels; usually
              propelled by an internal combustion engine Examples:
              ['he needs a car to get to work']
              Synset: car.n.02
              Definition: a wheeled vehicle adapted to the rails of
              railroad
              Examples: ['three cars had jumped the rails']
              Synset: car.n.03
              Definition: the compartment that is suspended from an
              airship and that carries personnel and the cargo and the
              power plant Examples: []
              Synset: car.n.04
              Definition: where passengers ride up and down
              Examples: ['the car was on the top floor']
              Synset: cable_car.n.01
              Definition: a conveyance for passengers or freight on a
              cable railway
              Examples: ['they took a cable car to the top of the
              mountain']
22261A6634                                           55
                print(f"Hypernym: {hypernym.name()}")
Write a Python code to generate n-grams using NLTK n-gram Library Description:
22261A6634                                              56
       (NLP) and is used in many tasks, including language modeling, text analysis, speech
       recognition, and machine translation. n represents the number of items (usually
       words) in the sequence.
Types of N-Grams:
             1. Unigram (1-gram):
                    o   A unigram is simply a single word (or character) from the text.
                    o   Example: "I love NLP"
                           §   Unigrams: ['I', 'love', 'NLP']
             2. Bigram (2-gram):
                    o   A bigram is a sequence of two consecutive words.
                    o   Example: "I love NLP"
                           §   Bigrams: [('I', 'love'), ('love', 'NLP')]
             3. Trigram (3-gram):
                    o   A trigram is a sequence of three consecutive words.
                    o   Example: "I love NLP"
                           §   Trigrams: [('I', 'love', 'NLP')]
             4. Tetragram (4-gram):
                    o   A tetragram is a sequence of four consecutive words.
                    o   Example: "I love NLP very much"
                           §   Tetragrams: [('I', 'love', 'NLP', 'very'), ('love', 'NLP', 'very', 'much')]
             5. Higher-order n-grams (n > 3):
                    o   You can generate n-grams for any n. For instance, you can create
                        5grams, 6-grams, etc., depending on how much context you want to
                        capture.
22261A6634                                               57
                     o Example:
       Code:
                 import collections
                 import nltk
                  from nltk.util import ngrams
                  from nltk.tokenize import word_tokenize
                 # Example text text1 = "this is a sample text with several words this is another sample
                       text with some different words"
                       text = "Sample list of words"
                 # Generate Uni-grams
                         print("List of Unigram")
                 ngrams_list = get_ngrams(text,
                 1) for ngram in ngrams_list:
                          print(ngram)
                 # Generate bi-grams
                       print("List of Bigrams")
                       ngrams_list = get_ngrams(text, 2) for
                        ngram in ngrams_list:
                               print(ngram)
22261A6634                                             58
               # Generate tri-grams print("List of
               Trigrams") ngrams_list =
               get_ngrams(text, 3) for ngram in
               ngrams_list:
                               print(ngram)
       Output:
                 List of Unigram
                 ('sample',)
                 ('list',)
                 ('of',)
                 ('words',)
                 List of Bigrams
                 ('sample', 'list')
                 ('list', 'of')
                 ('of', 'words')
                 List of Trigrams
                 ('sample', 'list', 'of')
                 ('list', 'of', 'words')
       Program 21: Write a python program to train bi-gram model for a given corpus of
       text to predict the next probable word given the previous two words of a
       sentence.
Code:
22261A6634                                           59
                  "My passion is developing real world problem solving applications"
              ]
       # Step 4: Predict the next word given a context (bigram prediction) def
              predict_next_word(model, context): if context[-1] in model:
                   next_word_probs = model[context[-1]]       next_word =
              max(next_word_probs, key=next_word_probs.get)          return
              next_word else:
                   return None
22261A6634                                            60
       # Predict the next word based on the user input predicted_word =
              predict_next_word(bigram_model, context)
       Output:
       [nltk_data] Downloading package punkt_tab to /root/nltk_data... [nltk_data]
       Package punkt_tab is already up-to-date!
       Enter a sentence or context (e.g., 'The bank'): I love
Program 22: Write a python program to train bi-gram model for a given corpus of
text to predict the next probable word given the previous two words of a
22261A6634                                             61
       Steps are
       Code:
       import collections import
       nltk
       from nltk.util import ngrams
       from nltk.tokenize import word_tokenize
       def count_ngrams(ngrams):
         """Count the occurrences of each n-gram."""
       ngram_counts = collections.Counter(ngrams) return
       ngram_counts
22261A6634                                              62
            context = ngram[:-1]      context_count = unigram_counts[context] if n > 1 else
       sum(unigram_counts.values())        smoothed_probs[ngram] = (ngram_counts[ngram] + 1)
       / (context_count + vocab_size)
return smoothed_probs
       def build_vocabulary(text):
         """Build a vocabulary from the given text."""
       tokens = word_tokenize(text.lower()) vocab
       = set(tokens)
         return vocab
       # Example text
       text = "this is a sample text with several words this is another sample text with some different
       words"
Output:
22261A6634                                             63
       ('is', 'a'): 0.153846
       ('a', 'sample'): 0.166667
       ('sample', 'text'): 0.230769
       ('text', 'with'): 0.230769
       ('with', 'several'): 0.153846
       ('several', 'words'): 0.166667
       ('words', 'this'): 0.153846
       ('is', 'another'): 0.153846
       ('another', 'sample'): 0.166667
       ('with', 'some'): 0.153846
       ('some', 'different'): 0.166667
       ('different', 'words'): 0.166667
22261A6634 64