0% found this document useful (0 votes)

10 views25 pages

Sma U-4

SOcial edia analytics

Uploaded by

abhishekagrawalairbnb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views25 pages

Sma U-4

SOcial edia analytics

Uploaded by

abhishekagrawalairbnb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Social Media Analytics

Unit-4- Text Summarization Harsh Vardhan Mishra

Introduction to Text Summarization, Text extraction, classification and clustering, Anomaly and
Trend Detection, Text Processing, N-gram Frequency Count and Phrase Mining, Page Rank
and Text Rank Algorithm, LDA Topic Modelling, Machine-Learned Classification and Semantic
Topic Tagging, Python libraries for Text Summarization. (NumPy, Pandas, NLTK, Matplotlib).

Introduction to Text Summarization

Automatic Text Summarization is one of the most challenging and interesting problems in the
field of Natural Language Processing (NLP). It is a process of generating a concise and
meaningful summary of text from multiple text resources such as books, news articles, blog
posts, research papers, emails, and tweets.

The demand for automatic text summarization systems is spiking these days thanks to the
availability of large amounts of textual data.

Text summarization is the process of condensing a piece of text while retaining its most
important information. It can be achieved using techniques such as extraction-based
summarization or abstraction-based summarization.

Extraction-based Summarization
Extractive Summarization technique involves the extraction of important words/phrases from the
input sentence. The underlying idea is to create a summary by selecting the most important
words from the input sentence
It doesn't involve creating new sentences or rephrasing the original content.

Typically, this approach relies on techniques such as ranking sentences based on importance
(using algorithms like TF-IDF or TextRank) and selecting the top-ranked sentences for the
summary.
Example: Suppose you have a news article about a recent scientific discovery. An
extraction-based summary would extract the key sentences from the article that convey the
main findings and essential information without altering the wording. For instance, it might
include sentences like "Scientists have discovered a new species of plant in the Amazon
rainforest" and "The new species exhibits unique characteristics that could have significant
implications for biodiversity conservation."

Abstraction-based Summarization
Abstractive Summarization technique involves the generation of entirely new phrases that
capture the meaning of the input sentence. The underlying idea is to put a strong emphasis on
the form — aiming to generate a grammatical summary thereby requiring advanced language
modeling techniques.
This approach involves understanding the meaning of the text and generating new sentences
that capture the essence of the content while potentially using different words and sentence
structures.

Abstraction-based summarization often utilizes natural language generation techniques, such as

paraphrasing and synthesis, to create summaries that are not restricted to the exact wording of
the source text.
Example: Consider a review of a movie. An abstraction-based summary would involve
understanding the plot, character development, and themes of the movie and then generating a
condensed summary in new sentences. For instance, it might produce a summary like "The
movie tells the story of a young protagonist who embarks on a journey of self-discovery,
navigating through challenges and personal growth along the way. With stunning visuals and
compelling performances, it explores themes of love, sacrifice, and the human condition."

Extraction-based summarization focuses on selecting and condensing existing

content, while Abstraction-based summarization involves interpreting and rephrasing
the content to generate new, concise summaries.
Data Preprocessing for text summarizer

Pre-processing and cleaning is an important step because building a model on unclean and
messy data will in turn produce messy results. We will apply the below cleaning techniques
before feeding our data to the model

1. Converting all text to lower case for further processing

2. Parsing HTML tags
3. Removing text between () and []
4. Contraction Mapping — Replacing shortened version of words (for e.g. can’t is replaced
with cannot and so on)
5. Removing apostrophe
6. Removing punctuations and special characters
7. Removing stop words using nltk library
8. Retaining only long words, i.e. words with length > 3

Deep Learning model for abstractive based Text Summarization

Github Link: Text summarizer using Deep Learning

Text Extraction
In recent years there has been a surge in unstructured data in the form of text, videos, audio
and photos. NLU aids in extracting valuable information from text such as social media data,
customer surveys, and complaints.

Types of Text Extraction

NLP primarily comprises of natural language understanding (human to machine) and natural
language generation (machine to human). This article will mainly deal with natural language
understanding (NLU). In recent years there has been a surge in unstructured data in the form of
text, videos, audio and photos. NLU helps in extracting valuable information from unstructured
text.

1. Named Entity Recognition (NER)

2. Sentiment Analysis
3. Text Summarization
4. Aspect Mining
5. Topic Modeling
6. Part of Speech(POS) Tagging using spaCy
7. Text classification with Scikit-Learn
8. Using LLMs (like GPT-3)

Consider the text snippet below from a customer review of a fictional insurance company called
Rocketz Auto Insurance Company:

The customer service of Rocketz is terrible. I must call the call center multiple times before I
get a decent reply. The call center guys are extremely rude and totally ignorant. Last month I
called with a request to update my correspondence address from Brooklyn to Manhattan. I
spoke with about a dozen representatives – Lucas Hayes, Ethan Gray, Nora Diaz, Sofia
Parker to name a few. Even after writing multiple emails and filling out numerous forms, the
address has still not been updated. Even my agent John is useless. The policy details he
gave me were wrong. The only good thing about the company is the pricing. The premium is
reasonable compared to the other insurance companies in the United States. There has not
been any significant increase in my premium since 2015.

1. Named Entity Recognition

The most basic and useful technique in NLP is extracting the entities in the text. It highlights the
fundamental concepts and references in the text. Named entity recognition (NER) identifies
entities such as people, locations, organizations, dates, etc. from the text.

NER output for the sample text will typically be:

Person: Lucas Hayes, Ethan Gray, Nora Diaz, Sofia Parker, John

Location: Brooklyn, Manhattan, United States

Date: Last month, 2015

Organization: Rocketz

NER is generally based on grammar rules and supervised models. However, there are NER
platforms such as open NLP that have pre-trained and built-in NER models.

2. Sentiment Analysis
The most widely used technique in NLP is sentiment analysis. Sentiment analysis is most useful
in cases such as customer surveys, reviews and social media comments where people express
their opinions and feedback. The simplest output of sentiment analysis is a 3-point scale:
positive/negative/neutral. In more complex cases the output can be a numeric score that can be
bucketed into as many categories as required.

In the case of our text snippet, the customer clearly expresses different sentiments in various
parts of the text. Because of this, the output is not very useful. Instead, we can find the
sentiment of each sentence and separate out the negative and positive parts of the review.
Sentiment score can also help us pick out the most negative and positive parts of the review:

Most negative comment: The call center guys are extremely rude and totally ignorant.

Sentiment Score: -1.233288

Most positive comment: The premium is reasonable compared to the other insurance
companies in the United States.

Sentiment Score: 0.2672612

Sentiment Analysis can be done using supervised as well as unsupervised techniques. The
most popular supervised model used for sentiment analysis is naïve Bayes. It requires a training
corpus with sentiment labels, upon which a model is trained which is then used to identify the
sentiment. Naive Bayes is not the only tool out there - different machine learning techniques like
random forest or gradient boosting can also be used.

The unsupervised techniques also known as the lexicon-based methods require a corpus of
words with their associated sentiment and polarity. The sentiment score of the sentence is
calculated using the polarities of the words in the sentence.

3. Text Summarization

Extraction methods create a summary by extracting parts from the text. Abstraction methods
create summary by generating fresh text that conveys the crux of the original text. There are
various algorithms that can be used for text summarization like LexRank, TextRank, and Latent
Semantic Analysis. To take the example of LexRank, this algorithm ranks the sentences using
similarity between them. A sentence is ranked higher when it is similar to more sentences, and
these sentences are in turn similar to other sentences.

Using LexRank, the sample text is summarized as: I have to call the call center multiple times
before I get a decent reply. The premium is reasonable compared to the other insurance
companies in the United States.
4. Aspect Mining
Aspect mining identifies the different aspects in the text. When used in conjunction with
sentiment analysis, it extracts complete information from the text. One of the easiest methods of
aspect mining is using part-of-speech tagging.

When aspect mining along with sentiment analysis is used on the sample text, the output
conveys the complete intent of the text:

Aspects & Sentiments:

Customer service – negative

Call center – negative
Agent – negative
Pricing/Premium – positive

5. Topic Modeling
Topic modeling is one of the more complicated methods to identify natural topics in the text. A
prime advantage of topic modeling is that it is an unsupervised technique. Model training and a
labeled training dataset are not required.

There are quite a few algorithms for topic modeling:

● Latent Semantic Analysis (LSA)

● Probabilistic Latent Semantic Analysis (PLSA)
● Latent Dirichlet Allocation (LDA)
● Correlated Topic Model (CTM).

One of the most popular methods is latent Dirichlet allocation. The premise of LDA is that each
text document comprises of several topics and each topic comprises of several words. The input
required by LDA is merely the text documents and the expected number of topics.

Latent Dirichlet allocation (LDA) is a topic modeling technique that is used to discover hidden
topics in text such as long documents or news articles. It does this by representing each
document as a mixture of topics, and each topic is represented as a mixture of words. LDA is an
unsupervised learning algorithm, which means that it does not require training on new labeled
data. This makes it a powerful tool for discovering hidden structure in data that can be used
quickly. LDA allows you to find out what topics are being talked about in a document, and how
often those topics are mentioned. It can also be used to find out what words are associated with
each topic.
Using the sample text and assuming two inherent topics, the topic modeling output will identify
the common words across both topics.
For our example, the main theme for the first topic 1 includes words like call, center, and
service. The main themes in topic 2 are words like premium, reasonable and price. This implies
that topic 1 corresponds to customer service and topic two corresponds to pricing. The diagram
below shows the results in detail.
6. Part of Speech(POS) Tagging using spaCy

Part-of-speech (POS) tagging is a process of assigning a grammatical category to each word in

a sentence. The categories can include verb, noun, adjective, adverb, and so on. Each word is
tagged with the category that is most appropriate for that word in the context of the sentence.
For example, the word "fly" would be tagged as a verb in the sentence "I like to fly." POS
tagging is a helpful NLP technique because it can provide context for words and help you build
a better understanding of key information in unstructured text. This context can be helpful in
many tasks such as named entity recognition, sentiment analysis, and topic modeling, or used
as stand alone extracted information.

‍ paCy has a POS tagging model that can be used in an NLP pipeline for quick information
S
extraction. The model is pretrained on a large corpus of text, and it uses that training data to
learn how to POS tag words. spaCy POS tagging also allows for custom training data, which
means that you can train the model to POS tag words in a specific domain such as medical
texts or legal documents. We've used the POS tagging model as a standalone to write entity
extraction rules that enhance the ability of our NER or deep learning models.

7. Text classification with Scikit-Learn

Text classification is the task of assigning a class label to a piece of text based on a learned
relationship between information in the text and the class. This can be done for a variety of
purposes such as spam detection, sentiment analysis, topic classification, and so on. There are
a number of different algorithms that can be used for text classification, but in this section we'll
focus on the popular scikit-learn library and two different methods of text classification.

Scikit-Learn is a machine learning library that can be used for a variety of tasks, including text
classification. It offers a number of different text classification algorithms, and it also allows for
the creation of custom algorithms and pipelines.

The two of the most common text classification algorithms: support vector machines (SVMs)
and naïve Bayes. Both of these algorithms are based on the idea of using a training set of data
to learn the classification rules. The training set is a collection of documents that have been
labeled with the correct class label. For example, in a text classification task, the training set
would be a collection of documents/sentences/paragraphs that have been labeled as a specific
class. The classification algorithm would then learn a relationship between the classes and the
examples that maps the two together.

8. Using LLMs (like GPT-3)

Key topic extraction is a popular use case that focuses on extracting the key topics discussed in
a given input text. This is slightly different from topic modeling as we can use GPT-3 to focus on
specific points in the text instead of the broad topics. This can be valuable points made in an
interview, key points made in an informational blog post, or important questions asked in an
interview transcript. These are usually much more refined key phrases or sentences instead of
one word keywords or a list of broad topics. The pipeline used for this often combines both NLP
and natural language understanding.

‍Classification and Clustering

Classification

Classification is a supervised machine learning task where the goal is to categorize or label
items into predefined classes or categories based on their features. The model is trained using
a dataset that includes both the features of the data points and their corresponding class labels.
Features are the characteristics or attributes of the data points that influence the classification
outcome.

Text classification involves categorizing text documents into predefined topics or categories.
This is achieved using machine learning algorithms such as Naive Bayes, Support Vector
Machines (SVM), or deep learning models like Convolutional Neural Networks (CNNs) or
Recurrent Neural Networks (RNNs).
The goal is to assign a label or category to each document based on its content.

Techniques for Topic/text Classification

Feature extraction: Extracting relevant features from the text such as word frequencies,
n-grams, or word embeddings.
Model training: Training machine learning models on labeled data to learn the patterns and
relationships between features and categories.
Evaluation: Assessing the performance of the classification model using metrics like accuracy,
precision, recall, and F1-score.
Clustering

Clustering is a technique used to group similar documents together based on their content.
Unlike topic classification, clustering does not require predefined categories and allows for the
discovery of hidden patterns and structures in the data.
Common clustering algorithms include K-means, Hierarchical Clustering, and DBSCAN.

Techniques for Text Clustering

Text representation: Converting text documents into numerical vectors using techniques like
TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings.
Similarity measures: Calculating the similarity between documents using metrics like cosine
similarity or Jaccard similarity.
Clustering algorithms: Applying clustering algorithms to group similar documents together
based on their similarity scores.

Text Summarization Using Topic Classification and Clustering

Topic classification can be used to identify the main topics or themes present in a collection of
documents.
Clustering can be employed to group similar documents together, making it easier to generate
summaries for each cluster.
By combining these techniques, text summarization systems can generate concise and
informative summaries that capture the key topics and ideas present in the original text.

Anomaly Detection

Anomaly detection, also known as outlier detection, is the process of identifying patterns or
instances that deviate significantly from the norm or expected behavior.
Anomaly detection in text refers to the identification of unusual or unexpected patterns or
instances within a text corpus. These anomalies could be indicative of errors, outliers, or
significant deviations from normal behavior. Detecting anomalies is crucial for maintaining the
quality and accuracy of text summarization systems.

Types of Anomalies in Text

Outliers: Documents or sentences that significantly differ from the majority of the text in terms
of content or structure.
Novelty: Newly emerging topics or trends that have not been observed before in the dataset.
Noise: Irrelevant or erroneous information that does not contribute to the main themes or ideas
present in the text.

Techniques for Anomaly Detection

Statistical Methods: Utilizing statistical measures such as mean, standard deviation, or

z-scores to identify outliers based on their deviation from the expected distribution.
Machine Learning Approaches: Training anomaly detection models using supervised or
unsupervised learning techniques, such as Isolation Forest, One-Class SVM, or Autoencoders.
Text Representation: Representing text data in a numerical format using techniques like
TF-IDF or word embeddings, which can then be fed into anomaly detection models.
Clustering: Using clustering algorithms to group similar documents together and identify outliers
as instances that do not belong to any cluster.
NLP Techniques: Natural Language Processing techniques such as Named Entity Recognition
(NER), Part-of-Speech (POS) tagging, or topic modeling can help in detecting anomalies by
identifying unusual patterns in text structure or content.
Deep Learning: Deep learning architectures like Recurrent Neural Networks (RNNs),
Convolutional Neural Networks (CNNs), or Transformers can learn complex representations of
textual data, aiding in anomaly detection by capturing subtle deviations from normal patterns.

Evaluation of Anomaly Detection Models

Evaluation Metrics: Metrics such as precision, recall, F1-score, or Area Under the ROC Curve
(AUC-ROC) can be used to assess the performance of anomaly detection models.
Cross-Validation: Employing techniques like k-fold cross-validation to evaluate the
generalization ability of the models on unseen data.
Domain-Specific Evaluation: Considering domain-specific factors and requirements when
evaluating the effectiveness of anomaly detection models in real-world applications.

Applications of Anomaly Detection in Text Summarization

Quality Control: Identifying low-quality or spammy documents that may degrade the overall
quality of the summarization output.
Trend Detection: Detecting emerging topics or trends that may be of interest to users or
stakeholders.
Error Detection: Flagging erroneous or misleading information that could mislead readers or
affect decision-making processes.
Fraud Detection: Anomaly detection in textual data can help in identifying fraudulent activities,
such as fake reviews, spam emails, or deceptive content.
Security: Detection of anomalous patterns in communication data can assist in cybersecurity by
flagging suspicious activities or potential threats.
Healthcare: Anomaly detection in medical reports or patient records can aid in identifying
unusual symptoms or patterns, leading to early detection of diseases or medical conditions.
Finance: Analyzing anomalies in financial documents or transaction records can help in
detecting financial fraud or irregularities.

Challenges and Considerations

Data Sparsity: Dealing with sparse or imbalanced datasets where anomalies are rare
occurrences.
Interpretability: Ensuring that the detected anomalies are interpretable and actionable for
users or domain experts.
Scalability: Scaling anomaly detection techniques to handle large volumes of text data
efficiently without sacrificing performance.

Trend Detection

Trend detection in text summarization involves identifying significant patterns or changes in the
content of a text corpus over time. Understanding trends is essential for making informed
decisions, tracking developments, and staying updated with evolving topics or events.

Techniques and Methods:

Time-Series Analysis: Time-series analysis techniques such as moving averages, exponential

smoothing, or ARIMA (AutoRegressive Integrated Moving Average) models can be applied to
detect trends in temporal text data.

Topic Modeling: Topic modeling algorithms like Latent Dirichlet Allocation (LDA) or Dynamic
Topic Models (DTM) can uncover latent topics within a corpus and track their evolution over
time, revealing emerging trends.

Sentiment Analysis: Sentiment analysis can be employed to analyze the sentiment associated
with textual content over time, helping in detecting shifts in public opinion or sentiment trends.

Word Embeddings: Word embedding techniques like Word2Vec or GloVe can capture
semantic relationships between words and phrases, enabling the detection of trending topics
based on changes in word usage patterns.
Applications:

News Analysis: Trend detection in news articles can help in identifying emerging topics,
tracking public interest, or monitoring the spread of information over time.

Social Media Monitoring: Analyzing trends in social media content allows businesses and
organizations to understand user behavior, identify popular topics, or detect viral content.

Financial Markets: Trend detection in financial news or social media discussions can assist
investors in making investment decisions, predicting market trends, or assessing market
sentiment.

Epidemiology: Monitoring trends in health-related text data can aid in disease surveillance,
outbreak detection, or tracking public health concerns.

Challenges:

Data Volume: Handling large volumes of textual data and processing it in real-time poses
challenges for trend detection systems, requiring scalable algorithms and efficient data
processing techniques.

Data Noise: Noisy or irrelevant data can affect the accuracy of trend detection algorithms,
necessitating preprocessing steps to filter out noise and extract relevant information.

Data Representation: Choosing appropriate representations for textual data, such as word
embeddings or topic models, can impact the effectiveness of trend detection methods.
Interpretability: Interpreting detected trends and understanding the underlying factors driving
them requires domain knowledge and context-specific analysis.

The PageRank algorithm

The PageRank algorithm, developed by Larry Page and Sergey Brin, is a key algorithm used by
search engines to rank web pages in search results. It assigns a numerical weight to each
element of a hyperlinked set of documents, with the purpose of measuring its relative
importance within the set. PageRank forms the basis of many web ranking algorithms and has
also been adapted for use in text summarization.
Suppose we have 4 web pages — w1, w2, w3, and w4. These pages contain links pointing to
one another. Some pages might have no link – these are called dangling pages.

● Web page w1 has links directing to w2 and w4

● w2 has links for w3 and w1
● w4 has links only for the web page w1
● w3 has no links and hence it will be called a dangling page

In order to rank these pages, we would have to compute a score called the PageRank score.
This score is the probability of a user visiting that page.

To capture the probabilities of users navigating from one page to another, we will create a
square matrix M, having n rows and n columns, where n is the number of web pages.

Each element of this matrix denotes the probability of a user transitioning from one web page to
another. For example, the highlighted cell below contains the probability of transition from w1 to
w2.

The initialization of the probabilities is explained in the steps below:

1. Probability of going from page i to j, i.e., M[ i ][ j ], is initialized with 1/(number of unique links
in web page wi)
2. If there is no link between the page i and j, then the probability will be initialized with 0
3. If a user has landed on a dangling page, then it is assumed that he is equally likely to
transition to any page. Hence, M[ i ][ j ] will be initialized with 1/(number of web pages)

Hence, in our case, the matrix M will be initialized as follows:

Finally, the values in this matrix will be updated in an iterative fashion to arrive at the web page
rankings.

Applications

Web Search: PageRank forms the foundation of Google's search algorithm, influencing the
ranking of search results based on the importance of web pages.

Text Summarization: PageRank has been adapted for use in text summarization algorithms,
where it helps in identifying the most important sentences or phrases in a document based on
their connectivity and relevance to other parts of the text.

Network Analysis: PageRank is used in network analysis tasks beyond the web, such as
ranking scientific papers, social network analysis, or citation analysis.

TextRank Algorithm
Let’s understand the TextRank algorithm, now that we have a grasp on PageRank. I have listed
the similarities between these two algorithms below:

● In place of web pages, we use sentences

● Similarity between any two sentences is used as an equivalent to the web page
transition probability
● The similarity scores are stored in a square matrix, similar to the matrix M used for
PageRank
TextRank is an extractive and unsupervised text summarization technique. Let’s take a look at
the flow of the TextRank algorithm that we will be following:
● The first step would be to concatenate all the text contained in the articles
● Then split the text into individual sentences
● In the next step, we will find vector representation (word embeddings) for each and every
sentence
● Similarities between sentence vectors are then calculated and stored in a matrix
● The similarity matrix is then converted into a graph, with sentences as vertices and
similarity scores as edges, for sentence rank calculation
● Finally, a certain number of top-ranked sentences form the final summary

Implementation of Text Rank Algorithm

Split Text into Sentences

extract the words embeddings or word vectors

Text Preprocessing

Get rid of the stopwords (commonly used words of a language – is, am, the, of, in, etc.)
present in the sentences. If you have not downloaded nltk-stopwords, then execute the following
line of code:

We will use clean_sentences to create vectors for sentences in our data with the help of the
GloVe word vectors.
Vector Representation of Sentences

# Similarity Matrix Preparation

# first define a zero matrix of dimensions (n * n).
# initialize this matrix with cosine similarity scores of the sentences.
# n is the number of sentences.

# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])
# use Cosine Similarity to compute the similarity between a pair of
sentences.

from sklearn.metrics.pairwise import cosine_similarity

# initialize the matrix with cosine similarity scores.

for i in range(len(sentences)):
for j in range(len(sentences)):
if i != j:
sim_mat[i][j] = cosine_similarity (sentence_vectors[i].
reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
# Applying PageRank Algorithm
# convert the similarity matrix sim_mat into a graph.

import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

# Summary Extraction
# extract the top N sentences based on their rankings for summary
generation.

ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)),

reverse=True)
# Extract top 10 sentences as the summary
for i in range(10):
print(ranked_sentences[i][1])

N-Grams
N-grams are contiguous sequences of n items from a given sample of text or speech. The items
can be phonemes, syllables, letters, words, or base pairs according to the application. N-grams
are used in various areas of computational linguistics and text analysis. They are a simple and
effective method for text mining and natural language processing (NLP) tasks, such as text
prediction, spelling correction, language modeling, and text classification.

Consider the sentence "The quick brown fox jumps over the lazy dog."
Unigrams (N=1): "The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"
Bigrams (N=2): "The quick", "quick brown", "brown fox", "fox jumps", "jumps over", "over the", "the
lazy", "lazy dog"
Trigrams (N=3): "The quick brown", "quick brown fox", "brown fox jumps", "fox jumps over",
"jumps over the", "over the lazy", "the lazy dog"

As you can see, unigrams do not contain any context, bigrams contain a minimal context, and
trigrams start to form more coherent and contextually relevant phrases.

Applications of N-Grams
N-grams are widely used in various NLP tasks. Here are a few examples:
1. Language Modeling:
N-grams can be used to predict the next item in a sequence, making them useful for language
models in speech recognition, typing prediction, and other generative tasks.
2. Text Classification:
They can serve as features for algorithms that classify documents into categories, such as spam
filters or sentiment analysis.
3. Machine Translation:
N-grams help in statistical machine translation systems by providing probabilities of sequences
of words appearing together.
4. Spell Checking and Correction:
They can be used to suggest corrections for misspelled words based on the context provided by
surrounding words.
5. Information Retrieval:
Search engines use n-grams to index texts and provide search results based on the likelihood
of n-gram sequences.

N-gram frequency:
The mean, or summed, frequency of all fragments of a word of a given length. Most commonly
used is bigram frequency, using fragments of length 2.
"The quick", "quick brown", "brown fox", "fox jumps", "jumps over", "over the", "the lazy", "lazy dog"
Bigram frequency = frequency of length 2 = 8

Phrase Mining
Phrase mining is a natural language processing (NLP) technique that involves extracting
meaningful phrases or multi-word expressions from a corpus of text data. Instead of focusing
solely on individual words, phrase mining aims to identify combinations of words that convey a
specific meaning or represent a single concept. This can be particularly useful in tasks such as
text summarization, sentiment analysis, information retrieval, and document categorization.

Process involved in phrase mining

1. Tokenization: The first step in phrase mining involves breaking down the text into its
constituent units, typically words or tokens. This process may also involve removing punctuation
marks, special characters, and stopwords (commonly occurring words like "the", "and", "is",
etc.).
2. Candidate Phrase Extraction: Once the text is tokenized, candidate phrases are extracted
from the text based on certain criteria. These criteria could include:
● Frequency-based: Phrases that occur frequently in the text corpus are identified as
potential candidates.
● Part-of-Speech Patterns: Phrases that follow specific grammatical patterns, such as noun
phrases or verb phrases, are extracted.
● Dependency Parsing: Phrases that are connected by syntactic dependencies in the
sentence structure are considered as candidate phrases.
3. Scoring and Ranking: After extracting candidate phrases, they are scored based on various
metrics to determine their significance. Common metrics used for scoring include:
● Frequency: Phrases that occur more frequently in the text are often considered more
important.
● Pointwise Mutual Information (PMI): Measures the association between words in a
phrase, indicating how likely they occur together compared to their individual probabilities.
● Tf-idf (Term Frequency-Inverse Document Frequency): Weighs the importance of a
phrase in a document relative to its frequency across the entire corpus.
4. Filtering and Pruning: To improve the quality of extracted phrases, certain filtering and
pruning techniques may be applied to remove noise and irrelevant phrases. This could involve:
● Length-based Filtering: Removing phrases that are too short or too long.
● Stopword Removal: Eliminating phrases containing stopwords or common words.
● Domain-specific Filtering: Removing phrases that are not relevant to the specific domain
or context of the text corpus.
5. Post-processing: Finally, post-processing steps may be performed to refine the extracted
phrases further. This could include:
● Merging: Combining related phrases or hyphenated words into single phrases.
● Normalization: Converting phrases into a standard format or canonical representation.
● Contextual Analysis: Considering the context in which phrases occur to disambiguate and
refine their meaning.

Topic Modeling: Latent Dirichlet Allocation (LDA)

Topic Modeling identifies and extracts abstract topics from large collections of text documents. It
uses algorithms such as Latent Dirichlet Allocation (LDA) to identify latent topics in the text and
represent documents as a mixture of all the words of these topics. Some uses of topic modeling
include:

1. Text classification and document organization

2. Marketing and advertising to understand customer preferences
3. Recommendation systems to suggest similar content
4. News categorization and information retrieval systems
5. Customer service and support to categorize customer inquiries.

Latent Dirichlet Allocation, a statistical and visual concept, is used to find the word distribution
connections between many documents in a corpus. The Variational Exception Maximization
(VEM) technique is used to get the highest probability estimate from the full corpus of text.
Important Libraries in Topic Modeling Project
● Gensim
● NLTK
● Matplotlib
● Scikit-learn
● Pandas

Example:
# Importing necessary libraries
import gensim
from gensim import corpora
from pprint import pprint

# Sample documents
documents = [
"Machine learning is the future of technology",
"Natural language processing is a key component of AI",
"Data science is an interdisciplinary field",
"Artificial intelligence has many applications",
"Deep learning is a subset of machine learning"
]

# Tokenize the documents

tokenized_docs = [doc.split() for doc in documents]

# Create dictionary mapping of words to unique ids

dictionary = corpora.Dictionary(tokenized_docs)

# Convert tokenized documents into bag of words (BoW) format

corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
# Build LDA model
lda_model = gensim.models.LdaModel(corpus, num_topics=2,
id2word=dictionary, passes=15)

# Print topics and their associated words

pprint(lda_model.print_topics())

Machine-Learned Classification

Text classification is the task of assigning predefined categories or labels to textual documents
automatically.
Machine-learned classification involves training algorithms to learn patterns in text data and
make predictions based on those patterns.

Steps in Machine-Learned Classification

1. Data Preparation:
● Collect and preprocess the text data.
● Preprocessing includes tasks such as tokenization, removing stopwords, and
stemming/lemmatization.
2. Feature Extraction:
● Convert text documents into numerical feature vectors.
● Common techniques include bag-of-words, TF-IDF (Term Frequency-Inverse Document
Frequency), and word embeddings.
3. Model Training:
● Select a classification algorithm such as Naive Bayes, Support Vector Machines (SVM),
or Neural Networks.
● Split the dataset into training and testing sets.
● Train the model using the training data.
4. Model Evaluation:
● Evaluate the trained model's performance using metrics such as accuracy, precision,
recall, and F1-score on the testing set.
● Adjust hyperparameters if necessary to improve performance.
5. Prediction:
● Use the trained model to predict the class labels of new, unseen text documents.
Applications:

● Email spam detection,

● sentiment analysis,
● topic categorization,
● document classification, etc.

Semantic Topic Tagging

Semantic topic tagging aims to assign meaningful semantic labels or tags to text documents
based on their content.

Approaches to Semantic Topic Tagging

1. Keyword-based Tagging:
● Assign tags to documents based on the presence of specific keywords or phrases.
● Keywords are manually selected or extracted from the document.
2. Topic Modeling:
● Utilize unsupervised machine learning techniques such as Latent Dirichlet Allocation
(LDA) to discover latent topics in a corpus of documents.
● Each topic is represented by a distribution over words, and documents are assigned
to topics based on their word distributions.
3. Named Entity Recognition (NER):
● Identify and classify named entities such as persons, organizations, locations, etc.,
mentioned in the text.
● Named entities can serve as semantic tags for the document.

Challenges:

● Ambiguity: Words or phrases may have multiple meanings depending on context.

● Polysemy: Words with multiple meanings.
● Synonymy: Different words with similar meanings.

Applications

● Content tagging for organizing large document collections,

● recommendation systems,
● information retrieval, etc.

Example
import spacy

# Load the English language model for spaCy

nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Apple Inc. is a technology company headquartered in Cupertino,
California. Tim Cook is the CEO of Apple."

# Process the text using spaCy NER

doc = nlp(text)

# Extract named entities and their labels

entities = [(entity.text, entity.label_) for entity in doc.ents]

# Print the named entities and their labels

print("Named Entities:")
for entity, label in entities:
print(f"Entity: {entity}, Label: {label}")
Output

Named Entities:
Entity: Apple Inc., Label: ORG
Entity: Cupertino, Label: GPE
Entity: California, Label: GPE
Entity: Tim Cook, Label: PERSON
Entity: Apple, Label: ORG

Unit 4 NLP
No ratings yet
Unit 4 NLP
29 pages
NLP Text Summarization Techniques
No ratings yet
NLP Text Summarization Techniques
21 pages
Project Final Presentation
No ratings yet
Project Final Presentation
30 pages
An Extractive Approach For English Text
No ratings yet
An Extractive Approach For English Text
11 pages
Research Paper 7
No ratings yet
Research Paper 7
8 pages
NLP Text Summarization Techniques
100% (1)
NLP Text Summarization Techniques
8 pages
Implementation of NLP Based Automatic Text Summarization Using Spacy
No ratings yet
Implementation of NLP Based Automatic Text Summarization Using Spacy
15 pages
Summarization of Unstructured Text Data Methodology and Pre Processing Approach IJERTV14IS010028
No ratings yet
Summarization of Unstructured Text Data Methodology and Pre Processing Approach IJERTV14IS010028
5 pages
Text Summarization Using NLP Technique
No ratings yet
Text Summarization Using NLP Technique
7 pages
Text Summarization Using Natural Language Processing
No ratings yet
Text Summarization Using Natural Language Processing
5 pages
G.H Patel College of Engineering and Technology: Text Analysis, Summarization and Extraction
No ratings yet
G.H Patel College of Engineering and Technology: Text Analysis, Summarization and Extraction
98 pages
A Domain-Specific Automatic Text Summarization Using Fuzzy Logic
No ratings yet
A Domain-Specific Automatic Text Summarization Using Fuzzy Logic
13 pages
Research Paper 8
No ratings yet
Research Paper 8
4 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
1) What Is Natural Language Processing?
No ratings yet
1) What Is Natural Language Processing?
14 pages
Abstractive Text Summarization Using Transformer Based Approach
No ratings yet
Abstractive Text Summarization Using Transformer Based Approach
10 pages
Text Summarization
No ratings yet
Text Summarization
6 pages
TC6 PROJECT SYNOPSIS KrishShetty VedantLandge 231106 101402
No ratings yet
TC6 PROJECT SYNOPSIS KrishShetty VedantLandge 231106 101402
13 pages
IEEE Conference Template 1 PDF
No ratings yet
IEEE Conference Template 1 PDF
3 pages
Unit 3 AI-ML Driven Data Science and Automation
No ratings yet
Unit 3 AI-ML Driven Data Science and Automation
49 pages
Automatic Text Summarization Using Python
No ratings yet
Automatic Text Summarization Using Python
8 pages
Paper 3
No ratings yet
Paper 3
3 pages
Paper Work
No ratings yet
Paper Work
12 pages
Text Summarization - Articles - Weights & Biases
No ratings yet
Text Summarization - Articles - Weights & Biases
16 pages
Text Summarisation Method in NLP
No ratings yet
Text Summarisation Method in NLP
13 pages
DNLP ABL Project
No ratings yet
DNLP ABL Project
7 pages
FALLSEM2024-25 BCSE409L TH VL2024250101879 2024-11-14 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE409L TH VL2024250101879 2024-11-14 Reference-Material-I
13 pages
Deep Learning Powered Text Summarization Framework For Creating A Highly Accurate Summary
No ratings yet
Deep Learning Powered Text Summarization Framework For Creating A Highly Accurate Summary
19 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Extractive Text Summarization: Motilal Nehru National Institute of Technology Allahabad
No ratings yet
Extractive Text Summarization: Motilal Nehru National Institute of Technology Allahabad
29 pages
Exploring The Extractive Method of Text Summarization
No ratings yet
Exploring The Extractive Method of Text Summarization
22 pages
Text Summarisation and Document Understanding Report
No ratings yet
Text Summarisation and Document Understanding Report
50 pages
Irsw Project
No ratings yet
Irsw Project
8 pages
Comparative Analysis of Modern Text Summarization Techniques
No ratings yet
Comparative Analysis of Modern Text Summarization Techniques
16 pages
Research Paper On Text
No ratings yet
Research Paper On Text
7 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
Text Summarisation Method in NLP
No ratings yet
Text Summarisation Method in NLP
38 pages
Synopsis Creation For Research Paper Using Text Summarization Models
No ratings yet
Synopsis Creation For Research Paper Using Text Summarization Models
5 pages
Chapter 5 Tutorial Answer (Part 2)
No ratings yet
Chapter 5 Tutorial Answer (Part 2)
15 pages
IEEE Conference Template 3
No ratings yet
IEEE Conference Template 3
4 pages
Automating Document Summarization
No ratings yet
Automating Document Summarization
12 pages
DeekshikaJadyada AP24LDS11
No ratings yet
DeekshikaJadyada AP24LDS11
6 pages
IEEE Conference Template 3 PDF
No ratings yet
IEEE Conference Template 3 PDF
4 pages
NLP 2
No ratings yet
NLP 2
86 pages
Answer Key-3
No ratings yet
Answer Key-3
12 pages
Retrieving Information in Text Mining
No ratings yet
Retrieving Information in Text Mining
4 pages
Summerization Presentation
No ratings yet
Summerization Presentation
9 pages
ASWIN TS Summarisation of NLP Simplified Notes Unit 3
No ratings yet
ASWIN TS Summarisation of NLP Simplified Notes Unit 3
4 pages
NLP Module 6
No ratings yet
NLP Module 6
30 pages
Recent Approaches For Text Summarization
No ratings yet
Recent Approaches For Text Summarization
13 pages
Research Final
No ratings yet
Research Final
6 pages
Automatic Text Summarization Using Natural Language Processing
No ratings yet
Automatic Text Summarization Using Natural Language Processing
54 pages
Automatic Text Summarization Using Natural Language Processing PDF
No ratings yet
Automatic Text Summarization Using Natural Language Processing PDF
54 pages
Semantic Processing for Data Scientists
No ratings yet
Semantic Processing for Data Scientists
10 pages
Module 4
No ratings yet
Module 4
63 pages
Experiential Learning
No ratings yet
Experiential Learning
8 pages
NLP Term Work
No ratings yet
NLP Term Work
6 pages
Elt 1
No ratings yet
Elt 1
199 pages
Skripsi PDF
No ratings yet
Skripsi PDF
128 pages
Basic Parts of Speech: Nouns
No ratings yet
Basic Parts of Speech: Nouns
3 pages
Helpsheet: Paraphrasing
No ratings yet
Helpsheet: Paraphrasing
11 pages
Effective Business Communication Skills: Rationale
No ratings yet
Effective Business Communication Skills: Rationale
15 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
17 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
INTERJECTION
100% (1)
INTERJECTION
7 pages
English Language Quiz
No ratings yet
English Language Quiz
24 pages
An Introduction To Language
50% (2)
An Introduction To Language
23 pages
My Questions
No ratings yet
My Questions
31 pages
Neetu Singh Class Notes
86% (35)
Neetu Singh Class Notes
519 pages
The Renaissance
No ratings yet
The Renaissance
3 pages
BAHASA INGGRIS by Wawan Cahyadin-4
No ratings yet
BAHASA INGGRIS by Wawan Cahyadin-4
88 pages
2 Lang Review Q1 (Context Clues)
No ratings yet
2 Lang Review Q1 (Context Clues)
3 pages
Natural Language Processing Module 1 Notes PDF
100% (3)
Natural Language Processing Module 1 Notes PDF
15 pages
CDS English Full Length Mock Test 02
No ratings yet
CDS English Full Length Mock Test 02
17 pages
Conjunctions for English Learners
100% (1)
Conjunctions for English Learners
12 pages
Word Classes PDF
0% (3)
Word Classes PDF
15 pages
NHA2 - Form Classes
No ratings yet
NHA2 - Form Classes
36 pages
NOUNS
No ratings yet
NOUNS
6 pages
Morphology: 1.1. How To Do Morphological Analysis (Or Any Other Kind of Linguistic Analysis)
No ratings yet
Morphology: 1.1. How To Do Morphological Analysis (Or Any Other Kind of Linguistic Analysis)
12 pages
Gramatyka Opisowa - Język Angielski
100% (1)
Gramatyka Opisowa - Język Angielski
29 pages
Commonly Confused Words Presentation1
No ratings yet
Commonly Confused Words Presentation1
76 pages
Review Quiz
No ratings yet
Review Quiz
6 pages
Spelling Bee - Rules
No ratings yet
Spelling Bee - Rules
10 pages
Teacher Manual English Grammar
No ratings yet
Teacher Manual English Grammar
30 pages
ADJECTIVE
No ratings yet
ADJECTIVE
31 pages
Alley Faint Wurds, GPT-3 - A Techgnosis Butoh Grimoire-Alley Faint Wurds (2020)
No ratings yet
Alley Faint Wurds, GPT-3 - A Techgnosis Butoh Grimoire-Alley Faint Wurds (2020)
63 pages
Senior High English Proficiency Study
No ratings yet
Senior High English Proficiency Study
17 pages

Sma U-4

Uploaded by

Sma U-4

Uploaded by

Social Media Analytics

Unit-4- Text Summarization Harsh Vardhan Mishra

Introduction to Text Summarization

Abstraction-based summarization often utilizes natural language generation techniques, such as

Extraction-based summarization focuses on selecting and condensing existing

1. Converting all text to lower case for further processing

Deep Learning model for abstractive based Text Summarization

Github Link: Text summarizer using Deep Learning

Types of Text Extraction

1. Named Entity Recognition (NER)

1. Named Entity Recognition

NER output for the sample text will typically be:

Location: Brooklyn, Manhattan, United States

Date: Last month, 2015

Sentiment Score: -1.233288

Sentiment Score: 0.2672612

Aspects & Sentiments:

Customer service – negative

There are quite a few algorithms for topic modeling:

● Latent Semantic Analysis (LSA)

Part-of-speech (POS) tagging is a process of assigning a grammatical category to each word in

7. Text classification with Scikit-Learn

8. Using LLMs (like GPT-3)

‍Classification and Clustering

Techniques for Topic/text Classification

Techniques for Text Clustering

Text Summarization Using Topic Classification and Clustering

Types of Anomalies in Text

Techniques for Anomaly Detection

Statistical Methods: Utilizing statistical measures such as mean, standard deviation, or

Evaluation of Anomaly Detection Models

Applications of Anomaly Detection in Text Summarization

Challenges and Considerations

Techniques and Methods:

Time-Series Analysis: Time-series analysis techniques such as moving averages, exponential

The PageRank algorithm

● Web page w1 has links directing to w2 and w4

The initialization of the probabilities is explained in the steps below:

Hence, in our case, the matrix M will be initialized as follows:

● In place of web pages, we use sentences

Implementation of Text Rank Algorithm

Split Text into Sentences

# Similarity Matrix Preparation

from sklearn.metrics.pairwise import cosine_similarity

ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)),

Process involved in phrase mining

Topic Modeling: Latent Dirichlet Allocation (LDA)

1. Text classification and document organization

# Tokenize the documents

# Create dictionary mapping of words to unique ids

# Convert tokenized documents into bag of words (BoW) format

# Print topics and their associated words

Steps in Machine-Learned Classification

● Email spam detection,

Semantic Topic Tagging

Approaches to Semantic Topic Tagging

● Ambiguity: Words or phrases may have multiple meanings depending on context.

● Content tagging for organizing large document collections,

# Load the English language model for spaCy

# Process the text using spaCy NER

# Extract named entities and their labels

# Print the named entities and their labels

You might also like