Sma U-4
Sma U-4
Introduction to Text Summarization, Text extraction, classification and clustering, Anomaly and
Trend Detection, Text Processing, N-gram Frequency Count and Phrase Mining, Page Rank
and Text Rank Algorithm, LDA Topic Modelling, Machine-Learned Classification and Semantic
Topic Tagging, Python libraries for Text Summarization. (NumPy, Pandas, NLTK, Matplotlib).
Automatic Text Summarization is one of the most challenging and interesting problems in the
field of Natural Language Processing (NLP). It is a process of generating a concise and
meaningful summary of text from multiple text resources such as books, news articles, blog
posts, research papers, emails, and tweets.
The demand for automatic text summarization systems is spiking these days thanks to the
availability of large amounts of textual data.
Text summarization is the process of condensing a piece of text while retaining its most
important information. It can be achieved using techniques such as extraction-based
summarization or abstraction-based summarization.
Extraction-based Summarization
Extractive Summarization technique involves the extraction of important words/phrases from the
input sentence. The underlying idea is to create a summary by selecting the most important
words from the input sentence
It doesn't involve creating new sentences or rephrasing the original content.
Typically, this approach relies on techniques such as ranking sentences based on importance
(using algorithms like TF-IDF or TextRank) and selecting the top-ranked sentences for the
summary.
Example: Suppose you have a news article about a recent scientific discovery. An
extraction-based summary would extract the key sentences from the article that convey the
main findings and essential information without altering the wording. For instance, it might
include sentences like "Scientists have discovered a new species of plant in the Amazon
rainforest" and "The new species exhibits unique characteristics that could have significant
implications for biodiversity conservation."
Abstraction-based Summarization
Abstractive Summarization technique involves the generation of entirely new phrases that
capture the meaning of the input sentence. The underlying idea is to put a strong emphasis on
the form — aiming to generate a grammatical summary thereby requiring advanced language
modeling techniques.
This approach involves understanding the meaning of the text and generating new sentences
that capture the essence of the content while potentially using different words and sentence
structures.
Pre-processing and cleaning is an important step because building a model on unclean and
messy data will in turn produce messy results. We will apply the below cleaning techniques
before feeding our data to the model
Text Extraction
In recent years there has been a surge in unstructured data in the form of text, videos, audio
and photos. NLU aids in extracting valuable information from text such as social media data,
customer surveys, and complaints.
Consider the text snippet below from a customer review of a fictional insurance company called
Rocketz Auto Insurance Company:
The customer service of Rocketz is terrible. I must call the call center multiple times before I
get a decent reply. The call center guys are extremely rude and totally ignorant. Last month I
called with a request to update my correspondence address from Brooklyn to Manhattan. I
spoke with about a dozen representatives – Lucas Hayes, Ethan Gray, Nora Diaz, Sofia
Parker to name a few. Even after writing multiple emails and filling out numerous forms, the
address has still not been updated. Even my agent John is useless. The policy details he
gave me were wrong. The only good thing about the company is the pricing. The premium is
reasonable compared to the other insurance companies in the United States. There has not
been any significant increase in my premium since 2015.
Person: Lucas Hayes, Ethan Gray, Nora Diaz, Sofia Parker, John
Organization: Rocketz
NER is generally based on grammar rules and supervised models. However, there are NER
platforms such as open NLP that have pre-trained and built-in NER models.
2. Sentiment Analysis
The most widely used technique in NLP is sentiment analysis. Sentiment analysis is most useful
in cases such as customer surveys, reviews and social media comments where people express
their opinions and feedback. The simplest output of sentiment analysis is a 3-point scale:
positive/negative/neutral. In more complex cases the output can be a numeric score that can be
bucketed into as many categories as required.
In the case of our text snippet, the customer clearly expresses different sentiments in various
parts of the text. Because of this, the output is not very useful. Instead, we can find the
sentiment of each sentence and separate out the negative and positive parts of the review.
Sentiment score can also help us pick out the most negative and positive parts of the review:
Most negative comment: The call center guys are extremely rude and totally ignorant.
Most positive comment: The premium is reasonable compared to the other insurance
companies in the United States.
Sentiment Analysis can be done using supervised as well as unsupervised techniques. The
most popular supervised model used for sentiment analysis is naïve Bayes. It requires a training
corpus with sentiment labels, upon which a model is trained which is then used to identify the
sentiment. Naive Bayes is not the only tool out there - different machine learning techniques like
random forest or gradient boosting can also be used.
The unsupervised techniques also known as the lexicon-based methods require a corpus of
words with their associated sentiment and polarity. The sentiment score of the sentence is
calculated using the polarities of the words in the sentence.
3. Text Summarization
Extraction methods create a summary by extracting parts from the text. Abstraction methods
create summary by generating fresh text that conveys the crux of the original text. There are
various algorithms that can be used for text summarization like LexRank, TextRank, and Latent
Semantic Analysis. To take the example of LexRank, this algorithm ranks the sentences using
similarity between them. A sentence is ranked higher when it is similar to more sentences, and
these sentences are in turn similar to other sentences.
Using LexRank, the sample text is summarized as: I have to call the call center multiple times
before I get a decent reply. The premium is reasonable compared to the other insurance
companies in the United States.
4. Aspect Mining
Aspect mining identifies the different aspects in the text. When used in conjunction with
sentiment analysis, it extracts complete information from the text. One of the easiest methods of
aspect mining is using part-of-speech tagging.
When aspect mining along with sentiment analysis is used on the sample text, the output
conveys the complete intent of the text:
5. Topic Modeling
Topic modeling is one of the more complicated methods to identify natural topics in the text. A
prime advantage of topic modeling is that it is an unsupervised technique. Model training and a
labeled training dataset are not required.
One of the most popular methods is latent Dirichlet allocation. The premise of LDA is that each
text document comprises of several topics and each topic comprises of several words. The input
required by LDA is merely the text documents and the expected number of topics.
Latent Dirichlet allocation (LDA) is a topic modeling technique that is used to discover hidden
topics in text such as long documents or news articles. It does this by representing each
document as a mixture of topics, and each topic is represented as a mixture of words. LDA is an
unsupervised learning algorithm, which means that it does not require training on new labeled
data. This makes it a powerful tool for discovering hidden structure in data that can be used
quickly. LDA allows you to find out what topics are being talked about in a document, and how
often those topics are mentioned. It can also be used to find out what words are associated with
each topic.
Using the sample text and assuming two inherent topics, the topic modeling output will identify
the common words across both topics.
For our example, the main theme for the first topic 1 includes words like call, center, and
service. The main themes in topic 2 are words like premium, reasonable and price. This implies
that topic 1 corresponds to customer service and topic two corresponds to pricing. The diagram
below shows the results in detail.
6. Part of Speech(POS) Tagging using spaCy
paCy has a POS tagging model that can be used in an NLP pipeline for quick information
S
extraction. The model is pretrained on a large corpus of text, and it uses that training data to
learn how to POS tag words. spaCy POS tagging also allows for custom training data, which
means that you can train the model to POS tag words in a specific domain such as medical
texts or legal documents. We've used the POS tagging model as a standalone to write entity
extraction rules that enhance the ability of our NER or deep learning models.
Scikit-Learn is a machine learning library that can be used for a variety of tasks, including text
classification. It offers a number of different text classification algorithms, and it also allows for
the creation of custom algorithms and pipelines.
The two of the most common text classification algorithms: support vector machines (SVMs)
and naïve Bayes. Both of these algorithms are based on the idea of using a training set of data
to learn the classification rules. The training set is a collection of documents that have been
labeled with the correct class label. For example, in a text classification task, the training set
would be a collection of documents/sentences/paragraphs that have been labeled as a specific
class. The classification algorithm would then learn a relationship between the classes and the
examples that maps the two together.
Classification
Classification is a supervised machine learning task where the goal is to categorize or label
items into predefined classes or categories based on their features. The model is trained using
a dataset that includes both the features of the data points and their corresponding class labels.
Features are the characteristics or attributes of the data points that influence the classification
outcome.
Text classification involves categorizing text documents into predefined topics or categories.
This is achieved using machine learning algorithms such as Naive Bayes, Support Vector
Machines (SVM), or deep learning models like Convolutional Neural Networks (CNNs) or
Recurrent Neural Networks (RNNs).
The goal is to assign a label or category to each document based on its content.
Feature extraction: Extracting relevant features from the text such as word frequencies,
n-grams, or word embeddings.
Model training: Training machine learning models on labeled data to learn the patterns and
relationships between features and categories.
Evaluation: Assessing the performance of the classification model using metrics like accuracy,
precision, recall, and F1-score.
Clustering
Clustering is a technique used to group similar documents together based on their content.
Unlike topic classification, clustering does not require predefined categories and allows for the
discovery of hidden patterns and structures in the data.
Common clustering algorithms include K-means, Hierarchical Clustering, and DBSCAN.
Text representation: Converting text documents into numerical vectors using techniques like
TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings.
Similarity measures: Calculating the similarity between documents using metrics like cosine
similarity or Jaccard similarity.
Clustering algorithms: Applying clustering algorithms to group similar documents together
based on their similarity scores.
Topic classification can be used to identify the main topics or themes present in a collection of
documents.
Clustering can be employed to group similar documents together, making it easier to generate
summaries for each cluster.
By combining these techniques, text summarization systems can generate concise and
informative summaries that capture the key topics and ideas present in the original text.
Anomaly Detection
Anomaly detection, also known as outlier detection, is the process of identifying patterns or
instances that deviate significantly from the norm or expected behavior.
Anomaly detection in text refers to the identification of unusual or unexpected patterns or
instances within a text corpus. These anomalies could be indicative of errors, outliers, or
significant deviations from normal behavior. Detecting anomalies is crucial for maintaining the
quality and accuracy of text summarization systems.
Outliers: Documents or sentences that significantly differ from the majority of the text in terms
of content or structure.
Novelty: Newly emerging topics or trends that have not been observed before in the dataset.
Noise: Irrelevant or erroneous information that does not contribute to the main themes or ideas
present in the text.
Evaluation Metrics: Metrics such as precision, recall, F1-score, or Area Under the ROC Curve
(AUC-ROC) can be used to assess the performance of anomaly detection models.
Cross-Validation: Employing techniques like k-fold cross-validation to evaluate the
generalization ability of the models on unseen data.
Domain-Specific Evaluation: Considering domain-specific factors and requirements when
evaluating the effectiveness of anomaly detection models in real-world applications.
Quality Control: Identifying low-quality or spammy documents that may degrade the overall
quality of the summarization output.
Trend Detection: Detecting emerging topics or trends that may be of interest to users or
stakeholders.
Error Detection: Flagging erroneous or misleading information that could mislead readers or
affect decision-making processes.
Fraud Detection: Anomaly detection in textual data can help in identifying fraudulent activities,
such as fake reviews, spam emails, or deceptive content.
Security: Detection of anomalous patterns in communication data can assist in cybersecurity by
flagging suspicious activities or potential threats.
Healthcare: Anomaly detection in medical reports or patient records can aid in identifying
unusual symptoms or patterns, leading to early detection of diseases or medical conditions.
Finance: Analyzing anomalies in financial documents or transaction records can help in
detecting financial fraud or irregularities.
Data Sparsity: Dealing with sparse or imbalanced datasets where anomalies are rare
occurrences.
Interpretability: Ensuring that the detected anomalies are interpretable and actionable for
users or domain experts.
Scalability: Scaling anomaly detection techniques to handle large volumes of text data
efficiently without sacrificing performance.
Trend Detection
Trend detection in text summarization involves identifying significant patterns or changes in the
content of a text corpus over time. Understanding trends is essential for making informed
decisions, tracking developments, and staying updated with evolving topics or events.
Topic Modeling: Topic modeling algorithms like Latent Dirichlet Allocation (LDA) or Dynamic
Topic Models (DTM) can uncover latent topics within a corpus and track their evolution over
time, revealing emerging trends.
Sentiment Analysis: Sentiment analysis can be employed to analyze the sentiment associated
with textual content over time, helping in detecting shifts in public opinion or sentiment trends.
Word Embeddings: Word embedding techniques like Word2Vec or GloVe can capture
semantic relationships between words and phrases, enabling the detection of trending topics
based on changes in word usage patterns.
Applications:
News Analysis: Trend detection in news articles can help in identifying emerging topics,
tracking public interest, or monitoring the spread of information over time.
Social Media Monitoring: Analyzing trends in social media content allows businesses and
organizations to understand user behavior, identify popular topics, or detect viral content.
Financial Markets: Trend detection in financial news or social media discussions can assist
investors in making investment decisions, predicting market trends, or assessing market
sentiment.
Epidemiology: Monitoring trends in health-related text data can aid in disease surveillance,
outbreak detection, or tracking public health concerns.
Challenges:
Data Volume: Handling large volumes of textual data and processing it in real-time poses
challenges for trend detection systems, requiring scalable algorithms and efficient data
processing techniques.
Data Noise: Noisy or irrelevant data can affect the accuracy of trend detection algorithms,
necessitating preprocessing steps to filter out noise and extract relevant information.
Data Representation: Choosing appropriate representations for textual data, such as word
embeddings or topic models, can impact the effectiveness of trend detection methods.
Interpretability: Interpreting detected trends and understanding the underlying factors driving
them requires domain knowledge and context-specific analysis.
The PageRank algorithm, developed by Larry Page and Sergey Brin, is a key algorithm used by
search engines to rank web pages in search results. It assigns a numerical weight to each
element of a hyperlinked set of documents, with the purpose of measuring its relative
importance within the set. PageRank forms the basis of many web ranking algorithms and has
also been adapted for use in text summarization.
Suppose we have 4 web pages — w1, w2, w3, and w4. These pages contain links pointing to
one another. Some pages might have no link – these are called dangling pages.
In order to rank these pages, we would have to compute a score called the PageRank score.
This score is the probability of a user visiting that page.
To capture the probabilities of users navigating from one page to another, we will create a
square matrix M, having n rows and n columns, where n is the number of web pages.
Each element of this matrix denotes the probability of a user transitioning from one web page to
another. For example, the highlighted cell below contains the probability of transition from w1 to
w2.
Finally, the values in this matrix will be updated in an iterative fashion to arrive at the web page
rankings.
Applications
Web Search: PageRank forms the foundation of Google's search algorithm, influencing the
ranking of search results based on the importance of web pages.
Text Summarization: PageRank has been adapted for use in text summarization algorithms,
where it helps in identifying the most important sentences or phrases in a document based on
their connectivity and relevance to other parts of the text.
Network Analysis: PageRank is used in network analysis tasks beyond the web, such as
ranking scientific papers, social network analysis, or citation analysis.
TextRank Algorithm
Let’s understand the TextRank algorithm, now that we have a grasp on PageRank. I have listed
the similarities between these two algorithms below:
Text Preprocessing
Get rid of the stopwords (commonly used words of a language – is, am, the, of, in, etc.)
present in the sentences. If you have not downloaded nltk-stopwords, then execute the following
line of code:
We will use clean_sentences to create vectors for sentences in our data with the help of the
GloVe word vectors.
Vector Representation of Sentences
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])
# use Cosine Similarity to compute the similarity between a pair of
sentences.
for i in range(len(sentences)):
for j in range(len(sentences)):
if i != j:
sim_mat[i][j] = cosine_similarity (sentence_vectors[i].
reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
# Applying PageRank Algorithm
# convert the similarity matrix sim_mat into a graph.
import networkx as nx
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)
# Summary Extraction
# extract the top N sentences based on their rankings for summary
generation.
N-Grams
N-grams are contiguous sequences of n items from a given sample of text or speech. The items
can be phonemes, syllables, letters, words, or base pairs according to the application. N-grams
are used in various areas of computational linguistics and text analysis. They are a simple and
effective method for text mining and natural language processing (NLP) tasks, such as text
prediction, spelling correction, language modeling, and text classification.
Consider the sentence "The quick brown fox jumps over the lazy dog."
Unigrams (N=1): "The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"
Bigrams (N=2): "The quick", "quick brown", "brown fox", "fox jumps", "jumps over", "over the", "the
lazy", "lazy dog"
Trigrams (N=3): "The quick brown", "quick brown fox", "brown fox jumps", "fox jumps over",
"jumps over the", "over the lazy", "the lazy dog"
As you can see, unigrams do not contain any context, bigrams contain a minimal context, and
trigrams start to form more coherent and contextually relevant phrases.
Applications of N-Grams
N-grams are widely used in various NLP tasks. Here are a few examples:
1. Language Modeling:
N-grams can be used to predict the next item in a sequence, making them useful for language
models in speech recognition, typing prediction, and other generative tasks.
2. Text Classification:
They can serve as features for algorithms that classify documents into categories, such as spam
filters or sentiment analysis.
3. Machine Translation:
N-grams help in statistical machine translation systems by providing probabilities of sequences
of words appearing together.
4. Spell Checking and Correction:
They can be used to suggest corrections for misspelled words based on the context provided by
surrounding words.
5. Information Retrieval:
Search engines use n-grams to index texts and provide search results based on the likelihood
of n-gram sequences.
N-gram frequency:
The mean, or summed, frequency of all fragments of a word of a given length. Most commonly
used is bigram frequency, using fragments of length 2.
"The quick", "quick brown", "brown fox", "fox jumps", "jumps over", "over the", "the lazy", "lazy dog"
Bigram frequency = frequency of length 2 = 8
Phrase Mining
Phrase mining is a natural language processing (NLP) technique that involves extracting
meaningful phrases or multi-word expressions from a corpus of text data. Instead of focusing
solely on individual words, phrase mining aims to identify combinations of words that convey a
specific meaning or represent a single concept. This can be particularly useful in tasks such as
text summarization, sentiment analysis, information retrieval, and document categorization.
1. Tokenization: The first step in phrase mining involves breaking down the text into its
constituent units, typically words or tokens. This process may also involve removing punctuation
marks, special characters, and stopwords (commonly occurring words like "the", "and", "is",
etc.).
2. Candidate Phrase Extraction: Once the text is tokenized, candidate phrases are extracted
from the text based on certain criteria. These criteria could include:
● Frequency-based: Phrases that occur frequently in the text corpus are identified as
potential candidates.
● Part-of-Speech Patterns: Phrases that follow specific grammatical patterns, such as noun
phrases or verb phrases, are extracted.
● Dependency Parsing: Phrases that are connected by syntactic dependencies in the
sentence structure are considered as candidate phrases.
3. Scoring and Ranking: After extracting candidate phrases, they are scored based on various
metrics to determine their significance. Common metrics used for scoring include:
● Frequency: Phrases that occur more frequently in the text are often considered more
important.
● Pointwise Mutual Information (PMI): Measures the association between words in a
phrase, indicating how likely they occur together compared to their individual probabilities.
● Tf-idf (Term Frequency-Inverse Document Frequency): Weighs the importance of a
phrase in a document relative to its frequency across the entire corpus.
4. Filtering and Pruning: To improve the quality of extracted phrases, certain filtering and
pruning techniques may be applied to remove noise and irrelevant phrases. This could involve:
● Length-based Filtering: Removing phrases that are too short or too long.
● Stopword Removal: Eliminating phrases containing stopwords or common words.
● Domain-specific Filtering: Removing phrases that are not relevant to the specific domain
or context of the text corpus.
5. Post-processing: Finally, post-processing steps may be performed to refine the extracted
phrases further. This could include:
● Merging: Combining related phrases or hyphenated words into single phrases.
● Normalization: Converting phrases into a standard format or canonical representation.
● Contextual Analysis: Considering the context in which phrases occur to disambiguate and
refine their meaning.
Latent Dirichlet Allocation, a statistical and visual concept, is used to find the word distribution
connections between many documents in a corpus. The Variational Exception Maximization
(VEM) technique is used to get the highest probability estimate from the full corpus of text.
Important Libraries in Topic Modeling Project
● Gensim
● NLTK
● Matplotlib
● Scikit-learn
● Pandas
Example:
# Importing necessary libraries
import gensim
from gensim import corpora
from pprint import pprint
# Sample documents
documents = [
"Machine learning is the future of technology",
"Natural language processing is a key component of AI",
"Data science is an interdisciplinary field",
"Artificial intelligence has many applications",
"Deep learning is a subset of machine learning"
]
Machine-Learned Classification
Text classification is the task of assigning predefined categories or labels to textual documents
automatically.
Machine-learned classification involves training algorithms to learn patterns in text data and
make predictions based on those patterns.
1. Data Preparation:
● Collect and preprocess the text data.
● Preprocessing includes tasks such as tokenization, removing stopwords, and
stemming/lemmatization.
2. Feature Extraction:
● Convert text documents into numerical feature vectors.
● Common techniques include bag-of-words, TF-IDF (Term Frequency-Inverse Document
Frequency), and word embeddings.
3. Model Training:
● Select a classification algorithm such as Naive Bayes, Support Vector Machines (SVM),
or Neural Networks.
● Split the dataset into training and testing sets.
● Train the model using the training data.
4. Model Evaluation:
● Evaluate the trained model's performance using metrics such as accuracy, precision,
recall, and F1-score on the testing set.
● Adjust hyperparameters if necessary to improve performance.
5. Prediction:
● Use the trained model to predict the class labels of new, unseen text documents.
Applications:
Semantic topic tagging aims to assign meaningful semantic labels or tags to text documents
based on their content.
1. Keyword-based Tagging:
● Assign tags to documents based on the presence of specific keywords or phrases.
● Keywords are manually selected or extracted from the document.
2. Topic Modeling:
● Utilize unsupervised machine learning techniques such as Latent Dirichlet Allocation
(LDA) to discover latent topics in a corpus of documents.
● Each topic is represented by a distribution over words, and documents are assigned
to topics based on their word distributions.
3. Named Entity Recognition (NER):
● Identify and classify named entities such as persons, organizations, locations, etc.,
mentioned in the text.
● Named entities can serve as semantic tags for the document.
Challenges:
Applications
Example
import spacy
# Sample text
text = "Apple Inc. is a technology company headquartered in Cupertino,
California. Tim Cook is the CEO of Apple."
Named Entities:
Entity: Apple Inc., Label: ORG
Entity: Cupertino, Label: GPE
Entity: California, Label: GPE
Entity: Tim Cook, Label: PERSON
Entity: Apple, Label: ORG