0% found this document useful (0 votes)

22 views28 pages

Data Science Module 5

Data science subject notes for 5th module

Uploaded by

P.Sahana Prasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views28 pages

Data Science Module 5

Data science subject notes for 5th module

Uploaded by

P.Sahana Prasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Advanced Topics in Data Science

1. Dimensionality Reduction Techniques

Introduction to Dimensionality Reduction

Defination: Dimensionality Reduction is the process of reducing the number of features

(variables/dimensions) in a dataset while preserving as much important information as possible.
It helps in simplifying data, improving computation speed, and avoiding overfitting in machine
learning models.

🔹 Example: Converting a 3D object into a 2D image while keeping its essential details.
● Two main types:
○ Feature Selection: Selecting a subset of the original features.
○ Feature Extraction: Creating new features from existing ones.

Principal Component Analysis (PCA)

● Principal Component Analysis (PCA) is a dimensionality reduction technique that

transforms high-dimensional data into a lower-dimensional space while preserving the
most important patterns and variations. It does this by finding new orthogonal axes

🔹
(principal components) that maximize variance in the data.

🔹
● Key Idea: Reduce complexity while retaining as much information as possible.
Example: Converting a high-resolution image to a lower-resolution version
without losing key details.

Principal Component Analysis (PCA) – Detailed Explanation with Example

What is PCA?

Principal Component Analysis (PCA) is a dimensionality reduction technique used in machine

learning and data science to transform a dataset with many features (variables) into a smaller set
while retaining most of its variance (important information). It helps in:

● Reducing computational complexity

● Removing noise and redundancy
● Improving visualization (especially in 2D or 3D)
● Handling multicollinearity (highly correlated features)

Key Concepts in PCA

1. Standardization: Since PCA is affected by the scale of features, standardizing the data to
have a mean of 0 and a standard deviation of 1 is essential.
2. Covariance Matrix: This matrix captures the relationships (correlations) between
different features.
3. Eigenvalues and Eigenvectors: Eigenvalues represent the variance explained by each
principal component, while eigenvectors define the direction of these components.
4. Principal Components: These are new features (linear combinations of original features)
that maximize variance.
5. Choosing Components: The number of components is chosen based on the explained
variance (e.g., keeping 95% of variance).

Step-by-Step PCA Example

Let's apply PCA to a simple dataset using Python.

1. Import Necessary Libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

# Sample dataset (4 features)

data = pd.DataFrame({

'Feature1': [2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2.0, 1.0, 1.5, 1.1],

'Feature2': [2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9],

'Feature3': [1.8, 0.5, 2.4, 1.5, 2.7, 2.4, 2.2, 1.3, 1.8, 1.0],

'Feature4': [0.5, 0.3, 0.7, 0.6, 1.0, 0.9, 0.8, 0.2, 0.4, 0.3]

})

print("Original Data:\n", data.head())

2. Standardize the Data

Since PCA is sensitive to different scales, we standardize the data.

scaler = StandardScaler()

data_scaled = scaler.fit_transform(data)

3. Compute PCA
pca = PCA(n_components=2) # Reduce to 2 dimensions

principal_components = pca.fit_transform(data_scaled)

# Convert to DataFrame

pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

print("\nTransformed Data (PCA Components):\n", pca_df.head())

4. Explained Variance

To decide how many components to keep, we check the explained variance.

explained_variance = pca.explained_variance_ratio_

print("\nExplained Variance Ratio:\n", explained_variance)

If PC1 explains 70% of variance and PC2 explains 20%, keeping two components retains 90% of
information.

5. Visualizing PCA Components

plt.figure(figsize=(8,6))

plt.scatter(pca_df['PC1'], pca_df['PC2'], color='blue')

plt.xlabel('Principal Component 1')

plt.ylabel('Principal Component 2')

plt.title('PCA Projection')

plt.grid()

Principal Component Analysis (PCA) using sklearn. This program takes a dataset, standardizes
it, applies PCA, and visualizes the results.

PCA complete Program in Python

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

# Sample dataset (4 features)

data = pd.DataFrame({

'Feature1': [2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2.0, 1.0, 1.5, 1.1],

'Feature2': [2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9],

'Feature3': [1.8, 0.5, 2.4, 1.5, 2.7, 2.4, 2.2, 1.3, 1.8, 1.0],

'Feature4': [0.5, 0.3, 0.7, 0.6, 1.0, 0.9, 0.8, 0.2, 0.4, 0.3]

})

print("Original Data:\n", data.head())

# Step 1: Standardize the data

scaler = StandardScaler()

data_scaled = scaler.fit_transform(data)
# Step 2: Apply PCA

pca = PCA(n_components=2) # Reduce to 2 dimensions

principal_components = pca.fit_transform(data_scaled)

# Convert to DataFrame

pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

print("\nTransformed Data (PCA Components):\n", pca_df.head())

# Step 3: Explained Variance

explained_variance = pca.explained_variance_ratio_

print("\nExplained Variance Ratio:\n", explained_variance)

# Step 4: Visualization

plt.figure(figsize=(8,6))

plt.scatter(pca_df['PC1'], pca_df['PC2'], color='blue')

plt.xlabel('Principal Component 1')

plt.ylabel('Principal Component 2')

plt.title('PCA Projection')

plt.grid()

plt.show()
Data preprocessing is an essential step in the data science workflow that ensures data is clean,
consistent, and ready for analysis. By applying techniques like handling missing data, encoding
categorical variables, normalizing data, and reducing dimensionality, we improve the accuracy
and efficiency of machine learning models. Effective data preprocessing leads to better insights
and more reliable predictions.

2. Anomaly Detection and Outlier Analysis

Introduction to Anomaly Detection

Anomaly detection is the process of identifying patterns or data points that do not conform
to expected behavior. These unusual observations, called anomalies, can reveal important
insights or indicate problems such as fraud, system failure, or rare events.

Types of Anomalies:

1.Point Anomalies: Single data points differing from the rest.

Definition: A single data point that significantly differs from the rest.

Example:

● Credit Card Fraud: A customer who usually makes small purchases suddenly makes an
expensive purchase from a foreign country.
Example Flow: Credit Card Fraud Detection
1. Normal Behavior:
A customer, let's call them John, usually makes small purchases. For instance, he buys
groceries, coffee, and gas—items that are typical and within a certain price range. His
usual buying pattern is predictable and local.
2. Suspicious Activity:
One day, John's credit card is used to make a large purchase for an expensive item, like
a high-end watch, from a foreign country. This is completely different from his usual
spending habits.
3. Flagging the Transaction:
The bank's fraud detection system notices this sudden change in John's behavior. The
system has learned what typical spending patterns look like and flags this transaction as
suspicious because it doesn't fit John's usual profile.
4. Verification:
The bank or credit card company sends a notification to John, asking if he made the
purchase. It could be through a phone call, email, or a mobile app alert.
5. Action Taken:
○ If John confirms the purchase, everything proceeds as normal.
○ If John denies the purchase, the bank can freeze the account and begin
investigating further to prevent any potential fraud.

2.Contextual Anomalies: Anomalies in specific contexts

Definition: A data point that is normal in one context but anomalous in another.

Example:

● Temperature: A temperature of 30°C might be normal in summer but anomalous in

winter.
Example Flow: Temperature Anomaly Detection
1. Normal Behavior in Summer: During the summer, temperatures often rise to around
30°C. People are used to this warmth, and it’s considered perfectly normal. For instance,
if you’re planning a day at the beach or a picnic, 30°C feels just right.
2. Anomalous Behavior in Winter: In winter, temperatures usually drop significantly.
People expect colder weather, often around 0°C to 10°C. If the temperature suddenly
spikes to 30°C in winter, this would be unusual and anomalous.
3. Detecting the Anomaly: A temperature monitoring system, such as a weather app or
sensor, detects the sudden rise to 30°C during winter. Since this is far above the typical
winter temperature range, it flags it as anomalous.
4. Alert or Action: The system might send an alert indicating that this unusual temperature
is outside the expected range for winter. This could trigger further investigation or
preparation, such as adjusting forecasts or checking if there’s an error in the temperature
data.
3.Collective Anomalies: A group of points forming an anomalous pattern.
Definition: A group of related data points that together form an anomaly.

Example:

● Server Downtime: A series of server failures occurring one after another, indicating a
possible cyber attack.
Example Flow: Server Downtime and Cyber Attack Detection
1. Normal Server Operation:
Normally, servers work smoothly, handling requests and providing services without
interruption. Occasionally, a server might experience a brief glitch or downtime, but it is
usually resolved quickly.
2. Unexpected Series of Failures:
One day, a series of server failures start happening one after another. Servers that were
previously stable begin to go down in quick succession. This pattern is unusual and
suggests something is not quite right.
3. Flagging Suspicious Activity:
The system monitoring the servers notices that these failures are occurring too frequently
and in a pattern that’s unlike typical server issues. This could indicate that something
intentional is happening, such as a cyber attack trying to overload or disrupt the servers.
4. Detecting a Possible Cyber Attack:
The sequence of failures might look like a Denial of Service (DoS) attack, where
multiple servers are targeted in an attempt to bring down the entire system. The
monitoring system flags this behavior as a potential cyber attack.
5. Action Taken:
○ The system sends an immediate alert to IT security teams.
○ The security team investigates further, potentially isolating affected servers or
implementing countermeasures (like blocking suspicious traffic).
○ The servers may be restored to normal operation once the attack is mitigated.

Statistical Approaches

● Z-Score: Measures how far a data point is from the mean. Data points with a Z-score
above a certain threshold are flagged as anomalies.
● Example: A user’s purchase behavior on an e-commerce site is compared to the average
to detect abnormal spending.

Machine Learning-Based Methods

● Supervised Anomaly Detection: Requires labeled training data.

● Unsupervised Anomaly Detection:

1.K-Means Clustering:Divides data into k clusters based on similarity.

Example: Scenario: Grouping customers based on their purchasing habits into 3 clusters.

Cluster 1: Frequent buyers of low-cost products.

Cluster 2: Occasional buyers of high-cost products.

Cluster 3: Rare buyers with diverse product preferences.

2.Isolation Forest: is an algorithm used for detecting anomalies by isolating rare data
points. This makes the algorithm efficient and effective for identifying outliers, especially in
large and high-dimensional datasets.

How Isolation Forest Works:

● Random Partitioning: The algorithm randomly selects a feature and then a value to
split the data into two parts. The process of partitioning continues recursively,
generating isolation trees.
● Tree Building: Each tree is constructed by repeating the random partitioning
process. The depth of a tree determines how quickly a point can be isolated.
Anomalies tend to be isolated closer to the root (shallower depth).
● Anomaly Scoring: The average path length to isolate a point is calculated across all
trees. If a point has a short average path length, it is an anomaly (isolated quickly).
If a point has a long path length, it is normal (isolated slowly).

○ One-Class SVM
○ DBSCAN (Density-Based Clustering)

Deep Learning Approaches

● Autoencoders: Learn compressed representations of data.
Are neural networks used for learning efficient data representations. They work by
compressing input data into a smaller, lower-dimensional form (encoding) and then
reconstructing it back to the original form (decoding).

How Autoencoders Work:

1. Input Data: The network takes in data, such as images, text, or numerical data.
2. Encoding: The encoder reduces the dimensionality of the data by mapping it into a
compact, low-dimensional space.
3. Reconstruction: The decoder attempts to rebuild the original input data from the
compressed representation.
4. Optimization: The network is trained using backpropagation to minimize the
reconstruction error, thus learning an efficient representation of the data.

GANs (Generative Adversarial Networks): Can generate synthetic normal patterns.

are a type of neural network where two models, the Generator and the Discriminator, work in
competition. The Generator creates fake data, while the Discriminator tries to distinguish
between real and fake data
Example Workflow:

Let’s imagine you want to generate realistic images of cats using GANs.

1. Data Collection: First, you gather a large dataset of real cat images.
2. Generator Creation: The Generator takes random noise and tries to create an image of a
cat.
3. Discriminator Creation: The Discriminator evaluates whether an image is a real cat
image or a fake one generated by the Generator.
4. Training: Both the Generator and the Discriminator are trained together. The Generator
tries to create better and more realistic cat images to fool the Discriminator, while the
Discriminator gets better at telling the difference between real and fake images.
5. Outcome: After many iterations, the Generator learns to produce cat images that are
nearly indistinguishable from real ones.

Evaluation Metrics

● Precision, Recall, F1-score.

● ROC and AUC curves for imbalanced data.
Precision
Precision is the ratio of correctly predicted positive observations to the total predicted positives.
It answers the question: "Of all the instances the model predicted as positive, how many were
actually positive?"

Formula:

Example:
If a model predicts 100 positive cases, and 90 of them are actually positive, the precision is 90%.

● High precision means fewer false positives.

2. Recall (Sensitivity or True Positive Rate)

Recall is the ratio of correctly predicted positive observations to all actual positives. It answers
the question: "Of all the actual positive instances, how many did the model correctly identify?"

Formula:

If the model identifies 90 out of 100 actual positive cases, the recall is 90%.

● High recall means fewer false negatives, and the model is better at identifying positive
cases.

3. F1-Score

F1-Score is the harmonic mean of Precision and Recall. It balances both metrics and is a good
measure when there is an uneven class distribution (imbalanced dataset).

Formula:
Example:
If the model has high precision but low recall, the F1-score will help highlight this balance.

● A higher F1-score indicates that both precision and recall are reasonably balanced.

4. ROC Curve (Receiver Operating Characteristic Curve)

The ROC Curve is a graphical representation of a classification model’s performance. It plots

the True Positive Rate (Recall) against the False Positive Rate (1 - Specificity) at various
threshold settings.

● True Positive Rate (Recall): The proportion of actual positives correctly identified by
the model.
● False Positive Rate: The proportion of actual negatives that are incorrectly classified as
positives.

5. AUC (Area Under the Curve)

AUC is the area under the ROC curve. It is a single value that summarizes the performance of
the classifier across all possible thresholds.

● AUC = 1 indicates a perfect classifier.

● AUC = 0.5 indicates a model that performs no better than random guessing.
● AUC > 0.7 is generally considered good, indicating the model is doing a better job of
distinguishing between classes.

Example:

Imagine a binary classification problem where we want to predict if an email is spam (positive)
or not spam (negative). The dataset is highly imbalanced, with 90% non-spam emails and 10%
spam emails.

1. Precision: Out of all the emails predicted as spam, how many were actually spam?
2. Recall: Out of all the actual spam emails, how many did the model correctly identify?
3. F1-Score: A balance between precision and recall.
4. ROC Curve & AUC: A curve showing how the true positive rate varies with the false
positive rate at different thresholds, and the area under the curve tells how well the model
distinguishes between spam and non-spam.

3. Text Mining

Introduction to Text Mining

● Extracts meaningful patterns from unstructured text data.

● Applications: Sent iment analysis, spam detection, topic modeling.

Text Preprocessing

● Tokenization: Splitting text into words.

● Stopword Removal: Removing common words (e.g., "the", "is").
● Stemming vs. Lemmatization: Converting words to root forms.
● Named Entity Recognition (NER): Identifying proper names and locations.

Feature Extraction from Text

● Bag of Words (BoW): Counts word occurrences.

● TF-IDF (Term Frequency-Inverse Document Frequency): Measures importance of
words.
● Word Embeddings (Word2Vec, GloVe, FastText): Converts words into dense vectors.

Text Classification

● Naïve Bayes: Probabilistic classifier.

● Support Vector Machines (SVM): Separates text using hyperplanes.
● LSTMs & Transformers: Deep learning models for sequential data.

Topic Modeling

● Latent Dirichlet Allocation (LDA): Uncovers hidden topics in text.

● Non-Negative Matrix Factorization (NMF): Another approach to topic extraction.text
mining, explained in detail.

How Does the Text Mining Process Work? Key Steps Involved
Have you ever wondered how businesses uncover hidden insights from massive piles of text
data? Industries like healthcare, finance, e-commerce, and even entertainment heavily rely on
text mining in data mining to transform unstructured data into actionable intelligence.

Across these sectors, the process involves structured steps that consistently convert raw text into
meaningful insights. To understand this transformative journey, here are the key steps involved in
text mining, explained in detail.

Step 1: Data Collection and Extraction

The first step is gathering raw, unstructured data from multiple sources. The objective here is to
compile text data relevant to the problem or task.

Below are common techniques used to collect and extract data.

Technique Description

Web Scraping Extracts data directly from websites.

API Integration Fetches data from platforms like Twitter or

Facebook.

Manual Entry Captures text from offline documents.

Step 2: Text Preprocessing

After gathering data, preprocessing ensures the text is clean and ready for analysis. This step
removes noise and standardizes the text.

Here are the sub-steps involved in preprocessing.

● Cleanup: Removes unnecessary elements like HTML tags, advertisements, and binary
formats from the text.
● Tokenization: Splits text into smaller units, such as words or sentences, for analysis.
● Stop-word Removal: Eliminates common but contextually irrelevant words like "the" or
"and."
● Stemming and Lemmatization: Converts words to their root forms (e.g., "running" to
"run") to reduce complexity.

After cleaning the data, the next step is representing the data in a way algorithms can interpret.

Step 3: Text Representation

The objective of this step is to convert clean text into numerical or symbolic formats that are
usable by machine learning models.

Here are the most common methods for text representation.

● Bag of Words (BoW): Represents text as a matrix of word frequencies across documents.
● TF-IDF: Highlights words that are important to a document while discounting common
words across the dataset.
● Word Embeddings: Converts words into dense vectors in a high-dimensional space (e.g.,
Word2Vec or GloVe).

With text now in a usable format, the analysis phase begins.

Step 4: Text Analysis

This step involves applying analytical techniques to derive insights and patterns from the data.
The objective is to uncover hidden knowledge and actionable insights.

Below are the key analytical tasks involved.

● Text Classification: Assigns predefined categories, such as spam vs. non-spam emails.
● Clustering: Groups similar documents without predefined categories.
● Sentiment Analysis: Identifies emotional tones, such as positive, negative, or neutral
sentiments.
● Named Entity Recognition (NER): Detects entities like names, locations, and dates
within text.
● Topic Modeling: Identifies hidden themes in text collections using algorithms like LDA.

Step 5: Evaluation of Results

Once analysis is complete, the next step is evaluating how effective the results are. The objective
here is to measure the accuracy and relevance of the results from text analysis.

Below are common metrics used for evaluation.

Metric Description
Precision Proportion of relevant results among
retrieved ones.

Recall Proportion of relevant results retrieved

from all data.

F1 Score Harmonic mean of precision and recall for

balanced evaluation.

Step 6: Interpretation and Visualization of Findings

After evaluating, the goal of this step is to present findings in a visually intuitive format that
stakeholders can easily understand.

Below are popular visualization techniques.

Technique Description

Graphs Display relationships or trends in text data.

Heatmaps Highlight the density or frequency of

keywords.

Word Clouds Represent the prominence of words

visually.

Step 7: Iteration and Refinement

This final step focuses on improving the accuracy and relevance of results through
experimentation and fine-tuning.
Below are common practices in this step.

● Experiment with alternative techniques, such as switching from BoW to Word

Embeddings.
● Fine-tune hyperparameters in models for better accuracy.
● Incorporate domain-specific knowledge to tailor analysis to industry needs (e.g., medical
terms in healthcare).

By following these structured steps, text mining in data mining transforms unstructured text into
impactful insights, empowering businesses to make smarter decisions.

What Are the Key Techniques Used in Text Mining?

Industries today depend on text mining in data mining to extract valuable insights from vast
amounts of text data. From uncovering customer sentiments to identifying fraudulent activities,
text mining techniques are at the core of transforming unstructured text into actionable
intelligence.

These techniques work by breaking down text into structured forms and applying advanced
algorithms to find patterns, relationships, and meanings.

Below are the most important techniques used in text mining, explained in detail.

Technique 1: Information Retrieval (IR)

Information retrieval focuses on extracting relevant information from large text datasets. It
enables users to find the most relevant content based on queries or predefined parameters.

Below are the key methods involved in information retrieval.

● Tokenization: Breaks text into smaller units like words or sentences for easier processing.
● Stemming and Lemmatization: Reduces words to their root forms, ensuring consistency
in analysis.
● Pattern Matching: Identifies specific terms or phrases using algorithms, such as keyword
searches.

Information retrieval is extensively used in search engines and library catalog systems to provide
relevant results. Below is a table showcasing examples.

Application Example Outcome

Search Engines Google’s keyword search Delivers ranked search

results

Library Catalogs University databases Finds books based on

topics or titles

Technique 2: Natural Language Processing (NLP)

NLP enables computers to understand, interpret, and respond to human language. It bridges the
gap between raw text and semantic meaning.

Below are sub-techniques in NLP that bring text mining to life.

● Part-of-Speech (POS) Tagging: Assigns grammatical tags like nouns and verbs to tokens,
adding semantic depth.
● Named Entity Recognition (NER): Identifies entities such as names, dates, and locations
for context-specific insights.
● Sentiment Analysis: Determines emotional tones (positive, negative, or neutral) in text,
useful for customer feedback.
● Text Summarization in NLP: Generates concise summaries of long texts for quick
consumption.
NLP powers chatbots, virtual assistants, and customer service automation. Here’s a table
showing how.

Application Example Outcome

Chatbots AI-driven customer Instant query resolution

support

Virtual Assistants Alexa, Siri Executes voice-based

commands

Customer Service Automated email Improves response time

responses

Technique 3: Information Extraction (IE)

Information extraction identifies specific pieces of information from text and transforms them
into structured data for analysis.

Below are sub-techniques that enable this transformation.

● Feature Extraction: Generates new dimensions or variables from text, such as extracting
keywords from reviews.
● Feature Selection: Reduces dimensionality by keeping only the most significant features
for analysis.

IE is widely used in extracting data from legal documents, research papers, or social media posts.
Below is an example table.
Application Example Outcome

Legal Documents Identifying contract Streamlines legal reviews

clauses

Research Papers Extracting key findings Saves time for researchers

Social Media Analyzing hashtags and Improves marketing

mentions strategies

Where is Text Mining Applied? Real-world Examples

How do businesses, hospitals, and social platforms unlock the secrets buried in text data? From
improving customer experiences to diagnosing diseases, what is text mining, if not a gateway to
revolutionary solutions?

Text mining’s versatility makes it indispensable. It analyzes unstructured data in healthcare,

deciphers social media trends, and sharpens business intelligence strategies.

Below, you’ll explore how text mining is transforming industries through its diverse applications.

1. Text Mining in Customer Service

Text mining application in customer service enhances customer interactions by extracting

insights from chat logs, emails, and feedback forms. Companies use it to predict customer needs,
improve resolution times, and personalize interactions.

Here are companies that leverage text mining for better customer service.
Company Applications

Zendesk Analyzing support tickets to prioritize

issues

Salesforce Automating customer query responses

Amazon Predicting customer satisfaction

2. Text Mining in Healthcare

In healthcare, text mining processes clinical notes, medical records, and research papers. It aids
in diagnosing conditions, predicting disease outbreaks, and discovering new treatments.

Below are examples of companies using text mining in healthcare.

Company Applications

IBM Watson Health Extracting insights from electronic health

records

Mayo Clinic Predicting patient outcomes through

unstructured data

Pfizer Identifying drug interaction patterns

3. Text Mining in Social Media Analysis

Social media platforms generate mountains of unstructured data. Text mining uncovers trends,
tracks brand sentiment, and predicts market shifts from this data. It’s a key tool for digital
marketers and strategists.

Here are companies that leverage text mining for social media analysis.

Company Applications

Hootsuite Sentiment analysis of social media posts

Brandwatch Identifying trends in brand mentions

Twitter Detecting and removing harmful content

What Are the Advantages and Disadvantages of Text Mining?

While text mining in data mining opens doors to analyzing vast amounts of unstructured data, it
comes with its share of complexities. On one hand, it offers scalability, automation, and valuable
insights.

On the other, challenges like data quality issues, processing costs, and ethical concerns demand
careful attention. This balance of benefits and limitations defines what is text mining and how
organizations approach its adoption.

The advantages of text mining demonstrate its immense potential across industries. Here are
some advantages of text mining.
● Efficient data analysis: Text mining automates the extraction of insights from massive
datasets, saving time. For example, customer service platforms use it to analyze millions
of support tickets for recurring issues.
● Enhanced decision-making: Text mining helps businesses make informed choices by
uncovering trends. Predicting customer behavior from product reviews is one such
impactful application.
● Cost savings: Automating labor-intensive tasks, like sorting through legal contracts,
reduces manual effort and lowers costs for companies in finance and law.
● Improved accuracy: Advanced models ensure precision in sentiment analysis, such as
identifying customer satisfaction in feedback surveys.
● Cross-industry application: From predicting disease outbreaks in healthcare to detecting
fraud in banking, text mining adapts to various sectors.

Despite its advantages, text mining faces limitations that organizations must tackle carefully.
Here are some of its disadvantages.

● Data quality issues: Unstructured data often includes errors or inconsistencies. For
instance, misspellings in social media posts can skew sentiment analysis.
● High processing costs: Implementing advanced models, especially for large datasets,
demands significant computational power, which can strain budgets.
● Ethical concerns: Mining sensitive data, like patient records or private messages, raises
serious privacy issues and risks non-compliance with regulations.
● Dependence on domain knowledge: Text mining requires industry-specific expertise to
interpret results accurately, such as understanding medical terminology in healthcare.
● Complexity of interpretation: The insights generated are not always straightforward. For
example, topic modeling outputs often need expert review to contextualize results.

4. Time Series Analysis

Introduction to Time Series Analysis

● Time Series Analysis involves analyzing data points collected or recorded at specific time
intervals. Unlike random observations, time series data maintains an inherent temporal
order, making it essential for trend analysis, forecasting, and anomaly detection
● Applications of Time Series Analysis:
○ Weather Forecasting: Predicting future weather patterns based on historical data.
○ Stock Market Prediction: Forecasting stock prices and trends using past market
data.
○ Economic and Sales Forecasting: Estimating revenue, demand, and economic
indicators.
○ Healthcare Monitoring: Analyzing patient vitals over time for anomaly
detection.
○ IoT and Sensor Data Analysis: Monitoring smart devices and industrial systems.

Components of Time Series Data

Time series data consists of various components that help in identifying patterns:

1. Trend:
○ The long-term upward or downward movement in the data.
○ Example: A company’s revenue growth over years.
2. Seasonality:
○ Repeating patterns or cycles over a fixed period (e.g., daily, weekly, yearly).
○ Example: Retail sales increase during festive seasons.
3. Cyclic Patterns:
○ Irregular fluctuations that do not follow a fixed time period.
○ Example: Business cycles influenced by economic conditions.
4. Noise:
○ Random variations that do not follow any pattern.
○ Example: Sudden spikes due to external factors like news or events.

Time Series Forecasting Techniques

Various techniques exist to forecast future values based on historical data:

1. Moving Average & Exponential Smoothing

● Moving Average:
○ Smooths out short-term fluctuations by averaging previous observations.
○ Used for trend estimation.
● Exponential Smoothing:
○ Gives more weight to recent observations.
○ Variants include Simple, Double, and Triple Exponential Smoothing
(Holt-Winters method).

2. ARIMA (AutoRegressive Integrated Moving Average)

● One of the most popular statistical models for forecasting.

● Consists of:
○ AutoRegression (AR): Uses past values to predict future values.
○ Differencing (I): Makes data stationary by removing trends.
○ Moving Average (MA): Accounts for past forecast errors.

3. SARIMA (Seasonal ARIMA)

● Extension of ARIMA that handles seasonal patterns in time series data.

● Suitable for data with repeating seasonal trends.

4. Facebook Prophet

● Developed by Meta for business forecasting.

● Handles missing data and outliers well.
● Works well with daily, weekly, and yearly trends.

Deep Learning for Time Series

Modern AI techniques enhance time series forecasting with high accuracy.

1. Long Short-Term Memory (LSTMs)

● A type of recurrent neural network (RNN) that captures long-term dependencies.

● Effective for handling complex sequential dependencies in time series data.

2. Transformers for Time Series

● Originally developed for NLP but applied to time series forecasting.

● Efficiently processes sequential data using attention mechanisms.
Time Series Anomaly Detection

Detecting anomalies is critical for identifying unusual patterns.

1. Change Point Detection

● Identifies sudden shifts in data trends.

● Used in fraud detection and system failure alerts.

2. Autoencoders

● Neural network-based approach that learns normal patterns.

● Flags deviations as anomalies

Evaluation Metrics

To assess the accuracy of time series models, the following metrics are commonly used:

● Mean Absolute Error (MAE)

○
○ Measures the average absolute difference between actual and predicted values.
● Root Mean Squared Error (RMSE)

○
○ Penalizes larger errors more than MAE.
● Mean Absolute Percentage Error (MAPE)

○
○ Expresses error as a percentage, making it scale-independent.

Time series analysis is an essential tool in data science for making data-driven predictions and
identifying patterns over time. With traditional statistical methods and advanced AI-based
models, time series forecasting has become a crucial element in fields like finance, healthcare,
and IoT.

Principal Component Analysis
No ratings yet
Principal Component Analysis
11 pages
Kinya Sharon - Ass2 - Machine Learning
No ratings yet
Kinya Sharon - Ass2 - Machine Learning
12 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
34 pages
PCA Using Python
No ratings yet
PCA Using Python
18 pages
Love Report 1
No ratings yet
Love Report 1
10 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
8 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
8 pages
Principal Component Analysis and Cluster Analysis
No ratings yet
Principal Component Analysis and Cluster Analysis
14 pages
ML Module 6
No ratings yet
ML Module 6
6 pages
Pca 1
No ratings yet
Pca 1
3 pages
Assignment
No ratings yet
Assignment
24 pages
Love Report
No ratings yet
Love Report
7 pages
Reduce Data Dimensionality Using PCA
No ratings yet
Reduce Data Dimensionality Using PCA
6 pages
Module 2 Lab 2
No ratings yet
Module 2 Lab 2
5 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
1 page
Module 3
No ratings yet
Module 3
41 pages
What Is PCA?: Image Source
No ratings yet
What Is PCA?: Image Source
17 pages
Principle Component Analysis (PCA) : Purpose of This Project
No ratings yet
Principle Component Analysis (PCA) : Purpose of This Project
30 pages
The Intuition Behind PCA: Machine Learning Assignment
No ratings yet
The Intuition Behind PCA: Machine Learning Assignment
11 pages
Basic Theory
No ratings yet
Basic Theory
4 pages
Data Pre-Processing-IV (Feature Extraction-PCA)
No ratings yet
Data Pre-Processing-IV (Feature Extraction-PCA)
23 pages
Module3 OTML
No ratings yet
Module3 OTML
67 pages
PCA Dev
No ratings yet
PCA Dev
16 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
1 page
PR - Unit 4
No ratings yet
PR - Unit 4
15 pages
U5@-Data Reduction
No ratings yet
U5@-Data Reduction
22 pages
1 Principal Component Analysis (PCA) : Complete Lecture Notes
No ratings yet
1 Principal Component Analysis (PCA) : Complete Lecture Notes
22 pages
Week 2 A
No ratings yet
Week 2 A
4 pages
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
No ratings yet
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
19 pages
Principal Component Analysis Concepts
No ratings yet
Principal Component Analysis Concepts
16 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
6 pages
Pca 2
No ratings yet
Pca 2
3 pages
ML (Unit 5)
No ratings yet
ML (Unit 5)
34 pages
PCA in Machine Learning Explained
No ratings yet
PCA in Machine Learning Explained
33 pages
Mloa Exp2 C121
No ratings yet
Mloa Exp2 C121
20 pages
Program 3
No ratings yet
Program 3
7 pages
Principal Component Analysis: #Datascience
No ratings yet
Principal Component Analysis: #Datascience
13 pages
Ai (PCA)
No ratings yet
Ai (PCA)
3 pages
Principal Component Analysis (PCA)
No ratings yet
Principal Component Analysis (PCA)
3 pages
DR Pca
No ratings yet
DR Pca
22 pages
Principal Component Analysis
100% (1)
Principal Component Analysis
10 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
2 pages
What Is Principal Component Analysis (PCA) ?
No ratings yet
What Is Principal Component Analysis (PCA) ?
13 pages
DS Prac 9
No ratings yet
DS Prac 9
3 pages
It ML Unit 4 Notes Final
No ratings yet
It ML Unit 4 Notes Final
21 pages
6 Principal Component Analysis
No ratings yet
6 Principal Component Analysis
7 pages
Dimensionality Reduction (Principal Component Analysis)
No ratings yet
Dimensionality Reduction (Principal Component Analysis)
12 pages
Feature Extraction: - Saheni Patra
No ratings yet
Feature Extraction: - Saheni Patra
17 pages
Dimensionality Reduction: Motivation I: Data Compression
No ratings yet
Dimensionality Reduction: Motivation I: Data Compression
35 pages
Data Analysis: Dr. C Santhosh Kumar
No ratings yet
Data Analysis: Dr. C Santhosh Kumar
22 pages
Dimensionality Reduction: Key Concepts
No ratings yet
Dimensionality Reduction: Key Concepts
13 pages
Unit 3
No ratings yet
Unit 3
102 pages
Principal Component Analysis1
No ratings yet
Principal Component Analysis1
26 pages
PCA Guide: Usage, Python Implementation, Feature Importance
No ratings yet
PCA Guide: Usage, Python Implementation, Feature Importance
9 pages
PCA Finds Representation Through Linear Transformation
No ratings yet
PCA Finds Representation Through Linear Transformation
28 pages
Feature Extraction Techniques
No ratings yet
Feature Extraction Techniques
32 pages
Data Analysis: Dr. C Santhosh Kumar
No ratings yet
Data Analysis: Dr. C Santhosh Kumar
22 pages
Principal Component Analysis Guide
No ratings yet
Principal Component Analysis Guide
23 pages
Use of Risk Assessment Instruments To Predict Violence in Forensic
No ratings yet
Use of Risk Assessment Instruments To Predict Violence in Forensic
7 pages
ADHD and The QbTest
No ratings yet
ADHD and The QbTest
7 pages
Deelen et al. - 2016 - Validation of a calf-side β-hydroxybutyrate test a
No ratings yet
Deelen et al. - 2016 - Validation of a calf-side β-hydroxybutyrate test a
10 pages
Understanding Test Validity
No ratings yet
Understanding Test Validity
7 pages
Are Sex Differences in Human Brain Structure - Associated With Sex Difference in Behaviour - FSL
No ratings yet
Are Sex Differences in Human Brain Structure - Associated With Sex Difference in Behaviour - FSL
71 pages
Good Practices in Visual Inspection - Drury
No ratings yet
Good Practices in Visual Inspection - Drury
85 pages
Machine Learning Algorithms for Predicting Urinary Tract Infections- Integration of Demographic Data and Dipstick Reflectance Results
No ratings yet
Machine Learning Algorithms for Predicting Urinary Tract Infections- Integration of Demographic Data and Dipstick Reflectance Results
12 pages
ImageNet: Hierarchical Image Database
No ratings yet
ImageNet: Hierarchical Image Database
9 pages
QRM2 C4
No ratings yet
QRM2 C4
58 pages
Data Mining Exam Prep Guide
No ratings yet
Data Mining Exam Prep Guide
4 pages
Evaluation of Machine Learning For Smart Phone Malware Detection
No ratings yet
Evaluation of Machine Learning For Smart Phone Malware Detection
6 pages
Understanding ROC Curves in Diagnostics
No ratings yet
Understanding ROC Curves in Diagnostics
3 pages
Credit Card Fraud Detection Using Ensemble Data Mining Methods
No ratings yet
Credit Card Fraud Detection Using Ensemble Data Mining Methods
19 pages
CSCB HW 1
No ratings yet
CSCB HW 1
6 pages
Analysis of Imbalanced Classification Algorithms A Perspective View
No ratings yet
Analysis of Imbalanced Classification Algorithms A Perspective View
5 pages
Perbandingan Validitas Sistem Skoring Apache II, SOFA, Dan Customized Sequential Organ Failure Assessment (Csofa) Untuk Memperkirakan Mortalitas Pasien Non-Bedah Yang Dirawat Di Ru..
No ratings yet
Perbandingan Validitas Sistem Skoring Apache II, SOFA, Dan Customized Sequential Organ Failure Assessment (Csofa) Untuk Memperkirakan Mortalitas Pasien Non-Bedah Yang Dirawat Di Ru..
13 pages
Pediatric Acute Respiratory Distress Syndrome
No ratings yet
Pediatric Acute Respiratory Distress Syndrome
7 pages
05 Classification II 2024
No ratings yet
05 Classification II 2024
54 pages
1-S2.0-S0020138322001954-Main
No ratings yet
1-S2.0-S0020138322001954-Main
6 pages
Skin Cancer Detection via Deep Learning
No ratings yet
Skin Cancer Detection via Deep Learning
9 pages
Social Media Fake Account Prediction Report
No ratings yet
Social Media Fake Account Prediction Report
21 pages
1 s2.0 S0022169421007320 Main
No ratings yet
1 s2.0 S0022169421007320 Main
13 pages
Comparison of The SpO2/FIO2 Ratio and The PaO2/FIO2 Ratio in Patients With Acute Lung Injury or ARDS
No ratings yet
Comparison of The SpO2/FIO2 Ratio and The PaO2/FIO2 Ratio in Patients With Acute Lung Injury or ARDS
8 pages
Kredit Skoring Dan Big Data
No ratings yet
Kredit Skoring Dan Big Data
12 pages
Chapter6: Experiment On Fps Elimination
No ratings yet
Chapter6: Experiment On Fps Elimination
8 pages
Predictors of Increased Risk
No ratings yet
Predictors of Increased Risk
8 pages
AWS ML Specialty Cheat Sheet
100% (1)
AWS ML Specialty Cheat Sheet
67 pages
Balaji
No ratings yet
Balaji
34 pages
Silva Et Al 2024 - The Importance of Legal Reserves in The Conservation of Psophia Obscura
No ratings yet
Silva Et Al 2024 - The Importance of Legal Reserves in The Conservation of Psophia Obscura
8 pages
C-EXIT Cut Off
No ratings yet
C-EXIT Cut Off
6 pages

Data Science Module 5

Uploaded by

Data Science Module 5

Uploaded by

Advanced Topics in Data Science

1. Dimensionality Reduction Techniques

Introduction to Dimensionality Reduction

Defination: Dimensionality Reduction is the process of reducing the number of features

Principal Component Analysis (PCA)

●​ Principal Component Analysis (PCA) is a dimensionality reduction technique that

Principal Component Analysis (PCA) – Detailed Explanation with Example

Principal Component Analysis (PCA) is a dimensionality reduction technique used in machine

●​ Reducing computational complexity

Key Concepts in PCA

Step-by-Step PCA Example

Let's apply PCA to a simple dataset using Python.

1. Import Necessary Libraries

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

# Sample dataset (4 features)

print("Original Data:\n", data.head())

Since PCA is sensitive to different scales, we standardize the data.

pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

print("\nTransformed Data (PCA Components):\n", pca_df.head())

To decide how many components to keep, we check the explained variance.

print("\nExplained Variance Ratio:\n", explained_variance)

5. Visualizing PCA Components

plt.scatter(pca_df['PC1'], pca_df['PC2'], color='blue')

plt.xlabel('Principal Component 1')

plt.ylabel('Principal Component 2')

PCA complete Program in Python

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

# Sample dataset (4 features)

print("Original Data:\n", data.head())

# Step 1: Standardize the data

pca = PCA(n_components=2) # Reduce to 2 dimensions

pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

print("\nTransformed Data (PCA Components):\n", pca_df.head())

# Step 3: Explained Variance

print("\nExplained Variance Ratio:\n", explained_variance)

plt.scatter(pca_df['PC1'], pca_df['PC2'], color='blue')

plt.xlabel('Principal Component 1')

plt.ylabel('Principal Component 2')

2. Anomaly Detection and Outlier Analysis

Introduction to Anomaly Detection​

2.Contextual Anomalies: Anomalies in specific contexts ​

●​ Temperature: A temperature of 30°C might be normal in summer but anomalous in

Machine Learning-Based Methods

●​ Supervised Anomaly Detection: Requires labeled training data.

1.K-Means Clustering:Divides data into k clusters based on similarity.​

Cluster 1: Frequent buyers of low-cost products.

Cluster 2: Occasional buyers of high-cost products.

Cluster 3: Rare buyers with diverse product preferences.

How Isolation Forest Works:

Deep Learning Approaches

How Autoencoders Work:

GANs (Generative Adversarial Networks): Can generate synthetic normal patterns.​

●​ Precision, Recall, F1-score.

●​ High precision means fewer false positives.

2. Recall (Sensitivity or True Positive Rate)

4. ROC Curve (Receiver Operating Characteristic Curve)

The ROC Curve is a graphical representation of a classification model’s performance. It plots

5. AUC (Area Under the Curve)

●​ AUC = 1 indicates a perfect classifier.

Introduction to Text Mining

●​ Extracts meaningful patterns from unstructured text data.

●​ Tokenization: Splitting text into words.

Feature Extraction from Text

●​ Bag of Words (BoW): Counts word occurrences.

●​ Naïve Bayes: Probabilistic classifier.

●​ Latent Dirichlet Allocation (LDA): Uncovers hidden topics in text.

Step 1: Data Collection and Extraction

Below are common techniques used to collect and extract data.

Web Scraping Extracts data directly from websites.

API Integration Fetches data from platforms like Twitter or

Manual Entry Captures text from offline documents.

Step 2: Text Preprocessing

● Principal Component Analysis (PCA) is a dimensionality reduction technique that

● Reducing computational complexity

Introduction to Anomaly Detection

2.Contextual Anomalies: Anomalies in specific contexts

● Temperature: A temperature of 30°C might be normal in summer but anomalous in

● Supervised Anomaly Detection: Requires labeled training data.

1.K-Means Clustering:Divides data into k clusters based on similarity.

GANs (Generative Adversarial Networks): Can generate synthetic normal patterns.

● Precision, Recall, F1-score.

● High precision means fewer false positives.

● AUC = 1 indicates a perfect classifier.

● Extracts meaningful patterns from unstructured text data.

● Tokenization: Splitting text into words.

● Bag of Words (BoW): Counts word occurrences.

● Naïve Bayes: Probabilistic classifier.

● Latent Dirichlet Allocation (LDA): Uncovers hidden topics in text.

● Experiment with alternative techniques, such as switching from BoW to Word

1. Text Mining in Customer Service

2. Text Mining in Healthcare

● One of the most popular statistical models for forecasting.

● Extension of ARIMA that handles seasonal patterns in time series data.

● Developed by Meta for business forecasting.

● A type of recurrent neural network (RNN) that captures long-term dependencies.

● Originally developed for NLP but applied to time series forecasting.

● Identifies sudden shifts in data trends.

● Neural network-based approach that learns normal patterns.