Data Science Module 5
Data Science Module 5
🔹 Example: Converting a 3D object into a 2D image while keeping its essential details.
● Two main types:
○ Feature Selection: Selecting a subset of the original features.
○ Feature Extraction: Creating new features from existing ones.
🔹
(principal components) that maximize variance in the data.
🔹
● Key Idea: Reduce complexity while retaining as much information as possible.
Example: Converting a high-resolution image to a lower-resolution version
without losing key details.
What is PCA?
import pandas as pd
data = pd.DataFrame({
'Feature1': [2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2.0, 1.0, 1.5, 1.1],
'Feature2': [2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9],
'Feature3': [1.8, 0.5, 2.4, 1.5, 2.7, 2.4, 2.2, 1.3, 1.8, 1.0],
'Feature4': [0.5, 0.3, 0.7, 0.6, 1.0, 0.9, 0.8, 0.2, 0.4, 0.3]
})
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
3. Compute PCA
pca = PCA(n_components=2) # Reduce to 2 dimensions
principal_components = pca.fit_transform(data_scaled)
# Convert to DataFrame
4. Explained Variance
explained_variance = pca.explained_variance_ratio_
If PC1 explains 70% of variance and PC2 explains 20%, keeping two components retains 90% of
information.
plt.grid()
Principal Component Analysis (PCA) using sklearn. This program takes a dataset, standardizes
it, applies PCA, and visualizes the results.
import pandas as pd
data = pd.DataFrame({
'Feature1': [2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2.0, 1.0, 1.5, 1.1],
'Feature2': [2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9],
'Feature3': [1.8, 0.5, 2.4, 1.5, 2.7, 2.4, 2.2, 1.3, 1.8, 1.0],
'Feature4': [0.5, 0.3, 0.7, 0.6, 1.0, 0.9, 0.8, 0.2, 0.4, 0.3]
})
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# Step 2: Apply PCA
principal_components = pca.fit_transform(data_scaled)
# Convert to DataFrame
explained_variance = pca.explained_variance_ratio_
# Step 4: Visualization
plt.figure(figsize=(8,6))
plt.title('PCA Projection')
plt.grid()
plt.show()
Data preprocessing is an essential step in the data science workflow that ensures data is clean,
consistent, and ready for analysis. By applying techniques like handling missing data, encoding
categorical variables, normalizing data, and reducing dimensionality, we improve the accuracy
and efficiency of machine learning models. Effective data preprocessing leads to better insights
and more reliable predictions.
Types of Anomalies:
1.Point Anomalies: Single data points differing from the rest.
Definition: A single data point that significantly differs from the rest.
Example:
● Credit Card Fraud: A customer who usually makes small purchases suddenly makes an
expensive purchase from a foreign country.
Example Flow: Credit Card Fraud Detection
1. Normal Behavior:
A customer, let's call them John, usually makes small purchases. For instance, he buys
groceries, coffee, and gas—items that are typical and within a certain price range. His
usual buying pattern is predictable and local.
2. Suspicious Activity:
One day, John's credit card is used to make a large purchase for an expensive item, like
a high-end watch, from a foreign country. This is completely different from his usual
spending habits.
3. Flagging the Transaction:
The bank's fraud detection system notices this sudden change in John's behavior. The
system has learned what typical spending patterns look like and flags this transaction as
suspicious because it doesn't fit John's usual profile.
4. Verification:
The bank or credit card company sends a notification to John, asking if he made the
purchase. It could be through a phone call, email, or a mobile app alert.
5. Action Taken:
○ If John confirms the purchase, everything proceeds as normal.
○ If John denies the purchase, the bank can freeze the account and begin
investigating further to prevent any potential fraud.
Example:
Example:
● Server Downtime: A series of server failures occurring one after another, indicating a
possible cyber attack.
Example Flow: Server Downtime and Cyber Attack Detection
1. Normal Server Operation:
Normally, servers work smoothly, handling requests and providing services without
interruption. Occasionally, a server might experience a brief glitch or downtime, but it is
usually resolved quickly.
2. Unexpected Series of Failures:
One day, a series of server failures start happening one after another. Servers that were
previously stable begin to go down in quick succession. This pattern is unusual and
suggests something is not quite right.
3. Flagging Suspicious Activity:
The system monitoring the servers notices that these failures are occurring too frequently
and in a pattern that’s unlike typical server issues. This could indicate that something
intentional is happening, such as a cyber attack trying to overload or disrupt the servers.
4. Detecting a Possible Cyber Attack:
The sequence of failures might look like a Denial of Service (DoS) attack, where
multiple servers are targeted in an attempt to bring down the entire system. The
monitoring system flags this behavior as a potential cyber attack.
5. Action Taken:
○ The system sends an immediate alert to IT security teams.
○ The security team investigates further, potentially isolating affected servers or
implementing countermeasures (like blocking suspicious traffic).
○ The servers may be restored to normal operation once the attack is mitigated.
Statistical Approaches
● Z-Score: Measures how far a data point is from the mean. Data points with a Z-score
above a certain threshold are flagged as anomalies.
● Example: A user’s purchase behavior on an e-commerce site is compared to the average
to detect abnormal spending.
2.Isolation Forest: is an algorithm used for detecting anomalies by isolating rare data
points. This makes the algorithm efficient and effective for identifying outliers, especially in
large and high-dimensional datasets.
● Random Partitioning: The algorithm randomly selects a feature and then a value to
split the data into two parts. The process of partitioning continues recursively,
generating isolation trees.
● Tree Building: Each tree is constructed by repeating the random partitioning
process. The depth of a tree determines how quickly a point can be isolated.
Anomalies tend to be isolated closer to the root (shallower depth).
● Anomaly Scoring: The average path length to isolate a point is calculated across all
trees. If a point has a short average path length, it is an anomaly (isolated quickly).
If a point has a long path length, it is normal (isolated slowly).
○ One-Class SVM
○ DBSCAN (Density-Based Clustering)
1. Input Data: The network takes in data, such as images, text, or numerical data.
2. Encoding: The encoder reduces the dimensionality of the data by mapping it into a
compact, low-dimensional space.
3. Reconstruction: The decoder attempts to rebuild the original input data from the
compressed representation.
4. Optimization: The network is trained using backpropagation to minimize the
reconstruction error, thus learning an efficient representation of the data.
Let’s imagine you want to generate realistic images of cats using GANs.
1. Data Collection: First, you gather a large dataset of real cat images.
2. Generator Creation: The Generator takes random noise and tries to create an image of a
cat.
3. Discriminator Creation: The Discriminator evaluates whether an image is a real cat
image or a fake one generated by the Generator.
4. Training: Both the Generator and the Discriminator are trained together. The Generator
tries to create better and more realistic cat images to fool the Discriminator, while the
Discriminator gets better at telling the difference between real and fake images.
5. Outcome: After many iterations, the Generator learns to produce cat images that are
nearly indistinguishable from real ones.
Evaluation Metrics
Formula:
Example:
If a model predicts 100 positive cases, and 90 of them are actually positive, the precision is 90%.
Recall is the ratio of correctly predicted positive observations to all actual positives. It answers
the question: "Of all the actual positive instances, how many did the model correctly identify?"
Formula:
If the model identifies 90 out of 100 actual positive cases, the recall is 90%.
● High recall means fewer false negatives, and the model is better at identifying positive
cases.
3. F1-Score
F1-Score is the harmonic mean of Precision and Recall. It balances both metrics and is a good
measure when there is an uneven class distribution (imbalanced dataset).
Formula:
Example:
If the model has high precision but low recall, the F1-score will help highlight this balance.
● A higher F1-score indicates that both precision and recall are reasonably balanced.
● True Positive Rate (Recall): The proportion of actual positives correctly identified by
the model.
● False Positive Rate: The proportion of actual negatives that are incorrectly classified as
positives.
AUC is the area under the ROC curve. It is a single value that summarizes the performance of
the classifier across all possible thresholds.
Example:
Imagine a binary classification problem where we want to predict if an email is spam (positive)
or not spam (negative). The dataset is highly imbalanced, with 90% non-spam emails and 10%
spam emails.
1. Precision: Out of all the emails predicted as spam, how many were actually spam?
2. Recall: Out of all the actual spam emails, how many did the model correctly identify?
3. F1-Score: A balance between precision and recall.
4. ROC Curve & AUC: A curve showing how the true positive rate varies with the false
positive rate at different thresholds, and the area under the curve tells how well the model
distinguishes between spam and non-spam.
3. Text Mining
Text Preprocessing
Text Classification
Topic Modeling
How Does the Text Mining Process Work? Key Steps Involved
Have you ever wondered how businesses uncover hidden insights from massive piles of text
data? Industries like healthcare, finance, e-commerce, and even entertainment heavily rely on
text mining in data mining to transform unstructured data into actionable intelligence.
Across these sectors, the process involves structured steps that consistently convert raw text into
meaningful insights. To understand this transformative journey, here are the key steps involved in
text mining, explained in detail.
The first step is gathering raw, unstructured data from multiple sources. The objective here is to
compile text data relevant to the problem or task.
After gathering data, preprocessing ensures the text is clean and ready for analysis. This step
removes noise and standardizes the text.
● Cleanup: Removes unnecessary elements like HTML tags, advertisements, and binary
formats from the text.
● Tokenization: Splits text into smaller units, such as words or sentences, for analysis.
● Stop-word Removal: Eliminates common but contextually irrelevant words like "the" or
"and."
● Stemming and Lemmatization: Converts words to their root forms (e.g., "running" to
"run") to reduce complexity.
After cleaning the data, the next step is representing the data in a way algorithms can interpret.
The objective of this step is to convert clean text into numerical or symbolic formats that are
usable by machine learning models.
This step involves applying analytical techniques to derive insights and patterns from the data.
The objective is to uncover hidden knowledge and actionable insights.
● Text Classification: Assigns predefined categories, such as spam vs. non-spam emails.
● Clustering: Groups similar documents without predefined categories.
● Sentiment Analysis: Identifies emotional tones, such as positive, negative, or neutral
sentiments.
● Named Entity Recognition (NER): Detects entities like names, locations, and dates
within text.
● Topic Modeling: Identifies hidden themes in text collections using algorithms like LDA.
Once analysis is complete, the next step is evaluating how effective the results are. The objective
here is to measure the accuracy and relevance of the results from text analysis.
Metric Description
Precision Proportion of relevant results among
retrieved ones.
After evaluating, the goal of this step is to present findings in a visually intuitive format that
stakeholders can easily understand.
Technique Description
This final step focuses on improving the accuracy and relevance of results through
experimentation and fine-tuning.
Below are common practices in this step.
By following these structured steps, text mining in data mining transforms unstructured text into
impactful insights, empowering businesses to make smarter decisions.
Industries today depend on text mining in data mining to extract valuable insights from vast
amounts of text data. From uncovering customer sentiments to identifying fraudulent activities,
text mining techniques are at the core of transforming unstructured text into actionable
intelligence.
These techniques work by breaking down text into structured forms and applying advanced
algorithms to find patterns, relationships, and meanings.
Below are the most important techniques used in text mining, explained in detail.
Information retrieval focuses on extracting relevant information from large text datasets. It
enables users to find the most relevant content based on queries or predefined parameters.
● Tokenization: Breaks text into smaller units like words or sentences for easier processing.
● Stemming and Lemmatization: Reduces words to their root forms, ensuring consistency
in analysis.
● Pattern Matching: Identifies specific terms or phrases using algorithms, such as keyword
searches.
Information retrieval is extensively used in search engines and library catalog systems to provide
relevant results. Below is a table showcasing examples.
NLP enables computers to understand, interpret, and respond to human language. It bridges the
gap between raw text and semantic meaning.
● Part-of-Speech (POS) Tagging: Assigns grammatical tags like nouns and verbs to tokens,
adding semantic depth.
● Named Entity Recognition (NER): Identifies entities such as names, dates, and locations
for context-specific insights.
● Sentiment Analysis: Determines emotional tones (positive, negative, or neutral) in text,
useful for customer feedback.
● Text Summarization in NLP: Generates concise summaries of long texts for quick
consumption.
NLP powers chatbots, virtual assistants, and customer service automation. Here’s a table
showing how.
Information extraction identifies specific pieces of information from text and transforms them
into structured data for analysis.
● Feature Extraction: Generates new dimensions or variables from text, such as extracting
keywords from reviews.
● Feature Selection: Reduces dimensionality by keeping only the most significant features
for analysis.
IE is widely used in extracting data from legal documents, research papers, or social media posts.
Below is an example table.
Application Example Outcome
How do businesses, hospitals, and social platforms unlock the secrets buried in text data? From
improving customer experiences to diagnosing diseases, what is text mining, if not a gateway to
revolutionary solutions?
Below, you’ll explore how text mining is transforming industries through its diverse applications.
Here are companies that leverage text mining for better customer service.
Company Applications
In healthcare, text mining processes clinical notes, medical records, and research papers. It aids
in diagnosing conditions, predicting disease outbreaks, and discovering new treatments.
Company Applications
Social media platforms generate mountains of unstructured data. Text mining uncovers trends,
tracks brand sentiment, and predicts market shifts from this data. It’s a key tool for digital
marketers and strategists.
Here are companies that leverage text mining for social media analysis.
Company Applications
While text mining in data mining opens doors to analyzing vast amounts of unstructured data, it
comes with its share of complexities. On one hand, it offers scalability, automation, and valuable
insights.
On the other, challenges like data quality issues, processing costs, and ethical concerns demand
careful attention. This balance of benefits and limitations defines what is text mining and how
organizations approach its adoption.
The advantages of text mining demonstrate its immense potential across industries. Here are
some advantages of text mining.
● Efficient data analysis: Text mining automates the extraction of insights from massive
datasets, saving time. For example, customer service platforms use it to analyze millions
of support tickets for recurring issues.
● Enhanced decision-making: Text mining helps businesses make informed choices by
uncovering trends. Predicting customer behavior from product reviews is one such
impactful application.
● Cost savings: Automating labor-intensive tasks, like sorting through legal contracts,
reduces manual effort and lowers costs for companies in finance and law.
● Improved accuracy: Advanced models ensure precision in sentiment analysis, such as
identifying customer satisfaction in feedback surveys.
● Cross-industry application: From predicting disease outbreaks in healthcare to detecting
fraud in banking, text mining adapts to various sectors.
Despite its advantages, text mining faces limitations that organizations must tackle carefully.
Here are some of its disadvantages.
● Data quality issues: Unstructured data often includes errors or inconsistencies. For
instance, misspellings in social media posts can skew sentiment analysis.
● High processing costs: Implementing advanced models, especially for large datasets,
demands significant computational power, which can strain budgets.
● Ethical concerns: Mining sensitive data, like patient records or private messages, raises
serious privacy issues and risks non-compliance with regulations.
● Dependence on domain knowledge: Text mining requires industry-specific expertise to
interpret results accurately, such as understanding medical terminology in healthcare.
● Complexity of interpretation: The insights generated are not always straightforward. For
example, topic modeling outputs often need expert review to contextualize results.
● Time Series Analysis involves analyzing data points collected or recorded at specific time
intervals. Unlike random observations, time series data maintains an inherent temporal
order, making it essential for trend analysis, forecasting, and anomaly detection
● Applications of Time Series Analysis:
○ Weather Forecasting: Predicting future weather patterns based on historical data.
○ Stock Market Prediction: Forecasting stock prices and trends using past market
data.
○ Economic and Sales Forecasting: Estimating revenue, demand, and economic
indicators.
○ Healthcare Monitoring: Analyzing patient vitals over time for anomaly
detection.
○ IoT and Sensor Data Analysis: Monitoring smart devices and industrial systems.
Time series data consists of various components that help in identifying patterns:
1. Trend:
○ The long-term upward or downward movement in the data.
○ Example: A company’s revenue growth over years.
2. Seasonality:
○ Repeating patterns or cycles over a fixed period (e.g., daily, weekly, yearly).
○ Example: Retail sales increase during festive seasons.
3. Cyclic Patterns:
○ Irregular fluctuations that do not follow a fixed time period.
○ Example: Business cycles influenced by economic conditions.
4. Noise:
○ Random variations that do not follow any pattern.
○ Example: Sudden spikes due to external factors like news or events.
4. Facebook Prophet
2. Autoencoders
Evaluation Metrics
To assess the accuracy of time series models, the following metrics are commonly used:
○
○ Measures the average absolute difference between actual and predicted values.
● Root Mean Squared Error (RMSE)
○
○ Penalizes larger errors more than MAE.
● Mean Absolute Percentage Error (MAPE)
○
○ Expresses error as a percentage, making it scale-independent.
Time series analysis is an essential tool in data science for making data-driven predictions and
identifying patterns over time. With traditional statistical methods and advanced AI-based
models, time series forecasting has become a crucial element in fields like finance, healthcare,
and IoT.