0% found this document useful (0 votes)

58 views13 pages

Unit 2

The document provides an overview of data mining, detailing its two main types: predictive and descriptive data mining. Predictive data mining focuses on forecasting future events using techniques like classification and regression, while descriptive data mining analyzes past data to uncover hidden patterns through methods like clustering and association rule mining. Additionally, it discusses the processes of knowledge discovery in databases (KDD) and data preprocessing methods to enhance data quality for analysis.

Uploaded by

manifighter24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views13 pages

Unit 2

Uploaded by

manifighter24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Introduction to Data Mining

Data mining is the process of discovering useful patterns, relationships, and

trends from large amounts of data. It helps businesses, researchers, and
organizations make better decisions by analyzing past data.
There are two main types of data mining techniques:
1. Predictive Data Mining
2. Descriptive Data Mining

1. Predictive Data Mining Techniques

Predictive data mining is used to predict future events or unknown values
based on past data. It works by learning from historical data and applying this
knowledge to new data.
It is called "supervised learning" because the model is trained using labeled
data, where both input (features) and output (target) values are known.

Common Predictive Data Mining Techniques

1. Classification
Classification is a technique that categorizes data into predefined classes based
on patterns. It builds models that can classify new data based on learned
relationships. It is used in spam detection, medical diagnosis, and customer
segmentation. Common algorithms include Decision Trees, Naïve Bayes, and
Support Vector Machines (SVM).

2. Regression
Regression is a statistical technique used to predict continuous numerical
values. It identifies relationships between dependent and independent
variables to estimate unknown values. It is used in house price estimation,
sales forecasting, and stock price prediction. Popular techniques include
Linear Regression and Logistic Regression.

3. Time Series Analysis

Time series analysis is used to study data points collected over time to identify
trends and patterns. It helps in forecasting future values based on past
observations. It is widely used in weather forecasting, stock market analysis,
and economic predictions. Models like ARIMA and LSTMs help in forecasting
time-based data.
4. Neural Networks & Deep Learning
These models mimic the human brain and can learn complex patterns. It is
used in image recognition, speech recognition, and customer behaviour
prediction.
Example: Facial recognition in smartphones.

Advantages of Predictive Data Mining

1. Future Trend Prediction – Helps organizations forecast future trends and
make data-driven decisions.
2. Risk Management – Identifies potential risks in finance, healthcare, and
business operations.
3. Improved Decision-Making – Assists in strategic planning by analyzing past
patterns.
4. Enhanced Customer Targeting – Helps businesses personalize marketing
strategies based on predicted behaviours.
5. Fraud Detection – Detects anomalies in transactions, reducing financial
losses due to fraud.

2. Descriptive Data Mining Techniques

Descriptive data mining is used to analyze past data and find hidden patterns,
relationships, or trends. Unlike predictive mining, it does not try to make
future predictions.
It is called "unsupervised learning" because the model is not given predefined
labels or output categories. Instead, it finds patterns on its own
Common Descriptive Data Mining Techniques
1. Clustering
Clustering is a technique that groups similar data points together based on
shared characteristics. It helps in identifying natural structures within a dataset.
It is used in customer segmentation, image recognition, and fraud detection.
Common algorithms include K-Means, DBSCAN, and Hierarchical Clustering.

2. Association Rule Mining

Association rule mining discovers relationships between variables in large
datasets. It identifies patterns that frequently occur together within the data. It
is commonly used in market basket analysis, where businesses discover
product purchase patterns. Apriori and FP-Growth are popular algorithms.

3. Summarization
Summarization provides a compact and meaningful representation of large
datasets. It reduces complexity by extracting key insights and essential
features. It is used in business intelligence, text mining, and report generation
to extract key insights from large data volumes.

4. Anomaly Detection
Anomaly detection identifies unusual patterns or outliers in data. It helps in
recognizing deviations that differ significantly from the expected behaviour. It is
widely used in fraud detection, cybersecurity, and medical diagnosis to spot
suspicious or abnormal behaviour. Common techniques include Isolation
Forests and One-Class SVM.

Advantages of Descriptive Data Mining

1. Pattern Identification – Discovers hidden patterns and relationships in large
datasets.
2. Better Data Understanding – Summarizes complex datasets, making them
easier to interpret.
3. Improved Business Insights – Helps organizations understand customer
behaviour and market trends.
4. Efficient Data Segmentation – Groups similar data points for better
organization and analysis.
5. Anomaly Detection – Identifies unusual patterns that could indicate errors
or fraud.

Supervised Learning
Supervised Learning is a machine learning technique where a model is trained
using labeled data,meaning each input has a corresponding correct output. The
algorithm learns patterns from this data and applies them to make predictions
for new inputs.

Types of Supervised Learning

Supervised learning is classified into two categories of algorithms:
• Regression: A regression problem is when the output variable is a real
value, such as “dollars” or “weight”.
• Classification: A classification problem is when the output variable is a
category, such as “Red” or “blue” , “disease” or “no disease”..

1- Regression
Regression is a type of supervised learning that is used to predict continuous
values, such as house prices, stock prices, or customer churn. Regression
algorithms learn a function that maps from the input features to the output
value.
Some common regression algorithms include:
• Linear Regression
• Polynomial Regression
• Support Vector Machine Regression
• Decision Tree Regression
• Random Forest Regression

2- Classification
Classification is a type of supervised learning that is used to predict categorical
values, such as whether a customer will churn or not, whether an email is spam
or not, or whether a medical image shows a tumor or not. Classification
algorithms learn a function that maps from the input features to a probability
distribution over the output classes.
Some common classification algorithms include:
1. Logistic Regression
2. Support Vector Machines
3. Decision Trees
4. Random Forests
5. Naive Baye

Applications of Supervised Learning

1. Spam Detection – Email services use supervised learning to classify emails
as spam or not spam based on keywords, sender details, and patterns.
2. Medical Diagnosis – AI models analyze patient data to classify diseases like
diabetes, cancer, or heart disease, aiding doctors in early detection.
3. Fraud Detection – Banks and financial institutions use supervised learning
to identify fraudulent transactions by analyzing spending patterns.
4. Sentiment Analysis – Companies use supervised learning to analyze
customer reviews and classify them as positive, neutral, or negative for
brand insights.
5. Speech Recognition – Virtual assistants like Siri, Alexa, and Google
Assistant use supervised learning to convert speech into text and
understand user commands.
6. Stock Market Prediction – Supervised learning models predict stock prices
based on historical data, trends, and market conditions.
7. Image Recognition – Used in facial recognition and object detection for
security, authentication, and autonomous vehicles.
8. Weather Forecasting – Predicts temperature, rainfall, and storms using
past weather data and supervised learning models.
Advantages of Supervised Learning
1. High Accuracy – Since the model learns from labeled data, predictions are
generally more accurate compared to unsupervised learning.
2. Predictive Power – It is useful for both classification (e.g., spam detection)
and regression (e.g., house price prediction) tasks.
3. Well-Defined Problem Solving – Works well when there is a clear
relationship between input and output.
4. Real-World Applications – Used in medical diagnosis, fraud detection,
recommendation systems, and more.
5. Easy Performance Evaluation – The accuracy of the model can be measured
using metrics like accuracy, precision, recall, and F1-score.

❌ Disadvantages of Supervised Learning

1. Requires Labeled Data – Training requires a large amount of labeled data,
which can be expensive and time-consuming to collect.
2. Limited Generalization – The model may not perform well on unseen or
real-world data if the training data is not diverse enough.
3. Overfitting – The model may learn patterns too well and fail on new data,
reducing its effectiveness.
4. Computationally Expensive – Some supervised learning algorithms (e.g.,
deep learning) require high computational power and memory.
5. Not Ideal for Complex Patterns – It struggles with datasets where patterns
are not well-defined or structured.

Unsupervised Learning
Unsupervised learning is a type of machine learning that learns from unlabeled
data. This means that the data does not have any pre-existing labels or
categories. The goal of unsupervised learning is to discover patterns and
relationships in the data without any explicit guidance.

Types of Unsupervised Learning

Unsupervised learning is mainly classified into two types: Clustering and
Association Rule Learning.

1. Clustering
Clustering is an unsupervised learning technique that groups similar data points
based on patterns and similarities. It is used when there are no predefined
labels, and the model identifies natural groupings in the data.
Example: Customer segmentation in marketing, where businesses categorize
customers based on their purchasing behavior.
Common Clustering Algorithms:
• K-Means Clustering.
• Hierarchical Clustering.
• DBSCAN (Density-Based Clustering)

2. Association Rule Learning

Association rule learning discovers hidden relationships between data items in
large datasets. It identifies patterns like "if X happens, then Y is likely to
happen," helping in decision-making. Example: Market Basket Analysis, where
stores identify products that are frequently bought together.
Common Association Algorithms:
• Apriori Algorithm
• FP-Growth Algorithm

Applications of Unsupervised Learning

1. Customer Segmentation – It is used to group customers based on shopping
habits, helping businesses send personalized offers.
2. Product Recommendation – It is used by platforms like Amazon and Netflix
to suggest products, movies, or music based on user preferences.
3. Fraud Detection – It is used by banks to identify unusual transactions that
may indicate fraudulent activity.
4. Social Media Analysis – It is used to detect trends, group similar users, and
identify fake or spam accounts.
5. Self-Driving Cars – It is used to recognize pedestrians, traffic signs, and other
vehicles for safe autonomous driving.
6. Medical Research & Diagnosis – It is used to find patterns in diseases and
symptoms, helping in early detection and treatment.
7. Stock Market Analysis – It is used to group stocks with similar price
movements for better investment strategies.
8. Image Recognition – It is used in applications like facial recognition, object
detection, and medical imaging.
9. Document Clustering – It is used to automatically categorize news articles,
research papers, or emails into relevant topics.
10. Cybersecurity – It is used to detect unusual network activities and prevent
cyberattacks.

Advantages of Unsupervised Learning

1. No Need for Labeled Data – It works with unlabeled data, reducing the
time and cost of data labeling.
2. Identifies Hidden Patterns – It finds unknown relationships in data that
might not be obvious.
3. Useful for Complex Data – Works well with large and high-dimensional
datasets, such as images or text.
4. Can Adapt to New Data – Since it doesn’t rely on predefined labels, it can
detect new trends and patterns over time.
5. Helps in Data Exploration – It is useful for data analysis and preprocessing,
like finding clusters or anomalies before applying other models.

Disadvantages of Unsupervised Learning

1. Less Accuracy – Since there are no predefined labels, the results may not
always be precise.
2. Difficult to Interpret – The model might find patterns, but understanding
their meaning can be challenging.
3. Risk of Overfitting – Sometimes, the algorithm creates too many groups or
finds false patterns that don’t exist in reality.
4. Computationally Expensive – Some algorithms, like clustering, require high
processing power for large datasets.
5. No Direct Output Labels – Unlike supervised learning, the model does not
provide clear answers but rather groups or associations that need further
interpretation.

Process of Knowledge Discovery in

Databases (KDD)
Knowledge Discovery in Databases (KDD) is the systematic process of
extracting useful knowledge from large datasets. It involves collecting relevant
data, cleaning it to remove errors and inconsistencies, and transforming it into
a suitable format for analysis. Data mining techniques are then applied to find
patterns, trends, or relationships hidden in the data. These patterns are then
evaluated and interpreted for their significance. The final goal is to use these
insights to support decision-making and gain actionable knowledge from the
data.

KDD Process Flow:

1 Data Selection → 2 Data Preprocessing → 3 Data Transformation → 4 Data
Mining → 5 Interpretation & Evaluation
1. Data Selection
In this step, relevant data is chosen from various sources such as databases,
files, or external sources. Since datasets are usually large and contain
unnecessary information, only the data that is useful for the analysis is
selected. This helps in focusing on important data and making the analysis
more efficient.

2. Data Preprocessing
Raw data often has missing values, duplicates, or errors. In this step, the data is
cleaned by fixing these issues. Missing values can be filled in, duplicates are
removed, and errors are corrected. Preprocessing ensures that the data is
accurate and ready for further analysis.

3.Data Transformation
Here, the data is changed into a suitable format for analysis. This can include
normalizing values, where numbers are scaled to a common range, or selecting
important features and removing unnecessary ones. The goal is to make the
data easier to analyze and more relevant for mining techniques.

4.Data Mining
This is the core step where analytical techniques are applied to extract
patterns, trends, and relationships from the data. Algorithms such as
classification, clustering, association rule mining, and regression are used to
uncover hidden insights. The choice of technique depends on whether the goal
is prediction, pattern discovery, or anomaly detection. Data mining helps
organizations make informed decisions based on discovered patterns.

5.Interpretation & Evaluation

In this final step, the discovered patterns are evaluated for their usefulness and
accuracy. The results are checked to ensure that they are valid and meaningful.
Once validated, the insights are presented using reports or visualizations and
used for decision-making or further analysis.

Data Preprocessing Methods

Data preprocessing is a sub-step of the KDD process that improves the quality
of raw data before analysis. It ensures the data is clean, accurate, and
structured for efficient processing.

1. Data Cleaning
Data cleaning is the process of detecting and correcting errors in the dataset.
Raw data often contains missing values, duplicate entries, inconsistencies, and
noise. Cleaning involves:
• Handling Missing Data: Filling missing values using statistical methods
(mean, median) or removing incomplete records.
• Removing Duplicates: Eliminating redundant entries to avoid data
distortion.
• Correcting Inconsistencies: Standardizing different formats (e.g., date
formats, units).
• Handling Noise & Outliers: Removing extreme values that can mislead
analysis.
Data cleaning ensures that the dataset is accurate and reliable for further
processing.

2. Data Integration
Data integration combines data from multiple sources into a single, unified
dataset. This is important when data is collected from different databases,
applications, or formats. Key aspects include:
• Schema Matching: Aligning similar fields from different datasets (e.g.,
"Customer_ID" vs. "Client_ID").
• Removing Redundancies: Eliminating duplicate or overlapping data.
• Resolving Conflicts: Addressing inconsistencies in units, formats, or naming
conventions.
A well-integrated dataset ensures smooth analysis and reduces errors in
processing.

3. Data Transformation
Data transformation converts data into a suitable format for analysis. This step
improves consistency and compatibility. Common techniques include:
• Normalization: Scaling numerical data to a common range (e.g., 0 to 1) to
ensure equal weightage.
• Encoding Categorical Data: Converting non-numeric values (e.g., Gender:
Male/Female) into numerical format (e.g., 0/1).
• Feature Engineering: Creating new variables from existing ones to enhance
data insights (e.g., extracting age from date of birth).
Transformation improves the efficiency and performance of machine learning
algorithms.

4. Data Reduction
Data reduction minimizes the size of a dataset while preserving important
information. This improves processing speed and efficiency. Common
techniques include:
• Dimensionality Reduction: Removing unnecessary features using
techniques like Principal Component Analysis (PCA).
• Sampling: Selecting a subset of data instead of using the full dataset,
reducing computation time.
• Data Compression: Storing data in a compact form without losing significant
details.
Data reduction is essential for handling large datasets efficiently.

5. Data Discretization
Data discretization converts continuous numerical values into categorical ones,
making them easier to analyze. It is useful in classification and decision-making
processes. Common techniques include:
• Equal-Width Binning: Dividing data into intervals of equal size.
• Equal-Frequency Binning: Grouping data into bins containing an equal
number of values.
• Clustering-Based Discretization: Using clustering techniques to form natural
groups in data.

AI (Part II)
No ratings yet
AI (Part II)
11 pages
Agnik KR Jana - Ca2
No ratings yet
Agnik KR Jana - Ca2
9 pages
Unit 3 and Unit 4 Notes - Data Science - III BCA 2
No ratings yet
Unit 3 and Unit 4 Notes - Data Science - III BCA 2
27 pages
Intro To ML and Its Type, Applications
No ratings yet
Intro To ML and Its Type, Applications
8 pages
Unit 2 ML
No ratings yet
Unit 2 ML
141 pages
Machine Learning - Part - 1
No ratings yet
Machine Learning - Part - 1
17 pages
UNIT4
No ratings yet
UNIT4
12 pages
4.introduction To Learning - Unit 2
No ratings yet
4.introduction To Learning - Unit 2
8 pages
Business Data Mining Week 5
No ratings yet
Business Data Mining Week 5
19 pages
Machine Learning Classification, Regression and Clustering
No ratings yet
Machine Learning Classification, Regression and Clustering
77 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
29 pages
Machine Learning
No ratings yet
Machine Learning
12 pages
Machine Learning Is The Branch of
No ratings yet
Machine Learning Is The Branch of
12 pages
Types of Machine Learning
No ratings yet
Types of Machine Learning
14 pages
AI Unit 5
No ratings yet
AI Unit 5
27 pages
Learning Algorithms
No ratings yet
Learning Algorithms
28 pages
Ai Unit 4
No ratings yet
Ai Unit 4
32 pages
Full Notes
No ratings yet
Full Notes
37 pages
Data Science Unit-4 B.sc. III Sem. MDC
No ratings yet
Data Science Unit-4 B.sc. III Sem. MDC
6 pages
Machine Learning
No ratings yet
Machine Learning
35 pages
Machine Learning Lab Viva
50% (2)
Machine Learning Lab Viva
9 pages
All Algos - of - ML
No ratings yet
All Algos - of - ML
31 pages
Machine Learning Unit-I
No ratings yet
Machine Learning Unit-I
41 pages
Machine Learning
No ratings yet
Machine Learning
33 pages
Machine Learning
No ratings yet
Machine Learning
56 pages
Datascience Notes
No ratings yet
Datascience Notes
16 pages
Predictive Analytics & Data Mining
No ratings yet
Predictive Analytics & Data Mining
15 pages
ML Unit 1
No ratings yet
ML Unit 1
19 pages
Supervised Unsupervised Reinforcement
No ratings yet
Supervised Unsupervised Reinforcement
39 pages
UNIT 1 All Notes
No ratings yet
UNIT 1 All Notes
24 pages
Ds Unit 2
No ratings yet
Ds Unit 2
36 pages
An Overview of Machine Learning Classification Tec
No ratings yet
An Overview of Machine Learning Classification Tec
24 pages
Partha Pratim Das New1
No ratings yet
Partha Pratim Das New1
13 pages
Ml-Unit 1
No ratings yet
Ml-Unit 1
53 pages
Unit 1
No ratings yet
Unit 1
8 pages
Session 3 Types of Machine Learning
No ratings yet
Session 3 Types of Machine Learning
22 pages
Wa0000.
No ratings yet
Wa0000.
26 pages
Machine Learning Essentials
No ratings yet
Machine Learning Essentials
19 pages
DS Unit2
No ratings yet
DS Unit2
23 pages
Module 1
No ratings yet
Module 1
54 pages
Machine Learning - Its Types
No ratings yet
Machine Learning - Its Types
8 pages
Unit 3
No ratings yet
Unit 3
33 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
74 pages
AI Unit 4
No ratings yet
AI Unit 4
11 pages
ML Unit 1
No ratings yet
ML Unit 1
6 pages
ML Unit 1
No ratings yet
ML Unit 1
42 pages
CH 4
No ratings yet
CH 4
8 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
13 pages
ML Assignment 1
No ratings yet
ML Assignment 1
12 pages
Machine Learning
No ratings yet
Machine Learning
8 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
17 pages
Unit 3 Model Construction 3.1 Machine Learning Concepts - An Overview
No ratings yet
Unit 3 Model Construction 3.1 Machine Learning Concepts - An Overview
36 pages
Meta Motion Fitness Tracker 241109 213742 (1) Removed
No ratings yet
Meta Motion Fitness Tracker 241109 213742 (1) Removed
20 pages
Machine Learning and AI
No ratings yet
Machine Learning and AI
13 pages
AI Using Python
No ratings yet
AI Using Python
26 pages
Data Mining Techniques and Its Applications in Banking Section - Chitra and Subashini
No ratings yet
Data Mining Techniques and Its Applications in Banking Section - Chitra and Subashini
8 pages
Unit 1
No ratings yet
Unit 1
52 pages
TSL - Mediclaim User Guide - Officers Policy
0% (1)
TSL - Mediclaim User Guide - Officers Policy
18 pages
ABB Robotic Product Range Brochure 2018 RevE
No ratings yet
ABB Robotic Product Range Brochure 2018 RevE
37 pages
SoW - TMS - Market - Mordor Intelligence
No ratings yet
SoW - TMS - Market - Mordor Intelligence
5 pages
Module 03a Bug - Hunting
No ratings yet
Module 03a Bug - Hunting
23 pages
Bank Application System Java Project D
No ratings yet
Bank Application System Java Project D
4 pages
The One Hour Startup - Dror Gill
No ratings yet
The One Hour Startup - Dror Gill
68 pages
Noc Bench Spec Part1 v03 With XML
No ratings yet
Noc Bench Spec Part1 v03 With XML
13 pages
Angkasa Cerah PL 28-11-2024
No ratings yet
Angkasa Cerah PL 28-11-2024
4 pages
Multimedia Search for Unstructured Data
No ratings yet
Multimedia Search for Unstructured Data
5 pages
Spring Notes
No ratings yet
Spring Notes
13 pages
COLOR 3D Laser Scanning Microscope
No ratings yet
COLOR 3D Laser Scanning Microscope
28 pages
Ai Tools
100% (2)
Ai Tools
20 pages
Cluster To Cluster Storage Replication
No ratings yet
Cluster To Cluster Storage Replication
11 pages
User Guide: Thinkpad T430S and T430Si
No ratings yet
User Guide: Thinkpad T430S and T430Si
192 pages
Titan 2000: Advanced Digital Radiography System
100% (1)
Titan 2000: Advanced Digital Radiography System
16 pages
LANDBANK IAccess Retail Internet Banking - Retail - 042850
No ratings yet
LANDBANK IAccess Retail Internet Banking - Retail - 042850
2 pages
Gym Hub Final Mane
No ratings yet
Gym Hub Final Mane
47 pages
Statement of Card - Account
No ratings yet
Statement of Card - Account
1 page
Fileless Malware
No ratings yet
Fileless Malware
9 pages
Pine Script Guide for TradingView
No ratings yet
Pine Script Guide for TradingView
6 pages
Operating Systems R18-Lab Manual
No ratings yet
Operating Systems R18-Lab Manual
94 pages
Hardware Manual ACS850-04 Drive Modules (0.37 To 45 KW) : Maillefer Doc. 550 0480X.2
No ratings yet
Hardware Manual ACS850-04 Drive Modules (0.37 To 45 KW) : Maillefer Doc. 550 0480X.2
126 pages
Untas Herianto, ST
No ratings yet
Untas Herianto, ST
6 pages
CHAPTER 1 Introduction PDF
No ratings yet
CHAPTER 1 Introduction PDF
31 pages
Unit 4
No ratings yet
Unit 4
23 pages
CDintroduction
No ratings yet
CDintroduction
32 pages
openQRM User Guide
No ratings yet
openQRM User Guide
74 pages
AI For Marketing
No ratings yet
AI For Marketing
198 pages
AI Based Health Monitoring System
No ratings yet
AI Based Health Monitoring System
2 pages
NEWPRS Booking Application User Manual
No ratings yet
NEWPRS Booking Application User Manual
69 pages

Unit 2

Uploaded by

Unit 2

Uploaded by

Introduction to Data Mining

Data mining is the process of discovering useful patterns, relationships, and

1. Predictive Data Mining Techniques

Common Predictive Data Mining Techniques

3. Time Series Analysis

Advantages of Predictive Data Mining

2. Descriptive Data Mining Techniques

2. Association Rule Mining

Advantages of Descriptive Data Mining

Types of Supervised Learning

Applications of Supervised Learning

❌ Disadvantages of Supervised Learning

Types of Unsupervised Learning

2. Association Rule Learning

Applications of Unsupervised Learning

Advantages of Unsupervised Learning

Disadvantages of Unsupervised Learning

Process of Knowledge Discovery in

KDD Process Flow:

5.Interpretation & Evaluation

Data Preprocessing Methods

You might also like