0% found this document useful (0 votes)

8 views8 pages

Unit Ii DM

Data mining

Uploaded by

shyamala devi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views8 pages

Unit Ii DM

Data mining

Uploaded by

shyamala devi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

UNIT II

Introduction to Data Mining Techniques

Data mining is the process of extracting valuable patterns and knowledge from large datasets. To
achieve this, various methods and techniques are used, each suited for different types of data
structures and problem-solving needs. These techniques are grounded in different algorithmic
approaches and modeling strategies.

One fundamental distinction in data mining techniques is between parametric and

nonparametric models. Let’s explore these two approaches:

Parametric Models

Parametric models describe the relationship between input and output through algebraic
equations, where some parameters remain unspecified. These unknown parameters are typically
determined through training on input examples or data. While this approach is a well-established
and useful theoretical framework, it often proves too simplistic for complex, real-world data.
Additionally, parametric models assume a specific structure or form for the model in advance,
which may not always reflect the intricacies of the data being analyzed. For real-world problems,
the assumptions inherent in parametric models can limit their usefulness, particularly when the
exact nature of the data is not fully understood or cannot be easily modeled.

Nonparametric Models

Nonparametric models, by contrast, are more flexible and adaptable. These models do not rely
on predefined algebraic equations to describe relationships between input and output. Instead, the
model is built directly from the data itself, allowing for a data-driven approach. This means that
the model adapts dynamically as new data is introduced, making nonparametric techniques
particularly valuable in situations where data is complex, dynamic, or evolving.

Nonparametric methods excel in environments where large amounts of data are available. These
techniques require substantial datasets to create a robust model, and the modeling process itself
involves "sifting" through the data to detect patterns, relationships, or trends. This ability to build
models directly from data is crucial in database applications with vast amounts of continuously
changing data.

Advantages of Nonparametric Models

 Adaptability: Nonparametric models don’t assume a specific form or structure for the
data, making them more flexible and applicable to a wider range of problems.
 Scalability: These models can scale well with increasing amounts of data, improving in
accuracy as more data is input.
 Dynamic Learning: Recent advances in machine learning have made nonparametric
techniques capable of dynamically learning as new data becomes available. This
continuous learning process enables the model to evolve in real time, making it
particularly suitable for applications involving large, dynamic datasets.

Examples of Nonparametric Techniques

 Neural Networks: These are computational models inspired by the human brain, capable
of learning from large datasets by adjusting weights based on input data.
 Decision Trees: A decision tree is a flowchart-like structure that makes decisions based
on input data, splitting the dataset at each node to identify the most significant features.
 Genetic Algorithms: These algorithms use evolutionary principles such as selection,
mutation, and crossover to iteratively improve models and find optimal solutions.

A Statistical Perspective on Data Mining

In data mining, statistical concepts provide the foundation for many of the techniques used to
analyze and extract insights from data. These concepts are essential for understanding how data
can be processed, modeled, and interpreted. Below is a review of some important statistical
concepts that are commonly applied in data mining.

3.2.1 Point Estimation

Point estimation refers to the process of estimating a population parameter, denoted by θ\thetaθ,
using an estimate θ^\hat{\theta}θ^. For instance, we may want to estimate parameters such as the
mean, variance, or standard deviation of a population. These estimates are typically made based
on a sample of data from the population.

 Bias of an Estimator: Bias is the difference between the expected value of the estimator
and the true value of the population parameter. The formula for bias is:

Bias=E(θ^)−θ

An unbiased estimator is one where the bias is 0, meaning that the estimator accurately
predicts the true value on average. However, in larger datasets, most estimators tend to be
biased due to the complexities of real-world data.

 Mean Squared Error (MSE): A key measure of the effectiveness of an estimate is the
Mean Squared Error (MSE), which quantifies the difference between the estimated
value and the true value. It is defined as the expected value of the squared difference
between the estimate and the actual value:

MSE(θ^)=E(θ^−θ)2

The MSE gives a way to measure the accuracy of a prediction. For example, if the true
value is 10 and the predicted value is 5, the squared error would be:

(5 ·- 10)2 = 25.
The squaring ensures that errors are always positive and that larger errors are weighted
more heavily.

In data mining, MSE is commonly used to evaluate the accuracy of prediction models,
particularly in machine learning applications.

Confidence Intervals

Sometimes, rather than providing a single point estimate, it is more useful to estimate a range of
possible values within which the true parameter might lie. This range is called a confidence
interval. A confidence interval provides a measure of uncertainty about the estimate and is often
expressed with a certain confidence level (e.g., 95% confidence).

Root Mean Square (RMS)

The Root Mean Square (RMS) is another statistic used to describe the magnitude of a set of
values. It is calculated by taking the square root of the average of the squared values in a dataset.
For a set of nnn values X={x1,x2,…,xn}X = \{x_1, x_2, \ldots, x_n\}X={x1,x2,…,xn}, the RMS
is calculated as:

The RMS can be useful for estimating the overall magnitude of the values, particularly when
dealing with data that includes both positive and negative values.

Root Mean Square Error (RMSE)

In the context of data mining and machine learning, the Root Mean Square Error (RMSE) is
frequently used as an alternative way to estimate the error in a prediction. It is calculated by
taking the square root of the Mean Squared Error (MSE):

RMSE provides a more interpretable measure of error, as it brings the unit of measurement back
to the same scale as the original data.

Jackknife Estimation

The jackknife is a popular resampling technique used to estimate the bias and variance of a
statistical estimate. The basic idea is to generate multiple estimates by systematically leaving out
one data point at a time from the dataset and calculating the estimate for each of these subsets.
Given a set of nnn values X={x1,x2,…,xn}X = \{x_1, x_2, \ldots, x_n\}X={x1,x2,…,xn}, the
jackknife estimate for the parameter can be computed by omitting one value at a time and
averaging the remaining data points:

This produces a set of jackknife estimates θ^1,θ^2,…,θ^n. An overall estimate of the parameter can then
be obtained by averaging these jackknife estimates.
Expectation-Maximization (EM) Algorithm

The Expectation-Maximization (EM) algorithm is a powerful iterative technique used to find

Maximum Likelihood Estimates (MLE) of parameters when the data is incomplete. It is widely
used in situations where we have missing data or latent (hidden) variables, and we need to
estimate the parameters of a model based on the observed data.

Basic Concept of the EM Algorithm

The EM algorithm works in two alternating steps:

1. Expectation (E-step): In this step, we estimate the missing or unobserved data (latent variables)
given the observed data and the current estimates of the parameters.
2. Maximization (M-step): In the maximization step, we calculate the parameter estimates that
maximize the likelihood function based on the observed data and the newly estimated values
for the missing data.

Data Mining Techniques: Point Estimation
No ratings yet
Data Mining Techniques: Point Estimation
13 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
33 pages
Unit-II DM Techniques
No ratings yet
Unit-II DM Techniques
20 pages
Data Mining Insights & Applications
No ratings yet
Data Mining Insights & Applications
9 pages
Wa0016.
No ratings yet
Wa0016.
60 pages
DM Unit 2-5 Notes
No ratings yet
DM Unit 2-5 Notes
78 pages
Jss Science and Technology University MYSURU-570006 Department of Information Science and Engineering
No ratings yet
Jss Science and Technology University MYSURU-570006 Department of Information Science and Engineering
4 pages
Fcthgchgtbelow
No ratings yet
Fcthgchgtbelow
6 pages
Data Mining and Machine Learning Overview
No ratings yet
Data Mining and Machine Learning Overview
12 pages
DMBI Simplified
No ratings yet
DMBI Simplified
28 pages
Data Mining Overview and Applications
No ratings yet
Data Mining Overview and Applications
6 pages
DWDM 4 Unit Notes
No ratings yet
DWDM 4 Unit Notes
21 pages
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
No ratings yet
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
24 pages
Data Science
No ratings yet
Data Science
32 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Final Report For Sales Dataset Project
No ratings yet
Final Report For Sales Dataset Project
25 pages
DA5.6 Marketing Analytics Q&a
No ratings yet
DA5.6 Marketing Analytics Q&a
4 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Lec 05&06 Data Mining and Data Wherehousing
No ratings yet
Lec 05&06 Data Mining and Data Wherehousing
25 pages
Unit 2
No ratings yet
Unit 2
20 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
DW&DM (Unit - 4)
No ratings yet
DW&DM (Unit - 4)
9 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
DMBAR Chapter 1
No ratings yet
DMBAR Chapter 1
15 pages
30-Additional Themes On Data Mining
No ratings yet
30-Additional Themes On Data Mining
9 pages
Journal On Decision Tree
No ratings yet
Journal On Decision Tree
5 pages
Data Mining
No ratings yet
Data Mining
5 pages
(IJCST-V3I1P21) : S. Padmapriya
No ratings yet
(IJCST-V3I1P21) : S. Padmapriya
5 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
ML Lecture 6 7 Preprocess
No ratings yet
ML Lecture 6 7 Preprocess
43 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
Statistics For Management
No ratings yet
Statistics For Management
7 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Statistical Data Analysis Guide
No ratings yet
Statistical Data Analysis Guide
5 pages
Mining Using Genitic Algorithms
No ratings yet
Mining Using Genitic Algorithms
7 pages
Predictive Analytics Overview
No ratings yet
Predictive Analytics Overview
39 pages
Business Analytics
No ratings yet
Business Analytics
6 pages
DM UNIT-1 Question and Answer
No ratings yet
DM UNIT-1 Question and Answer
25 pages
Unit 3
No ratings yet
Unit 3
34 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
14 pages
MBA Data Mining Essentials
No ratings yet
MBA Data Mining Essentials
28 pages
Yihao Final Paper CCSC For Submission
No ratings yet
Yihao Final Paper CCSC For Submission
6 pages
Data Mining System Architecture
No ratings yet
Data Mining System Architecture
35 pages
Data Mining Notes
No ratings yet
Data Mining Notes
14 pages
Data Mining
No ratings yet
Data Mining
44 pages
Unit 2
No ratings yet
Unit 2
37 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
6 pages
Data Mining Essentials Explained
No ratings yet
Data Mining Essentials Explained
24 pages
Lecture 9 Statistical Learning
No ratings yet
Lecture 9 Statistical Learning
3 pages
A Statistical Perspective On Data Mining
No ratings yet
A Statistical Perspective On Data Mining
25 pages
Unit2 Notes
No ratings yet
Unit2 Notes
8 pages
Three Attitudes Towards Data Mining: Kevin D. Hoover and Stephen J. Perez
No ratings yet
Three Attitudes Towards Data Mining: Kevin D. Hoover and Stephen J. Perez
16 pages
Data Science and Analytics Basics
No ratings yet
Data Science and Analytics Basics
97 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
DWM Notes Class by Proff
No ratings yet
DWM Notes Class by Proff
88 pages
IET Image Processing - 2020 - Eltrass - Fully Automated Scheme For Computer Aided Detection and Breast Cancer Diagnosis
No ratings yet
IET Image Processing - 2020 - Eltrass - Fully Automated Scheme For Computer Aided Detection and Breast Cancer Diagnosis
11 pages
High-Frequency Stock Returns Model
No ratings yet
High-Frequency Stock Returns Model
12 pages
41-, Gaussian Mixture Models, Expectation Maximization-20-11-2024
No ratings yet
41-, Gaussian Mixture Models, Expectation Maximization-20-11-2024
40 pages
Kernell Mallows Kernels For Permutations
No ratings yet
Kernell Mallows Kernels For Permutations
38 pages
Exp Family
No ratings yet
Exp Family
7 pages
The Use of Gaussian Processes in System Identification
No ratings yet
The Use of Gaussian Processes in System Identification
13 pages
Hybrid HMM MLP Models For Time Series Prediction
No ratings yet
Hybrid HMM MLP Models For Time Series Prediction
8 pages
Modified Gath-Geva Fuzzy Clustering For Identifica PDF
No ratings yet
Modified Gath-Geva Fuzzy Clustering For Identifica PDF
18 pages
Unit IV Aiml
No ratings yet
Unit IV Aiml
32 pages
Speech Recognition Using Backoff N-Gram Modelling in Android Application
No ratings yet
Speech Recognition Using Backoff N-Gram Modelling in Android Application
7 pages
Latent Class Análysis
No ratings yet
Latent Class Análysis
33 pages
Statistical Analysis With Missing Data 3rd Edition Roderick J. A. Little Available All Format
No ratings yet
Statistical Analysis With Missing Data 3rd Edition Roderick J. A. Little Available All Format
171 pages
MLQB Unit 3
No ratings yet
MLQB Unit 3
12 pages
15CS73 Module 4
No ratings yet
15CS73 Module 4
60 pages
cs3491 Aiandmllabmanual
No ratings yet
cs3491 Aiandmllabmanual
43 pages
Statistical Inference
No ratings yet
Statistical Inference
62 pages
HISAT2 RNA-Seq Alignment Guide
100% (1)
HISAT2 RNA-Seq Alignment Guide
35 pages
Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE
No ratings yet
Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE
34 pages
UT Dallas Syllabus For cs4375.501.07f Taught by Yu Chung NG (Ycn041000)
No ratings yet
UT Dallas Syllabus For cs4375.501.07f Taught by Yu Chung NG (Ycn041000)
5 pages
Paper 1
No ratings yet
Paper 1
38 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
Cybersecurity Midterm Exam Results
No ratings yet
Cybersecurity Midterm Exam Results
17 pages
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
No ratings yet
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
6 pages
Data Matching
No ratings yet
Data Matching
74 pages
Gaussian Mixture Model Clustering With Incomplete Data
No ratings yet
Gaussian Mixture Model Clustering With Incomplete Data
14 pages
A Tutorial On MM Algorithms
No ratings yet
A Tutorial On MM Algorithms
9 pages
Segmentation, Inference, and Classification of Partially Overlapping Nanoparticles
No ratings yet
Segmentation, Inference, and Classification of Partially Overlapping Nanoparticles
13 pages
Intro to Pattern Recognition
No ratings yet
Intro to Pattern Recognition
9 pages
A Simplified Approach To Understanding The Kalman Filter Techniqu
No ratings yet
A Simplified Approach To Understanding The Kalman Filter Techniqu
25 pages
DD2434 Machine Learning, Advanced Course Assignment 2: Jens Lagergren Deadline 23.00 (CET) December 30, 2017
No ratings yet
DD2434 Machine Learning, Advanced Course Assignment 2: Jens Lagergren Deadline 23.00 (CET) December 30, 2017
5 pages

Unit Ii DM

Uploaded by

Unit Ii DM

Uploaded by

UNIT II

Introduction to Data Mining Techniques

One fundamental distinction in data mining techniques is between parametric and

Advantages of Nonparametric Models

Examples of Nonparametric Techniques

A Statistical Perspective on Data Mining

3.2.1 Point Estimation

Root Mean Square (RMS)

Root Mean Square Error (RMSE)

The Expectation-Maximization (EM) algorithm is a powerful iterative technique used to find

Basic Concept of the EM Algorithm

The EM algorithm works in two alternating steps:

You might also like