UNIT II
Introduction to Data Mining Techniques
Data mining is the process of extracting valuable patterns and knowledge from large datasets. To
achieve this, various methods and techniques are used, each suited for different types of data
structures and problem-solving needs. These techniques are grounded in different algorithmic
approaches and modeling strategies.
One fundamental distinction in data mining techniques is between parametric and
nonparametric models. Let’s explore these two approaches:
Parametric Models
Parametric models describe the relationship between input and output through algebraic
equations, where some parameters remain unspecified. These unknown parameters are typically
determined through training on input examples or data. While this approach is a well-established
and useful theoretical framework, it often proves too simplistic for complex, real-world data.
Additionally, parametric models assume a specific structure or form for the model in advance,
which may not always reflect the intricacies of the data being analyzed. For real-world problems,
the assumptions inherent in parametric models can limit their usefulness, particularly when the
exact nature of the data is not fully understood or cannot be easily modeled.
Nonparametric Models
Nonparametric models, by contrast, are more flexible and adaptable. These models do not rely
on predefined algebraic equations to describe relationships between input and output. Instead, the
model is built directly from the data itself, allowing for a data-driven approach. This means that
the model adapts dynamically as new data is introduced, making nonparametric techniques
particularly valuable in situations where data is complex, dynamic, or evolving.
Nonparametric methods excel in environments where large amounts of data are available. These
techniques require substantial datasets to create a robust model, and the modeling process itself
involves "sifting" through the data to detect patterns, relationships, or trends. This ability to build
models directly from data is crucial in database applications with vast amounts of continuously
changing data.
Advantages of Nonparametric Models
Adaptability: Nonparametric models don’t assume a specific form or structure for the
data, making them more flexible and applicable to a wider range of problems.
Scalability: These models can scale well with increasing amounts of data, improving in
accuracy as more data is input.
Dynamic Learning: Recent advances in machine learning have made nonparametric
techniques capable of dynamically learning as new data becomes available. This
continuous learning process enables the model to evolve in real time, making it
particularly suitable for applications involving large, dynamic datasets.
Examples of Nonparametric Techniques
Neural Networks: These are computational models inspired by the human brain, capable
of learning from large datasets by adjusting weights based on input data.
Decision Trees: A decision tree is a flowchart-like structure that makes decisions based
on input data, splitting the dataset at each node to identify the most significant features.
Genetic Algorithms: These algorithms use evolutionary principles such as selection,
mutation, and crossover to iteratively improve models and find optimal solutions.
A Statistical Perspective on Data Mining
In data mining, statistical concepts provide the foundation for many of the techniques used to
analyze and extract insights from data. These concepts are essential for understanding how data
can be processed, modeled, and interpreted. Below is a review of some important statistical
concepts that are commonly applied in data mining.
3.2.1 Point Estimation
Point estimation refers to the process of estimating a population parameter, denoted by θ\thetaθ,
using an estimate θ^\hat{\theta}θ^. For instance, we may want to estimate parameters such as the
mean, variance, or standard deviation of a population. These estimates are typically made based
on a sample of data from the population.
Bias of an Estimator: Bias is the difference between the expected value of the estimator
and the true value of the population parameter. The formula for bias is:
Bias=E(θ^)−θ
An unbiased estimator is one where the bias is 0, meaning that the estimator accurately
predicts the true value on average. However, in larger datasets, most estimators tend to be
biased due to the complexities of real-world data.
Mean Squared Error (MSE): A key measure of the effectiveness of an estimate is the
Mean Squared Error (MSE), which quantifies the difference between the estimated
value and the true value. It is defined as the expected value of the squared difference
between the estimate and the actual value:
MSE(θ^)=E(θ^−θ)2
The MSE gives a way to measure the accuracy of a prediction. For example, if the true
value is 10 and the predicted value is 5, the squared error would be:
(5 ·- 10)2 = 25.
The squaring ensures that errors are always positive and that larger errors are weighted
more heavily.
In data mining, MSE is commonly used to evaluate the accuracy of prediction models,
particularly in machine learning applications.
Confidence Intervals
Sometimes, rather than providing a single point estimate, it is more useful to estimate a range of
possible values within which the true parameter might lie. This range is called a confidence
interval. A confidence interval provides a measure of uncertainty about the estimate and is often
expressed with a certain confidence level (e.g., 95% confidence).
Root Mean Square (RMS)
The Root Mean Square (RMS) is another statistic used to describe the magnitude of a set of
values. It is calculated by taking the square root of the average of the squared values in a dataset.
For a set of nnn values X={x1,x2,…,xn}X = \{x_1, x_2, \ldots, x_n\}X={x1,x2,…,xn}, the RMS
is calculated as:
The RMS can be useful for estimating the overall magnitude of the values, particularly when
dealing with data that includes both positive and negative values.
Root Mean Square Error (RMSE)
In the context of data mining and machine learning, the Root Mean Square Error (RMSE) is
frequently used as an alternative way to estimate the error in a prediction. It is calculated by
taking the square root of the Mean Squared Error (MSE):
RMSE provides a more interpretable measure of error, as it brings the unit of measurement back
to the same scale as the original data.
Jackknife Estimation
The jackknife is a popular resampling technique used to estimate the bias and variance of a
statistical estimate. The basic idea is to generate multiple estimates by systematically leaving out
one data point at a time from the dataset and calculating the estimate for each of these subsets.
Given a set of nnn values X={x1,x2,…,xn}X = \{x_1, x_2, \ldots, x_n\}X={x1,x2,…,xn}, the
jackknife estimate for the parameter can be computed by omitting one value at a time and
averaging the remaining data points:
This produces a set of jackknife estimates θ^1,θ^2,…,θ^n. An overall estimate of the parameter can then
be obtained by averaging these jackknife estimates.
Expectation-Maximization (EM) Algorithm
The Expectation-Maximization (EM) algorithm is a powerful iterative technique used to find
Maximum Likelihood Estimates (MLE) of parameters when the data is incomplete. It is widely
used in situations where we have missing data or latent (hidden) variables, and we need to
estimate the parameters of a model based on the observed data.
Basic Concept of the EM Algorithm
The EM algorithm works in two alternating steps:
1. Expectation (E-step): In this step, we estimate the missing or unobserved data (latent variables)
given the observed data and the current estimates of the parameters.
2. Maximization (M-step): In the maximization step, we calculate the parameter estimates that
maximize the likelihood function based on the observed data and the newly estimated values
for the missing data.