DMDW Unit1
DMDW Unit1
                                         1
                                  DMDW-UNIT1
Data Mining - This is the heart of the KDD process and involves applying
various data mining techniques to the transformed data to discover hidden
patterns, trends, relationships, and insights. A few of the most common data
mining techniques include clustering, classification, association rule mining,
and anomaly detection.
Pattern Evaluation - After the data mining, the next step is to evaluate the
discovered patterns to determine their usefulness and relevance. This
involves assessing the quality of the patterns, evaluating their significance,
and selecting the most promising patterns for further analysis.
Knowledge Representation - This step involves representing the knowledge
extracted from the data in a way humans can easily understand and use.
This can be done through visualizations, reports, or other forms of
communication that provide meaningful insights into the data.
Deployment - The final step in the KDD process is to deploy the knowledge
and insights gained from the data mining process to practical applications.
This involves integrating the knowledge into decision-making processes or
other applications to improve organizational efficiency and effectiveness.
                                      2
                                   DMDW-UNIT1
                                        3
                                  DMDW-UNIT1
                                       4
                                  DMDW-UNIT1
Database Data
A database system, also called a database management system (DBMS),
consists of a collection of interrelated data, known as a database, and a set
of software programs to manage and access the data.
A relational database is a collection of tables, each of which is assigned a
unique name. Each table consists of a set of attributes (columns or fields)
and usually stores a large set of tuples (records or rows).
Each tuple in a relational table represents an object identified by a unique
key and described by a set of attribute values.
                                          5
                                 DMDW-UNIT1
                                         6
                                DMDW-UNIT1
Unsupervised Learning:
In Unsupervised Learning, the machine uses unlabeled data and learns on
itself without any supervision. The machine tries to find a pattern in the
unlabeled data and gives a response.
Semi-supervised learning
Semi-supervised learning is a class of machine learning techniques that
make use of both labeled and unlabeled examples when learning a model. In
one approach, labeled examples are used to learn class models and
unlabeled examples are used to refine the boundaries between classes.
DataWarehouses
A data warehouse integrates data originating from multiple sources and
various timeframes. It consolidates data in multidimensional space to form
partially materialized data cubes. The data cube model not only facilitates
OLAP in multidimensional databases but also promotes multidimensional
data mining.
                                       7
                                       DMDW-UNIT1
Information Retrieval
Information retrieval (IR) is the science of searching for documents or
information in documents. Documents can be text or multimedia, and may
reside on the Web.
Q) Explain the Datamining Task primitives?
A Data mining task can be specified in the form of data mining query, which
is input to the data mining system.
A data mining query is defined in terms of data mining task primitives.
The data mining primitives specify the following:
1. The set of task-relevant data to be mined: This specifies the portions
of the database or the set of data in which the user is interested. This
includes the database attributes or dataware house dimensions of interest.
2. The kind of knowledge to be mined: This specifies the data mining
functions   to   be    performed       such       as   characterization,   discrimination,
association or correlation analysis or evolution analysis, classification,
prediction, clustering, outlier analysis or evolution analysis.
Example:
The search for association rules is confined to those matching the given
meta rule, such as
age (X, “30…..39”) ^ income (X, “40k….49K”) => buys (X, “VCR”) [2.2%,
60%]
and    occupation     (X,   “student   ”)     ^    age   (X,   “20…..29”)=>     buys   (X,
“computer”) [1.4%, 70%]
The former rule states that customers in their thirties, with an annual
income of between 40K and 49K, are likely (with 60% confidence) to
purchase a VCR, and such cases represent about 2.2.% of the total number
of transactions.
                                              8
                                   DMDW-UNIT1
4. Interestingness measures:
Different kinds of knowledge may have different interesting measures. They
may be used to guide the mining process to evaluate the discovered
patterns. For example, interesting measures for association rules include
support and confidence.
Rules whose support and confidence values are below user-specified
thresholds are considered uninteresting.
Simplicity: A factor contributing to the interestingness of a pattern is the
pattern’s overall simplicity for human comprehension.
Certainty (Confidence): A certainty measure for association rules of the
form "A =>B" where A and B are sets of items is confidence. Confidence is a
                                        9
                                  DMDW-UNIT1
                                       10
                                    DMDW-UNIT1
 2. Loose coupling: Loose coupling means that a DM system will use some
facilities of a DB or DW system, fetching data from a data repository managed by
these systems, performing data mining, and then storing the mining results either
in a file or in a designated place in a database or data Warehouse.
 3. Semitight coupling: Semitight coupling means that besides linking a DM
system to a DB/DW system, efficient implementations of a few essential data
mining primitives (identified by the analysis of frequently encountered data mining
functions) can be provided in the DB/DW system.
 4. Tight coupling: Tight coupling means that a DM system is smoothly integrated
into the DB/DW system. The data mining subsystem is treated as one functional
component of information system.
                                        11
                                     DMDW-UNIT1
                                       13
                                   DMDW-UNIT1
                                       14
                                    DMDW-UNIT1
                                         15
                                   DMDW-UNIT1
      b. Clean data: Often this is the lengthiest task. Without it, you’ll likely
fall victim to garbage-in, garbage-out. A common practice during this task is
to correct, impute, or remove erroneous values.
      c. Construct data: Derive new attributes that will be helpful. For
example, derive someone’s body mass index from height and weight fields.
      d. Integrate data: Create new data sets by combining data from
multiple sources.
      e. Format data: Re-format data as necessary. For example, you might
convert string values that store numbers to numeric values so that you can
perform mathematical operations.
4. Modeling:
This phase has four tasks:
a. Select modeling techniques: Determine which algorithms to try
b. Generate test design: you might need to split the data into training, test,
and validation sets.
c. Build model: this might just be executing a few lines of code.
d. Assess model: Generally, multiple models are competing against each
other, and the data scientist needs to interpret the model results based on
domain knowledge, the pre-defined success criteria, and the test design.
5. Evaluation
This phase has three tasks:
a. Evaluate results: Do the models meet the business success criteria?
Which one(s) should we approve for the business?
b. Review process: Review the work accomplished.
c. Determine next steps: Based on the previous three tasks, determine
whether to proceed to deployment, iterate further.
6. Deployment
                                       16
                                  DMDW-UNIT1
                                       17
                                     DMDW-UNIT1
end of the month. For a period of time following each month, the data stored
in the database is incomplete. Month end data are not updated in timely
fashion has negative impact on data quality.
      Two others       factors affecting data      quality are believability   and
interpretability. Believability reflects how much the data are trusted by
users, while interpretability reflects how easy the data are understood.
Suppose that a database at one point had several errors all of which have
since been corrected. The past errors had caused many problems so it no
longer to trust.
Q) What are the Data cleaning methods in Data mining?
In a real world data to be incomplete, noisy and inconsistent. Data cleaning
methods attempt to fill in missing values, smooth out noise by identifying
outliers and correct inconsistencies in the data.
In this section we discuss the basic methods for data cleaning.
Data cleaning methods:
Missing values:
The following are the basic methods for filling the missing values.
1. Ignore the tuple: This is usually done when the class label is missing
2. Fill in the missing value manually: In general this approach is time
   consuming and may not be feasible given a large data set with many
   missing values.
3. Use global constant to fill the missing value: replace all missing attribute
   values by the same constant such as label like “unknown” or ∞
4. Use the attribute mean to fill in the missing value
5. Use the attribute mean for all samples belonging to the same class to fill
in the missing value
6. Use the most probable value to fill in the missing value: This may be
determined with regression, inference based tools using Bayesian formalism
or decision tree.
                                          18
                                  DMDW-UNIT1
For example:
Noise data
Noise is a random error or variance in measured variable. The following are
the methods to handle noisy data.
 a. Binning
The binning method can be used for smoothing the data. Mostly data is full
of noise. Data smoothing is a data pre-processing technique using a different
kind of algorithm to remove the noise from the data set.
Unsorted data for price in dollars
Before sorting: 8 16, 9, 15, 21, 21, 24, 30,   26, 27, 30, 34
First of all, sort the data
After Sorting: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
Smoothing the data by equal frequency bins
Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34
Smoothing by bin means
For Bin 1:
(8+ 9 + 15 +16 / 4) = 12
Bin 1 = 12, 12, 12, 12
For Bin 2:
(21 + 21 + 24 + 26 / 4) = 23
Bin 2 = 23, 23, 23, 23
                                      19
                                        DMDW-UNIT1
For Bin 3:
(27 + 30 + 30 + 34 / 4) = 30
Bin 3 = 30, 30, 30, 30
Smoothing by bin boundaries
Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34
You need to pick the minimum and maximum value. Put the minimum on
the left side and maximum on the right side. Now, what will happen to the
middle values?
Middle values in bin boundaries move to its closest neighbor value with less
distance.
Unsorted data for price in dollars:
Before sorting: 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34
First of all, sort the data
After sorting: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
Smoothing the data by equal frequency bins
Bin1: 8, 9, 15, 16
Bin2: 21, 21, 24, 26,
Bin3: 27, 30, 30, 34
Smooth data after bin Boundary
Before bin Boundary: Bin1: 8, 9, 15, 16
Here, 1 is the minimum value and 16 is the maximum value.9 is near to 8,
so 9 will be treated as 8. 15 is more near to 16 and farther away from 8. So,
15 will be treated as 16.
After bin Boundary:                       Bin 1: 8, 8, 16, 16
Before bin Boundary:                              Bin 2: 21, 21, 24, 26,
After bin Boundary:                               Bin 2: 21, 21, 26, 26,
Before bin Boundary:                              Bin 3: 27, 30, 30, 34
After bin Boundary:                               Bin 3: 27, 27, 27, 34
                                             20
                                   DMDW-UNIT1
b. Regression:
Data smoothing can also be done by regression, a technique that conforms
data values to a function. Linear regression involves finding the “best” line to
fit two attributes (or variables) so that one attribute can be used to predict
the other. Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are fit to a
multidimensional surface.
c. Outlier analysis
Outliers may be detected by clustering, for example, where similar values
are organized into groups, or “clusters.” Intuitively, values that fall outside
of the set of clusters may be considered outliers
d. Data cleaning as process:
To handle the inconsistency data we use data cleaning as process. The first
step in data cleaning as process is “discrepancy detection”.
This is occur due to badly designed data forms, human errors, data decay
(out dated values) This can be removed by “meta data”. Discrepancy can
also be removed by measuring the central tendency.
The data should also be examined regarding “unique rules”, “consecutive
rules” and “null rules”.
A “unique rule” says that each value of given attribute must be different
from all other values for that attribute.
A “consecutive rule” says that there can be no missing values between
lowest and highest values for the attribute and that all values must also be
unique.
A “null rule” specifies the use of blanks, question marks and special
characters may indicate the null condition.
Q) Write about data integration?
Data integration merges data from several heterogeneous sources to attain
meaningful data. The source involves several databases, multiple files or
                                       21
                                    DMDW-UNIT1
                                        22
                                  DMDW-UNIT1
table, with the c values of A making up the columns and the r values of B
making up the rows.
Let (Ai ,Bj) denote the joint event that attribute A takes on value ai and
attribute B takes on value bj , that is, where (A = ai ,B = bj) Each and
every possible (Ai ,Bj) joint event has its own cell (or slot) in the table. The
                                       23
                                    DMDW-UNIT1
Tuple Duplication:
In addition to detecting redundancies between attributes, duplication should
also be detected at the tuple level (e.g., where there are two or more
identical tuples for a given unique data entry). Inconsistencies often arise
between various duplicates, due to inaccurate data entry or updating some
but not all data occurrences.
Data Value Conflict Detection and Resolution
Data integration also involves the detection and resolution of data value
conflicts. For example, for the real-world entity, attribute values from
different sources may differ.
This may be due to differences in representation. For instance, a weight
attribute may be stored in metric units in one system and British imperial
units in another for suppose the price of a hotel room may be represented in
different currencies in different cities. This kind of issues is detected and
resolved during data integration.
                                       24
                                    DMDW-UNIT1
                                          25
                                   DMDW-UNIT1
Where resolution level reaches “1” we stop the recursive process and notes
the DWT coefficients
                  [35,-3, 16, 16, 8,-8, 0, 12]
User specified threshold (constraint) is no need of below ‘0’ values so we put
‘0’ value in place of negative values. So final coefficients are
                  [35, 0, 16, 10, 8, 0, 0, 12]
Attribute subset selection
The data set may have a large number of attributes. But some of
those attributes can be irrelevant or redundant. The goal of attribute
subset selection is to find a minimum set of attributes such that
dropping of those irrelevant attributes does not much affect the utility
of data and the cost of data analysis could be reduced.
Methods of Attribute Subset Selection-
1. Stepwise Forward Selection.
2. Stepwise Backward Elimination.
3. Combination of Forward Selection and Backward Elimination.
4. Decision Tree Induction.
All the above methods are greedy approaches for attribute subset selection.
1. Stepwise Forward Selection: This procedure start with an empty set
   of attributes as the minimal set. The most relevant attributes are chosen
   (having minimum p-value) and are added to the minimal set. In each
   iteration, one attribute is added to a reduced set.
2. Stepwise     Backward       Elimination: Here      all   the    attributes   are
   considered in the initial set of attributes. In each iteration, one attribute
   is eliminated from the set of attributes whose p-value is higher than
   significance level.
3. Combination of Forward Selection and Backward Elimination: The
   stepwise forward selection and backward elimination are combined so as
                                        26
                                 DMDW-UNIT1
The above figure shows heuristic methods for attribute subset selection.
Clustering
Clustering techniques consider data tuples as objects. They partition the
objects into groups or clusters, so that objects in cluster are “similar” to
one another and “dissimilar” to other clusters.
Similarity is defined in terms of how “close” the objects are in space based
on distance function. The quality of cluster may be represented by its
diameter, the maximum distance between any two objects in the cluster.
Centroid distance is an alternative measure of cluster quality and is defined
as average distance of each cluster object from the cluster Centroid.
For example:
                                     27
                                      DMDW-UNIT1
                                          28
                                    DMDW-UNIT1
 Let's consider the situation of a company's data. This data consist of the
“AllElectronics”   sales   per   quarter,    for   the   years   2008   to   2010.
Now we are, generally, interested in the annual sales (total per year), rather
than the total per quarter.
Thus the data can be reduced so that the resulting data summarizes the
total sales per year instead of per quarter.
Eg:
Histograms:
Histograms use binning to approximate data distributions and are popular
form of data reduction. A histogram for an attribute “A” partitions the data
distribution of “A” into disjoint subsets, referred as buckets or bins.
If each bucket represents a single attribute value/frequency the buckets are
called “singleton buckets”.
                                        29
                                  DMDW-UNIT1
The following figure show histogram for the data using single ton buckets. To
further reduce the data, it is common to have bucket denote a continuous
value range for the given attribute.
In the following figure each bucket represents a different $10 range for
price.
                                       30
                                    DMDW-UNIT1
31