Data Minng
Data Minng
Syllabus
Module 1:
Introduction: Data, Information, Knowledge, KDD, Types of data for mining, Application
domains, Data mining functionalities/tasks.
Data Processing: Understanding data, Pre-processing data, Form of data processing, Data
cleaning (definition and phases only), Need for data integration, Steps in data
transformation, Need of data reduction.
Module II:
Module III:
Module IV:
Module 1:
1.1 INTRODUCTION:
Data Mining is defined as “the process of analyzing data from different perspectives and
summarizing it into useful information”.
Data mining is the process of discovering interesting patterns and knowledge from large
amounts of data. The data sources can include databases, data warehouses, the Web,
other information repositories, or data that are streamed into the system dynamically.
Data Mining is defined as extracting information from huge sets of data. In other
words, data mining is the procedure of mining knowledge from data.
Data Mining is also known as Knowledge Discovery from Data (KDD),or Knowledge
Extraction.
The overall goal of the data mining process is to extract information from a data set and transform
it into an understandable structure for further use.
Now a day, data mining is used in almost all the places where a large amount of data is
stored and processed.
For example, banks typically use ‘data mining’ to find out their prospective customers
who could be interested in credit cards, personal loans or insurances as well. Since banks
have the transaction details and detailed profiles of their customers, they analyze all this
data and try to find out patterns which help them predict that certain customers could be
interested in personal loans etc.
Volume of information is increasing every day that we can handle from business
transactions, scientific data, sensor data, pictures, videos, etc. So, we need a system that
will be capable of extracting essence of information available and that can automatically
generate report, views or summary of data for better decision-making.
Data
Data are raw facts, numbers, or text that can be processed by a computer.
Data in different types:
Operational or transactional data (Examples: sales, cost, inventory, payroll, and
accounting)
Non-operational data (Examples: industry sales, forecast data)
Meta data-data about the data itself (Logical database design or data dictionary
definitions).
Data is meaningless in itself, but once processed and interpreted, it becomes
information which is filled with meaning.
Information
Information is the set of data that has already been processed, analyzed, and structured
in a meaningful way to become useful. Once data is processed and gains relevance, it
becomes information that is fully reliable, certain, and useful.
Ultimately, the purpose of processing data and turning it into information is to help
organizations make better, more informed decisions that lead to successful outcomes.
To collect and process data, organizations use Information Systems (IS) which a
combination of technologies, procedures, are and tools that assemble and distribute
information needed to make decisions.
Knowledge
Information can be converted into knowledge about historical patterns and future trends.
For example, summary information on retail supermarket sales can be analyzed in light
of promotional efforts to provide knowledge of consumer buying behaviour. Thus, a
manufacturer or retailer could determin
1.1.2 KDD
Data mining is not specific to any one media or data. Data mining should be applicable to
any kind of information repository. Algorithms and approaches may differ when applied
to different types of data.
1. Flat Files
2. Database Data
3. Data Warehouse
4. Transactional Databases
5. Multimedia Databases
6. Spatial Databases
7. Time Series Databases
8. World Wide Web(WWW)
Flat Files:
The most common data source for data mining algorithms.
Simple data files in text or binary format with a structure that can be easily
extracted by data mining algorithms.
The data in these files can be transactions, time-series data, scientific
measurements and others.
Flat files are represented by data dictionary. Eg: CSV file.
Application: Used in Data Warehousing to store data.
Database Data
A database system, also called a database management system (DBMS), consists
of a collection of interrelated data, known as a database, and a set of software
programs to manage and access the data.
A relational database is a collection of tables, each of which is assigned a unique
name.
Each table consists of a set of attributes (columns or fields) and usually stores a
large set of tuples (records or rows)
Each tuple in a relational table represents an object identified by a unique key and
described by a set of attribute values.
A semantic data model, such as an entity-relationship (ER) data model, is often
constructed for relational databases.
An ER data model represents the database as a set of entities and their
relationships.
Data Warehouse
Transactional Databases
In general, each record in a transactional database captures a transaction, such as a
customer’s purchase, a flight booking, or a user’s clicks on a web page.
A transaction typically includes a unique transaction identity number (trans ID) and a list
of the items making up the transaction, such as the items purchased in the transaction.
A transactional database may have additional tables, which contain other information
related to the transactions, such as item description, information about the salesperson
or the branch, and so on.
Multimedia Databases
Multimedia databases consists audio, video, images and text media.
They can be stored on Object-Oriented Databases.
They are used to store complex information in pre-specified formats.
Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.
Spatial Database
Store geographical information.
Stores data in the form of coordinates, topology, lines, polygons, etc.
Application: Maps, Global positioning, etc.
Time-series Databases
Time series databases contain stock exchange data and user logged activities.
Handles array of numbers indexed by time, date, etc.
It requires real-time analysis.
Application: eXtremeDB, Graphite, InfluxDB, etc.
WWW
WWW refers to World wide web is a collection of documents and resources like
audio, video, text, etc which are identified by Uniform Resource Locators (URLs)
through web browsers, linked by HTML pages, and accessible via the Internet
network.
It is the most heterogeneous repository as it collects data from multiple resources.
It is dynamic in nature as Volume of data is continuously increasing and changing.
Application: Online shopping, Job search, Research, studying, etc.
1.1.4 Application Domains
Fraud Detection
Credit card spending by customer groups can be identified by using data mining.
The hidden correlation’s between different financial indicators can be discovered
by using data mining.
From historical market data, data mining enables to identify stock trading rules.
Data mining is used to identify customers loyalty by analyzing the data of
customer’s purchasing activities .
Data Mining Applications in Health Care and Insurance
Healthcare: Data mining in the healthcare industry has the potential to greatly improve
the industry. Data mining approaches like Machine learning, statistics and data
visualization are used by analysts to forecast patterns or predict future illnesses.
Insurance
Data mining is applied in claims analysis such as identifying which medical
procedures are claimed together.
Data mining enables to predict which customers will potentially purchase new
policies.
Data mining allows insurance companies to detect risky customers’ behaviour
patterns.
Data mining helps detect fraudulent behaviour.
Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks.
In general, data mining tasks can be classified into two categories:
Descriptive
Descriptive mining tasks characterize the general properties of the data in the
database.
Predictive
Predictive mining tasks perform inference on the current data in order to make
predictions
Data mining Functionalities:
1. Class / concept description
2. Association Analysis
3. Classification
4. Cluster Analysis
5. Prediction
6. Outlier Analysis
1. Class/concept description
Data can be associated with classes or concepts
For example, in the AlIElectronics store,
o Classes of items for sale include computers and printers
o Concepts of customers include bigspenders and budgetspenders
It can be useful to describe individual classes and concepts in summarized,
concise.
Such descriptions of a class or a concept are called class/concept descriptions
These descriptions can be derived via data characterization and discrimination
Data Characterization
For example, to study the characteristics of software products with sales that
increased by 10% in the previous year.
Data Discrimination
Data discrimination is a comparison of the general features of the target class data
objects against the general features of objects from one or multiple contrasting
classes.
For example, a user may want to compare the general features of software
products with sales that increased by 10% last year against those with sales that
decreased by at least 30% during the same period.
2.Association Analysis
3. Classification
Prediction is one of the most valuable data mining techniques, since it’s used to
project the types of data you’ll see in the future.
In many cases, just recognizing and understanding historical trends is enough to
chart a somewhat accurate prediction of what will happen in the future.
For example, you might review consumers’ credit histories and past purchases to
predict whether they’ll be a credit risk in the future
5. Clustering
Outlier analysis
Attribute
An attribute is (or dimensions, features, variables) a data field that represents
characteristics or features of a data object.
For a customer object attributes can be customer Id, address etc.
Types of attribute
Nominal Attributes
The values of a nominal attributes are name of things, symbols.
Values of Nominal attributes represents some category or state
Also referred as categorical attributes
There is no order among values of nominal attribute.
Attribute Values
Colors Black, Brown, White
Categorical Lecturer, Professor,
Data Asst Professor
Binary Attributes
Binary attribute is a nominal attribute.
Binary data has only two values or states: 0 or 1. Where 0 means that the attribute
is absent,1 means that it is present.
Also referred to as Boolean if the two states corresponding to yes or no
Types:
Symmetric: Both values are equally important
Attribute Value
Gender Male,
Female
Asymmetric: Both values are not equally important
Attribute Value
Cancer Yes, No
detected
Result Pass, Fail
Ordinal Attributes
The ordinal attributes contain values that have a meaningful order but the
magnitude between values is not known.
Attribute Value
Grade A,B,C,D,E,F
Numeric attributes:
A numeric attribute is quantitative because, it is a measurable quantity,
represented in integer or real values.
Numerical attributes are of two types, interval and ratio.
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Discrete
Continuous
Continuous data have an infinite no of states.
E.g., temperature, height, or weight
Continuous data is of float type.
Attribute Value
Height 5.4,6.2,…etc.
Weight 55.8,67,34…..etc.
Accuracy
Completeness
Consistency
Timeliness
Believability
Accessibility
1.2.3 Major Tasks in Data Preprocessing (Form of data Preprocessing)
Data Cleaning
Data Integration
Data Transformation
Data Reduction
1. Missing Values
E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
When the database contains large missing values, then filling manually method
is not feasible.
In general, this approach is time consuming and may not be feasible given a
large data set with many missing values.
Replace all missing attributes values by the same constant such as a label like
“Unknown” or -∞.
Use the attribute mean of all samples belonging to the same class as the given
tuple.
Filling with the most probable value uses regression, Bayesian formulation or
decision tree.
2. Noisy Data
Noise is a random error or variance in a measured variable.
Incorrect attribute values may be due to
Binning
Binning method:
First sort data and partition into (equi-depth) bins
Then smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
Smoothing by bin means: Each value in a bin is replaced by the mean value
of the bin.
Smoothing by bin median: Each bin value is replaced by its bin median value.
Smoothing by bin boundary: The minimum and maximum values in a given
bin are identified as the bin boundaries. Every value of bin is then replaced
with the closest boundary value.
Regression
Smooth by fitting the data into regression function.
Data smoothing can be done by regression, a technique that conforms the data
values to a function.
Outlier analysis
Outliers may be detected by clustering, for example, where similar values are
organized into groups, or “clusters.”
Values that fall outside of the set of clusters may be considered outliers.
Data Integration
Data Integration is a data preprocessing technique that involves merging of
data from multiple data sources and provides a unified view of the data.
These sources may include multiple data cubes, databases or flat files.
Careful integration can help reduce and avoid redundancies and inconsistencies in the
resulting data set.
This can help improve the accuracy and speed of the subsequent data mining process.
The data integration approach is formally defined as triple <G, S, M>
Where:
G stand for the global schema,
S stand for heterogeneous source of schema,
M stand for mapping between the queries of source and global schema.
1. Tight Coupling
2. Loose Coupling
Tight Coupling:
Loose Coupling:
In this approach, an interface is provided that takes the query from the user,
transforms it in a way the source database can understand and then sends the
query directly to the source databases to obtain the result.
In loose coupling data only remains in the actual source databases.
Schema Integration:
Integrate metadata from different sources.
The real-world entities from multiple sources are matched referred to as the entity
identification problem.
Redundancy:
An attribute may be redundant if it can be derived or obtaining from another
attribute or set of attributes.
Inconsistencies in attributes can also cause redundancies in the resulting data set.
Some redundancies can be detected by correlation analysis.
Careful integration can help reduce or avoid redundancies and inconsistencies and
improve mining speed and quality
Data Transformation
Data transformation is a technique used to convert the raw data into a suitable format
that eases data mining in retrieving the strategic information efficiently and fastly.
In data transformation process data are transformed from one format to a different
format, that's more appropriate for data processing.
So that the resulting mining process may be more efficient, and the patterns found may
be easier to understand.
1. Smoothing
Smoothing is a process of removing noise from the dataset. Techniques include
binning, regression, and clustering.
2. Aggregation
Aggregation is a process where summary or aggregation operations are applied to
the data.
3. Generalization
In generalization low-level or “primitive” data are replaced with high-level data
by using concept hierarchy.
For example, the attributes like street can be generalized to higher level concept city or
country.
4. Normalization
Where the attribute data scaled so as to fall within a small specified range.
Data normalization involves converting all data variable into a given range such
as 0.0 to 1.0. or -1.0 to 1.0.
Methods
Min-max normalization
This transforms the original data linearly.
Z-score normalization
In z-score normalization (or zero-mean normalization) the values of an
attribute (A), are normalized based on the mean of A and its standard
deviation
Normalization by decimal scaling
It normalizes the values of an attribute by changing the position of their
decimal points
The number of points by which the decimal point is moved can be
determined by the absolute maximum value of attribute A.
5. Attribute Construction
In attribute construction, new attributes are constructed and added from the
given set of attributes. This simplifies the original data and makes the mining more
efficient.
6. Discretization
It is a process of transforming continuous data into set of small intervals.
For example, (1-10, 11-20) (age:- young, middle age, senior).
Data reduction techniques are used to obtain a reduced representation of the dataset that
is much smaller in volume by maintaining the integrity of the original data.
Data reduction does not affect the result obtained from data mining. That means the
result obtained from data mining before and after data reduction is the same or almost
the same.
Numerosity reduction
Parametric methods
Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers).
Example: Regression and log-linear models
Nonparametric methods
Data compression
The data compression technique reduces the size of the files using different
encoding mechanisms.
Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the
precise original data from the compressed data.
Lossy Compression: In lossy-data compression, the decompressed data may differ from
the original data but are useful enough to retrieve information from them.
For example, the JPEG image format is a lossy compression, but we can find the meaning
equivalent to the original image. Methods such as the Discrete Wavelet transform
technique PCA (principal component analysis) are examples of this compression.