0% found this document useful (0 votes)
37 views20 pages

Data Minng

Uploaded by

pythonds123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views20 pages

Data Minng

Uploaded by

pythonds123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

CS 1641 DATA MININING AND WAREHOUSING

Syllabus

Module 1:

Introduction: Data, Information, Knowledge, KDD, Types of data for mining, Application
domains, Data mining functionalities/tasks.

Data Processing: Understanding data, Pre-processing data, Form of data processing, Data
cleaning (definition and phases only), Need for data integration, Steps in data
transformation, Need of data reduction.

Module II:

Data Warehouse: Databases, Data Warehouses, Data mart, Databases Vs Data


warehouses, Data warehouses Vs Data mart, OLTP OLAP, OLAP operations/functions,
OLAP multidimensional models, Data cubes, Star, Snow, Flakes, Fact constellation,
Association rules, Market basket analysis, Criteria for classifying frequent pattern mining,
Mining single dimensional Boolean association rule, A priori algorithm.

Module III:

Classification: Classification Vs Prediction, Issues, Decision trees, Bayes classification,


Bayes Theorem, Naive Bayesian classifier, K Nearest Neighbour method, Rule-Based
classification, Using IF….THEN rules for classification.

Module IV:

Cluster analysis: Definition and Requirements, Characteristics of clustering techniques,


Types of data in cluster analysis, Categories of clustering, Partitioning methods, K-Mean
and K- method only, outlier detection in clustering.

Module 1:

1.1 INTRODUCTION:
Data Mining is defined as “the process of analyzing data from different perspectives and
summarizing it into useful information”.

Data mining is the process of discovering interesting patterns and knowledge from large
amounts of data. The data sources can include databases, data warehouses, the Web,
other information repositories, or data that are streamed into the system dynamically.

Data Mining is defined as extracting information from huge sets of data. In other
words, data mining is the procedure of mining knowledge from data.

Data Mining is also known as Knowledge Discovery from Data (KDD),or Knowledge
Extraction.

The overall goal of the data mining process is to extract information from a data set and transform
it into an understandable structure for further use.

Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in Databases” in 1989.


However, the term ‘data mining’ became more popular in the business and press
communities. Currently, Data Mining and Knowledge Discovery are used
interchangeably.

Now a day, data mining is used in almost all the places where a large amount of data is
stored and processed.

For example, banks typically use ‘data mining’ to find out their prospective customers
who could be interested in credit cards, personal loans or insurances as well. Since banks
have the transaction details and detailed profiles of their customers, they analyze all this
data and try to find out patterns which help them predict that certain customers could be
interested in personal loans etc.

Why we need Data Mining?

Volume of information is increasing every day that we can handle from business
transactions, scientific data, sensor data, pictures, videos, etc. So, we need a system that
will be capable of extracting essence of information available and that can automatically
generate report, views or summary of data for better decision-making.

Why Data Mining is used in Business?


Data mining is used in business to make better managerial decisions by:

 Automatic summarization of data


 Extracting essence of information stored.
 Discovering patterns in raw data.

Data

Data are raw facts, numbers, or text that can be processed by a computer.
Data in different types:
 Operational or transactional data (Examples: sales, cost, inventory, payroll, and
accounting)
 Non-operational data (Examples: industry sales, forecast data)
 Meta data-data about the data itself (Logical database design or data dictionary
definitions).
Data is meaningless in itself, but once processed and interpreted, it becomes
information which is filled with meaning.
Information

Information is the set of data that has already been processed, analyzed, and structured
in a meaningful way to become useful. Once data is processed and gains relevance, it
becomes information that is fully reliable, certain, and useful.

Ultimately, the purpose of processing data and turning it into information is to help
organizations make better, more informed decisions that lead to successful outcomes.

To collect and process data, organizations use Information Systems (IS) which a
combination of technologies, procedures, are and tools that assemble and distribute
information needed to make decisions.

Knowledge

Information can be converted into knowledge about historical patterns and future trends.

For example, summary information on retail supermarket sales can be analyzed in light
of promotional efforts to provide knowledge of consumer buying behaviour. Thus, a
manufacturer or retailer could determin

e which items are most susceptible to promotional efforts.

1.1.2 KDD

 The term KDD stands for Knowledge Discovery in Databases.


 The main objective of the KDD process is extracting useful knowledge from
volumes of Data.
KDD refers to the overall process of discovering useful knowledge from data
Data mining plays an essential role in the knowledge discovery process.

Steps Involved in KDD Process:


1. Data Cleaning:
In this step, the noise and inconsistent data is removed.
2. Data Integration:
In this step, multiple data sources are combined.
3. Data Selection:
In this step, data relevant to the analysis task are retrieved from the database.
4. Data Transformation
In this step, data is transformed into appropriate form required by mining
procedure.
5. Data Mining
In this step, intelligent methods are applied in order to extract data patterns.
6. Pattern Evaluation:
In this step, data patterns are evaluated to represent knowledge based on given
measures.
7. Knowledge representation:
Where visualization and knowledge representation techniques used are to
present mined knowledge to users.

1.1.3 Types of data for mining

Data mining is not specific to any one media or data. Data mining should be applicable to
any kind of information repository. Algorithms and approaches may differ when applied
to different types of data.

Type of data can be mined:

1. Flat Files
2. Database Data
3. Data Warehouse
4. Transactional Databases
5. Multimedia Databases
6. Spatial Databases
7. Time Series Databases
8. World Wide Web(WWW)

Flat Files:
 The most common data source for data mining algorithms.
 Simple data files in text or binary format with a structure that can be easily
extracted by data mining algorithms.
 The data in these files can be transactions, time-series data, scientific
measurements and others.
 Flat files are represented by data dictionary. Eg: CSV file.
 Application: Used in Data Warehousing to store data.

Database Data
 A database system, also called a database management system (DBMS), consists
of a collection of interrelated data, known as a database, and a set of software
programs to manage and access the data.
 A relational database is a collection of tables, each of which is assigned a unique
name.
 Each table consists of a set of attributes (columns or fields) and usually stores a
large set of tuples (records or rows)
 Each tuple in a relational table represents an object identified by a unique key and
described by a set of attribute values.
 A semantic data model, such as an entity-relationship (ER) data model, is often
constructed for relational databases.
 An ER data model represents the database as a set of entities and their
relationships.
Data Warehouse

 A data warehouse is a repository of information collected from multiple sources,


stored under a unified schema, and usually residing at a single site.
 Data warehouses are constructed via a process of data cleaning, data integration,
data transformation, data loading, and periodic data refreshing.
 To facilitate decision making, the data in a data warehouse are organized around
major subjects (e.g., customer, item, supplier, and activity).
 The data are stored to provide information from a historical perspective, such as
in the past 6 to 12 months, and are typically summarized.
 A data warehouse is usually modeled by a multidimensional data structure, called
a data cube.
 A data cube provides a multidimensional view of data and allows the
precomputation and fast access of summarized data.

Transactional Databases
 In general, each record in a transactional database captures a transaction, such as a
customer’s purchase, a flight booking, or a user’s clicks on a web page.
 A transaction typically includes a unique transaction identity number (trans ID) and a list
of the items making up the transaction, such as the items purchased in the transaction.
 A transactional database may have additional tables, which contain other information
related to the transactions, such as item description, information about the salesperson
or the branch, and so on.

Multimedia Databases
 Multimedia databases consists audio, video, images and text media.
 They can be stored on Object-Oriented Databases.
 They are used to store complex information in pre-specified formats.
 Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.

Spatial Database
 Store geographical information.
 Stores data in the form of coordinates, topology, lines, polygons, etc.
 Application: Maps, Global positioning, etc.

Time-series Databases
 Time series databases contain stock exchange data and user logged activities.
 Handles array of numbers indexed by time, date, etc.
 It requires real-time analysis.
 Application: eXtremeDB, Graphite, InfluxDB, etc.

WWW
 WWW refers to World wide web is a collection of documents and resources like
audio, video, text, etc which are identified by Uniform Resource Locators (URLs)
through web browsers, linked by HTML pages, and accessible via the Internet
network.
 It is the most heterogeneous repository as it collects data from multiple resources.
 It is dynamic in nature as Volume of data is continuously increasing and changing.
 Application: Online shopping, Job search, Research, studying, etc.
1.1.4 Application Domains

Data mining is widely used in diverse areas.

1. Data Mining Applications in Sales/Marketing


2. Data Mining Applications in Banking / Finance
3. Data Mining Applications in Health Care and Insurance
4. Data Mining Application in Medicine
5. Data Mining Applications in Education
6. Data Mining Applications in Manufacturing Engineering
7. Research analysis

Data Mining Applications in Sales/Marketing


 Data mining enables businesses to understand the hidden patterns inside
historical purchasing transaction data.
 Data mining is used for Market Basket Analysis to provide information on what
product combinations were purchased together when they were bought and in
what sequence.
 This information helps businesses promote their most profitable products and
maximize the profit.
 Retail companies use data mining to identify customer’s behaviour buying
patterns.

Data Mining Applications in Banking / Finance

 Fraud Detection
 Credit card spending by customer groups can be identified by using data mining.
 The hidden correlation’s between different financial indicators can be discovered
by using data mining.
 From historical market data, data mining enables to identify stock trading rules.
 Data mining is used to identify customers loyalty by analyzing the data of
customer’s purchasing activities .
Data Mining Applications in Health Care and Insurance

Healthcare: Data mining in the healthcare industry has the potential to greatly improve
the industry. Data mining approaches like Machine learning, statistics and data
visualization are used by analysts to forecast patterns or predict future illnesses.

Insurance
 Data mining is applied in claims analysis such as identifying which medical
procedures are claimed together.
 Data mining enables to predict which customers will potentially purchase new
policies.
 Data mining allows insurance companies to detect risky customers’ behaviour
patterns.
 Data mining helps detect fraudulent behaviour.

Data Mining Application in Medicine


 Data mining enables to characterize patient activities to see incoming office visits.
 Data mining helps identify the patterns of successful medical therapies for
different illnesses.
 Example:- Smart Health Prediction in Data Mining

Data Mining Applications in Education

 There is a new emerging field, called Educational Data Mining


 It concerns with developing methods that discover knowledge from data
originating from educational Environments.
 The goals of EDM are identified as predicting students’ future learning behaviour,
studying the effects of educational support, and advancing scientific knowledge
about learning.
 Data mining can be used by an institution to take accurate decisions and also to
predict the results of the student.
 With the results the institution can focus on what to teach and how to teach.
 Learning pattern of the students can be captured and used to develop techniques
to teach them.
Data Mining Applications in Manufacturing Engineering
 Knowledge of the various factors that determine product success is critical for
manufacturing companies.
 Data mining can be used to forecast product development, customer
expectations, cost and other tasks.
Research analysis

 Data mining is helpful in data cleaning, data pre-processing and integration of


databases.
 The researchers can find any similar data from the database that might bring any
change in the research.
 Data visualization and visual data mining provide a clear view of the data.

1.1.5 DATA MINING FUNCTIONALITIES/ TASKS

Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks.
In general, data mining tasks can be classified into two categories:
 Descriptive
Descriptive mining tasks characterize the general properties of the data in the
database.
 Predictive
Predictive mining tasks perform inference on the current data in order to make
predictions
Data mining Functionalities:
1. Class / concept description
2. Association Analysis
3. Classification
4. Cluster Analysis
5. Prediction
6. Outlier Analysis

1. Class/concept description
 Data can be associated with classes or concepts
For example, in the AlIElectronics store,
o Classes of items for sale include computers and printers
o Concepts of customers include bigspenders and budgetspenders
 It can be useful to describe individual classes and concepts in summarized,
concise.
 Such descriptions of a class or a concept are called class/concept descriptions
 These descriptions can be derived via data characterization and discrimination

Data Characterization

Data characterization is a summarization of the general characteristics or features


of a target class of data.

For example, to study the characteristics of software products with sales that
increased by 10% in the previous year.

Data Discrimination

Data discrimination is a comparison of the general features of the target class data
objects against the general features of objects from one or multiple contrasting
classes.

For example, a user may want to compare the general features of software
products with sales that increased by 10% last year against those with sales that
decreased by at least 30% during the same period.

2.Association Analysis

 Association analysis is the discovery of association rules.


 It involves the frequency of items occurring together in transactional databases,
and based on a threshold called support, it identifies the frequent item sets.
 Another threshold, confidence, is the conditional probability that an item appears
in a transaction when another item appears, is used to pinpoint association rules.
 Association analysis is commonly used for market basket analysis.

3. Classification

 Classification analysis is the organization of data in classes.


 It is also known as supervised classification.
 The classification uses given class labels to order the objects in the data collection.
 Classification approaches normally use a set of training data, where all objects are
already associated with known class labels.
 The classification algorithm learns from the training set and builds a model.
 The model is used to classify all new objects.
 The derived model may be represented in various forms, such as classification
rules (i.e., IF-THEN rules), decision trees, or neural networks
4. Prediction

 Prediction is one of the most valuable data mining techniques, since it’s used to
project the types of data you’ll see in the future.
 In many cases, just recognizing and understanding historical trends is enough to
chart a somewhat accurate prediction of what will happen in the future.
 For example, you might review consumers’ credit histories and past purchases to
predict whether they’ll be a credit risk in the future

5. Clustering

 Clustering can be used to generate class labels for a group of data.


 The objects are clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity.
 That is, clusters of objects are formed so that objects within a cluster have high
similarity in comparison to one another, but are rather dissimilar to objects in
other clusters.
 Each cluster so formed can be viewed as a class of objects, from which rules can
be derived.
 Clustering is also called unsupervised classification.

Outlier analysis

 Data that do not comply with the general behavior or model


 Noisy data or exceptional data are also called as outlier data.
 The analysis of outlier data is referred to as outlier mining
 Useful for fraud detection.
E.g. Detect purchases of extremely large amounts

1.2 DATA PROCESSING

1.2.1 Understanding Data

 Data is collection of data objects and their attributes.


 An attribute is a property or characteristic of an object.
 Attribute is also known as variable, field, characteristic, or feature.
 A collection of attributes describes an object.
 Object is also known as record, point, case, sample, entity, or instance.
 Attribute values are numbers or symbols.
 Same attribute can be mapped to different attribute values.
 For instance, height can be measured in feet or meters.
 Different attributes can be mapped to the same set of values, like attribute values
for employee-id and age are integers.
 Properties of attribute values can be different, employee-id has no limit but age
has a maximum and minimum value.
Data objects

 Data sets (databases) are made up of data objects.


 A data object represents an entity.
For example:
Sales database: customer, sales, store items
Medical database: patients, treatments

University database: students, professors, courses

 Also called samples, examples, instances, data points, objects, tuples.


 Data objects are described by attributes.
 If the data object is listed in a database, they are called data tuples.
 Database rows -> data objects; columns ->attributes.

Attribute
 An attribute is (or dimensions, features, variables) a data field that represents
characteristics or features of a data object.
 For a customer object attributes can be customer Id, address etc.

Types of attribute

1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).


2. Quantitative (Numeric, Discrete, Continuous)

Nominal Attributes
 The values of a nominal attributes are name of things, symbols.
 Values of Nominal attributes represents some category or state
 Also referred as categorical attributes
 There is no order among values of nominal attribute.

Attribute Values
Colors Black, Brown, White
Categorical Lecturer, Professor,
Data Asst Professor

Fig: Nominal Attributes

Binary Attributes
 Binary attribute is a nominal attribute.
 Binary data has only two values or states: 0 or 1. Where 0 means that the attribute
is absent,1 means that it is present.
 Also referred to as Boolean if the two states corresponding to yes or no
Types:
Symmetric: Both values are equally important
Attribute Value

Gender Male,
Female
Asymmetric: Both values are not equally important

Attribute Value
Cancer Yes, No
detected
Result Pass, Fail

Ordinal Attributes
 The ordinal attributes contain values that have a meaningful order but the
magnitude between values is not known.

Attribute Value

Grade A,B,C,D,E,F

Pasic pay scale 16,17,18

Fig: Ordinal Attributes

Numeric attributes:
 A numeric attribute is quantitative because, it is a measurable quantity,
represented in integer or real values.
 Numerical attributes are of two types, interval and ratio.

Interval
 Measured on a scale of equal-sized units
 Values have order
E.g., temperature in C˚or F˚, calendar dates
 No true zero-point

Ratio

 Ratio attribute is a numeric attribute with a fix zero-point.


 The values are ordered, and we can also compute the difference between
values, and the mean, median, mode, Quantile-range.
e.g., temperature in Kelvin, length, counts, monetary quantities

Discrete

 Has only a finite or countably infinite set of values


 It can be numerical and can also be in categorical form
E.g., zip codes, profession, or the set of words in a collection of documents
Attribute Value

Profession Teacher, Peon,


Business man
ZIP code 600190, 654783

Fig: Discrete Attributes

Continuous
 Continuous data have an infinite no of states.
E.g., temperature, height, or weight
 Continuous data is of float type.

Attribute Value
Height 5.4,6.2,…etc.
Weight 55.8,67,34…..etc.

Fig: Continuous Attributes

1.2.2 Pre-Processing Data


Data preprocessing

It is a data mining technique that involves transforming raw data into an


understandable format.

 Data in the real world is usually incomplete, incomplete and noisy.


o incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
o noisy: containing errors or outliers
o inconsistent: lack of compatibility, similarity between two or more facts

No quality data, no quality mining results!

 Quality decisions must be based on quality data


 Data warehouse needs consistent integration of quality data

Measures of data quality

 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Accessibility
1.2.3 Major Tasks in Data Preprocessing (Form of data Preprocessing)

Major steps involved in Data preprocessing:

 Data Cleaning
 Data Integration
 Data Transformation
 Data Reduction

Fig: Forms of Data Preprocessing


1.2.4 Data Cleaning

Data cleaning routines works to “clean” the data by


 Filling in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistencies in the data

Data Cleaning Process

1. Missing Values

 Data is not always available

E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data

Methods to fill the missing values:

Ignore the tuple:

 This is usually done when the class label is missing


 This method is not very effective, unless the tuple contains several attributes
with missing values.

Enter the missing value manually

 When the database contains large missing values, then filling manually method
is not feasible.
 In general, this approach is time consuming and may not be feasible given a
large data set with many missing values.

Use the global constant to fill in the missing value

 Replace all missing attributes values by the same constant such as a label like
“Unknown” or -∞.

Use the attribute mean of all samples belonging to the same class as the given
tuple.

Use the attribute mean to fill in the missing values

Using the most probable value to fill in the missing value:

 Filling with the most probable value uses regression, Bayesian formulation or
decision tree.

2. Noisy Data
Noise is a random error or variance in a measured variable.
Incorrect attribute values may be due to

– faulty data collection instruments


– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention

Methods to handle noisy data

Binning
Binning method:
 First sort data and partition into (equi-depth) bins
 Then smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
Smoothing by bin means: Each value in a bin is replaced by the mean value
of the bin.
Smoothing by bin median: Each bin value is replaced by its bin median value.
Smoothing by bin boundary: The minimum and maximum values in a given
bin are identified as the bin boundaries. Every value of bin is then replaced
with the closest boundary value.
Regression
 Smooth by fitting the data into regression function.
 Data smoothing can be done by regression, a technique that conforms the data
values to a function.
Outlier analysis
 Outliers may be detected by clustering, for example, where similar values are
organized into groups, or “clusters.”
 Values that fall outside of the set of clusters may be considered outliers.

1.2.5 Need for Data Integration

Data Integration
Data Integration is a data preprocessing technique that involves merging of
data from multiple data sources and provides a unified view of the data.

These sources may include multiple data cubes, databases or flat files.

Careful integration can help reduce and avoid redundancies and inconsistencies in the
resulting data set.
This can help improve the accuracy and speed of the subsequent data mining process.
The data integration approach is formally defined as triple <G, S, M>
Where:
G stand for the global schema,
S stand for heterogeneous source of schema,
M stand for mapping between the queries of source and global schema.

There are mainly two major approaches for data integration

1. Tight Coupling
2. Loose Coupling

Tight Coupling:

 In this approach, a data warehouse is treated as an information retrieval


component.
 In this coupling, data is combined from different sources into a single physical
location through the process of ETL – Extraction, Transformation and Loading.

Loose Coupling:

 In this approach, an interface is provided that takes the query from the user,
transforms it in a way the source database can understand and then sends the
query directly to the source databases to obtain the result.
 In loose coupling data only remains in the actual source databases.

Issues in Data Integration:

Schema Integration:
 Integrate metadata from different sources.
 The real-world entities from multiple sources are matched referred to as the entity
identification problem.

Redundancy:
 An attribute may be redundant if it can be derived or obtaining from another
attribute or set of attributes.
 Inconsistencies in attributes can also cause redundancies in the resulting data set.
 Some redundancies can be detected by correlation analysis.

Detection and resolution of data value conflicts:


 This is the third important issue in data integration.
 Attribute values from different sources may differ for the same real-world entity.
 An attribute in one system may be recorded at a lower level abstraction than the
“same” attribute in another.

Careful integration can help reduce or avoid redundancies and inconsistencies and
improve mining speed and quality

1.2.6 Steps in Data Transformation

Data Transformation

Data transformation is a technique used to convert the raw data into a suitable format
that eases data mining in retrieving the strategic information efficiently and fastly.

In data transformation process data are transformed from one format to a different
format, that's more appropriate for data processing.

So that the resulting mining process may be more efficient, and the patterns found may
be easier to understand.

Steps involved in data transformation:

1. Smoothing
Smoothing is a process of removing noise from the dataset. Techniques include
binning, regression, and clustering.

2. Aggregation
Aggregation is a process where summary or aggregation operations are applied to
the data.

3. Generalization
In generalization low-level or “primitive” data are replaced with high-level data
by using concept hierarchy.
For example, the attributes like street can be generalized to higher level concept city or
country.
4. Normalization
Where the attribute data scaled so as to fall within a small specified range.
Data normalization involves converting all data variable into a given range such
as 0.0 to 1.0. or -1.0 to 1.0.
Methods

Min-max normalization
 This transforms the original data linearly.

Z-score normalization
 In z-score normalization (or zero-mean normalization) the values of an
attribute (A), are normalized based on the mean of A and its standard
deviation
Normalization by decimal scaling
 It normalizes the values of an attribute by changing the position of their
decimal points
 The number of points by which the decimal point is moved can be
determined by the absolute maximum value of attribute A.

5. Attribute Construction
In attribute construction, new attributes are constructed and added from the
given set of attributes. This simplifies the original data and makes the mining more
efficient.

6. Discretization
It is a process of transforming continuous data into set of small intervals.
For example, (1-10, 11-20) (age:- young, middle age, senior).

1.2.7 Data Reduction

Data Reduction is technique to obtaining a reduced representation of the data set


that is much smaller in volume but yet produces the same analytical results.

Need for data reduction

Data reduction techniques are used to obtain a reduced representation of the dataset that
is much smaller in volume by maintaining the integrity of the original data.

Data reduction does not affect the result obtained from data mining. That means the
result obtained from data mining before and after data reduction is the same or almost
the same.

Data Reduction Strategies

Data reduction strategies include:

 Data cube aggregation


 Dimensionality reduction
 Numerosity reduction
 Data compression.

Data cube aggregation

 This technique is used to aggregate data in a simpler form.


 Data Cube Aggregation is a multidimensional aggregation that uses aggregation at
various levels of a data cube to represent the original data set, thus achieving data
reduction.
Dimensionality reduction

 Process of reducing the number of random variables or attributes under


consideration.
 Refers to a method where encoding techniques are used to minimize the size of
the data set.
 Dimensionality reduction methods include wavelet transforms and principal
components analysis, which transform or project the original data onto a smaller
space.
 Attribute subset selection is a method of dimensionality reduction in which
irrelevant, weakly relevant, or redundant attributes or dimensions are detected
and removed.

Numerosity reduction

 Reduce data volume by alternative, smaller forms of data representation.


 These techniques may be parametric or nonparametric.

Parametric methods
 Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers).
Example: Regression and log-linear models

Nonparametric methods

 Do not assume models


 Store reduced representations of the data include histograms, clustering,
sampling.

Data compression

The data compression technique reduces the size of the files using different
encoding mechanisms.

Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the
precise original data from the compressed data.
Lossy Compression: In lossy-data compression, the decompressed data may differ from
the original data but are useful enough to retrieve information from them.
For example, the JPEG image format is a lossy compression, but we can find the meaning
equivalent to the original image. Methods such as the Discrete Wavelet transform
technique PCA (principal component analysis) are examples of this compression.

You might also like