0% found this document useful (0 votes)

27 views21 pages

NME1 Unit4 Notes

Data validation is the process of ensuring the accuracy and quality of data before it is used, involving various checks to maintain logical consistency. It is crucial for effective data management and analytics, as poor-quality data can lead to significant downstream issues and costs. Data transformation, a related process, alters the format and structure of data to enhance its usability and facilitate better decision-making.

Uploaded by

APARNA R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views21 pages

NME1 Unit4 Notes

Uploaded by

APARNA R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

What is Data Validation?

Data validation is the process of verifying and validating data that is

collected before it is used. Any type of data handling task, whether it is
gathering data, analyzing it, or structuring it for presentation, must
include data validation to ensure accurate results. Sometimes it can be
tempting to skip validation since it takes time. However, it is an essential
step toward garnering the best results possible.

Several checks are built into a system which ensures the data being
entered and stored has logical consistency. With the advancement in
technology, data validation today is much faster. Most of the data
integration platforms incorporate and automate the data validation step
so that it becomes an inherent step in the entire workflow rather than an
additional one. In such automated systems, there is little human
intervention required. Data validation becomes essential because poor-
quality data causes issues downstream, and there are higher costs
attached to cleansing data if done later in the process.

The data validation process has gained significant importance within

organizations involved with data and its collection, processing, and
analysis. It is considered to be the foundation for efficient data
management since it facilitates analytics based on meaningful and valid
datasets.

What is Data Validation?

Data validation refers to the process of ensuring the accuracy and

quality of data. It is implemented by building several checks into
a system or report to ensure the logical consistency of input and
stored data.

In automated systems, data is entered with minimal or no human

supervision. Therefore, it is necessary to ensure that the data that
enters the system is correct and meets the desired quality
standards. The data will be of little use if it is not entered properly
and can create bigger downstream reporting issues. Unstructured
data, even if entered correctly, will incur related costs for
cleaning, transforming, and storage.
Types of Data Validation

There are many types of data validation. Most data validation

procedures will perform one or more of these checks to ensure
that the data is correct before storing it in the database. Common
types of data validation checks include:

1. Data Type Check

A data type check confirms that the data entered has the correct
data type. For example, a field might only accept numeric data. If
this is the case, then any data containing other characters such
as letters or special symbols should be rejected by the system.

2. Code Check

A code check ensures that a field is selected from a valid list of

values or follows certain formatting rules. For example, it is easier
to verify that a postal code is valid by checking it against a list of
valid codes. The same concept can be applied to other items such
as country codes and NAICS industry codes.

3. Range Check

A range check will verify whether input data falls within a

predefined range. For example, latitude and longitude are
commonly used in geographic data. A latitude value should be
between -90 and 90, while a longitude value must be between -
180 and 180. Any values out of this range are invalid.

4. Format Check

Many data types follow a certain predefined format. A common

use case is date columns that are stored in a fixed format like
“YYYY-MM-DD” or “DD-MM-YYYY.” A data validation procedure
that ensures dates are in the proper format helps maintain
consistency across data and through time.

5. Consistency Check

A consistency check is a type of logical check that confirms the

data’s been entered in a logically consistent way. An example is
checking if the delivery date is after the shipping date for a
parcel.
6. Uniqueness Check

Some data like IDs or e-mail addresses are unique by nature. A

database should likely have unique entries on these fields. A
uniqueness check ensures that an item is not entered multiple
times into a database.

Analyzing information requires structured and accessible data for best results. Data
transformation enables organizations to alter the structure and format of raw data as needed.
Learn how your enterprise can transform its data to perform analytics efficiently.

What is data transformation?

Data transformation is the process of changing the format, structure, or values of data. For
data analytics projects, data may be transformed at two stages of the data pipeline.
Organizations that use on-premises data warehouses generally use an ETL (extract,
transform, load) process, in which data transformation is the middle step. Today, most
organizations use cloud-based data warehouses, which can scale compute and storage
resources with latency measured in seconds or minutes. The scalability of the cloud platform
lets organizations skip preload transformations and load raw data into the data warehouse,
then transform it at query time — a model called ELT ( extract, load, transform).

Processes such as data integration, data migration, data warehousing, and data
wrangling all may involve data transformation.

Data transformation may be constructive (adding, copying, and replicating data), destructive
(deleting fields and records), aesthetic (standardizing salutations or street names), or
structural (renaming, moving, and combining columns in a database).

An enterprise can choose among a variety of ETL tools that automate the process of data
transformation. Data analysts, data engineers, and data scientists also transform data
using scripting languages such as Python or domain-specific languages like SQL.

Benefits and challenges of data transformation

Transforming data yields several benefits:

 Data is transformed to make it better-organized. Transformed data may be easier for both
humans and computers to use.

 Properly formatted and validated data improves data quality and protects applications from
potential landmines such as null values, unexpected duplicates, incorrect indexing, and
incompatible formats.
 Data transformation facilitates compatibility between applications, systems, and types of
data. Data used for multiple purposes may need to be transformed in different ways..

However, there are challenges to transforming data effectively:

 Data transformation can be expensive. The cost is dependent on the specific infrastructure,
software, and tools used to process data. Expenses may include those related to licensing,
computing resources, and hiring necessary personnel.

 Data transformation processes can be resource-intensive. Performing transformations in an

on-premises data warehouse after loading, or transforming data before feeding it into
applications, can create a computational burden that slows down other operations. If you
use a cloud-based data warehouse, you can do the transformations after loading because
the platform can scale up to meet demand.

 Lack of expertise and carelessness can introduce problems during transformation. Data
analysts without appropriate subject matter expertise are less likely to notice typos or
incorrect data because they are less familiar with the range of accurate and permissible
values. For example, someone working on medical data who is unfamiliar with relevant
terms might fail to flag disease names that should be mapped to a singular value or notice
misspellings.

 Enterprises can perform transformations that don't suit their needs. A business might
change information to a specific format for one application only to then revert the
information back to its prior format for a different application.

How to transform data

Data transformation can increase the efficiency of analytic and business processes and enable
better data-driven decision-making. The first phase of data transformations should include
things like data type conversion and flattening of hierarchical data. These operations shape
data to increase compatibility with analytics systems. Data analysts and data scientists can
implement further transformations additively as necessary as individual layers of
processing. Each layer of processing should be designed to perform a specific set of tasks
that meet a known business or technical requirement.

Data transformation serves many functions within the data analytics stack.

Extraction and parsing

In the modern ELT process, data ingestion begins with extracting information from a data
source, followed by copying the data to its destination. Initial transformations are focused on
shaping the format and structure of data to ensure its compatibility with both the destination
system and the data already there. Parsing fields out of comma-delimited log data for loading
to a relational database is an example of this type of data transformation.
Translation and mapping
Some of the most basic data transformations involve the mapping and translation of data. For
example, a column containing integers representing error codes can be mapped to the relevant
error descriptions, making that column easier to understand and more useful for display in a
customer-facing application.

Translation converts data from formats used in one system to formats appropriate for a
different system. Even after parsing, web data might arrive in the form of hierarchical JSON
or XML files, but need to be translated into row and column data for inclusion in a relational
database.

Filtering, aggregation, and summarization

Data transformation is often concerned with whittling data down and making it more
manageable. Data may be consolidated by filtering out unnecessary fields, columns, and
records. Omitted data might include numerical indexes in data intended for graphs and
dashboards or records from business regions that aren’t of interest in a particular study.

Data might also be aggregated or summarized. by, for instance, transforming a time series of
customer transactions to hourly or daily sales counts.

BI tools can do this filtering and aggregation, but it can be more efficient to do the
transformations before a reporting tool accesses the data.

Enrichment and imputation

Data from different sources can be merged to create denormalized, enriched information. A
customer’s transactions can be rolled up into a grand total and added into a customer
information table for quicker reference or for use by customer analytics systems. Long or
freeform fields may be split into multiple columns, and missing values can be imputed or
corrupted data replaced as a result of these kinds of transformations.

Indexing and ordering

Data can be transformed so that it's ordered logically or to suit a data storage scheme. In
relational database management systems, for example, creating indexes can improve
performance or improve the management of relationships between different tables.

Anonymization and encryption

Data containing personally identifiable information, or other information that could
compromise privacy or security, should be anonymized before propagation. Encryption of
private data is a requirement in many industries, and systems can perform encryption at
multiple levels, from individual database cells to entire records or fields.

Modeling, typecasting, formatting, and renaming

Finally, a whole set of transformations can reshape data without changing content. This
includes casting and converting data types for compatibility, adjusting dates and times with
offsets and format localization, and renaming schemas, tables, and columns for clarity.
Refining the data transformation process
Before your enterprise can run analytics, and even before you transform the data, you must
replicate it to a data warehouse architected for analytics. Most organizations today choose a
cloud data warehouse, allowing them to take full advantage of ELT. Stitch can load all of
your data to your preferred data warehouse in a raw state, ready for transformation. Try
Stitch for free.

What is Feature Extraction?

Feature extraction is a process of dimensionality reduction by which an initial set of raw data
is reduced to more manageable groups for processing. A characteristic of these large data
sets is a large number of variables that require a lot of computing resources to process.
Feature extraction is the name for methods that select and /or combine variables into
features, effectively reducing the amount of data that must be processed, while still
accurately and completely describing the original data set.

Why is this Useful?

The process of feature extraction is useful when you need to reduce the number of
resources needed for processing without losing important or relevant information.
Feature extraction can also reduce the amount of redundant data for a given
analysis. Also, the reduction of the data and the machine’s efforts in building variable
combinations (features) facilitate the speed of learning and generalization steps in
the machine learning process.

Practical Uses of Feature Extraction

 Autoencoders

– The purpose of autoencoders is unsupervised learning of efficient data

coding. Feature extraction is used here to identify key features in the data for
coding by learning from the coding of the original data set to derive new ones.

 Bag-of-Words

– A technique for natural language processing that extracts the words

(features) used in a sentence, document, website, etc. and classifies them by
frequency of use. This technique can also be applied to image processing.

 Image Processing – Algorithms are used to detect features such as shaped, edges,
or motion in a digital image or video.

In machine learning, pattern recognition, and image processing, feature extraction starts from
an initial set of measured data and builds derived values (features) intended to be informative
and non-redundant, facilitating the subsequent learning and generalization steps, and in some
cases leading to better human interpretations. Feature extraction is related to dimensionality
reduction.[1]
When the input data to an algorithm is too large to be processed and it is suspected to be
redundant (e.g. the same measurement in both feet and meters, or the repetitiveness of images
presented as pixels), then it can be transformed into a reduced set of features (also named
a feature vector). Determining a subset of the initial features is called feature selection.[2] The
selected features are expected to contain the relevant information from the input data, so that the
desired task can be performed by using this reduced representation instead of the complete
initial data.

General
Feature extraction involves reducing the number of resources required to describe a large set of
data. When performing analysis of complex data one of the major problems stems from the
number of variables involved. Analysis with a large number of variables generally requires a
large amount of memory and computation power, also it may cause a classification algorithm
to overfit to training samples and generalize poorly to new samples. Feature extraction is a
general term for methods of constructing combinations of the variables to get around these
problems while still describing the data with sufficient accuracy. Many machine
learning practitioners believe that properly optimized feature extraction is the key to effective
model construction.[3]
Results can be improved using constructed sets of application-dependent features, typically built
by an expert. One such process is called feature engineering. Alternatively, general
dimensionality reduction techniques are used such as:

 Independent component analysis

 Isomap
 Kernel PCA
 Latent semantic analysis
 Partial least squares
 Principal component analysis
 Multifactor dimensionality reduction
 Nonlinear dimensionality reduction
 Semidefinite embedding
 Autoencoder

Data reduction

Data reduction is the process of reducing the amount of capacity required

to store data. Data reduction can increase storage efficiency and reduce
costs. Storage vendors will often describe storage capacity in terms
of raw capacity and effective capacity, which refers to data after the
reduction.

Data reduction can be achieved several ways. The main types are data
deduplication, compression and single-instance storage. Data
deduplication, also known as data dedupe, eliminates redundant segments
of data on storage systems. It only stores redundant segments once and
uses that one copy whenever a request is made to access that piece of
data. Data dedupe is more granular than single-instance storage. Single-
instance storage finds files such as email attachments sent to multiple
people and only stores one copy of that file. As with dedupe, single-
instance storage replaces duplicates with pointers to the one saved copy.

Some storage arrays track which blocks are the most heavily shared.
Those blocks that are shared by the largest number of files may be moved
to a memory- or flash storage-based cache so they can be read as
efficiently as possible.

Data compression also works on a file level. It is accomplished natively in

storage systems using algorithms or formulas designed to identify and
remove redundant bits of data. Data compression specifically refers to a
data reduction method by which files are shrunk at the bit level.
Compression works by using formulas or algorithms to reduce the number
of bits needed to represent the data. This is usually done by representing a
repeating string of bits with a smaller string of bits and using a dictionary to
convert between them.

Common techniques of data reduction

There are also ways to reduce the amount of data that has to be stored without
actually shrinking the sizes of blocks and files. These techniques include thin
provisioning and data archiving.

Thin provisioning is achieved by dynamically allocating storage space in a flexible

manner. This method keeps reserved space just a little ahead of actual written
space, enabling more unreserved space to be used by other applications.
Traditional thick provisioning allocates fixed amounts of storage space as soon as a
disk is created, regardless of whether that entire capacity will be filled.
Differe
nces between thin and thick provisioning

Archiving data also reduces data on storage systems, but the approach is quite
different. Rather than reducing data within files or databases, archiving removes
older, infrequently accessed data from expensive storage and moves it to low-cost,
high-capacity storage. Archive storage can be on disk, tape or cloud.

Data reduction for primary storage

Although data deduplication was first developed for backup data on secondary
storage, it is possible to deduplicate primary storage. Primary storage deduplication
can occur as a function of the storage hardware or operating system
(OS). Windows Server 2012 and Windows Server 2012 R2, for instance, have
built-in data deduplication capabilities. The deduplication engine uses post-
processing deduplication, which means deduplication does not occur in real time.
Instead, a scheduled process periodically deduplicates primary storage data.
Primary storage deduplication is a common feature of many all-flash storage
systems. Because flash storage is expensive, deduplication is used to make the
most of flash storage capacity. Also, because flash storage offers such high
performance, the overhead of performing deduplication has less of an impact than
it would on a disk system.

DATA SAMPLING

In data analysis, sampling is the practice of analyzing a subset of all data

in order to uncover the meaningful information in the larger data set.

What is Sampling?

It is the practice of selecting an individual group from a population to

study the whole population.

Let’s say we want to know the percentage of people who use iPhones in a
city, for example. One way to do this is to call up everyone in the city and
ask them what type of phone they use. The other way would be to get a
smaller subgroup of individuals and ask them the same question, and
then use this information as an approximation of the total population.

However, this process is not as simple as it sounds. Whenever you follow

this method, your sample size has to be ideal - it should not be too large
or too small. Then once you have decided on the size of your sample, you
must use the right type of sampling techniques to collect a sample from
the population. Ultimately, every sampling type comes under two broad
categories:

 Probability sampling - Random selection techniques are used to select

the sample.

 Non-probability sampling - Non-random selection techniques based on

certain criteria are used to select the sample.

Types Of Sampling Techniques in Data Analytics-

Now, let’s discuss the types of sampling in data analytics. First, let us start
with the Probability Sampling techniques.

Probability Sampling Techniques

Probability Sampling Techniques are one of the important types of

sampling techniques. Probability sampling allows every member of the
population a chance to get selected. It is mainly used in quantitative
research when you want to produce results representative of the whole
population.

1. Simple Random Sampling

In simple random sampling, the researcher selects the participants

randomly. There are a number of data analytics tools like random number
generators and random number tables used that are based entirely on
chance.

Example: The researcher assigns every member in a company database a

number from 1 to 1000 (depending on the size of your company) and then
use a random number generator to select 100 members.

2. Systematic Sampling

In systematic sampling, every population is given a number as well like in

simple random sampling. However, instead of randomly generating
numbers, the samples are chosen at regular intervals.

Example: The researcher assigns every member in the company database

a number. Instead of randomly generating numbers, a random starting
point (say 5) is selected. From that number onwards, the researcher
selects every, say, 10th person on the list (5, 15, 25, and so on) until the
sample is obtained.
3. Stratified Sampling

In stratified sampling, the population is subdivided into subgroups, called

strata, based on some characteristics (age, gender, income, etc.). After
forming a subgroup, you can then use random or systematic sampling to
select a sample for each subgroup. This method allows you to draw more
precise conclusions because it ensures that every subgroup is properly
represented.

Example: If a company has 500 male employees and 100 female

employees, the researcher wants to ensure that the sample reflects the
gender as well. So the population is divided into two subgroups based on
gender.

4. Cluster Sampling

In cluster sampling, the population is divided into subgroups, but each

subgroup has similar characteristics to the whole sample. Instead of
selecting a sample from each subgroup, you randomly select an entire
subgroup. This method is helpful when dealing with large and diverse
populations.

Example: A company has over a hundred offices in ten cities across the
world which has roughly the same number of employees in similar job
roles. The researcher randomly selects 2 to 3 offices and uses them as the
sample.

Here comes the next type of sampling techniques i.e., Non-Probability

Sampling Techniques
Non-Probability Sampling Techniques

Non-Probability Sampling Techniques is one of the important types of

Sampling techniques. In non-probability sampling, not every individual has
a chance of being included in the sample. This sampling method is easier
and cheaper but also has high risks of sampling bias. It is often used in
exploratory and qualitative research with the aim to develop an initial
understanding of the population.

1. Convenience Sampling

In this sampling method, the researcher simply selects the individuals

which are most easily accessible to them. This is an easy way to gather
data, but there is no way to tell if the sample is representative of the
entire population. The only criteria involved is that people are available
and willing to participate.

Example: The researcher stands outside a company and asks the

employees coming in to answer questions or complete a survey.

2. Voluntary Response Sampling

Voluntary response sampling is similar to convenience sampling, in the

sense that the only criterion is people are willing to participate. However,
instead of the researcher choosing the participants, the participants
volunteer themselves.

Example: The researcher sends out a survey to every employee in a

company and gives them the option to take part in it.

3. Purposive Sampling

In purposive sampling, the researcher uses their expertise and judgment

to select a sample that they think is the best fit. It is often used when the
population is very small and the researcher only wants to gain knowledge
about a specific phenomenon rather than make statistical inferences.

Example: The researcher wants to know about the experiences of disabled

employees at a company. So the sample is purposefully selected from this
population.

4. Snowball Sampling

In snowball sampling, the research participants recruit other participants

for the study. It is used when participants required for the research are
hard to find. It is called snowball sampling because like a snowball, it picks
up more participants along the way and gets larger and larger.

Example: The researcher wants to know about the experiences of

homeless people in a city. Since there is no detailed list of homeless
people, a probability sample is not possible. The only way to get the
sample is to get in touch with one homeless person who will then put you
in touch with other homeless people in a particular area.

Which Sampling Technique to Use?

In this article on types of sampling techniques in Data Analytics, we

covered everything about probability and non-probability sampling
techniques. For any type of research, it is necessary that you choose the
right sampling techniques before diving into the study. The effectiveness
of your research is hugely dependent on the sample that you choose.
These are just the top types of sampling techniques and there are still lots
more that you can choose from to refine your research. In order
to become a data analyst, you have to be exactly sure of what sampling
techniques you should use and when. If you want to learn more about
data analytics, Simplilearn’s Data Analytics Certification Program, in
partnership with Purdue University and in collaboration with IBM, features
masterclasses and follows a boot camp model designed with real-life
projects and business case studies. Get started with this course today and
embark on a successful career in data analytics.

If you have any doubts in the types of sampling techniques article, leave a
comment below and our team experts will get in touch with you as soon
as possible!

Types and Sources of Data

Data types and sources can be represented in a variety of ways. The two
primary data types are:

1. Quantitative represents as numerical figures - interval and ratio level

measurements.
2. Qualitative are text, images, audio/video, etc.

Although scientific disciplines differ in their preference for one type over
another, some investigators utilize information from both quantitative and
qualitative with the expectation of developing a richer understanding of a
targeted phenomenon.

Researchers collect information from human beings that can be

qualitative (ex. observing child-rearing practices) or quantitative
(recording biochemical markers, anthropometric measurements). Data
sources can include field notes, journals, laboratory notes/specimens, or
direct observations of humans, animals, plants. Interactions between data
type and source are not infrequent.

Determining appropriate data is discipline-specific and is primarily driven

by the nature of the investigation, existing literature, and accessibility to
data sources. Questions that need to know when selecting data type and
sources are given below:

o What is the research question?

o What is the scope of the investigation? (This defines the parameters of
any study. Selected data should not extend beyond the scope of the
study).
o What has the literature (previous research) determined to be the most
appropriate data to collect?
o What type of data should be considered: quantitative, qualitative, or a
composite of both?
What is Feature Selection in Data Mining?
Feature selection has been an active research area in pattern recognition,
statistics, and data mining communities. The main idea of feature
selection is to choose a subset of input variables by eliminating features
with little or no predictive information. Feature selection can significantly
improve the comprehensibility of the resulting classifier models and often
build a model that generalizes better to unseen points. Further, it is often
the case that finding the correct subset of predictive features is an
important problem in its own right.

For example, a physician may decide based on the selected features

whether a dangerous surgery is necessary for treatment or not. Feature
selection in supervised learning has been well studied, where the main
goal is to find a feature subset that produces higher classification
accuracy.

Recently, several researchers have studied feature selection and

clustering together with a single or unified criterion. For feature selection
in unsupervised learning, learning algorithms are designed to find a
natural grouping of the examples in the feature space. Thus feature
selection in unsupervised learning aims to find a good subset of features
that forms the high quality of clusters for a given number of clusters.

However, the traditional approaches to feature selection with a single

evaluation criterion have shown limited capability in terms of knowledge
discovery and decision support. This is because decision-makers should
take into account multiple, conflicting objectives simultaneously. In
particular, no single criterion for unsupervised feature selection is best for
every application, and only the decision-maker can determine the relative
weights of criteria for her application.

Principal Component Analysis

Principal Component Analysis is an unsupervised learning algorithm that is
used for the dimensionality reduction in machine learning. It is a statistical
process that converts the observations of correlated features into a set of
linearly uncorrelated features with the help of orthogonal transformation.
These new transformed features are called the Principal Components. It
is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the
given dataset by reducing the variances.

PCA generally tries to find the lower-dimensional surface to project the

high-dimensional data.
PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces
the dimensionality. Some real-world applications of PCA are image
processing, movie recommendation system, optimizing the power
allocation in various communication channels. It is a feature
extraction technique, so it contains the important variables and drops the
least important variable.

The PCA algorithm is based on some mathematical concepts such as:

o Variance and Covariance

o Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:

o Dimensionality: It is the number of features or variables present in the

given dataset. More easily, it is the number of columns present in the
dataset.
o Correlation: It signifies that how strongly two variables are related to
each other. Such as if one changes, the other variable also gets changed.
The correlation value ranges from -1 to +1. Here, -1 occurs if variables are
inversely proportional to each other, and +1 indicates that variables are
directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other,
and hence the correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is
given. Then v will be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair
of variables is called the Covariance Matrix.

Principal Components in PCA

As described above, the transformed new features or the output of PCA
are the Principal Components. The number of these PCs are either equal
to or less than the original features present in the dataset. Some
properties of these principal components are given below:

o The principal component must be the linear combination of the original

features.
o These components are orthogonal, i.e., the correlation between a pair of
variables is zero.
o The importance of each component decreases when going to 1 to n, it
means the 1 PC has the most importance, and n PC will have the least
importance.

Steps for PCA algorithm

1. Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X
and Y, where X is the training set, and Y is the validation set.
2. Representing data into a structure
Now we will represent our dataset into a structure. Such as we will
represent the two-dimensional matrix of independent variable X. Here
each row corresponds to the data items, and the column corresponds to
the Features. The number of columns is the dimensions of the dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular
column, the features with high variance are more important compared to
the features with lower variance.
If the importance of features is independent of the variance of the feature,
then we will divide each data item in a column with the standard deviation
of the column. Here we will name the matrix as Z.
4. Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will
transpose it. After transpose, we will multiply it by Z. The output matrix
will be the Covariance matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the
resultant covariance matrix Z. Eigenvectors or the covariance matrix are
the directions of the axes with high information. And the coefficients of
these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in
decreasing order, which means from largest to smallest. And
simultaneously sort the eigenvectors accordingly in matrix P of
eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P*
matrix to the Z. In the resultant matrix Z*, each observation is the linear
combination of original features. Each column of the Z* matrix is
independent of each other.
8. Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and
what to remove. It means, we will only keep the relevant or important
features in the new dataset, and unimportant features will be removed
out.

Applications of Principal Component Analysis

o PCA is mainly used as the dimensionality reduction technique in various AI
applications such as computer vision, image compression, etc.
o It can also be used for finding hidden patterns if data has high dimensions.
Some fields where PCA is used are Finance, data mining, Psychology, etc.

Discretization in data mining

Data discretization refers to a method of converting a huge number of
data values into smaller ones so that the evaluation and management of
data become easy. In other words, data discretization is a method of
converting attributes values of continuous data into a finite set of intervals
with minimum data loss. There are two forms of data discretization first is
supervised discretization, and the second is unsupervised discretization.
Supervised discretization refers to a method in which the class data is
used. Unsupervised discretization refers to a method depending upon the
way which operation proceeds. It means it works on the top-down splitting
strategy and bottom-up merging strategy.

Now, we can understand this concept with the help of an example

Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Attribute Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46

After Discretization Child Young Mature

Another example is analytics, where we gather the static data of website

visitors. For example, all visitors who visit the site with the IP address of
India are shown under country level.

Some Famous techniques of data discretization

Histogram analysis

Histogram refers to a plot used to represent the underlying frequency

distribution of a continuous data set. Histogram assists the data
inspection for data distribution. For example, Outliers, skewness
representation, normal distribution representation, etc.

Binning

Binning refers to a data smoothing technique that helps to group a huge

number of continuous values into smaller values. For data discretization
and the development of idea hierarchy, this technique can also be used.

Cluster Analysis

Cluster analysis is a form of data discretization. A clustering algorithm is

executed by dividing the values of x numbers into clusters to isolate a
computational feature of x.

Data discretization using decision tree analysis

Data discretization refers to a decision tree analysis in which a top-down

slicing technique is used. It is done through a supervised procedure. In a
numeric attribute discretization, first, you need to select the attribute that
has the least entropy, and then you need to run it with the help of a
recursive process. The recursive process divides it into various discretized
disjoint intervals, from top to bottom, using the same splitting criterion.

Data discretization using correlation analysis

Discretizing data by linear regression technique, you can get the best
neighboring interval, and then the large intervals are combined to develop
a larger overlap to form the final 20 overlapping intervals. It is a
supervised procedure.

Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Lecture 5 - Data Transformation
No ratings yet
Lecture 5 - Data Transformation
7 pages
Comprehensive Guide to Data Validation
No ratings yet
Comprehensive Guide to Data Validation
9 pages
A Step by Step Guide To Data Preparation
No ratings yet
A Step by Step Guide To Data Preparation
14 pages
Intro. Data Science 3
No ratings yet
Intro. Data Science 3
38 pages
A Guide To Improving Data Integrity and Adoption
No ratings yet
A Guide To Improving Data Integrity and Adoption
39 pages
Part II, Meet 4 - CH 6 Dan 7 UNP
No ratings yet
Part II, Meet 4 - CH 6 Dan 7 UNP
19 pages
Importance of Data Cleaning
No ratings yet
Importance of Data Cleaning
35 pages
Data Processing and Transformation Guide
No ratings yet
Data Processing and Transformation Guide
31 pages
Cse2026 Module 1 & 2 Detailed Notes
No ratings yet
Cse2026 Module 1 & 2 Detailed Notes
185 pages
Monitor and Support Data Conversion
No ratings yet
Monitor and Support Data Conversion
5 pages
Pharmacy Lecture (Data Processing)
No ratings yet
Pharmacy Lecture (Data Processing)
11 pages
Data Cleaning, Integration, and Data Transformation Techniques
No ratings yet
Data Cleaning, Integration, and Data Transformation Techniques
7 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
Data Transformation Challenges
No ratings yet
Data Transformation Challenges
22 pages
DAA - Chapter 02
No ratings yet
DAA - Chapter 02
12 pages
Monitor and Adminstration
No ratings yet
Monitor and Adminstration
3 pages
Microsoft PowerPoint - DAA - Chapter 02
No ratings yet
Microsoft PowerPoint - DAA - Chapter 02
8 pages
ACC 157 SAS No. 22
No ratings yet
ACC 157 SAS No. 22
5 pages
Unit-3 Bi
No ratings yet
Unit-3 Bi
57 pages
Math211101020
No ratings yet
Math211101020
12 pages
Data Analysis
No ratings yet
Data Analysis
6 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Unit-3 Bi
No ratings yet
Unit-3 Bi
48 pages
Chapter 6 Part2
No ratings yet
Chapter 6 Part2
23 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Data Analytics
No ratings yet
Data Analytics
8 pages
Updated Notes of APR - 084732
No ratings yet
Updated Notes of APR - 084732
6 pages
BIA 5000 Introduction To Analytics - Lesson 6
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 6
59 pages
Session2 Parts 3 4
No ratings yet
Session2 Parts 3 4
202 pages
Monitoring and Supporting Data Conversion
No ratings yet
Monitoring and Supporting Data Conversion
4 pages
Question Bank
No ratings yet
Question Bank
13 pages
Unit 2 Data Preprocessing and Association Rule Mining
No ratings yet
Unit 2 Data Preprocessing and Association Rule Mining
31 pages
DV Chapter 2
No ratings yet
DV Chapter 2
36 pages
m4t5 - PDF - Eng Data Cleaning & Etl
No ratings yet
m4t5 - PDF - Eng Data Cleaning & Etl
6 pages
Cours Preprocessing
No ratings yet
Cours Preprocessing
23 pages
Unit I
No ratings yet
Unit I
31 pages
CS822 DataMining Week3
No ratings yet
CS822 DataMining Week3
91 pages
Chapter 2EMR
No ratings yet
Chapter 2EMR
21 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
Introduction To Data Analysis
100% (1)
Introduction To Data Analysis
94 pages
Session 2 - Data Structure, Data Management and Data Quality
No ratings yet
Session 2 - Data Structure, Data Management and Data Quality
37 pages
(M3S1) Data Analytics Framework
No ratings yet
(M3S1) Data Analytics Framework
12 pages
Handouts
No ratings yet
Handouts
19 pages
New DM
No ratings yet
New DM
47 pages
BL Module 3 Validation and Transaction
No ratings yet
BL Module 3 Validation and Transaction
10 pages
6a - Data Quality and Data Cleaning
No ratings yet
6a - Data Quality and Data Cleaning
5 pages
Bi Unit 4
No ratings yet
Bi Unit 4
19 pages
Data Integrity for Analysts
No ratings yet
Data Integrity for Analysts
48 pages
10 Data Quality and Integration
No ratings yet
10 Data Quality and Integration
43 pages
Slide For Chapter 2
No ratings yet
Slide For Chapter 2
16 pages
Denisha FINAL - PROPOSAL.
No ratings yet
Denisha FINAL - PROPOSAL.
11 pages
Data Quality
No ratings yet
Data Quality
15 pages
Data Quality for Business Efficiency
No ratings yet
Data Quality for Business Efficiency
64 pages
Business Data Analytics Introduction To Data Science For Business Decision
No ratings yet
Business Data Analytics Introduction To Data Science For Business Decision
1 page
Session2 Short
No ratings yet
Session2 Short
196 pages
W02L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed
No ratings yet
W02L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed
22 pages
DWDV Unit 1
No ratings yet
DWDV Unit 1
21 pages
Unit3 Notes
No ratings yet
Unit3 Notes
15 pages
Unit2 Notes
No ratings yet
Unit2 Notes
8 pages
Unit 4
No ratings yet
Unit 4
76 pages
M2 J Edu AudioStegano Cepstrum
No ratings yet
M2 J Edu AudioStegano Cepstrum
11 pages
M5 2023 Ijics-83975 PPV
No ratings yet
M5 2023 Ijics-83975 PPV
20 pages
VTF Pentesting Week 1 Submission
No ratings yet
VTF Pentesting Week 1 Submission
13 pages
A MATLAB Simulink Model For Toyota Prius 2004 Based On DOE Reports
No ratings yet
A MATLAB Simulink Model For Toyota Prius 2004 Based On DOE Reports
9 pages
CN Unit-2 Notes
No ratings yet
CN Unit-2 Notes
29 pages
Det30043-Electrical Machines Lab 3 - Ac Gen
No ratings yet
Det30043-Electrical Machines Lab 3 - Ac Gen
6 pages
39
No ratings yet
39
79 pages
Detecting Fake News with Deep Learning
No ratings yet
Detecting Fake News with Deep Learning
6 pages
Synopsis Hotel Management
No ratings yet
Synopsis Hotel Management
19 pages
CHM13P Learning Task 4
No ratings yet
CHM13P Learning Task 4
7 pages
JSF File Data Description 0004824 - Rev - 1.20
No ratings yet
JSF File Data Description 0004824 - Rev - 1.20
38 pages
Soil Nitrogen Dynamics Guide
No ratings yet
Soil Nitrogen Dynamics Guide
18 pages
Stellar Numbers Project
No ratings yet
Stellar Numbers Project
13 pages
PHD Thesis On Routing in Wireless Sensor Networks
100% (3)
PHD Thesis On Routing in Wireless Sensor Networks
4 pages
Hadiyyisa POS Tagger With Deep Learning
100% (2)
Hadiyyisa POS Tagger With Deep Learning
34 pages
Process For Obtaining Honey Form Husk Coffe
No ratings yet
Process For Obtaining Honey Form Husk Coffe
10 pages
MA3103
No ratings yet
MA3103
1 page
Calculate Average Velocity Experiment
No ratings yet
Calculate Average Velocity Experiment
5 pages
History of Indian Philosophy Vol 2 - Frauwallner, Erich
No ratings yet
History of Indian Philosophy Vol 2 - Frauwallner, Erich
274 pages
BladeCenter H Redbook
100% (1)
BladeCenter H Redbook
31 pages
Ministation Hd-Pntu3: User Manual
No ratings yet
Ministation Hd-Pntu3: User Manual
154 pages
Sporlan TXV Bulletin 10-91
No ratings yet
Sporlan TXV Bulletin 10-91
19 pages
Rules For The Survey and Construction of Steel Ships: Part C
No ratings yet
Rules For The Survey and Construction of Steel Ships: Part C
19 pages
9hw PDF
No ratings yet
9hw PDF
2 pages
Ufgs 26 20 00
No ratings yet
Ufgs 26 20 00
74 pages
Infinity Square Creepy Skull
100% (2)
Infinity Square Creepy Skull
13 pages
MCA Data Structures Syllabus
No ratings yet
MCA Data Structures Syllabus
132 pages
LVDS Cable for LCD Repair Shops
100% (1)
LVDS Cable for LCD Repair Shops
5 pages
All 2000kn Presses (2025-04-27 00 - 37 - 41)
No ratings yet
All 2000kn Presses (2025-04-27 00 - 37 - 41)
1 page
A4 Profile Formate La Rece
No ratings yet
A4 Profile Formate La Rece
8 pages