What is Data Validation?
Data validation is the process of verifying and validating data that is
collected before it is used. Any type of data handling task, whether it is
gathering data, analyzing it, or structuring it for presentation, must
include data validation to ensure accurate results. Sometimes it can be
tempting to skip validation since it takes time. However, it is an essential
step toward garnering the best results possible.
Several checks are built into a system which ensures the data being
entered and stored has logical consistency. With the advancement in
technology, data validation today is much faster. Most of the data
integration platforms incorporate and automate the data validation step
so that it becomes an inherent step in the entire workflow rather than an
additional one. In such automated systems, there is little human
intervention required. Data validation becomes essential because poor-
quality data causes issues downstream, and there are higher costs
attached to cleansing data if done later in the process.
The data validation process has gained significant importance within
organizations involved with data and its collection, processing, and
analysis. It is considered to be the foundation for efficient data
management since it facilitates analytics based on meaningful and valid
datasets.
What is Data Validation?
Data validation refers to the process of ensuring the accuracy and
quality of data. It is implemented by building several checks into
a system or report to ensure the logical consistency of input and
stored data.
In automated systems, data is entered with minimal or no human
supervision. Therefore, it is necessary to ensure that the data that
enters the system is correct and meets the desired quality
standards. The data will be of little use if it is not entered properly
and can create bigger downstream reporting issues. Unstructured
data, even if entered correctly, will incur related costs for
cleaning, transforming, and storage.
Types of Data Validation
There are many types of data validation. Most data validation
procedures will perform one or more of these checks to ensure
that the data is correct before storing it in the database. Common
types of data validation checks include:
1. Data Type Check
A data type check confirms that the data entered has the correct
data type. For example, a field might only accept numeric data. If
this is the case, then any data containing other characters such
as letters or special symbols should be rejected by the system.
2. Code Check
A code check ensures that a field is selected from a valid list of
values or follows certain formatting rules. For example, it is easier
to verify that a postal code is valid by checking it against a list of
valid codes. The same concept can be applied to other items such
as country codes and NAICS industry codes.
3. Range Check
A range check will verify whether input data falls within a
predefined range. For example, latitude and longitude are
commonly used in geographic data. A latitude value should be
between -90 and 90, while a longitude value must be between -
180 and 180. Any values out of this range are invalid.
4. Format Check
Many data types follow a certain predefined format. A common
use case is date columns that are stored in a fixed format like
“YYYY-MM-DD” or “DD-MM-YYYY.” A data validation procedure
that ensures dates are in the proper format helps maintain
consistency across data and through time.
5. Consistency Check
A consistency check is a type of logical check that confirms the
data’s been entered in a logically consistent way. An example is
checking if the delivery date is after the shipping date for a
parcel.
6. Uniqueness Check
Some data like IDs or e-mail addresses are unique by nature. A
database should likely have unique entries on these fields. A
uniqueness check ensures that an item is not entered multiple
times into a database.
Analyzing information requires structured and accessible data for best results. Data
transformation enables organizations to alter the structure and format of raw data as needed.
Learn how your enterprise can transform its data to perform analytics efficiently.
What is data transformation?
Data transformation is the process of changing the format, structure, or values of data. For
data analytics projects, data may be transformed at two stages of the data pipeline.
Organizations that use on-premises data warehouses generally use an ETL (extract,
transform, load) process, in which data transformation is the middle step. Today, most
organizations use cloud-based data warehouses, which can scale compute and storage
resources with latency measured in seconds or minutes. The scalability of the cloud platform
lets organizations skip preload transformations and load raw data into the data warehouse,
then transform it at query time — a model called ELT ( extract, load, transform).
Processes such as data integration, data migration, data warehousing, and data
wrangling all may involve data transformation.
Data transformation may be constructive (adding, copying, and replicating data), destructive
(deleting fields and records), aesthetic (standardizing salutations or street names), or
structural (renaming, moving, and combining columns in a database).
An enterprise can choose among a variety of ETL tools that automate the process of data
transformation. Data analysts, data engineers, and data scientists also transform data
using scripting languages such as Python or domain-specific languages like SQL.
Benefits and challenges of data transformation
Transforming data yields several benefits:
Data is transformed to make it better-organized. Transformed data may be easier for both
humans and computers to use.
Properly formatted and validated data improves data quality and protects applications from
potential landmines such as null values, unexpected duplicates, incorrect indexing, and
incompatible formats.
Data transformation facilitates compatibility between applications, systems, and types of
data. Data used for multiple purposes may need to be transformed in different ways..
However, there are challenges to transforming data effectively:
Data transformation can be expensive. The cost is dependent on the specific infrastructure,
software, and tools used to process data. Expenses may include those related to licensing,
computing resources, and hiring necessary personnel.
Data transformation processes can be resource-intensive. Performing transformations in an
on-premises data warehouse after loading, or transforming data before feeding it into
applications, can create a computational burden that slows down other operations. If you
use a cloud-based data warehouse, you can do the transformations after loading because
the platform can scale up to meet demand.
Lack of expertise and carelessness can introduce problems during transformation. Data
analysts without appropriate subject matter expertise are less likely to notice typos or
incorrect data because they are less familiar with the range of accurate and permissible
values. For example, someone working on medical data who is unfamiliar with relevant
terms might fail to flag disease names that should be mapped to a singular value or notice
misspellings.
Enterprises can perform transformations that don't suit their needs. A business might
change information to a specific format for one application only to then revert the
information back to its prior format for a different application.
How to transform data
Data transformation can increase the efficiency of analytic and business processes and enable
better data-driven decision-making. The first phase of data transformations should include
things like data type conversion and flattening of hierarchical data. These operations shape
data to increase compatibility with analytics systems. Data analysts and data scientists can
implement further transformations additively as necessary as individual layers of
processing. Each layer of processing should be designed to perform a specific set of tasks
that meet a known business or technical requirement.
Data transformation serves many functions within the data analytics stack.
Extraction and parsing
In the modern ELT process, data ingestion begins with extracting information from a data
source, followed by copying the data to its destination. Initial transformations are focused on
shaping the format and structure of data to ensure its compatibility with both the destination
system and the data already there. Parsing fields out of comma-delimited log data for loading
to a relational database is an example of this type of data transformation.
Translation and mapping
Some of the most basic data transformations involve the mapping and translation of data. For
example, a column containing integers representing error codes can be mapped to the relevant
error descriptions, making that column easier to understand and more useful for display in a
customer-facing application.
Translation converts data from formats used in one system to formats appropriate for a
different system. Even after parsing, web data might arrive in the form of hierarchical JSON
or XML files, but need to be translated into row and column data for inclusion in a relational
database.
Filtering, aggregation, and summarization
Data transformation is often concerned with whittling data down and making it more
manageable. Data may be consolidated by filtering out unnecessary fields, columns, and
records. Omitted data might include numerical indexes in data intended for graphs and
dashboards or records from business regions that aren’t of interest in a particular study.
Data might also be aggregated or summarized. by, for instance, transforming a time series of
customer transactions to hourly or daily sales counts.
BI tools can do this filtering and aggregation, but it can be more efficient to do the
transformations before a reporting tool accesses the data.
Enrichment and imputation
Data from different sources can be merged to create denormalized, enriched information. A
customer’s transactions can be rolled up into a grand total and added into a customer
information table for quicker reference or for use by customer analytics systems. Long or
freeform fields may be split into multiple columns, and missing values can be imputed or
corrupted data replaced as a result of these kinds of transformations.
Indexing and ordering
Data can be transformed so that it's ordered logically or to suit a data storage scheme. In
relational database management systems, for example, creating indexes can improve
performance or improve the management of relationships between different tables.
Anonymization and encryption
Data containing personally identifiable information, or other information that could
compromise privacy or security, should be anonymized before propagation. Encryption of
private data is a requirement in many industries, and systems can perform encryption at
multiple levels, from individual database cells to entire records or fields.
Modeling, typecasting, formatting, and renaming
Finally, a whole set of transformations can reshape data without changing content. This
includes casting and converting data types for compatibility, adjusting dates and times with
offsets and format localization, and renaming schemas, tables, and columns for clarity.
Refining the data transformation process
Before your enterprise can run analytics, and even before you transform the data, you must
replicate it to a data warehouse architected for analytics. Most organizations today choose a
cloud data warehouse, allowing them to take full advantage of ELT. Stitch can load all of
your data to your preferred data warehouse in a raw state, ready for transformation. Try
Stitch for free.
What is Feature Extraction?
Feature extraction is a process of dimensionality reduction by which an initial set of raw data
is reduced to more manageable groups for processing. A characteristic of these large data
sets is a large number of variables that require a lot of computing resources to process.
Feature extraction is the name for methods that select and /or combine variables into
features, effectively reducing the amount of data that must be processed, while still
accurately and completely describing the original data set.
Why is this Useful?
The process of feature extraction is useful when you need to reduce the number of
resources needed for processing without losing important or relevant information.
Feature extraction can also reduce the amount of redundant data for a given
analysis. Also, the reduction of the data and the machine’s efforts in building variable
combinations (features) facilitate the speed of learning and generalization steps in
the machine learning process.
Practical Uses of Feature Extraction
Autoencoders
– The purpose of autoencoders is unsupervised learning of efficient data
coding. Feature extraction is used here to identify key features in the data for
coding by learning from the coding of the original data set to derive new ones.
Bag-of-Words
– A technique for natural language processing that extracts the words
(features) used in a sentence, document, website, etc. and classifies them by
frequency of use. This technique can also be applied to image processing.
Image Processing – Algorithms are used to detect features such as shaped, edges,
or motion in a digital image or video.
In machine learning, pattern recognition, and image processing, feature extraction starts from
an initial set of measured data and builds derived values (features) intended to be informative
and non-redundant, facilitating the subsequent learning and generalization steps, and in some
cases leading to better human interpretations. Feature extraction is related to dimensionality
reduction.[1]
When the input data to an algorithm is too large to be processed and it is suspected to be
redundant (e.g. the same measurement in both feet and meters, or the repetitiveness of images
presented as pixels), then it can be transformed into a reduced set of features (also named
a feature vector). Determining a subset of the initial features is called feature selection.[2] The
selected features are expected to contain the relevant information from the input data, so that the
desired task can be performed by using this reduced representation instead of the complete
initial data.
General
Feature extraction involves reducing the number of resources required to describe a large set of
data. When performing analysis of complex data one of the major problems stems from the
number of variables involved. Analysis with a large number of variables generally requires a
large amount of memory and computation power, also it may cause a classification algorithm
to overfit to training samples and generalize poorly to new samples. Feature extraction is a
general term for methods of constructing combinations of the variables to get around these
problems while still describing the data with sufficient accuracy. Many machine
learning practitioners believe that properly optimized feature extraction is the key to effective
model construction.[3]
Results can be improved using constructed sets of application-dependent features, typically built
by an expert. One such process is called feature engineering. Alternatively, general
dimensionality reduction techniques are used such as:
Independent component analysis
Isomap
Kernel PCA
Latent semantic analysis
Partial least squares
Principal component analysis
Multifactor dimensionality reduction
Nonlinear dimensionality reduction
Semidefinite embedding
Autoencoder
Data reduction
Data reduction is the process of reducing the amount of capacity required
to store data. Data reduction can increase storage efficiency and reduce
costs. Storage vendors will often describe storage capacity in terms
of raw capacity and effective capacity, which refers to data after the
reduction.
Data reduction can be achieved several ways. The main types are data
deduplication, compression and single-instance storage. Data
deduplication, also known as data dedupe, eliminates redundant segments
of data on storage systems. It only stores redundant segments once and
uses that one copy whenever a request is made to access that piece of
data. Data dedupe is more granular than single-instance storage. Single-
instance storage finds files such as email attachments sent to multiple
people and only stores one copy of that file. As with dedupe, single-
instance storage replaces duplicates with pointers to the one saved copy.
Some storage arrays track which blocks are the most heavily shared.
Those blocks that are shared by the largest number of files may be moved
to a memory- or flash storage-based cache so they can be read as
efficiently as possible.
Data compression also works on a file level. It is accomplished natively in
storage systems using algorithms or formulas designed to identify and
remove redundant bits of data. Data compression specifically refers to a
data reduction method by which files are shrunk at the bit level.
Compression works by using formulas or algorithms to reduce the number
of bits needed to represent the data. This is usually done by representing a
repeating string of bits with a smaller string of bits and using a dictionary to
convert between them.
Common techniques of data reduction
There are also ways to reduce the amount of data that has to be stored without
actually shrinking the sizes of blocks and files. These techniques include thin
provisioning and data archiving.
Thin provisioning is achieved by dynamically allocating storage space in a flexible
manner. This method keeps reserved space just a little ahead of actual written
space, enabling more unreserved space to be used by other applications.
Traditional thick provisioning allocates fixed amounts of storage space as soon as a
disk is created, regardless of whether that entire capacity will be filled.
Differe
nces between thin and thick provisioning
Archiving data also reduces data on storage systems, but the approach is quite
different. Rather than reducing data within files or databases, archiving removes
older, infrequently accessed data from expensive storage and moves it to low-cost,
high-capacity storage. Archive storage can be on disk, tape or cloud.
Data reduction for primary storage
Although data deduplication was first developed for backup data on secondary
storage, it is possible to deduplicate primary storage. Primary storage deduplication
can occur as a function of the storage hardware or operating system
(OS). Windows Server 2012 and Windows Server 2012 R2, for instance, have
built-in data deduplication capabilities. The deduplication engine uses post-
processing deduplication, which means deduplication does not occur in real time.
Instead, a scheduled process periodically deduplicates primary storage data.
Primary storage deduplication is a common feature of many all-flash storage
systems. Because flash storage is expensive, deduplication is used to make the
most of flash storage capacity. Also, because flash storage offers such high
performance, the overhead of performing deduplication has less of an impact than
it would on a disk system.
DATA SAMPLING
In data analysis, sampling is the practice of analyzing a subset of all data
in order to uncover the meaningful information in the larger data set.
What is Sampling?
It is the practice of selecting an individual group from a population to
study the whole population.
Let’s say we want to know the percentage of people who use iPhones in a
city, for example. One way to do this is to call up everyone in the city and
ask them what type of phone they use. The other way would be to get a
smaller subgroup of individuals and ask them the same question, and
then use this information as an approximation of the total population.
However, this process is not as simple as it sounds. Whenever you follow
this method, your sample size has to be ideal - it should not be too large
or too small. Then once you have decided on the size of your sample, you
must use the right type of sampling techniques to collect a sample from
the population. Ultimately, every sampling type comes under two broad
categories:
Probability sampling - Random selection techniques are used to select
the sample.
Non-probability sampling - Non-random selection techniques based on
certain criteria are used to select the sample.
Types Of Sampling Techniques in Data Analytics-
Now, let’s discuss the types of sampling in data analytics. First, let us start
with the Probability Sampling techniques.
Probability Sampling Techniques
Probability Sampling Techniques are one of the important types of
sampling techniques. Probability sampling allows every member of the
population a chance to get selected. It is mainly used in quantitative
research when you want to produce results representative of the whole
population.
1. Simple Random Sampling
In simple random sampling, the researcher selects the participants
randomly. There are a number of data analytics tools like random number
generators and random number tables used that are based entirely on
chance.
Example: The researcher assigns every member in a company database a
number from 1 to 1000 (depending on the size of your company) and then
use a random number generator to select 100 members.
2. Systematic Sampling
In systematic sampling, every population is given a number as well like in
simple random sampling. However, instead of randomly generating
numbers, the samples are chosen at regular intervals.
Example: The researcher assigns every member in the company database
a number. Instead of randomly generating numbers, a random starting
point (say 5) is selected. From that number onwards, the researcher
selects every, say, 10th person on the list (5, 15, 25, and so on) until the
sample is obtained.
3. Stratified Sampling
In stratified sampling, the population is subdivided into subgroups, called
strata, based on some characteristics (age, gender, income, etc.). After
forming a subgroup, you can then use random or systematic sampling to
select a sample for each subgroup. This method allows you to draw more
precise conclusions because it ensures that every subgroup is properly
represented.
Example: If a company has 500 male employees and 100 female
employees, the researcher wants to ensure that the sample reflects the
gender as well. So the population is divided into two subgroups based on
gender.
4. Cluster Sampling
In cluster sampling, the population is divided into subgroups, but each
subgroup has similar characteristics to the whole sample. Instead of
selecting a sample from each subgroup, you randomly select an entire
subgroup. This method is helpful when dealing with large and diverse
populations.
Example: A company has over a hundred offices in ten cities across the
world which has roughly the same number of employees in similar job
roles. The researcher randomly selects 2 to 3 offices and uses them as the
sample.
Here comes the next type of sampling techniques i.e., Non-Probability
Sampling Techniques
Non-Probability Sampling Techniques
Non-Probability Sampling Techniques is one of the important types of
Sampling techniques. In non-probability sampling, not every individual has
a chance of being included in the sample. This sampling method is easier
and cheaper but also has high risks of sampling bias. It is often used in
exploratory and qualitative research with the aim to develop an initial
understanding of the population.
1. Convenience Sampling
In this sampling method, the researcher simply selects the individuals
which are most easily accessible to them. This is an easy way to gather
data, but there is no way to tell if the sample is representative of the
entire population. The only criteria involved is that people are available
and willing to participate.
Example: The researcher stands outside a company and asks the
employees coming in to answer questions or complete a survey.
2. Voluntary Response Sampling
Voluntary response sampling is similar to convenience sampling, in the
sense that the only criterion is people are willing to participate. However,
instead of the researcher choosing the participants, the participants
volunteer themselves.
Example: The researcher sends out a survey to every employee in a
company and gives them the option to take part in it.
3. Purposive Sampling
In purposive sampling, the researcher uses their expertise and judgment
to select a sample that they think is the best fit. It is often used when the
population is very small and the researcher only wants to gain knowledge
about a specific phenomenon rather than make statistical inferences.
Example: The researcher wants to know about the experiences of disabled
employees at a company. So the sample is purposefully selected from this
population.
4. Snowball Sampling
In snowball sampling, the research participants recruit other participants
for the study. It is used when participants required for the research are
hard to find. It is called snowball sampling because like a snowball, it picks
up more participants along the way and gets larger and larger.
Example: The researcher wants to know about the experiences of
homeless people in a city. Since there is no detailed list of homeless
people, a probability sample is not possible. The only way to get the
sample is to get in touch with one homeless person who will then put you
in touch with other homeless people in a particular area.
Which Sampling Technique to Use?
In this article on types of sampling techniques in Data Analytics, we
covered everything about probability and non-probability sampling
techniques. For any type of research, it is necessary that you choose the
right sampling techniques before diving into the study. The effectiveness
of your research is hugely dependent on the sample that you choose.
These are just the top types of sampling techniques and there are still lots
more that you can choose from to refine your research. In order
to become a data analyst, you have to be exactly sure of what sampling
techniques you should use and when. If you want to learn more about
data analytics, Simplilearn’s Data Analytics Certification Program, in
partnership with Purdue University and in collaboration with IBM, features
masterclasses and follows a boot camp model designed with real-life
projects and business case studies. Get started with this course today and
embark on a successful career in data analytics.
If you have any doubts in the types of sampling techniques article, leave a
comment below and our team experts will get in touch with you as soon
as possible!
Types and Sources of Data
Data types and sources can be represented in a variety of ways. The two
primary data types are:
1. Quantitative represents as numerical figures - interval and ratio level
measurements.
2. Qualitative are text, images, audio/video, etc.
Although scientific disciplines differ in their preference for one type over
another, some investigators utilize information from both quantitative and
qualitative with the expectation of developing a richer understanding of a
targeted phenomenon.
Researchers collect information from human beings that can be
qualitative (ex. observing child-rearing practices) or quantitative
(recording biochemical markers, anthropometric measurements). Data
sources can include field notes, journals, laboratory notes/specimens, or
direct observations of humans, animals, plants. Interactions between data
type and source are not infrequent.
Determining appropriate data is discipline-specific and is primarily driven
by the nature of the investigation, existing literature, and accessibility to
data sources. Questions that need to know when selecting data type and
sources are given below:
o What is the research question?
o What is the scope of the investigation? (This defines the parameters of
any study. Selected data should not extend beyond the scope of the
study).
o What has the literature (previous research) determined to be the most
appropriate data to collect?
o What type of data should be considered: quantitative, qualitative, or a
composite of both?
What is Feature Selection in Data Mining?
Feature selection has been an active research area in pattern recognition,
statistics, and data mining communities. The main idea of feature
selection is to choose a subset of input variables by eliminating features
with little or no predictive information. Feature selection can significantly
improve the comprehensibility of the resulting classifier models and often
build a model that generalizes better to unseen points. Further, it is often
the case that finding the correct subset of predictive features is an
important problem in its own right.
For example, a physician may decide based on the selected features
whether a dangerous surgery is necessary for treatment or not. Feature
selection in supervised learning has been well studied, where the main
goal is to find a feature subset that produces higher classification
accuracy.
Recently, several researchers have studied feature selection and
clustering together with a single or unified criterion. For feature selection
in unsupervised learning, learning algorithms are designed to find a
natural grouping of the examples in the feature space. Thus feature
selection in unsupervised learning aims to find a good subset of features
that forms the high quality of clusters for a given number of clusters.
However, the traditional approaches to feature selection with a single
evaluation criterion have shown limited capability in terms of knowledge
discovery and decision support. This is because decision-makers should
take into account multiple, conflicting objectives simultaneously. In
particular, no single criterion for unsupervised feature selection is best for
every application, and only the decision-maker can determine the relative
weights of criteria for her application.
Principal Component Analysis
Principal Component Analysis is an unsupervised learning algorithm that is
used for the dimensionality reduction in machine learning. It is a statistical
process that converts the observations of correlated features into a set of
linearly uncorrelated features with the help of orthogonal transformation.
These new transformed features are called the Principal Components. It
is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the
given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the
high-dimensional data.
PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces
the dimensionality. Some real-world applications of PCA are image
processing, movie recommendation system, optimizing the power
allocation in various communication channels. It is a feature
extraction technique, so it contains the important variables and drops the
least important variable.
The PCA algorithm is based on some mathematical concepts such as:
o Variance and Covariance
o Eigenvalues and Eigen factors
Some common terms used in PCA algorithm:
o Dimensionality: It is the number of features or variables present in the
given dataset. More easily, it is the number of columns present in the
dataset.
o Correlation: It signifies that how strongly two variables are related to
each other. Such as if one changes, the other variable also gets changed.
The correlation value ranges from -1 to +1. Here, -1 occurs if variables are
inversely proportional to each other, and +1 indicates that variables are
directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other,
and hence the correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is
given. Then v will be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair
of variables is called the Covariance Matrix.
Principal Components in PCA
As described above, the transformed new features or the output of PCA
are the Principal Components. The number of these PCs are either equal
to or less than the original features present in the dataset. Some
properties of these principal components are given below:
o The principal component must be the linear combination of the original
features.
o These components are orthogonal, i.e., the correlation between a pair of
variables is zero.
o The importance of each component decreases when going to 1 to n, it
means the 1 PC has the most importance, and n PC will have the least
importance.
Steps for PCA algorithm
1. Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X
and Y, where X is the training set, and Y is the validation set.
2. Representing data into a structure
Now we will represent our dataset into a structure. Such as we will
represent the two-dimensional matrix of independent variable X. Here
each row corresponds to the data items, and the column corresponds to
the Features. The number of columns is the dimensions of the dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular
column, the features with high variance are more important compared to
the features with lower variance.
If the importance of features is independent of the variance of the feature,
then we will divide each data item in a column with the standard deviation
of the column. Here we will name the matrix as Z.
4. Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will
transpose it. After transpose, we will multiply it by Z. The output matrix
will be the Covariance matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the
resultant covariance matrix Z. Eigenvectors or the covariance matrix are
the directions of the axes with high information. And the coefficients of
these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in
decreasing order, which means from largest to smallest. And
simultaneously sort the eigenvectors accordingly in matrix P of
eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P*
matrix to the Z. In the resultant matrix Z*, each observation is the linear
combination of original features. Each column of the Z* matrix is
independent of each other.
8. Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and
what to remove. It means, we will only keep the relevant or important
features in the new dataset, and unimportant features will be removed
out.
Applications of Principal Component Analysis
o PCA is mainly used as the dimensionality reduction technique in various AI
applications such as computer vision, image compression, etc.
o It can also be used for finding hidden patterns if data has high dimensions.
Some fields where PCA is used are Finance, data mining, Psychology, etc.
Discretization in data mining
Data discretization refers to a method of converting a huge number of
data values into smaller ones so that the evaluation and management of
data become easy. In other words, data discretization is a method of
converting attributes values of continuous data into a finite set of intervals
with minimum data loss. There are two forms of data discretization first is
supervised discretization, and the second is unsupervised discretization.
Supervised discretization refers to a method in which the class data is
used. Unsupervised discretization refers to a method depending upon the
way which operation proceeds. It means it works on the top-down splitting
strategy and bottom-up merging strategy.
Now, we can understand this concept with the help of an example
Suppose we have an attribute of Age with the given values
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77
Table before Discretization
Attribute Age Age Age
1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46
After Discretization Child Young Mature
Another example is analytics, where we gather the static data of website
visitors. For example, all visitors who visit the site with the IP address of
India are shown under country level.
Some Famous techniques of data discretization
Histogram analysis
Histogram refers to a plot used to represent the underlying frequency
distribution of a continuous data set. Histogram assists the data
inspection for data distribution. For example, Outliers, skewness
representation, normal distribution representation, etc.
Binning
Binning refers to a data smoothing technique that helps to group a huge
number of continuous values into smaller values. For data discretization
and the development of idea hierarchy, this technique can also be used.
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is
executed by dividing the values of x numbers into clusters to isolate a
computational feature of x.
Data discretization using decision tree analysis
Data discretization refers to a decision tree analysis in which a top-down
slicing technique is used. It is done through a supervised procedure. In a
numeric attribute discretization, first, you need to select the attribute that
has the least entropy, and then you need to run it with the help of a
recursive process. The recursive process divides it into various discretized
disjoint intervals, from top to bottom, using the same splitting criterion.
Data discretization using correlation analysis
Discretizing data by linear regression technique, you can get the best
neighboring interval, and then the large intervals are combined to develop
a larger overlap to form the final 20 overlapping intervals. It is a
supervised procedure.