What is Data Science
Data Science is an interdisciplinary field that allows you to
extract knowledge from structured or unstructured data.
Data science enables you to translate a business problem
into a research project and then translate it back into a
practical solution.
SRM Institute of Science and Technology 2
What is Data Science
Data science is the process of deriving knowledge and
insights from a huge and diverse set of data through
organizing, processing and analyzing the data.
Data Science is an interdisciplinary field that allows you to
extract knowledge from structured or unstructured data.
SRM Institute of Science and Technology 3
What is Data Science
Data science enables you to translate a business problem
into a research project and then translate it back into a
practical solution.
.
SRM Institute of Science and Technology 4
What is Data Science
SRM Institute of Science and Technology 5
What is Data Science
Statistics:
Statistics is the most critical unit in Data science. It is the
method or science of collecting and analyzing numerical
data in large quantities to get useful insights.
Visualization:
Visualization technique helps you to access huge amounts
of data in easy to understand and digestible visuals.
SRM Institute of Science and Technology 6
What is Data Science
Machine Learning:
Machine Learning explores the building and study of
algorithms which learn to make predictions about
unforeseen/future data
Deep Learning:
Deep Learning method is new machine learning research
where the algorithm is applied to handle huge amount of
data.
SRM Institute of Science and Technology 7
What is Data Science
SRM Institute of Science and Technology 8
Data Science Process
• Defining data science project roles
• Understanding the stages of a data science project
• Setting expectations for a new data science project
SRM Institute of Science and Technology 9
The roles in a data science project
Project sponsor : represents the business interests;
champions the project
Client :represents end user
Data scientist : sets and executes analytic strategy;
communicates with sponsor and client
SRM Institute of Science and Technology 2
The roles in a data science project
Data architect : manages data and data storage; sometimes
manages data collection
Operations: manages infrastructure; deploys final project
results
SRM Institute of Science and Technology 3
PROJECT SPONSOR
The sponsor is the person who wants the data science
result.
The sponsor is responsible for deciding whether the
project is a success or failure.
SRM Institute of Science and Technology 4
CLIENT
The client is the role that represents the model’s end
users’ interests.
Generally the client belongs to a different group in the
organization and has other responsibilities beyond your
project
SRM Institute of Science and Technology 5
DATA SCIENTIST
Data scientist is responsible for taking all necessary steps
to make the project succeed.
Responsible for setting the project strategy ,design, pick the
data sources, and pick the tools to be used and the
techniques
They’re also responsible for project planning and tracking
,testing and procedures, applies machine learning models,
and evaluates results.
SRM Institute of Science and Technology 6
DATA ARCHITECT
The data architect is responsible for all of the data and its
storage.
Data architects often manage data warehouses for many
different projects
SRM Institute of Science and Technology 7
DATA ARCHITECT
The data architect is responsible for all of the data and its
storage.
Data architects often manage data warehouses for many
different projects
SRM Institute of Science and Technology 8
Stages of a data science project
SRM Institute of Science and Technology 2
Defining the goal
The first task in a data science project is to define a
measurable and quantifiable goal.
Why do the sponsors want the project in the first place?
What do they lack, and what do they need?
What are they doing to solve the problem now, and why isn’t
that good enough?
SRM Institute of Science and Technology 3
Defining the goal
What resources will you need: what kind of data and how
much staff?
Will you have domain experts to collaborate with, and what
are the computational resources?
How do the project sponsors plan to deploy your results?
What are the constraints that have to be met for successful
deployment.
Example: The ultimate business goal is to
reduce the bank’s losses due to bad loans.
SRM Institute of Science and Technology 4
Data collection and management
This step encompasses identifying the data you need,
exploring it, and conditioning it to be suitable for analysis.
This stage is often the most time-consuming step in the
process.
What data is available to me?
Will it help me solve the problem?
Is data enough?
Is the data quality good enough?
SRM Institute of Science and Technology 5
Modeling
Finalize the statistics and machine learning during the
modeling, or analysis stage.
Extracting useful insights from the data in order to achieve
your goals.
The loan application problem is a classification problem
SRM Institute of Science and Technology 6
Modeling
The most common data science modeling tasks are these:
Classification—Deciding if something belongs to one
category or another
Scoring—Predicting or estimating a numeric value, such as
a price or probability
Ranking—Learning to order items by preferences
Clustering—Grouping items into most-similar groups
Finding relations—Finding correlations or potential causes
of effects seen in the data
Characterization—Very general plotting and report
generation from data
SRM Institute of Science and Technology 7
Model evaluation and critique
Once you have a model, you need to determine if it meets
your goals.
Is it accurate enough for your needs? Does it generalize
well?
Does it perform better than ―the obvious guess‖? Better
than whatever estimate you currently use?
Do the results of the model (coefficients, clusters, rules)
make sense in the context of the problem domain?
SRM Institute of Science and Technology 8
Model evaluation and critique
CONFUSION MATRIX
Predicted
Values
TP FP
Predicted
Values
FN TN
Predicted
Values
Actual 5 0
Values
1 9
SRM Institute of Science and Technology 9
Model evaluation and critique
Accuracy
Accuracy is defined as the ratio of total number of correct
predictions to the total number of samples.
Accuracy =(True Positive + True Negative) / (True positive
+True Negative+ False Positive +False Negative)
SRM Institute of Science and Technology 10
Model evaluation and critique
Precision
Precision defines the correct identification of actual
positives
Precision = True Positives / (True positives + False
positives)
Recall
Recall or True Positive Rate is defined as the ratio of true
positives to total number of true positives and false negatives.
Recall = True Positives / (True positives + False
Negatives)
SRM Institute of Science and Technology 11
Presentation and documentation
Once you have a model that meets your success criteria,
you’ll present your results to your project sponsor and other
stakeholders.
You must also document the model for those in the
organization who are responsible for using, running, and
maintaining the model once it has been deployed
SRM Institute of Science and Technology 12
Model deployment and maintenance
Finally, the model is put into operation.
In many organizations this means the data scientist no
longer has primary responsibility for the day-to-day operation
of the model
SRM Institute of Science and Technology 13
Stages of a data science project
SRM Institute of Science and Technology 2
Defining the goal
The first task in a data science project is to define a
measurable and quantifiable goal.
Why do the sponsors want the project in the first place?
What do they lack, and what do they need?
What are they doing to solve the problem now, and why isn’t
that good enough?
SRM Institute of Science and Technology 3
Defining the goal
What resources will you need: what kind of data and how
much staff?
Will you have domain experts to collaborate with, and what
are the computational resources?
How do the project sponsors plan to deploy your results?
What are the constraints that have to be met for successful
deployment.
Example: The ultimate business goal is to
reduce the bank’s losses due to bad loans.
SRM Institute of Science and Technology 4
Data collection and management
This step encompasses identifying the data you need,
exploring it, and conditioning it to be suitable for analysis.
This stage is often the most time-consuming step in the
process.
What data is available to me?
Will it help me solve the problem?
Is data enough?
Is the data quality good enough?
SRM Institute of Science and Technology 5
Modeling
Finalize the statistics and machine learning during the
modeling, or analysis stage.
Extracting useful insights from the data in order to achieve
your goals.
The loan application problem is a classification problem
SRM Institute of Science and Technology 6
Modeling
The most common data science modeling tasks are these:
Classification—Deciding if something belongs to one
category or another
Scoring—Predicting or estimating a numeric value, such as
a price or probability
Ranking—Learning to order items by preferences
Clustering—Grouping items into most-similar groups
Finding relations—Finding correlations or potential causes
of effects seen in the data
Characterization—Very general plotting and report
generation from data
SRM Institute of Science and Technology 7
Model evaluation and critique
Once you have a model, you need to determine if it meets
your goals.
Is it accurate enough for your needs? Does it generalize
well?
Does it perform better than ―the obvious guess‖? Better
than whatever estimate you currently use?
Do the results of the model (coefficients, clusters, rules)
make sense in the context of the problem domain?
SRM Institute of Science and Technology 8
Model evaluation and critique
CONFUSION MATRIX
Predicted
Values
TP FP
Predicted
Values
FN TN
Predicted
Values
Actual 5 0
Values
1 9
SRM Institute of Science and Technology 9
Model evaluation and critique
Accuracy
Accuracy is defined as the ratio of total number of correct
predictions to the total number of samples.
Accuracy =(True Positive + True Negative) / (True positive
+True Negative+ False Positive +False Negative)
SRM Institute of Science and Technology 10
Model evaluation and critique
Precision
Precision defines the correct identification of actual
positives
Precision = True Positives / (True positives + False
positives)
Recall
Recall or True Positive Rate is defined as the ratio of true
positives to total number of true positives and false negatives.
Recall = True Positives / (True positives + False
Negatives)
SRM Institute of Science and Technology 11
Presentation and documentation
Once you have a model that meets your success criteria,
you’ll present your results to your project sponsor and other
stakeholders.
You must also document the model for those in the
organization who are responsible for using, running, and
maintaining the model once it has been deployed
SRM Institute of Science and Technology 12
Model deployment and maintenance
Finally, the model is put into operation.
In many organizations this means the data scientist no
longer has primary responsibility for the day-to-day operation
of the model
SRM Institute of Science and Technology 13
WORKING WITH DATA FROM FILES
The most common ready-to-go data format is a family of tabular
formats called structured values.
Working with well-structured data from files
Loading file:
uciCar <- read.table(
'http://www.win-vector.com/dfiles/car.data.csv',
sep=',‘ ,header=T)
This loads the data and stores it in a new R data frame object called
uciCar.
SRM Institute of Science and Technology 2
WORKING WITH DATA FROM FILES
EXAMINING OUR DATA
class()— tells us the object uciCar is of class data.frame.
help()—Gives you the documentation for a class.
summary()—Gives you a summary of almost any R object.
summary(uciCar) shows us a lot about the distribution of the
UCI car data.
SRM Institute of Science and Technology 3
WORKING WITH DATA FROM FILES
Exploring the car data
summary(uciCar)
buying maint
high :432 high :432
low :432 low :432
med :432 med :432
vhigh:432 vhigh:432
The summary() command shows us the distribution of each
variable in the dataset.
SRM Institute of Science and Technology 4
Structured data
Data containing a defined data type, format, and structure
(that is, transaction data,online analytical processing
[OLAP] data cubes, traditional RDBMS, CSV files, and
even simple spreadsheets).
SRM Institute of Science and Technology 2
WORKING WITH OTHER DATA FORMATS
XLS/XLSX—http://cran.r-project.org/doc/manuals/R-
data.html#Reading-Excel-spreadsheets
JSON—http://cran.r-
project.org/web/packages/rjson/index.html
XML—http://cran.r-project.org/web/packages/XML/index.html
MongoDB—http://cran.r-
project.org/web/packages/rmongodb/index.html
SQL—http://cran.r-project.org/web/packages/DBI/index.html
SRM Institute of Science and Technology 3
TRANSFORMING DATA IN R
Building a map to interpret loan use codes
mapping <- list(
'A40'='car (new)',
'A41'='car (used)',
'A42'='furniture/equipment',
'A43'='radio/television',
'A44'='domestic appliances', ...)
SRM Institute of Science and Technology 4
TRANSFORMING
EXAMINING OUR NEW DATA
DATA IN R
> table(d$Purpose,d$Good.Loan)
BadLoan GoodLoan
business 34 63
car (new) 89 145
car (used) 17 86
domestic appliances 4 8
education 22 28
furniture/equipment 58 123
others 5 7
radio/television 62 218
Repairs 8 14
SRM Institute of Science and Technology 5
WORKING WITH RELATIONAL DATABASES
RMySQL Package:
R has a built-in package named "RMySQL" which provides
native connectivity between with MySql database. You can
install this package in the R environment using the following
command.
install.packages("RMySQL")
SRM Institute of Science and Technology 2
WORKING WITH RELATIONAL DATABASES
Updating Rows in the Tables
dbSendQuery(mysqlconnection, "update mtcars set disp =
168.5 where hp = 110")
Dropping Tables in MySql
dbSendQuery(mysqlconnection, 'drop table if exists student ')
SRM Institute of Science and Technology 3
WORKING WITH RELATIONAL DATABASES-
STAGING THE DATA
A staging area, is an intermediate storage area used for
data processing during the extract, transform and load
(ETL)process.
The data staging area sits between the data source(s) and
the data target(s), which are often data warehouses, data
marts, or other data repositories
SRM Institute of Science and Technology 4
WORKING WITH RELATIONAL DATABASES-
CURATING THE DATA
Data curation is the organization and integration
of data collected from various sources.
It involves annotation, publication and presentation of the
data such that the value of the data is maintained over time,
and the data remains available for reuse and preservation.
5
WORKING WITH RELATIONAL DATABASES
Connecting R to MySql
create a connection object in R to connect to the database. It
takes the username, password, database name and host
name as input.
mysqlconnection = dbConnect(MySQL(), user = 'root',
password = '', dbname = 'sakila', host = 'localhost')
# List the tables available in this database.
dbListTables(mysqlconnection)
SRM Institute of Science and Technology 6
WORKING WITH RELATIONAL DATABASES
We can query the database tables in MySql using the
function dbSendQuery().
The query gets executed in MySql and the result set is
returned using the R fetch() function.
It is stored as a data frame in R.
#Create table
actor<-"CREATE TABLE actor(actor_id INT, first_name
TEXT, last_name TEXT, last update TEXT
SRM Institute of Science and Technology 7
WORKING WITH RELATIONAL DATABASES
# Insert the data into table
dbSendQuery(mysqlconnection, "insert into
actor(actor_id,first_name,last_name,last_update)
values(188,’Jeba’ ,’raj’,13/08/2019))
# Query the "actor" tables to get all the rows.
result = dbSendQuery(mysqlconnection, "select * from actor")
SRM Institute of Science and Technology 8
WORKING WITH RELATIONAL DATABASES
# Store the result in a R data frame object. n = 5 is used
to fetch first 5 rows.
data.frame = fetch(result, n = 5)
print(data.fame)
Query with Filter Clause:
result = dbSendQuery(mysqlconnection, "select * from
actor where last_name = 'TORN'")
SRM Institute of Science and Technology 9
WORKING WITH RELATIONAL DATABASES
# Fetch all the records(with n = -1) and store it as a data frame.
data.frame = fetch(result, n = -1)
print(data)
Output:
actor_id first_name last_name last_update
1 18 DAN TORN 2006-02-15 04:34:33
2 94 KENNETH TORN 2006-02-15 04:34:33 3
102 WALTER TORN 2006-02-15 04:34:33
SRM Institute of Science and Technology 10
WORKING WITH RELATIONAL DATABASES
Updating Rows in the Tables
dbSendQuery(mysqlconnection, "update mtcars set disp =
168.5 where hp = 110")
Dropping Tables in MySql
dbSendQuery(mysqlconnection, 'drop table if exists student ')
SRM Institute of Science and Technology 11
WORKING WITH RELATIONAL DATABASES-
STAGING THE DATA
A staging area, is an intermediate storage area used for
data processing during the extract, transform and load
(ETL)process.
The data staging area sits between the data source(s) and
the data target(s), which are often data warehouses, data
marts, or other data repositories
SRM Institute of Science and Technology 2
WORKING WITH RELATIONAL DATABASES-
CURATING THE DATA
Data curation is the organization and integration
of data collected from various sources.
It involves annotation, publication and presentation of the
data such that the value of the data is maintained over time,
and the data remains available for reuse and preservation.
3
WORKING WITH RELATIONAL DATABASES
Connecting R to MySql
create a connection object in R to connect to the database. It
takes the username, password, database name and host
name as input.
mysqlconnection = dbConnect(MySQL(), user = 'root',
password = '', dbname = 'sakila', host = 'localhost')
# List the tables available in this database.
dbListTables(mysqlconnection)
SRM Institute of Science and Technology 4
WORKING WITH RELATIONAL DATABASES
We can query the database tables in MySql using the
function dbSendQuery().
The query gets executed in MySql and the result set is
returned using the R fetch() function.
It is stored as a data frame in R.
#Create table
actor<-"CREATE TABLE actor(actor_id INT, first_name
TEXT, last_name TEXT, last update TEXT
SRM Institute of Science and Technology 5
WORKING WITH RELATIONAL DATABASES
# Insert the data into table
dbSendQuery(mysqlconnection, "insert into
actor(actor_id,first_name,last_name,last_update)
values(188,’Jeba’ ,’raj’,13/08/2019))
# Query the "actor" tables to get all the rows.
result = dbSendQuery(mysqlconnection, "select * from actor")
SRM Institute of Science and Technology 6
WORKING WITH RELATIONAL DATABASES
# Store the result in a R data frame object. n = 5 is used
to fetch first 5 rows.
data.frame = fetch(result, n = 5)
print(data.fame)
Query with Filter Clause:
result = dbSendQuery(mysqlconnection, "select * from
actor where last_name = 'TORN'")
SRM Institute of Science and Technology 7
WORKING WITH RELATIONAL DATABASES
# Fetch all the records(with n = -1) and store it as a data frame.
data.frame = fetch(result, n = -1)
print(data)
Output:
actor_id first_name last_name last_update
1 18 DAN TORN 2006-02-15 04:34:33
2 94 KENNETH TORN 2006-02-15 04:34:33 3
102 WALTER TORN 2006-02-15 04:34:33
SRM Institute of Science and Technology 8
WORKING WITH RELATIONAL DATABASES
Updating Rows in the Tables
dbSendQuery(mysqlconnection, "update mtcars set disp =
168.5 where hp = 110")
Dropping Tables in MySql
dbSendQuery(mysqlconnection, 'drop table if exists student ')
SRM Institute of Science and Technology 9
EXPLORING THE DATA
Using summary statistics to spot problems
Missing Valuees
Invalid Values And Outliers
Data Range
Units
SRM Institute of Science and Technology 2
EXPLORING THE DATA
Using summary statistics to spot problems
> summary(custdata)
custid sex
Min. : 2068 F:440
1st Qu.: 345667 M:560
Median : 693403
Mean : 698500
3rd Qu.:1044606
Max. :1414286
SRM Institute of Science and Technology 3
EXPLORING THE DATA
Missing Values
is.employed
Mode :logical
FALSE:73
TRUE :599
NA‘s :328 (Missing Values)
housing. type
Homeowner free and clear :157
Homeowner with mortgage/loan:412
Occupied with no rent : 11
Rented :364
NA's : 56 (Missing Values)
SRM Institute of Science and Technology 4
EXPLORING THE DATA
Invalid values and Outliers
Examples of invalid values include negative values in what
should be a non-negative numeric data field (like age or
income), or text where you expect numbers.
Outliers are data points that fall well out of the range of where
you expect the data to be.
SRM Institute of Science and Technology 5
EXPLORING THE DATA
Invalid values and Outliers
> summary(custdata$income)
Min. 1st Qu. Median Mean 3rd Qu.
-8700 14600 35000 53500 67000
Max.61500
(Negative values of income represent bad data)
> summary(custdata$age)
Min. 1st Qu. Median Mean 3rd Qu.
0.0 38.0 50.0 51.7 64.0
Max.146.7
(Customers of age zero, or customers of an age greater than about
110 are outliers)
SRM Institute of Science and Technology 6
EXPLORING THE DATA
DATA RANGE
> summary(custdata$income)
Min. 1st Qu. Median Mean 3rd Qu.
-8700 14600 35000 53500 67000
Max.
615000
-8.7 14.6 35.0 53.5 67.0 615.0
SRM Institute of Science and Technology 7
EXPLORING THE DATA
UNITS
Does the income data represent hourly wages, or yearly wages
in units of $1000?
> summary(Income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-8.7 14.6 35.0 53.5 67.0 615.0
SRM Institute of Science and Technology 8
THANK YOU
SRM Institute of Science and Technology 9
18CSE396T– DATA SCIENCE
Unit I– : Session –5 : SLO -1& SLO2
SRM Institute of Science and Technology 1
EXPLORING THE DATA
Using summary statistics to spot problems
Missing Valuees
Invalid Values And Outliers
Data Range
Units
SRM Institute of Science and Technology 2
EXPLORING THE DATA
Using summary statistics to spot problems
> summary(custdata)
custid sex
Min. : 2068 F:440
1st Qu.: 345667 M:560
Median : 693403
Mean : 698500
3rd Qu.:1044606
Max. :1414286
SRM Institute of Science and Technology 3
EXPLORING THE DATA
Missing Values
is.employed
Mode :logical
FALSE:73
TRUE :599
NA‘s :328 (Missing Values)
housing. type
Homeowner free and clear :157
Homeowner with mortgage/loan:412
Occupied with no rent : 11
Rented :364
NA's : 56 (Missing Values)
SRM Institute of Science and Technology 4
EXPLORING THE DATA
Invalid values and Outliers
Examples of invalid values include negative values in what
should be a non-negative numeric data field (like age or
income), or text where you expect numbers.
Outliers are data points that fall well out of the range of where
you expect the data to be.
SRM Institute of Science and Technology 5
EXPLORING THE DATA
Invalid values and Outliers
> summary(custdata$income)
Min. 1st Qu. Median Mean 3rd Qu.
-8700 14600 35000 53500 67000
Max.61500
(Negative values of income represent bad data)
> summary(custdata$age)
Min. 1st Qu. Median Mean 3rd Qu.
0.0 38.0 50.0 51.7 64.0
Max.146.7
(Customers of age zero, or customers of an age greater than about
110 are outliers)
SRM Institute of Science and Technology 6
EXPLORING THE DATA
DATA RANGE
> summary(custdata$income)
Min. 1st Qu. Median Mean 3rd Qu.
-8700 14600 35000 53500 67000
Max.
615000
-8.7 14.6 35.0 53.5 67.0 615.0
SRM Institute of Science and Technology 7
EXPLORING THE DATA
UNITS
Does the income data represent hourly wages, or yearly wages
in units of $1000?
> summary(Income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-8.7 14.6 35.0 53.5 67.0 615.0
SRM Institute of Science and Technology 8
THANK YOU
SRM Institute of Science and Technology 9
18CSE396T– DATA SCIENCE
Unit I– : Session –6 : SLO -1& SLO2
SRM Institute of Science and Technology 1
MANAGING DATA- CLEANING DATA
Treating missing values
To drop or not to drop
Missing data in categorical variables
Missing Values In Numeric Data
summary(custdata$Income)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 25000 45000 66200 82000 615000 328
SRM Institute of Science and Technology 2
MANAGING DATA- CLEANING DATA
TO DROP OR NOT TO DROP?
summary(custdata[is.na(custdata$housing.type),
c("recent.move","num.vehicles")])
The c function in R is used to create a vector with values you
provide explicitly.
SRM Institute of Science and Technology 3
MANAGING DATA- CLEANING DATA
Missing data in categorical variables
custdata$is.employed.fix <- ifelse(is.na(custdata$is.employed),
"missing",
ifelse(custdata$is.employed==T,
"employed",
"not employed"))
fix invokes edit on x and then assigns the new (edited) version
of x in the user's workspace.
SRM Institute of Science and Technology 4
MANAGING DATA- CLEANING DATA
summary(as.factor(custdata$is.employed.fix))
employed missing not employed
599 328 73\
Factors are variables in R which take on a limited
number of different values; such variables are often
referred to as categorical variables. ...
SRM Institute of Science and Technology 5
MANAGING DATA- CLEANING DATA
Treating missing values
To drop or not to drop
Missing data in categorical variables
Missing Values In Numeric Data
summary(custdata$Income)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 25000 45000 66200 82000 615000 328
SRM Institute of Science and Technology 2
MANAGING DATA- CLEANING DATA
TO DROP OR NOT TO DROP?
summary(custdata[is.na(custdata$housing.type),
c("recent.move","num.vehicles")])
The c function in R is used to create a vector with values you
provide explicitly.
SRM Institute of Science and Technology 3
MANAGING DATA- CLEANING DATA
Missing data in categorical variables
custdata$is.employed.fix <- ifelse(is.na(custdata$is.employed),
"missing",
ifelse(custdata$is.employed==T,
"employed",
"not employed"))
fix invokes edit on x and then assigns the new (edited) version
of x in the user's workspace.
SRM Institute of Science and Technology 4
MANAGING DATA- CLEANING DATA
summary(as.factor(custdata$is.employed.fix))
employed missing not employed
599 328 73\
Factors are variables in R which take on a limited
number of different values; such variables are often
referred to as categorical variables. ...
SRM Institute of Science and Technology 5
SAMPLING FOR MODELING AND VALIDATION
Sampling is the process of selecting a subset of a
population to represent the whole , during analysis
and modeling.
It’s important that the dataset that you do use is an
accurate representation of your population as a
whole.
SRM Institute of Science and Technology 2
SAMPLING FOR MODELING AND VALIDATION
For example, your customers might come from all
over the United States.
When you collect your custdata dataset, it might be
tempting to use all the customers from one state, to
train the model.
It’s a good idea to pick customers randomly from all
the states.
SRM Institute of Science and Technology 3
VALIDATION
Data validation refers to the process of ensuring the accuracy
and quality of data.
It is implemented by building several checks into a system or
report to ensure the log
SRM Institute of Science and Technology 4
TYPES OF DATA VALIDATION
Data Type Check
A data type check confirms that the data entered has the
correct data type. For example, a field might only accept
numeric data.
Code Check
A code check ensures that a field is selected from a valid list of
values or follows certain formatting rules
SRM Institute of Science and Technology 5
TYPES OF DATA VALIDATION
Range Check
A range check will verify whether input data falls within a
predefined range.
Format Check
Many data types follow a certain predefined format. A common
use case is date columns that are stored in a fixed format like
“YYYY-MM-DD” or “DD-MM-YYYY.”
SRM Institute of Science and Technology 6
TYPES OF DATA VALIDATION
Consistency Check
A consistency check is a type of logical check that confirms the
data’s been entered in a logically consistent way.
Uniqueness Check
uniqueness check ensures that an item is not entered multiple
times into a database.
SRM Institute of Science and Technology 7
TEST AND TRAINING SPLITS
The training set is the data that you feed to the
model-building algorithm—regression, decision
trees, and so on.
The test set is the data that you feed into the
resulting model, to verify that the model’s
predictions are accurate.
SRM Institute of Science and Technology 2
CREATING A SAMPLE GROUP COLUMN
Convenient way to manage random sampling is to
add a sample group column to the data frame.
The sample group column contains a number
generated uniformly from zero to one, using the
runif function.
SRM Institute of Science and Technology 3
CREATING A SAMPLE GROUP COLUMN
> custdata$gp <- runif(dim(custdata)[1])
> testSet <- subset(custdata, custdata$gp <= 0.1)
> trainingSet <- subset(custdata, custdata$gp > 0.1)
> dim(testSet)[1]
[1] 93
> dim(trainingSet)[1]
[1] 907
SRM Institute of Science and Technology 4
RECORD GROUPING
hh <- unique(hhdata$household_id)
households <- data.frame(household_id = hh, gp =
runif(length(hh)))
hhdata <- merge(hhdata, households, by="household_id")
SRM Institute of Science and Technology 5
DATA PROVENANCE
Data provenance records the information on how the data sets
were generated.
SRM Institute of Science and Technology 6
DATA STRUCTURES
Structured
Semi- Structured
Quasi-Structured
Unstructured
SRM Institute of Science and Technology 2
DATA STRUCTURES
Big data can come in multiple forms, including
structured and non-structured data such as financial data,
text files, multimedia files, and genetic mappings.
Most of the Big Data is unstructured or semi-structured
in nature, which requires different techniques and tools to
process and analyze.
SRM Institute of Science and Technology 3
DATA STRUCTURES
Structured data: Data containing a defined data type, format,
and structure.
Semi-structured data: Textual data files with a discernible
pattern that enables parsing
SRM Institute of Science and Technology 4
DATA STRUCTURES
Quasi-structured data: Textual data with erratic data formats
that can be formatted with effort, tools, and time
Unstructured data: Data that has no inherent structure, which
may include text documents, PDFs,images, and video.
SRM Institute of Science and Technology 5
DATA STRUCTURES
Structured
Semi- Structured
Quasi-Structured
Unstructured
SRM Institute of Science and Technology 2
DATA STRUCTURES
Big data can come in multiple forms, including
structured and non-structured data such as financial data,
text files, multimedia files, and genetic mappings.
Most of the Big Data is unstructured or semi-structured
in nature, which requires different techniques and tools to
process and analyze.
SRM Institute of Science and Technology 3
DATA STRUCTURES
Structured data: Data containing a defined data type, format,
and structure.
Semi-structured data: Textual data files with a discernible
pattern that enables parsing
SRM Institute of Science and Technology 4
DATA STRUCTURES
Quasi-structured data: Textual data with erratic data formats
that can be formatted with effort, tools, and time
Unstructured data: Data that has no inherent structure, which
may include text documents, PDFs,images, and video.
SRM Institute of Science and Technology 5
DRIVERS OF BIG DATA
Smart devices, which provide sensor-based collection of
information from smart electric grids, smart buildings, and many
other public and industry infrastructures
Nontraditional IT devices, including the use of radio-
frequency identification (RFID) readers, GPS navigation
systems, and seismic processing
SRM Institute of Science and Technology 2
DRIVERS OF BIG DATA
Smart devices, which provide sensor-based collection of
information from smart electric grids, smart buildings, and many
other public and industry infrastructures
Nontraditional IT devices, including the use of radio-
frequency identification (RFID) readers, GPS navigation
systems, and seismic processing
SRM Institute of Science and Technology 3
DRIVERS OF BIG DATA
Smart devices, which provide sensor-based collection of
information from smart electric grids, smart buildings, and many
other public and industry infrastructures
Nontraditional IT devices, including the use of radio-
frequency identification (RFID) readers, GPS navigation
systems, and seismic processing
SRM Institute of Science and Technology 2
DRIVERS OF BIG DATA
Smart devices, which provide sensor-based collection of
information from smart electric grids, smart buildings, and many
other public and industry infrastructures
Nontraditional IT devices, including the use of radio-
frequency identification (RFID) readers, GPS navigation
systems, and seismic processing
SRM Institute of Science and Technology 3