0% found this document useful (0 votes)

47 views113 pages

Unit 1 - DS

Data science is the process of extracting insights from structured and unstructured data. It involves organizing, processing, and analyzing large amounts of data to discover patterns and relationships. Data science draws from multiple fields like statistics, machine learning, and visualization to help businesses translate problems into research projects and practical solutions.

Uploaded by

ramya ravindran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views113 pages

Unit 1 - DS

Uploaded by

ramya ravindran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 113

What is Data Science

Data Science is an interdisciplinary field that allows you to

extract knowledge from structured or unstructured data.

 Data science enables you to translate a business problem

into a research project and then translate it back into a
practical solution.

SRM Institute of Science and Technology 2

What is Data Science

Data science is the process of deriving knowledge and

insights from a huge and diverse set of data through
organizing, processing and analyzing the data.

Data Science is an interdisciplinary field that allows you to

extract knowledge from structured or unstructured data.

SRM Institute of Science and Technology 3

What is Data Science

Data science enables you to translate a business problem

into a research project and then translate it back into a
practical solution.
.

SRM Institute of Science and Technology 4

What is Data Science

SRM Institute of Science and Technology 5

What is Data Science

Statistics:

Statistics is the most critical unit in Data science. It is the

method or science of collecting and analyzing numerical
data in large quantities to get useful insights.

Visualization:

Visualization technique helps you to access huge amounts

of data in easy to understand and digestible visuals.

SRM Institute of Science and Technology 6

What is Data Science

Machine Learning:

Machine Learning explores the building and study of

algorithms which learn to make predictions about
unforeseen/future data

Deep Learning:

Deep Learning method is new machine learning research

where the algorithm is applied to handle huge amount of
data.
SRM Institute of Science and Technology 7
What is Data Science

SRM Institute of Science and Technology 8

Data Science Process

• Defining data science project roles

• Understanding the stages of a data science project

• Setting expectations for a new data science project

SRM Institute of Science and Technology 9

The roles in a data science project

Project sponsor : represents the business interests;

champions the project

Client :represents end user

Data scientist : sets and executes analytic strategy;

communicates with sponsor and client

SRM Institute of Science and Technology 2

The roles in a data science project

Data architect : manages data and data storage; sometimes

manages data collection

Operations: manages infrastructure; deploys final project

results

SRM Institute of Science and Technology 3

PROJECT SPONSOR

 The sponsor is the person who wants the data science

result.

 The sponsor is responsible for deciding whether the

project is a success or failure.

SRM Institute of Science and Technology 4

CLIENT
 The client is the role that represents the model’s end
users’ interests.

 Generally the client belongs to a different group in the

organization and has other responsibilities beyond your
project

SRM Institute of Science and Technology 5

DATA SCIENTIST

Data scientist is responsible for taking all necessary steps

to make the project succeed.

Responsible for setting the project strategy ,design, pick the

data sources, and pick the tools to be used and the
techniques

They’re also responsible for project planning and tracking

,testing and procedures, applies machine learning models,
and evaluates results.

SRM Institute of Science and Technology 6

DATA ARCHITECT

The data architect is responsible for all of the data and its
storage.

Data architects often manage data warehouses for many

different projects

SRM Institute of Science and Technology 7

DATA ARCHITECT

The data architect is responsible for all of the data and its
storage.

Data architects often manage data warehouses for many

different projects

SRM Institute of Science and Technology 8

Stages of a data science project

SRM Institute of Science and Technology 2

Defining the goal
The first task in a data science project is to define a
measurable and quantifiable goal.

Why do the sponsors want the project in the first place?

What do they lack, and what do they need?

What are they doing to solve the problem now, and why isn’t
that good enough?

SRM Institute of Science and Technology 3

Defining the goal
What resources will you need: what kind of data and how
much staff?
Will you have domain experts to collaborate with, and what
are the computational resources?
 How do the project sponsors plan to deploy your results?
What are the constraints that have to be met for successful
deployment.
Example: The ultimate business goal is to
reduce the bank’s losses due to bad loans.

SRM Institute of Science and Technology 4

Data collection and management

This step encompasses identifying the data you need,

exploring it, and conditioning it to be suitable for analysis.
 This stage is often the most time-consuming step in the
process.
 What data is available to me?
Will it help me solve the problem?
Is data enough?
 Is the data quality good enough?

SRM Institute of Science and Technology 5

Modeling
Finalize the statistics and machine learning during the
modeling, or analysis stage.

Extracting useful insights from the data in order to achieve

your goals.

The loan application problem is a classification problem

SRM Institute of Science and Technology 6

Modeling
The most common data science modeling tasks are these:
Classification—Deciding if something belongs to one
category or another
Scoring—Predicting or estimating a numeric value, such as
a price or probability
Ranking—Learning to order items by preferences
Clustering—Grouping items into most-similar groups
Finding relations—Finding correlations or potential causes
of effects seen in the data
Characterization—Very general plotting and report
generation from data
SRM Institute of Science and Technology 7
Model evaluation and critique
Once you have a model, you need to determine if it meets
your goals.

Is it accurate enough for your needs? Does it generalize

well?

 Does it perform better than ―the obvious guess‖? Better

than whatever estimate you currently use?

Do the results of the model (coefficients, clusters, rules)

make sense in the context of the problem domain?

SRM Institute of Science and Technology 8

Model evaluation and critique

CONFUSION MATRIX
Predicted
Values

TP FP
Predicted
Values
FN TN

Predicted
Values
Actual 5 0
Values
1 9

SRM Institute of Science and Technology 9

Model evaluation and critique

Accuracy

Accuracy is defined as the ratio of total number of correct

predictions to the total number of samples.

Accuracy =(True Positive + True Negative) / (True positive

+True Negative+ False Positive +False Negative)

SRM Institute of Science and Technology 10

Model evaluation and critique
Precision
Precision defines the correct identification of actual
positives
Precision = True Positives / (True positives + False
positives)

Recall
Recall or True Positive Rate is defined as the ratio of true
positives to total number of true positives and false negatives.

Recall = True Positives / (True positives + False

Negatives)
SRM Institute of Science and Technology 11
Presentation and documentation

Once you have a model that meets your success criteria,

you’ll present your results to your project sponsor and other
stakeholders.

You must also document the model for those in the

organization who are responsible for using, running, and
maintaining the model once it has been deployed

SRM Institute of Science and Technology 12

Model deployment and maintenance

Finally, the model is put into operation.

In many organizations this means the data scientist no

longer has primary responsibility for the day-to-day operation
of the model

SRM Institute of Science and Technology 13

Stages of a data science project

SRM Institute of Science and Technology 2

Defining the goal
The first task in a data science project is to define a
measurable and quantifiable goal.

Why do the sponsors want the project in the first place?

What do they lack, and what do they need?

What are they doing to solve the problem now, and why isn’t
that good enough?

SRM Institute of Science and Technology 3

SRM Institute of Science and Technology 4

Data collection and management

This step encompasses identifying the data you need,

SRM Institute of Science and Technology 5

Modeling
Finalize the statistics and machine learning during the
modeling, or analysis stage.

Extracting useful insights from the data in order to achieve

your goals.

The loan application problem is a classification problem

SRM Institute of Science and Technology 6

Is it accurate enough for your needs? Does it generalize

well?

 Does it perform better than ―the obvious guess‖? Better

than whatever estimate you currently use?

Do the results of the model (coefficients, clusters, rules)

make sense in the context of the problem domain?

SRM Institute of Science and Technology 8

Model evaluation and critique

CONFUSION MATRIX
Predicted
Values

TP FP
Predicted
Values
FN TN

Predicted
Values
Actual 5 0
Values
1 9

SRM Institute of Science and Technology 9

Model evaluation and critique

Accuracy

Accuracy is defined as the ratio of total number of correct

predictions to the total number of samples.

Accuracy =(True Positive + True Negative) / (True positive

+True Negative+ False Positive +False Negative)

SRM Institute of Science and Technology 10

Model evaluation and critique
Precision
Precision defines the correct identification of actual
positives
Precision = True Positives / (True positives + False
positives)

Recall
Recall or True Positive Rate is defined as the ratio of true
positives to total number of true positives and false negatives.

Recall = True Positives / (True positives + False

Negatives)
SRM Institute of Science and Technology 11
Presentation and documentation

Once you have a model that meets your success criteria,

you’ll present your results to your project sponsor and other
stakeholders.

You must also document the model for those in the

organization who are responsible for using, running, and
maintaining the model once it has been deployed

SRM Institute of Science and Technology 12

Model deployment and maintenance

Finally, the model is put into operation.

In many organizations this means the data scientist no

longer has primary responsibility for the day-to-day operation
of the model

SRM Institute of Science and Technology 13

WORKING WITH DATA FROM FILES
The most common ready-to-go data format is a family of tabular
formats called structured values.

Working with well-structured data from files

Loading file:

uciCar <- read.table(

'http://www.win-vector.com/dfiles/car.data.csv',
sep=',‘ ,header=T)

This loads the data and stores it in a new R data frame object called
uciCar.

SRM Institute of Science and Technology 2

WORKING WITH DATA FROM FILES

EXAMINING OUR DATA

class()— tells us the object uciCar is of class data.frame.

help()—Gives you the documentation for a class.

summary()—Gives you a summary of almost any R object.

summary(uciCar) shows us a lot about the distribution of the
UCI car data.

SRM Institute of Science and Technology 3

WORKING WITH DATA FROM FILES

Exploring the car data

summary(uciCar)
buying maint
high :432 high :432
low :432 low :432
med :432 med :432
vhigh:432 vhigh:432
The summary() command shows us the distribution of each
variable in the dataset.

SRM Institute of Science and Technology 4

Structured data

 Data containing a defined data type, format, and structure

(that is, transaction data,online analytical processing
[OLAP] data cubes, traditional RDBMS, CSV files, and
even simple spreadsheets).

SRM Institute of Science and Technology 2

WORKING WITH OTHER DATA FORMATS

XLS/XLSX—http://cran.r-project.org/doc/manuals/R-
data.html#Reading-Excel-spreadsheets
JSON—http://cran.r-
project.org/web/packages/rjson/index.html
XML—http://cran.r-project.org/web/packages/XML/index.html
MongoDB—http://cran.r-
project.org/web/packages/rmongodb/index.html
SQL—http://cran.r-project.org/web/packages/DBI/index.html

SRM Institute of Science and Technology 3

TRANSFORMING DATA IN R

Building a map to interpret loan use codes

mapping <- list(

'A40'='car (new)',
'A41'='car (used)',
'A42'='furniture/equipment',
'A43'='radio/television',
'A44'='domestic appliances', ...)

SRM Institute of Science and Technology 4

TRANSFORMING
EXAMINING OUR NEW DATA
DATA IN R

> table(d$Purpose,d$Good.Loan)
BadLoan GoodLoan
business 34 63
car (new) 89 145
car (used) 17 86
domestic appliances 4 8
education 22 28
furniture/equipment 58 123
others 5 7
radio/television 62 218
Repairs 8 14
SRM Institute of Science and Technology 5
WORKING WITH RELATIONAL DATABASES

RMySQL Package:

R has a built-in package named "RMySQL" which provides

native connectivity between with MySql database. You can
install this package in the R environment using the following
command.

install.packages("RMySQL")

SRM Institute of Science and Technology 2

WORKING WITH RELATIONAL DATABASES

Updating Rows in the Tables

dbSendQuery(mysqlconnection, "update mtcars set disp =
168.5 where hp = 110")
Dropping Tables in MySql
dbSendQuery(mysqlconnection, 'drop table if exists student ')

SRM Institute of Science and Technology 3

WORKING WITH RELATIONAL DATABASES-
STAGING THE DATA

A staging area, is an intermediate storage area used for

data processing during the extract, transform and load
(ETL)process.

The data staging area sits between the data source(s) and
the data target(s), which are often data warehouses, data
marts, or other data repositories

SRM Institute of Science and Technology 4

WORKING WITH RELATIONAL DATABASES-
CURATING THE DATA

Data curation is the organization and integration

of data collected from various sources.

 It involves annotation, publication and presentation of the

data such that the value of the data is maintained over time,
and the data remains available for reuse and preservation.

5
WORKING WITH RELATIONAL DATABASES
Connecting R to MySql

create a connection object in R to connect to the database. It

takes the username, password, database name and host
name as input.

mysqlconnection = dbConnect(MySQL(), user = 'root',

password = '', dbname = 'sakila', host = 'localhost')

# List the tables available in this database.

dbListTables(mysqlconnection)
SRM Institute of Science and Technology 6
WORKING WITH RELATIONAL DATABASES
We can query the database tables in MySql using the
function dbSendQuery().

The query gets executed in MySql and the result set is

returned using the R fetch() function.

It is stored as a data frame in R.

#Create table

actor<-"CREATE TABLE actor(actor_id INT, first_name

TEXT, last_name TEXT, last update TEXT

SRM Institute of Science and Technology 7

WORKING WITH RELATIONAL DATABASES

# Insert the data into table

dbSendQuery(mysqlconnection, "insert into

actor(actor_id,first_name,last_name,last_update)
values(188,’Jeba’ ,’raj’,13/08/2019))

# Query the "actor" tables to get all the rows.

result = dbSendQuery(mysqlconnection, "select * from actor")

SRM Institute of Science and Technology 8

WORKING WITH RELATIONAL DATABASES

# Store the result in a R data frame object. n = 5 is used

to fetch first 5 rows.
data.frame = fetch(result, n = 5)
print(data.fame)

Query with Filter Clause:

result = dbSendQuery(mysqlconnection, "select * from
actor where last_name = 'TORN'")

SRM Institute of Science and Technology 9

WORKING WITH RELATIONAL DATABASES

# Fetch all the records(with n = -1) and store it as a data frame.

data.frame = fetch(result, n = -1)
print(data)

Output:
actor_id first_name last_name last_update
1 18 DAN TORN 2006-02-15 04:34:33
2 94 KENNETH TORN 2006-02-15 04:34:33 3
102 WALTER TORN 2006-02-15 04:34:33

SRM Institute of Science and Technology 10

WORKING WITH RELATIONAL DATABASES

Updating Rows in the Tables

dbSendQuery(mysqlconnection, "update mtcars set disp =

168.5 where hp = 110")

Dropping Tables in MySql

dbSendQuery(mysqlconnection, 'drop table if exists student ')

SRM Institute of Science and Technology 11

WORKING WITH RELATIONAL DATABASES-
STAGING THE DATA

A staging area, is an intermediate storage area used for

data processing during the extract, transform and load
(ETL)process.

The data staging area sits between the data source(s) and
the data target(s), which are often data warehouses, data
marts, or other data repositories

SRM Institute of Science and Technology 2

WORKING WITH RELATIONAL DATABASES-
CURATING THE DATA

Data curation is the organization and integration

of data collected from various sources.

 It involves annotation, publication and presentation of the

data such that the value of the data is maintained over time,
and the data remains available for reuse and preservation.

3
WORKING WITH RELATIONAL DATABASES
Connecting R to MySql

create a connection object in R to connect to the database. It

takes the username, password, database name and host
name as input.

mysqlconnection = dbConnect(MySQL(), user = 'root',

password = '', dbname = 'sakila', host = 'localhost')

# List the tables available in this database.

dbListTables(mysqlconnection)
SRM Institute of Science and Technology 4
WORKING WITH RELATIONAL DATABASES
We can query the database tables in MySql using the
function dbSendQuery().

The query gets executed in MySql and the result set is

returned using the R fetch() function.

It is stored as a data frame in R.

#Create table

actor<-"CREATE TABLE actor(actor_id INT, first_name

TEXT, last_name TEXT, last update TEXT

SRM Institute of Science and Technology 5

WORKING WITH RELATIONAL DATABASES

# Insert the data into table

dbSendQuery(mysqlconnection, "insert into

actor(actor_id,first_name,last_name,last_update)
values(188,’Jeba’ ,’raj’,13/08/2019))

# Query the "actor" tables to get all the rows.

result = dbSendQuery(mysqlconnection, "select * from actor")

SRM Institute of Science and Technology 6

WORKING WITH RELATIONAL DATABASES

# Store the result in a R data frame object. n = 5 is used

to fetch first 5 rows.
data.frame = fetch(result, n = 5)
print(data.fame)

Query with Filter Clause:

result = dbSendQuery(mysqlconnection, "select * from
actor where last_name = 'TORN'")

SRM Institute of Science and Technology 7

WORKING WITH RELATIONAL DATABASES

# Fetch all the records(with n = -1) and store it as a data frame.

data.frame = fetch(result, n = -1)
print(data)

Output:
actor_id first_name last_name last_update
1 18 DAN TORN 2006-02-15 04:34:33
2 94 KENNETH TORN 2006-02-15 04:34:33 3
102 WALTER TORN 2006-02-15 04:34:33

SRM Institute of Science and Technology 8

WORKING WITH RELATIONAL DATABASES

Updating Rows in the Tables

dbSendQuery(mysqlconnection, "update mtcars set disp =

168.5 where hp = 110")

Dropping Tables in MySql

dbSendQuery(mysqlconnection, 'drop table if exists student ')

SRM Institute of Science and Technology 9

EXPLORING THE DATA

Using summary statistics to spot problems

Missing Valuees
Invalid Values And Outliers
Data Range
Units

SRM Institute of Science and Technology 2

EXPLORING THE DATA

Using summary statistics to spot problems

> summary(custdata)
custid sex
Min. : 2068 F:440
1st Qu.: 345667 M:560
Median : 693403
Mean : 698500
3rd Qu.:1044606
Max. :1414286

SRM Institute of Science and Technology 3

EXPLORING THE DATA

Missing Values

is.employed
Mode :logical
FALSE:73
TRUE :599
NA‘s :328 (Missing Values)
housing. type
Homeowner free and clear :157
Homeowner with mortgage/loan:412
Occupied with no rent : 11
Rented :364
NA's : 56 (Missing Values)

SRM Institute of Science and Technology 4

EXPLORING THE DATA

Invalid values and Outliers

Examples of invalid values include negative values in what

should be a non-negative numeric data field (like age or
income), or text where you expect numbers.
Outliers are data points that fall well out of the range of where
you expect the data to be.

SRM Institute of Science and Technology 5

EXPLORING THE DATA
Invalid values and Outliers

> summary(custdata$income)

Min. 1st Qu. Median Mean 3rd Qu.

-8700 14600 35000 53500 67000
Max.61500
(Negative values of income represent bad data)
> summary(custdata$age)
Min. 1st Qu. Median Mean 3rd Qu.
0.0 38.0 50.0 51.7 64.0
Max.146.7
(Customers of age zero, or customers of an age greater than about
110 are outliers)
SRM Institute of Science and Technology 6
EXPLORING THE DATA
DATA RANGE

> summary(custdata$income)
Min. 1st Qu. Median Mean 3rd Qu.
-8700 14600 35000 53500 67000
Max.
615000

-8.7 14.6 35.0 53.5 67.0 615.0

SRM Institute of Science and Technology 7

EXPLORING THE DATA

UNITS

Does the income data represent hourly wages, or yearly wages

in units of $1000?
> summary(Income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-8.7 14.6 35.0 53.5 67.0 615.0

SRM Institute of Science and Technology 8

THANK YOU

SRM Institute of Science and Technology 9

18CSE396T– DATA SCIENCE

Unit I– : Session –5 : SLO -1& SLO2

SRM Institute of Science and Technology 1

EXPLORING THE DATA

Using summary statistics to spot problems

Missing Valuees
Invalid Values And Outliers
Data Range
Units

SRM Institute of Science and Technology 2

EXPLORING THE DATA

Using summary statistics to spot problems

> summary(custdata)
custid sex
Min. : 2068 F:440
1st Qu.: 345667 M:560
Median : 693403
Mean : 698500
3rd Qu.:1044606
Max. :1414286

SRM Institute of Science and Technology 3

EXPLORING THE DATA

Missing Values

SRM Institute of Science and Technology 4

EXPLORING THE DATA

Invalid values and Outliers

Examples of invalid values include negative values in what

should be a non-negative numeric data field (like age or
income), or text where you expect numbers.
Outliers are data points that fall well out of the range of where
you expect the data to be.

SRM Institute of Science and Technology 5

EXPLORING THE DATA
Invalid values and Outliers

> summary(custdata$income)

Min. 1st Qu. Median Mean 3rd Qu.

> summary(custdata$income)
Min. 1st Qu. Median Mean 3rd Qu.
-8700 14600 35000 53500 67000
Max.
615000

-8.7 14.6 35.0 53.5 67.0 615.0

SRM Institute of Science and Technology 7

EXPLORING THE DATA

UNITS

Does the income data represent hourly wages, or yearly wages

in units of $1000?
> summary(Income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-8.7 14.6 35.0 53.5 67.0 615.0

SRM Institute of Science and Technology 8

THANK YOU

SRM Institute of Science and Technology 9

18CSE396T– DATA SCIENCE

Unit I– : Session –6 : SLO -1& SLO2

SRM Institute of Science and Technology 1

MANAGING DATA- CLEANING DATA

Treating missing values

To drop or not to drop
Missing data in categorical variables
Missing Values In Numeric Data

summary(custdata$Income)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0 25000 45000 66200 82000 615000 328

SRM Institute of Science and Technology 2

MANAGING DATA- CLEANING DATA

TO DROP OR NOT TO DROP?

summary(custdata[is.na(custdata$housing.type),
c("recent.move","num.vehicles")])

The c function in R is used to create a vector with values you

provide explicitly.

SRM Institute of Science and Technology 3

MANAGING DATA- CLEANING DATA

Missing data in categorical variables

custdata$is.employed.fix <- ifelse(is.na(custdata$is.employed),

"missing",
ifelse(custdata$is.employed==T,
"employed",
"not employed"))
fix invokes edit on x and then assigns the new (edited) version
of x in the user's workspace.

SRM Institute of Science and Technology 4

MANAGING DATA- CLEANING DATA

summary(as.factor(custdata$is.employed.fix))

employed missing not employed

599 328 73\

 Factors are variables in R which take on a limited

number of different values; such variables are often
referred to as categorical variables. ...

SRM Institute of Science and Technology 5

MANAGING DATA- CLEANING DATA

Treating missing values

To drop or not to drop
Missing data in categorical variables
Missing Values In Numeric Data

summary(custdata$Income)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0 25000 45000 66200 82000 615000 328

SRM Institute of Science and Technology 2

MANAGING DATA- CLEANING DATA

TO DROP OR NOT TO DROP?

summary(custdata[is.na(custdata$housing.type),
c("recent.move","num.vehicles")])

The c function in R is used to create a vector with values you

provide explicitly.

SRM Institute of Science and Technology 3

MANAGING DATA- CLEANING DATA

Missing data in categorical variables

custdata$is.employed.fix <- ifelse(is.na(custdata$is.employed),

"missing",
ifelse(custdata$is.employed==T,
"employed",
"not employed"))
fix invokes edit on x and then assigns the new (edited) version
of x in the user's workspace.

SRM Institute of Science and Technology 4

MANAGING DATA- CLEANING DATA

summary(as.factor(custdata$is.employed.fix))

employed missing not employed

599 328 73\

 Factors are variables in R which take on a limited

number of different values; such variables are often
referred to as categorical variables. ...

SRM Institute of Science and Technology 5

SAMPLING FOR MODELING AND VALIDATION

Sampling is the process of selecting a subset of a

population to represent the whole , during analysis
and modeling.

It’s important that the dataset that you do use is an

accurate representation of your population as a
whole.

SRM Institute of Science and Technology 2

SAMPLING FOR MODELING AND VALIDATION

For example, your customers might come from all

over the United States.

When you collect your custdata dataset, it might be

tempting to use all the customers from one state, to
train the model.

It’s a good idea to pick customers randomly from all

the states.

SRM Institute of Science and Technology 3

VALIDATION

Data validation refers to the process of ensuring the accuracy

and quality of data.

It is implemented by building several checks into a system or

report to ensure the log

SRM Institute of Science and Technology 4

TYPES OF DATA VALIDATION

Data Type Check

A data type check confirms that the data entered has the
correct data type. For example, a field might only accept
numeric data.

Code Check

A code check ensures that a field is selected from a valid list of

values or follows certain formatting rules

SRM Institute of Science and Technology 5

TYPES OF DATA VALIDATION

Range Check

A range check will verify whether input data falls within a

predefined range.

Format Check

Many data types follow a certain predefined format. A common

use case is date columns that are stored in a fixed format like
“YYYY-MM-DD” or “DD-MM-YYYY.”

SRM Institute of Science and Technology 6

TYPES OF DATA VALIDATION

Consistency Check

A consistency check is a type of logical check that confirms the

data’s been entered in a logically consistent way.

Uniqueness Check

uniqueness check ensures that an item is not entered multiple

times into a database.

SRM Institute of Science and Technology 7

TEST AND TRAINING SPLITS

The training set is the data that you feed to the

model-building algorithm—regression, decision
trees, and so on.

The test set is the data that you feed into the
resulting model, to verify that the model’s
predictions are accurate.

SRM Institute of Science and Technology 2

CREATING A SAMPLE GROUP COLUMN

Convenient way to manage random sampling is to

add a sample group column to the data frame.

The sample group column contains a number

generated uniformly from zero to one, using the
runif function.

SRM Institute of Science and Technology 3

CREATING A SAMPLE GROUP COLUMN

> custdata$gp <- runif(dim(custdata)[1])

> testSet <- subset(custdata, custdata$gp <= 0.1)
> trainingSet <- subset(custdata, custdata$gp > 0.1)
> dim(testSet)[1]
[1] 93
> dim(trainingSet)[1]
[1] 907

SRM Institute of Science and Technology 4

RECORD GROUPING

hh <- unique(hhdata$household_id)
households <- data.frame(household_id = hh, gp =
runif(length(hh)))

hhdata <- merge(hhdata, households, by="household_id")

SRM Institute of Science and Technology 5

DATA PROVENANCE

Data provenance records the information on how the data sets

were generated.

SRM Institute of Science and Technology 6

DATA STRUCTURES

Structured
Semi- Structured
Quasi-Structured
Unstructured

SRM Institute of Science and Technology 2

DATA STRUCTURES

Big data can come in multiple forms, including

structured and non-structured data such as financial data,
text files, multimedia files, and genetic mappings.

Most of the Big Data is unstructured or semi-structured

in nature, which requires different techniques and tools to
process and analyze.

SRM Institute of Science and Technology 3

DATA STRUCTURES

Structured data: Data containing a defined data type, format,

and structure.

Semi-structured data: Textual data files with a discernible

pattern that enables parsing

SRM Institute of Science and Technology 4

DATA STRUCTURES

Quasi-structured data: Textual data with erratic data formats

that can be formatted with effort, tools, and time

Unstructured data: Data that has no inherent structure, which

may include text documents, PDFs,images, and video.

SRM Institute of Science and Technology 5

DATA STRUCTURES

Structured
Semi- Structured
Quasi-Structured
Unstructured

SRM Institute of Science and Technology 2

DATA STRUCTURES

Big data can come in multiple forms, including

structured and non-structured data such as financial data,
text files, multimedia files, and genetic mappings.

Most of the Big Data is unstructured or semi-structured

in nature, which requires different techniques and tools to
process and analyze.

SRM Institute of Science and Technology 3

DATA STRUCTURES

Structured data: Data containing a defined data type, format,

and structure.

Semi-structured data: Textual data files with a discernible

pattern that enables parsing

SRM Institute of Science and Technology 4

DATA STRUCTURES

Quasi-structured data: Textual data with erratic data formats

that can be formatted with effort, tools, and time

Unstructured data: Data that has no inherent structure, which

may include text documents, PDFs,images, and video.

SRM Institute of Science and Technology 5

DRIVERS OF BIG DATA

Smart devices, which provide sensor-based collection of

information from smart electric grids, smart buildings, and many
other public and industry infrastructures

Nontraditional IT devices, including the use of radio-

frequency identification (RFID) readers, GPS navigation
systems, and seismic processing

SRM Institute of Science and Technology 2

DRIVERS OF BIG DATA

Smart devices, which provide sensor-based collection of

information from smart electric grids, smart buildings, and many
other public and industry infrastructures

Nontraditional IT devices, including the use of radio-

frequency identification (RFID) readers, GPS navigation
systems, and seismic processing

SRM Institute of Science and Technology 3

DRIVERS OF BIG DATA

Smart devices, which provide sensor-based collection of

information from smart electric grids, smart buildings, and many
other public and industry infrastructures

Nontraditional IT devices, including the use of radio-

frequency identification (RFID) readers, GPS navigation
systems, and seismic processing

SRM Institute of Science and Technology 2

DRIVERS OF BIG DATA

Smart devices, which provide sensor-based collection of

information from smart electric grids, smart buildings, and many
other public and industry infrastructures

Nontraditional IT devices, including the use of radio-

frequency identification (RFID) readers, GPS navigation
systems, and seismic processing

SRM Institute of Science and Technology 3

Handout 1
No ratings yet
Handout 1
5 pages
Managing Ai Projects
No ratings yet
Managing Ai Projects
16 pages
Tools and Techniques For Data Science
No ratings yet
Tools and Techniques For Data Science
139 pages
Notes
No ratings yet
Notes
132 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
Data Science Notes
No ratings yet
Data Science Notes
105 pages
DSA Lecture1
No ratings yet
DSA Lecture1
15 pages
Fundamentals of Data Science Unit 1
No ratings yet
Fundamentals of Data Science Unit 1
33 pages
Intro to Data Science Basics
No ratings yet
Intro to Data Science Basics
171 pages
Ds Final
No ratings yet
Ds Final
3 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
Summary of Data Science
No ratings yet
Summary of Data Science
5 pages
Intro To Data and Data Science
No ratings yet
Intro To Data and Data Science
9 pages
Data Science Lifecycle Explained
No ratings yet
Data Science Lifecycle Explained
9 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Data Science
No ratings yet
Data Science
17 pages
PM Unit 1
No ratings yet
PM Unit 1
41 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
WINSEM2024-25 BCSE206L TH VL2024250502024 2024-12-21 Reference-Material-II
No ratings yet
WINSEM2024-25 BCSE206L TH VL2024250502024 2024-12-21 Reference-Material-II
27 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
00 Introduction To DataScience
No ratings yet
00 Introduction To DataScience
5 pages
Unit 1
No ratings yet
Unit 1
28 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Computational Data Science - Unit 1
No ratings yet
Computational Data Science - Unit 1
18 pages
Data-Science-Report - Priyesh
No ratings yet
Data-Science-Report - Priyesh
32 pages
Unit I
No ratings yet
Unit I
52 pages
Data Science Course in Pitampura
No ratings yet
Data Science Course in Pitampura
19 pages
Data Science Training Insights
No ratings yet
Data Science Training Insights
32 pages
Data Science Overview and Applications
No ratings yet
Data Science Overview and Applications
13 pages
Data Science
100% (2)
Data Science
33 pages
Unit 1 Part 1
No ratings yet
Unit 1 Part 1
18 pages
For Students Copy Intro To Data Science
No ratings yet
For Students Copy Intro To Data Science
40 pages
Module 2
No ratings yet
Module 2
49 pages
Unit 1 DS BCA NOTES
No ratings yet
Unit 1 DS BCA NOTES
7 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Data Science Assignment Final
No ratings yet
Data Science Assignment Final
2 pages
Ilovepdf Merged Pagenumber
No ratings yet
Ilovepdf Merged Pagenumber
26 pages
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
No ratings yet
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
53 pages
Dsdm-Unit1 241031 194317
No ratings yet
Dsdm-Unit1 241031 194317
38 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Coffee Shop Sales Internship Report
No ratings yet
Coffee Shop Sales Internship Report
39 pages
DSE 3 Unit 1
100% (1)
DSE 3 Unit 1
10 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
25 pages
Editor Document
No ratings yet
Editor Document
27 pages
Exploring Data Science Projects and Tools
No ratings yet
Exploring Data Science Projects and Tools
17 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
Data Science Components
No ratings yet
Data Science Components
7 pages
Data Science Introduction
No ratings yet
Data Science Introduction
24 pages
Data Science Course in Hyderabad
No ratings yet
Data Science Course in Hyderabad
9 pages
Challenges and Scope of Data Science Project
No ratings yet
Challenges and Scope of Data Science Project
21 pages
DS Unit 1
No ratings yet
DS Unit 1
26 pages
Bi Unit 2 PDF
No ratings yet
Bi Unit 2 PDF
33 pages
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
63 pages
File
No ratings yet
File
27 pages
Module 1 Applied Data Science 1.1 and 1.2
No ratings yet
Module 1 Applied Data Science 1.1 and 1.2
104 pages
Trends in Data Science: AI and DS-I
No ratings yet
Trends in Data Science: AI and DS-I
32 pages
Solved Problems - Computer Organization and Architecture
No ratings yet
Solved Problems - Computer Organization and Architecture
7 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
117 pages
Model Evaluation and Validation Guide
No ratings yet
Model Evaluation and Validation Guide
118 pages
Unit 2 DS
No ratings yet
Unit 2 DS
116 pages
Resume
No ratings yet
Resume
1 page
Splunking Microsoft Windows Firewalls
No ratings yet
Splunking Microsoft Windows Firewalls
4 pages
Syllabus of 70-533 Implementing Microsoft Azure Infrastructure Solutions
No ratings yet
Syllabus of 70-533 Implementing Microsoft Azure Infrastructure Solutions
6 pages
Huawei ICT Competition 2023-2024 Exam Outline - Cloud Track
0% (1)
Huawei ICT Competition 2023-2024 Exam Outline - Cloud Track
1 page
Syllabus Mca 4th Sem
No ratings yet
Syllabus Mca 4th Sem
15 pages
Hostel Management System PPT Oumfbo fbn5mb
No ratings yet
Hostel Management System PPT Oumfbo fbn5mb
25 pages
Eclipse Vert.x for Developers
No ratings yet
Eclipse Vert.x for Developers
12 pages
New York - 38
No ratings yet
New York - 38
11 pages
Geometry Dash Hackermode Guide
0% (1)
Geometry Dash Hackermode Guide
8 pages
Data Management at Scale, Second Edition Piethein Strengholt Online PDF
No ratings yet
Data Management at Scale, Second Edition Piethein Strengholt Online PDF
110 pages
Cognos Impromptu by Gopi
No ratings yet
Cognos Impromptu by Gopi
14 pages
Ethiopian Airlines Report
No ratings yet
Ethiopian Airlines Report
6 pages
Ajay Kumar H L - PM
No ratings yet
Ajay Kumar H L - PM
5 pages
Saas - l4
No ratings yet
Saas - l4
14 pages
Attacking and Defending Active Directory
No ratings yet
Attacking and Defending Active Directory
59 pages
SOA: Benefits and Challenges
No ratings yet
SOA: Benefits and Challenges
5 pages
Analyzing Analytics
No ratings yet
Analyzing Analytics
126 pages
WP CMDB Design Guidance
No ratings yet
WP CMDB Design Guidance
20 pages
DevOps Engineer Career Guide
No ratings yet
DevOps Engineer Career Guide
5 pages
Softek Best Practices Data Migration
No ratings yet
Softek Best Practices Data Migration
16 pages
Strategy to IT Infrastructure Guide
No ratings yet
Strategy to IT Infrastructure Guide
12 pages
Custom Code Management: New Options To Navigate Through Your "City"
No ratings yet
Custom Code Management: New Options To Navigate Through Your "City"
17 pages
TSM Questions
No ratings yet
TSM Questions
2 pages
ARCON PAM ITSM Integration
No ratings yet
ARCON PAM ITSM Integration
9 pages
Android Application For Training and Placement Cell: K.Anand, Retheesh D, J. Hemalatha, S. Karishma, R. Logeswari
No ratings yet
Android Application For Training and Placement Cell: K.Anand, Retheesh D, J. Hemalatha, S. Karishma, R. Logeswari
8 pages
Test Case Template Branded
No ratings yet
Test Case Template Branded
21 pages
Spring Microservices in Action PDF
No ratings yet
Spring Microservices in Action PDF
161 pages
Project Report (E-Commerce)
100% (2)
Project Report (E-Commerce)
22 pages
EMC Data Domain Tech
No ratings yet
EMC Data Domain Tech
22 pages
DBMS Lab Questions
No ratings yet
DBMS Lab Questions
13 pages