APPROACHING ANALYTICS PROBLEMS
The Data Analytics Lifecycle is designed
specifically for Big Data problems and data science
projects
The lifecycle has six phases, and project work can
occur in several phases at once.
For most phases in the lifecycle, the movement
can be either forward or backward 2
SRM Institute of Science and Technology
APPROACHING ANALYTICS PROBLEMS
Business Understanding:
Before solving any problem in the Business domain it needs to
be understood properly.
Business understanding forms a concrete base, which further
leads to easy resolution of queries.
SRM Institute of Science and Technology 3
APPROACHING ANALYTICS PROBLEMS
Analytic Understanding
The approaches can be of 4 types: Descriptive
approach (current status and information provided.
Diagnostic approach(a.k.a statistical analysis, what
is happening and why it is happening)
Predictive approach(it forecasts on the trends or
future events probability)
Prescriptive approach( how the problem should be
solved actually).
SRM Institute of Science and Technology 4
APPROACHING ANALYTICS PROBLEMS
Analytic Understanding
The approaches can be of 4 types: Descriptive
approach (current status and information provided.
Diagnostic approach(a.k.a statistical analysis, what
is happening and why it is happening)
Predictive approach(it forecasts on the trends or
future events probability)
Prescriptive approach( how the problem should be
solved actually).
SRM Institute of Science and Technology 5
APPROACHING ANALYTICS PROBLEMS
Data Requirements:
The above chosen analytical method indicates the necessary
data content, formats and sources to be gathered.
During the process of data requirements, one should find the
answers for questions like „what‟, „where‟, „when‟, „why‟, „how‟ &
„who‟.
SRM Institute of Science and Technology 6
APPROACHING ANALYTICS PROBLEMS
Data Collection:
Data collected can be obtained in any random format. So,
according to the approach chosen and the output to be obtained,
the data collected should be validated.
SRM Institute of Science and Technology 7
APPROACHING ANALYTICS PROBLEMS
Data Understanding:
Data understanding answers the question “Is the data
collected representative of the problem to be solved?”.
Descriptive statistics calculates the measures applied over
data to access the content and quality of matter.
SRM Institute of Science and Technology 8
APPROACHING ANALYTICS PROBLEMS
Data Preparation:
This whole process includes transformation, normalization etc.
Modelling:
Modelling decides whether the data prepared for processing is
appropriate or requires more finishing and seasoning.
This phase focuses on the building of predictive/descriptive
models.
SRM Institute of Science and Technology 9
APPROACHING ANALYTICS PROBLEMS
Evaluation:
Model evaluation is done during model development. It checks
for the quality of the model to be assessed and also if it meets
the business requirements
Deployment:
Deployment phase checks how much the model can withstand
in the external environment and perform superiorly as
compared to others.
SRM Institute of Science and Technology 10
APPROACHING ANALYTICS PROBLEMS
Feedback:
Feedback is the necessary purpose which helps in refining the
model and accessing its performance and impact.
SRM Institute of Science and Technology 11
KEY ROLES FOR A SUCCESSFUL
ANALYTICS PROJECT
Business User
Project Sponsor
Project Manager
Business Intelligence Analyst
Database Administrator (DBA)
Data Engineer
Data Scientist
SRM Institute of Science and Technology 2
KEY ROLES FOR A SUCCESSFUL
ANALYTICS PROJECT
Business User:
Someone who understands the domain area and usually
benefits from the results.
This person can consult and advise the project team on
the context of the project, the value of the results, and how
the outputs will be operationalized.
SRM Institute of Science and Technology 3
KEY ROLES FOR A SUCCESSFUL
ANALYTICS PROJECT
Project Manager:
Ensures that key milestones and objectives are met on time and at
the expected quality.
Business Intelligence Analyst :
Provides business domain expertise based on a deep
understanding of the data, key performance indicators (KPis),
key metrics, and business intelligence from a reporting
perspective.
Business Intelligence Analysts generally create dashboards and
reports and have knowledge of the data feeds and sources.
SRM Institute of Science and Technology 4
KEY ROLES FOR A SUCCESSFUL
ANALYTICS PROJECT
Database Administrator (DBA):
His responsibilities include providing access to key databases
or tables and ensuring the appropriate security levels are in
place related to the data repositories.
SRM Institute of Science and Technology 5
KEY ROLES FOR A SUCCESSFUL
ANALYTICS PROJECT
Data Engineer:
Leverages deep technical skills to assist with tuning SQL
queries for data management and data extraction, and
provides support for data ingestion into the analytic
sandbox.
DBA sets up and configures the databases to be used, the
data engineer executes the actual data extractions and
performs substantial data manipulation to facilitate the
analytics.
SRM Institute of Science and Technology 6
KEY ROLES FOR A SUCCESSFUL
ANALYTICS PROJECT
Data Scientist:
Provides subject matter expertise for analytical techniques,
data modeling, and applying valid analytical techniques to
given business problems.
Ensures overall analytics objectives are met.
SRM Institute of Science and Technology 7
PHASE 1: DISCOVERY
The data science team must learn and investigate the
problem, develop context and understanding, and
learn about the data sources needed and available for
the project.
Learning the Business Domain
Resources
Framing the Problem
Identifying Key Stakeholders
Interviewing the Analytics Sponsor
Developing Initial Hypotheses
Identifying Potential Data Sources
SRM Institute of Science and Technology 2
PHASE 1: DISCOVERY
Learning the Business Domain:
Understanding the domain area of the problem is
essential.
Data scientists have deep knowledge of the
methods, techniques, and ways for applying
heuristics to a variety of business and conceptual
problems
SRM Institute of Science and Technology 3
PHASE 1: DISCOVERY
Resources:
As part of the discovery phase, the team needs to
assess the resources available to support the
project.
In this context, resources include technology, tools,
systems, data, and people.
SRM Institute of Science and Technology 4
PHASE 1: DISCOVERY
Resources:
Does the requisite level of expertise exist within the
organization today, or will it need to be cultivated?
The team will need to determine whether it must collect
additional data, purchase it from outside sources, or
transform existing data.
Ensure the project team has the right mix of domain
experts, customers, analytic talent, and project
management to be effective.
SRM Institute of Science and Technology 5
PHASE 1: DISCOVERY
Framing the Problem :
Framing is the process of stating the analytics problem to
be solved.
It is crucial to state the analytics problem, as well as why
and to whom it is important
it is important to identify the main objectives of the project,
identify what needs to be achieved in business terms, and
identify what needs to be done to meet the needs.
SRM Institute of Science and Technology 6
PHASE 1: DISCOVERY
It is best practice to share the statement of goals
and success criteria with the team and confirm
alignment with the project sponsor's expectations.
Establishing criteria for both success and failure
helps the participants to avoid unproductive effort
and remain aligned with the project sponsors
SRM Institute of Science and Technology 7
PHASE 1: DISCOVERY
Identifying Key Stakeholders:
Important step is to identify the key stakeholders and their
interests in the project.
During these discussions, the team can identify the
success criteria, key risks, and stakeholders
When interviewing stakeholders, learn about the domain
area and any relevant history from similar analytics projects.
SRM Institute of Science and Technology 8
PHASE 1: DISCOVERY
Identifying Key Stakeholders:
Depending on the number of stakeholders and
participants, the team may consider outlining the type of
activity and participation expected from each stakeholder
and participant.
This will set clear expectations with the participants and
avoid delays later
SRM Institute of Science and Technology 9
PHASE 1: DISCOVERY
Interviewing the Analytics Sponsor:
The team should plan to collaborate with the stakeholders
to clarify and frame the analytics problem.
Sponsors may have a predetermined solution that may not
necessarily realize the desired
outcome.
In these cases, the team must use its knowledge and
expertise to identify the true underlying
problem and appropriate solution.
SRM Institute of Science and Technology 10
PHASE 1: DISCOVERY
Data science team typically may have a more objective
understanding of the problem set than the stakeholders,
who may be suggesting
solutions.
Some tips for interviewing project sponsors:
Prepare for the interview; draft questions, and review with
colleagues.
Use open-ended questions; avoid asking leading
questions.
Document what the team heard, and review it with the
sponsors. SRM Institute of Science and Technology 11
PHASE 1: DISCOVERY
Developing Initial Hypotheses:
This step involves forming ideas that the team
can test with data.
In this way, the team can compare its answers with the
outcome of an experiment or test to generate additional
possible solutions to problems
Another part of this process involves gathering and
assessing hypotheses from stakeholders and domain
experts who may have their own perspective on what
the problem is, what the solution should be, and how
to arrive at a solution.
SRM Institute of Science and Technology 12
PHASE 1: DISCOVERY
Identifying Potential Data Sources:
The team should perform five main activities during this
step of the discovery phase:
Identify data sources
Capture aggregated data sources
Review the raw data
Evaluate the data structures and tools needed
Sort of data infrastructure needed for this type of
problem
SRM Institute of Science and Technology 13
PHASE 1: DISCOVERY
The data science team must learn and investigate the
problem, develop context and understanding, and
learn about the data sources needed and available for
the project.
Learning the Business Domain
Resources
Framing the Problem
Identifying Key Stakeholders
Interviewing the Analytics Sponsor
Developing Initial Hypotheses
Identifying Potential Data Sources
SRM Institute of Science and Technology 2
PHASE 1: DISCOVERY
Learning the Business Domain:
Understanding the domain area of the problem is
essential.
Data scientists have deep knowledge of the
methods, techniques, and ways for applying
heuristics to a variety of business and conceptual
problems
SRM Institute of Science and Technology 3
PHASE 1: DISCOVERY
Resources:
As part of the discovery phase, the team needs to
assess the resources available to support the
project.
In this context, resources include technology, tools,
systems, data, and people.
SRM Institute of Science and Technology 4
PHASE 1: DISCOVERY
Resources:
Does the requisite level of expertise exist within the
organization today, or will it need to be cultivated?
The team will need to determine whether it must collect
additional data, purchase it from outside sources, or
transform existing data.
Ensure the project team has the right mix of domain
experts, customers, analytic talent, and project
management to be effective.
SRM Institute of Science and Technology 5
PHASE 1: DISCOVERY
Framing the Problem :
Framing is the process of stating the analytics problem to
be solved.
It is crucial to state the analytics problem, as well as why
and to whom it is important
it is important to identify the main objectives of the project,
identify what needs to be achieved in business terms, and
identify what needs to be done to meet the needs.
SRM Institute of Science and Technology 6
PHASE 1: DISCOVERY
It is best practice to share the statement of goals
and success criteria with the team and confirm
alignment with the project sponsor's expectations.
Establishing criteria for both success and failure
helps the participants to avoid unproductive effort
and remain aligned with the project sponsors
SRM Institute of Science and Technology 7
PHASE 1: DISCOVERY
Identifying Key Stakeholders:
Important step is to identify the key stakeholders and their
interests in the project.
During these discussions, the team can identify the
success criteria, key risks, and stakeholders
When interviewing stakeholders, learn about the domain
area and any relevant history from similar analytics projects.
SRM Institute of Science and Technology 8
PHASE 1: DISCOVERY
Identifying Key Stakeholders:
Depending on the number of stakeholders and
participants, the team may consider outlining the type of
activity and participation expected from each stakeholder
and participant.
This will set clear expectations with the participants and
avoid delays later
SRM Institute of Science and Technology 9
PHASE 1: DISCOVERY
Interviewing the Analytics Sponsor:
The team should plan to collaborate with the stakeholders
to clarify and frame the analytics problem.
Sponsors may have a predetermined solution that may not
necessarily realize the desired
outcome.
In these cases, the team must use its knowledge and
expertise to identify the true underlying
problem and appropriate solution.
SRM Institute of Science and Technology 10
PHASE 1: DISCOVERY
Data science team typically may have a more objective
understanding of the problem set than the stakeholders,
who may be suggesting
solutions.
Some tips for interviewing project sponsors:
Prepare for the interview; draft questions, and review with
colleagues.
Use open-ended questions; avoid asking leading
questions.
Document what the team heard, and review it with the
sponsors. SRM Institute of Science and Technology 11
PHASE 1: DISCOVERY
Developing Initial Hypotheses:
This step involves forming ideas that the team
can test with data.
In this way, the team can compare its answers with the
outcome of an experiment or test to generate additional
possible solutions to problems
Another part of this process involves gathering and
assessing hypotheses from stakeholders and domain
experts who may have their own perspective on what
the problem is, what the solution should be, and how
to arrive at a solution.
SRM Institute of Science and Technology 12
PHASE 1: DISCOVERY
Identifying Potential Data Sources:
The team should perform five main activities during this
step of the discovery phase:
Identify data sources
Capture aggregated data sources
Review the raw data
Evaluate the data structures and tools needed
Sort of data infrastructure needed for this type of
problem
SRM Institute of Science and Technology 13
PHASE 2: DATA PREPARATION
The second phase of the Data Analytics Lifecycle involves
data preparation, which includes the steps to explore,
preprocess, and condition data prior to modeling and
analysis.
To get the data into the sandbox, the team needs to
perform ETLT, by a combination of extracting, transforming,
and loading data into the sandbox. Once the data is in the
sandbox, the team needs to learn about the data and
become familiar with it.
SRM Institute of Science and Technology 2
PHASE 2: DATA PREPARATION
The team may perform data visualizations to help team
members understand the data, including its trends,
outliers, and relationship among data variables. The step
involves
Preparing the Analytic Sandbox
Performing ETLT
Learning About the Data
Data Conditioning
Survey and Visualize
Common Tools for the Data Preparation Phase
SRM Institute of Science and Technology 3
PHASE 2: DATA PREPARATION
Preparing the Analytic Sandbox
When developing the analytic sandbox, it is a best practice
to collect all kinds of data there, as team members need
access to high volumes and varieties of data for a Big
Data analytics project.
This can include Analytic Sandbox: everything from
summary-level aggregated data, structured data, raw data
feeds, and unstructured text data from call logs or web
logs, depending on the kind of analysis the team plans to
undertake SRM Institute of Science and Technology 4
PHASE 2: DATA PREPARATION
Expect the sandbox to be large.lt may contain raw data,
aggregated data, and other data types that are less
commonly used in organizations.
Sandbox size can vary greatly depending on the project. A
good rule is to plan for the sandbox to be at least 5-10 times
the size of the original data sets, partly because copies of the
data may be created that serve as specific tables or data
stores for specific kinds of analysis in the project.
SRM Institute of Science and Technology 5
PHASE 2: DATA PREPARATION
Performing ETLT
In ETL, users perform extract, transform, load processes to
extract data from a datastore, perform data transformations,
and load the data back into the datastore.
ln this case, the data is extracted in its raw form and loaded
into the data store, where analysts can choose to transform
the data into a new state or leave it in its original, raw
condition.
SRM Institute of Science and Technology 6
PHASE 2: DATA PREPARATION
Performing ETLT
As part of the ETLT step, it is advisable to make an inventory
of the data and compare the data currently available with
datasets the team needs (Gap Analysis).
SRM Institute of Science and Technology 7
PHASE 2: DATA PREPARATION
Learning About the Data:
A critical aspect of a data science project is to become
familiar with the data itself
Clarifies the data that the data science team has access to at
the start of the project
Highlights gaps by identifying datasets within an organization
that the team may find useful
Identifies datasets outside the organization that may be
useful to obtain, through open APIs, data sharing, or
purchasing data to supplement already existing datasets
SRM Institute of Science and Technology 8
PHASE 2: DATA PREPARATION
Data Conditioning:
Data conditioning refers to the process of cleaning data,
normalizing datasets, and performing transformations On the
data
Data conditioning can involve many complex steps to join or
merge data sets or otherwise get datasets into a state that
enables analysis in further phases.
SRM Institute of Science and Technology 9
PHASE 2: DATA PREPARATION
Data Conditioning:
What are the data sources? What are the target fields (for
example, columns of the tables)?
How clean is the data?
How consistent are the contents and files?
Review the content of data columns or other inputs
Look for any evidence of systematic error.
SRM Institute of Science and Technology 10
PHASE 2: DATA PREPARATION
Survey and Visualize:
After the team has collected and obtained at least some of
the datasets needed for the subsequent analysis, a useful
step is to leverage data visualization tools to gain an
overview of the data.
Seeing high-level patterns in the data enables one to
understand characteristics about the data very quickly
SRM Institute of Science and Technology 11
PHASE 2: DATA PREPARATION
Survey and Visualize:
Review data to ensure that calculations remained consistent
within columns or across tables for a given data field.
Does the data distribution stay consistent over all the data? If
not, what kinds of actions should be taken to address this
problem?
Assess the granularity of the data, the range of values, and
the level of aggregation of the data.
SRM Institute of Science and Technology 12
PHASE 2: DATA PREPARATION
For time-related variables, are the measurements daily,
weekly, monthly?
Is the data standardized/normalized? Are the scales
consistent?
For geospatial datasets, are state or country abbreviations
consistent across the data?
SRM Institute of Science and Technology 13
PHASE 2: DATA PREPARATION
Common Tools for the Data Preparation Phase:
Hadoop :can perform massively parallel and custom analysis
for web traffic parsing, GPS location analytics and genomic
analysis
Alpine Miner : provides a graphical user interface (GUI) for
creating analytic work flows, including data manipulations and
a series of analytic events such as data-mining techniques
14
PHASE 2: DATA PREPARATION
Open Refine :(formerly called Google Refine) is "a free, open
source, powerful tool for working with messy data." It is a
popular GUI-based tool.
Data Wrangler :is an interactive tool for data clean ing and
transformation. Wrangler was developed at Stanford University
and can be used to perform many transformations on a given
dataset forming data transformations
SRM Institute of Science and Technology 15
PHASE 2: DATA PREPARATION
The second phase of the Data Analytics Lifecycle involves
data preparation, which includes the steps to explore,
preprocess, and condition data prior to modeling and
analysis.
To get the data into the sandbox, the team needs to
perform ETLT, by a combination of extracting, transforming,
and loading data into the sandbox. Once the data is in the
sandbox, the team needs to learn about the data and
become familiar with it.
SRM Institute of Science and Technology 2
PHASE 2: DATA PREPARATION
The team may perform data visualizations to help team
members understand the data, including its trends,
outliers, and relationship among data variables. The step
involves
Preparing the Analytic Sandbox
Performing ETLT
Learning About the Data
Data Conditioning
Survey and Visualize
Common Tools for the Data Preparation Phase
SRM Institute of Science and Technology 3
PHASE 2: DATA PREPARATION
Preparing the Analytic Sandbox
When developing the analytic sandbox, it is a best practice
to collect all kinds of data there, as team members need
access to high volumes and varieties of data for a Big
Data analytics project.
This can include Analytic Sandbox: everything from
summary-level aggregated data, structured data, raw data
feeds, and unstructured text data from call logs or web
logs, depending on the kind of analysis the team plans to
undertake SRM Institute of Science and Technology 4
PHASE 2: DATA PREPARATION
Expect the sandbox to be large.lt may contain raw data,
aggregated data, and other data types that are less
commonly used in organizations.
Sandbox size can vary greatly depending on the project. A
good rule is to plan for the sandbox to be at least 5-10 times
the size of the original data sets, partly because copies of the
data may be created that serve as specific tables or data
stores for specific kinds of analysis in the project.
SRM Institute of Science and Technology 5
PHASE 2: DATA PREPARATION
Performing ETLT
In ETL, users perform extract, transform, load processes to
extract data from a datastore, perform data transformations,
and load the data back into the datastore.
ln this case, the data is extracted in its raw form and loaded
into the data store, where analysts can choose to transform
the data into a new state or leave it in its original, raw
condition.
SRM Institute of Science and Technology 6
PHASE 2: DATA PREPARATION
Performing ETLT
As part of the ETLT step, it is advisable to make an inventory
of the data and compare the data currently available with
datasets the team needs (Gap Analysis).
SRM Institute of Science and Technology 7
PHASE 2: DATA PREPARATION
Learning About the Data:
A critical aspect of a data science project is to become
familiar with the data itself
Clarifies the data that the data science team has access to at
the start of the project
Highlights gaps by identifying datasets within an organization
that the team may find useful
Identifies datasets outside the organization that may be
useful to obtain, through open APIs, data sharing, or
purchasing data to supplement already existing datasets
SRM Institute of Science and Technology 8
PHASE 2: DATA PREPARATION
Data Conditioning:
Data conditioning refers to the process of cleaning data,
normalizing datasets, and performing transformations On the
data
Data conditioning can involve many complex steps to join or
merge data sets or otherwise get datasets into a state that
enables analysis in further phases.
SRM Institute of Science and Technology 9
PHASE 2: DATA PREPARATION
Data Conditioning:
What are the data sources? What are the target fields (for
example, columns of the tables)?
How clean is the data?
How consistent are the contents and files?
Review the content of data columns or other inputs
Look for any evidence of systematic error.
SRM Institute of Science and Technology 10
PHASE 2: DATA PREPARATION
Survey and Visualize:
After the team has collected and obtained at least some of
the datasets needed for the subsequent analysis, a useful
step is to leverage data visualization tools to gain an
overview of the data.
Seeing high-level patterns in the data enables one to
understand characteristics about the data very quickly
SRM Institute of Science and Technology 11
PHASE 2: DATA PREPARATION
Survey and Visualize:
Review data to ensure that calculations remained consistent
within columns or across tables for a given data field.
Does the data distribution stay consistent over all the data? If
not, what kinds of actions should be taken to address this
problem?
Assess the granularity of the data, the range of values, and
the level of aggregation of the data.
SRM Institute of Science and Technology 12
PHASE 2: DATA PREPARATION
For time-related variables, are the measurements daily,
weekly, monthly?
Is the data standardized/normalized? Are the scales
consistent?
For geospatial datasets, are state or country abbreviations
consistent across the data?
SRM Institute of Science and Technology 13
PHASE 2: DATA PREPARATION
Common Tools for the Data Preparation Phase:
Hadoop :can perform massively parallel and custom analysis
for web traffic parsing, GPS location analytics and genomic
analysis
Alpine Miner : provides a graphical user interface (GUI) for
creating analytic work flows, including data manipulations and
a series of analytic events such as data-mining techniques
14
PHASE 2: DATA PREPARATION
Open Refine :(formerly called Google Refine) is "a free, open
source, powerful tool for working with messy data." It is a
popular GUI-based tool.
Data Wrangler :is an interactive tool for data clean ing and
transformation. Wrangler was developed at Stanford University
and can be used to perform many transformations on a given
dataset forming data transformations
SRM Institute of Science and Technology 15
MODEL PLANNING
Data science team identifies candidate models to apply to the data for
clustering, classifying, or finding relationships in the data depending on the goal
of the project
Assess the structure of the datasets.
Ensure that the analytical techniques enable the team to meet the business
objectives and accept or reject the working hypotheses.
Determine if the situation warrants a single model or a series of techniques as
part of a larger analytic workflow.
SRM Institute of Science and Technology 2
MODEL PLANNING
Data Exploration and Variable Selection
Model Selection
Common Tools for the Model Planning Phase
SRM Institute of Science and Technology 3
MODEL PLANNING
Data Exploration and Variable Selection:
Data exploration takes place in the data preparation
phase, those activities focus mainly on data hygiene
and on assessing the quality of the data itself.
the objective of the data exploration is to understand
the relationships among the variables to inform
selection of the variables and methods and to
understand the problem domain
SRM Institute of Science and Technology 4
MODEL PLANNING
Data Exploration and Variable Selection:
The key to this approach is to aim for capturing the
most essential predictors and variables rather than
considering every possible variable that people think
may influence the outcome.
SRM Institute of Science and Technology 5
MODEL PLANNING
Model Selection
The team's main goal is to choose an analytical
technique, or a short list of candidate techniques, based
on the end goal of the project.
In the case of machine learning and data mining, these
rules and conditions are grouped into several general
sets of techniques, such as classification, association
rules, and clustering.
SRM Institute of Science and Technology 6
MODEL PLANNING
Model Selection
Teams create the initial models using a statistical
software package such as R, SAS, or Matlab.
Although these tools are designed for data mining and
machine learning algorithms, they may have limitations
when applying the models to very large datasets, as is
common with Big Data.
SRM Institute of Science and Technology 7
MODEL PLANNING
Model Selection
Teams create the initial models using a statistical
software package such as R, SAS, or Matlab.
Although these tools are designed for data mining and
machine learning algorithms, they may have limitations
when applying the models to very large datasets, as is
common with Big Data.
SRM Institute of Science and Technology 8
MODEL PLANNING
Data science team identifies candidate models to apply to the data for
clustering, classifying, or finding relationships in the data depending on the goal
of the project
Assess the structure of the datasets.
Ensure that the analytical techniques enable the team to meet the business
objectives and accept or reject the working hypotheses.
Determine if the situation warrants a single model or a series of techniques as
part of a larger analytic workflow.
SRM Institute of Science and Technology 2
MODEL PLANNING
Data Exploration and Variable Selection
Model Selection
Common Tools for the Model Planning Phase
SRM Institute of Science and Technology 3
MODEL PLANNING
Data Exploration and Variable Selection:
Data exploration takes place in the data preparation
phase, those activities focus mainly on data hygiene
and on assessing the quality of the data itself.
the objective of the data exploration is to understand
the relationships among the variables to inform
selection of the variables and methods and to
understand the problem domain
SRM Institute of Science and Technology 4
MODEL PLANNING
Data Exploration and Variable Selection:
The key to this approach is to aim for capturing the
most essential predictors and variables rather than
considering every possible variable that people think
may influence the outcome.
SRM Institute of Science and Technology 5
MODEL PLANNING
Model Selection
The team's main goal is to choose an analytical
technique, or a short list of candidate techniques, based
on the end goal of the project.
In the case of machine learning and data mining, these
rules and conditions are grouped into several general
sets of techniques, such as classification, association
rules, and clustering.
SRM Institute of Science and Technology 6
MODEL PLANNING
Model Selection
Teams create the initial models using a statistical
software package such as R, SAS, or Matlab.
Although these tools are designed for data mining and
machine learning algorithms, they may have limitations
when applying the models to very large datasets, as is
common with Big Data.
SRM Institute of Science and Technology 7
MODEL PLANNING
Model Selection
Teams create the initial models using a statistical
software package such as R, SAS, or Matlab.
Although these tools are designed for data mining and
machine learning algorithms, they may have limitations
when applying the models to very large datasets, as is
common with Big Data.
SRM Institute of Science and Technology 8
MODEL BUILDING PHASE
The data science team needs to develop data sets for
training, testing, and production purposes.
These data sets enable the data scientist to develop the
analytical model and train it while holding aside some of
the data for testing the model.
During this phase, users run models from analytical
software packages, such as R or SAS, on file extracts
and small data sets for testing purposes. On a small
scale, assess the validity of the model and its results.
SRM Institute of Science and Technology 2
COMMON TOOLS FOR THE MODEL BUILDING
PHASE
Common Tools for the Model Building Phase
SAS Enterprise Miner allows users to run predictive and
descriptive models based on large volumes of data from
across the enterprise.
SPSS Modeler offers methods to explore and analyze
data through a GUI.
Matlab provides a high-level language for performing a
variety of data analytics, algorithms, and data exploration
SRM Institute of Science and Technology 3
COMMON TOOLS FOR THE MODEL BUILDING
PHASE
Alpine Miner provides a GUI front end for users to
develop analytic work flows and interact with Big Data
tools and platforms on the back end.
STATISTICA and Mathematica are popular and well-
regarded data mining and analytics tools
SRM Institute of Science and Technology 4
COMMON TOOLS FOR THE MODEL BUILDING
PHASE
Open Source tools:
R and PL/R :PL/R is a procedural language for
PostgreSQL with R.
Octave: a free software programming language for
computational modeling, has some of the functionality of
Matlab.
SRM Institute of Science and Technology 5
COMMON TOOLS FOR THE MODEL BUILDING
PHASE
Open Source tools:
WEKA is a free data mining software package with an
analytic workbench
Python is a programming language that provides toolkits
for machine learning and analysis, such as scikit-learn,
numpy, scipy, pandas, and related data visualization
using matplotlib
SRM Institute of Science and Technology 6
MODEL BUILDING PHASE
The data science team needs to develop data sets for
training, testing, and production purposes.
These data sets enable the data scientist to develop the
analytical model and train it while holding aside some of
the data for testing the model.
During this phase, users run models from analytical
software packages, such as R or SAS, on file extracts
and small data sets for testing purposes. On a small
scale, assess the validity of the model and its results.
SRM Institute of Science and Technology 2
COMMON TOOLS FOR THE MODEL BUILDING
PHASE
Common Tools for the Model Building Phase
SAS Enterprise Miner allows users to run predictive and
descriptive models based on large volumes of data from
across the enterprise.
SPSS Modeler offers methods to explore and analyze
data through a GUI.
Matlab provides a high-level language for performing a
variety of data analytics, algorithms, and data exploration
SRM Institute of Science and Technology 3
COMMON TOOLS FOR THE MODEL BUILDING
PHASE
Alpine Miner provides a GUI front end for users to
develop analytic work flows and interact with Big Data
tools and platforms on the back end.
STATISTICA and Mathematica are popular and well-
regarded data mining and analytics tools
SRM Institute of Science and Technology 4
COMMON TOOLS FOR THE MODEL BUILDING
PHASE
Open Source tools:
R and PL/R :PL/R is a procedural language for
PostgreSQL with R.
Octave: a free software programming language for
computational modeling, has some of the functionality of
Matlab.
SRM Institute of Science and Technology 5
COMMON TOOLS FOR THE MODEL BUILDING
PHASE
Open Source tools:
WEKA is a free data mining software package with an
analytic workbench
Python is a programming language that provides toolkits
for machine learning and analysis, such as scikit-learn,
numpy, scipy, pandas, and related data visualization
using matplotlib
SRM Institute of Science and Technology 6
COMMUNICATE RESULTS
After executing the model, the team needs to compare
the outcomes of the modeling to the criteria established
for success and failure.
When conducting this assessment, determine if the
results are statistically significant and valid
SRM Institute of Science and Technology 2
COMMUNICATE RESULTS
The best practice in this phase is to record all the findings
and then select the three most significant ones that can
be shared with the stakeholders.
The team will have documented the key findings and
major insights derived from the analysis.
SRM Institute of Science and Technology 3
ANALYSIS OVER DIFFERENT MODELS
Better performance
Longer lifetime
Easier retraining
Speedy production
SRM Institute of Science and Technology 4
COMMUNICATE RESULTS
After executing the model, the team needs to compare
the outcomes of the modeling to the criteria established
for success and failure.
When conducting this assessment, determine if the
results are statistically significant and valid
SRM Institute of Science and Technology 2
COMMUNICATE RESULTS
The best practice in this phase is to record all the findings
and then select the three most significant ones that can
be shared with the stakeholders.
The team will have documented the key findings and
major insights derived from the analysis.
SRM Institute of Science and Technology 3
ANALYSIS OVER DIFFERENT MODELS
Better performance
Longer lifetime
Easier retraining
Speedy production
SRM Institute of Science and Technology 4
OPERATIONALIZE
The team communicates the benefits of the project more
broadly and sets up a pilot project to deploy the work in a
controlled way before broadening the work to a full enterprise
or ecosystem of users.
This allows the team to learn from the deployment and make
any needed adjustments before launching the model across
the enterprise.
The presentation needs to include supporting information
about analytical methodology and data sources
SRM Institute of Science and Technology 2
OPERATIONALIZE
Creating a mechanism for performing ongoing monitoring
of model accuracy and, if accuracy degrades, finding
ways to retrain the model.
SRM Institute of Science and Technology 3
OPERATIONALIZE
Business Intelligence Analyst needs to know if the
reports and dashboards he manages will be impacted
and need to change.
Data Engineer and Database Administrator (DBA)
typically need to share their code from the analytics
project and create a technical document on how to
implement it.
Data Scientist needs to share the code and explain the
model to managers, and other stakeholders.
SRM Institute of Science and Technology 4
MOVING MODEL TO DEPLOYMENT ENVIRONMENT
Developing Core Material for Multiple Audiences
Project Goals
Main Findings
Approach
Model Description
Model Details
Providing Technical Specifications and Code
SRM Institute of Science and Technology 5
OPERATIONALIZE
The team communicates the benefits of the project more
broadly and sets up a pilot project to deploy the work in a
controlled way before broadening the work to a full enterprise
or ecosystem of users.
This allows the team to learn from the deployment and make
any needed adjustments before launching the model across
the enterprise.
The presentation needs to include supporting information
about analytical methodology and data sources
SRM Institute of Science and Technology 2
OPERATIONALIZE
Creating a mechanism for performing ongoing monitoring
of model accuracy and, if accuracy degrades, finding
ways to retrain the model.
SRM Institute of Science and Technology 3
OPERATIONALIZE
Business Intelligence Analyst needs to know if the
reports and dashboards he manages will be impacted
and need to change.
Data Engineer and Database Administrator (DBA)
typically need to share their code from the analytics
project and create a technical document on how to
implement it.
Data Scientist needs to share the code and explain the
model to managers, and other stakeholders.
SRM Institute of Science and Technology 4
MOVING MODEL TO DEPLOYMENT ENVIRONMENT
Developing Core Material for Multiple Audiences
Project Goals
Main Findings
Approach
Model Description
Model Details
Providing Technical Specifications and Code
SRM Institute of Science and Technology 5
ANALYTICS PLAN
Discovery , Business problem framed
Initial Hypotheses
Data and Scope
Model planning-Analytic Techniques
Result and Key finding
Business impact
SRM Institute of Science and Technology 2
ANALYTICS PLAN
Discovery , Business problem framed
Initial Hypotheses
Data and Scope
Model planning-Analytic Techniques
Result and Key finding
Business impact
SRM Institute of Science and Technology 2
KEY DELIVERABLES OF ANALYTICS PROJECT
Developing Core Material for Multiple Audiences
Project Goals
Main Findings
Approach
Model Description
Model Details
Providing Technical Specifications and Code
SRM Institute of Science and Technology 2
PRESENTING YOUR RESULTS TO THE PROJECT
SPONSOR
project sponsor is the person who wants the data science
result—generally for the business need that it will fill.
1.Summarize the motivation behind the project, and its
goals.
2.State the project’s results.
3.Back up the results with details (Code), as needed.
4.Discuss recommendations, outstanding issues, and
possible future work.
SRM Institute of Science and Technology 3
PRESENTING YOUR RESULTS TO THE PROJECT
SPONSOR
Project sponsor presentation takeaways
Keep it short.
Keep it focused on the business issues, not the
technical ones.
Your project sponsor might use your presentation to
help sell the project or its results to the rest of the
organization.
SRM Institute of Science and Technology 4
PROVIDING TECHNICAL SPECIFICATIONS AND
CODE
The team should anticipate questions from IT related to
how computationally expensive it will be to run the model
in the production environment.
Teams should approach writing technical documentation
for their code and specifications.
Introduce your results early in the presentation, rather
than building up to them.
SRM Institute of Science and Technology 5
KEY DELIVERABLES OF ANALYTICS PROJECT
Developing Core Material for Multiple Audiences
Project Goals
Main Findings
Approach
Model Description
Model Details
Providing Technical Specifications and Code
SRM Institute of Science and Technology 2
PRESENTING YOUR RESULTS TO THE PROJECT
SPONSOR
project sponsor is the person who wants the data science
result—generally for the business need that it will fill.
1.Summarize the motivation behind the project, and its
goals.
2.State the project’s results.
3.Back up the results with details (Code), as needed.
4.Discuss recommendations, outstanding issues, and
possible future work.
SRM Institute of Science and Technology 3
PRESENTING YOUR RESULTS TO THE PROJECT
SPONSOR
Project sponsor presentation takeaways
Keep it short.
Keep it focused on the business issues, not the
technical ones.
Your project sponsor might use your presentation to
help sell the project or its results to the rest of the
organization.
SRM Institute of Science and Technology 4
PROVIDING TECHNICAL SPECIFICATIONS AND
CODE
The team should anticipate questions from IT related to
how computationally expensive it will be to run the model
in the production environment.
Teams should approach writing technical documentation
for their code and specifications.
Introduce your results early in the presentation, rather
than building up to them.
SRM Institute of Science and Technology 5