Data Science Roles & Lifecycle Guide

The document outlines the concepts of cloud computing, grid computing, and the MapReduce programming model, highlighting their roles in data analytics. It details the key roles in data analytics projects, including business users, project sponsors, and data scientists, emphasizing their contributions to successful project execution. Additionally, it describes the various phases in the data analytics life cycle, from discovery to operationalization, providing a structured approach to data-driven projects.

Uploaded by

mayankchaudhry2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views15 pages

Data Science Roles & Lifecycle Guide

Uploaded by

mayankchaudhry2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

AJAY KUMAR GARG ENGINEERING COLLEGE, GHAZIABAD

Computer Science And Engineering

Data Analytics and Visualization (KDS-501)

Data Analytics LifeCycle

Presented By:
Ms. Neeharika Tripathi
Assistant Professor
Department of Computer Science And Engineering
Cloud Computing
⚫ Cloud computing is on-demand access, via the
internet, to computing
resources—applications, servers (physical
servers and virtual servers), data
storage, development tools, networking
capabilities, and more—hosted at a
remote data center managed by a cloud
services provider (or CSP). The CSP makes
these resources available for a monthly
subscription fee or bills them according to
usage.
⚫ Public and Private Cloud: In a private cloud, a
single organization controls and maintains the
underlying infrastructure to deliver the IT
resources. In a public cloud, external cloud
providers deliver the resources as a fully
Grid computing
⚫ Grid computing is a computing
infrastructure that combines computer
resources spread over diﬀerent
geographical locations to achieve a
common goal. All unused resources on
multiple computers are pooled together
and made available for a single task.
Components of Grid Computing:

⚫ Nodes
⚫ The computers or servers on a grid
computing network are called nodes. Each
node offers unused computing resources
such as CPU, memory, and storage to the
grid network.
⚫ Grid middleware
Grid middleware is a specialized software
application that connects computing
resources in grid operations with
high-level applications. It controls the user
sharing of available resources to prevent
overwhelming the grid computers. The
grid middleware also provides security to
Grid computing architecture
⚫ Grid architecture represents the internal
structure of grid computers. The following
layers are broadly present in a grid node:
⚫ The top layer consists of high-level
applications, such as an application to
perform predictive modeling.
⚫ The second layer, also known as
middleware, manages and allocates
resources requested by applications.
⚫ The third layer consists of available
computer resources such as CPU, memory,
and storage.
⚫ The bottom layer allows the computer to
connect to a grid computing network.
Map Reduce
⚫ MapReduce is a programming model used
for efficient processing in parallel over
large data-sets in a distributed manner.
The data is first split and then combined to
produce the final result.
⚫ The purpose of MapReduce in Hadoop is
to Map each of the jobs and then it will
reduce it to equivalent tasks for providing
less overhead over the cluster network and
to reduce the processing power.
Components of MapReduce
architecture

⚫ Client: The MapReduce client is the one who

brings the Job to the MapReduce for
processing.
⚫ Job: The MapReduce Job is the actual work
that the client wanted to do which is
comprised of so many smaller tasks
⚫ Hadoop MapReduce Master: It divides the
particular job into subsequent job-parts.
⚫ Job-Parts: The task or sub-jobs that are
obtained after dividing the main job. The
result of all the job-parts combined to produce
the final output.
⚫ Input Data: The data set that is fed to the
MapReduce for processing.
⚫ Output Data: The final result is obtained after
Key Roles in Data Analytics Projects
⚫ There are certain key roles that are required for
the complete and fulfilled functioning of the data
science team to execute projects on analytics
successfully. The key roles are seven in number.
⚫ Each key plays a crucial role in developing a
successful analytics project.

1. Business User :The business user is the one who

understands the main area of the project and is
also basically benefited from the results.
⚫ This user gives advice and consult the team
working on the project about the value of the
results obtained and how the operations on the
outputs are done.
⚫ The business manager, line manager, or deep
2. Project Sponsor :
⚫ The Project Sponsor is the one who is responsible to
initiate the project. Project Sponsor provides the
actual requirements for the project and presents the
basic business issue.
⚫ He generally provides the funds and measures the
degree of value from the final output of the team
working on the project.
⚫ This person introduce the prime concern and
brooms the desired output.
3. Project Manager :This person ensures that key
milestone and purpose of the project is met on
time and of the expected quality.
4. Business Intelligence Analyst :Business
Intelligence Analyst provides business domain
perfection based on a detailed and deep
understanding of the data, key performance
indicators (KPIs), key matrix, and business
5. Database Administrator (DBA) :
⚫ DBA facilitates and arrange the database environment to
support the analytics need of the team working on a
project.
⚫ His responsibilities may include providing permission to
key databases or tables and making sure that the
appropriate security stages are in their correct places
related to the data repositories or not.
6. Data Engineer :
⚫ Data engineer grasps deep technical skills to assist with
tuning SQL queries for data management and data
extraction and provides support for data intake into the
analytic sandbox.
⚫ The data engineer works jointly with the data scientist to
help build data in correct ways for analysis.
7. Data Scientist :
⚫ Data scientist facilitates with the subject matter expertise
for analytical techniques, data modelling, and applying
correct analytical techniques for a given business issues.
⚫ He ensures overall analytical objectives are met.
Various Phases in Data Analytics
Life Cycle
1. Phase 1: Discovery –The data science team learn
and investigate the problem.
⚫ Develop context and understanding.
⚫ Come to know about data sources needed and
available for the project.
⚫ The team formulates initial hypothesis that can be
later tested with data.
2. Phase 2: Data Preparation- Steps to explore,
preprocess, and condition data prior to modeling
and analysis.
⚫ It requires the presence of an analytic sandbox,
the team execute, load, and transform, to get data
into the sandbox.
⚫ Data preparation tasks are likely to be performed
multiple times and not in predefined order.
⚫ 3. Phase 3: Model Planning –Team explores data
to learn about relationships between variables and
subsequently, selects key variables and the most
suitable models.
⚫ In this phase, data science team develop data sets
for training, testing, and production purposes.
⚫ Team builds and executes models based on the
work done in the model planning phase.
⚫ Several tools commonly used for this phase are –
Matlab, STASTICA.
4. Phase 4: Model Building –Team develops
datasets for testing, training, and production
purposes.
⚫ Team also considers whether its existing tools will
suffice for running the models or if they need
more robust environment for executing models.
⚫ Free or open-source tools – Rand PL/R, Octave,
5. Phase 5: Communication Results –After executing
model team need to compare outcomes of modeling
to criteria established for success and failure.
⚫ Team considers how best to articulate findings and
outcomes to various team members and stakeholders,
taking into account warning, assumptions.
⚫ Team should identify key findings, quantify business
value, and develop narrative to summarize and
convey findings to stakeholders.

6. Phase 6: Operationalize –The team communicates

beneﬁts of project more broadly and sets up pilot
project to deploy work in controlled way before
broadening the work to full enterprise of users.
⚫ This approach enables team to learn about
performance and related constraints of the model in
production environment on small scale , and make
adjustments before full deployment.

Unit2 DATA SCIENCE
No ratings yet
Unit2 DATA SCIENCE
8 pages
Data Science Roles & Lifecycle Guide
No ratings yet
Data Science Roles & Lifecycle Guide
20 pages
Ch1-Introduction To Data Analytics & LifeCycle
No ratings yet
Ch1-Introduction To Data Analytics & LifeCycle
26 pages
Unit 1
No ratings yet
Unit 1
88 pages
DA&V Module 1 (SAMI)
No ratings yet
DA&V Module 1 (SAMI)
4 pages
Data Science
No ratings yet
Data Science
9 pages
DSE 3 Unit 1
100% (1)
DSE 3 Unit 1
10 pages
Unit1 Introduction To Data Analytics and Data Analytics Lifecycle Notes
No ratings yet
Unit1 Introduction To Data Analytics and Data Analytics Lifecycle Notes
13 pages
Module1 Data Science
No ratings yet
Module1 Data Science
15 pages
Data Science Course in Hyderabad
No ratings yet
Data Science Course in Hyderabad
9 pages
Unit 2 - Data Science
No ratings yet
Unit 2 - Data Science
37 pages
DataAnalytics Chap 1
No ratings yet
DataAnalytics Chap 1
36 pages
Adobe Scan 27-Mar-2024
No ratings yet
Adobe Scan 27-Mar-2024
12 pages
Data Science
No ratings yet
Data Science
17 pages
Antim Prahar 2024 Data Analytics For Business Decisions
50% (2)
Antim Prahar 2024 Data Analytics For Business Decisions
38 pages
Unit V
No ratings yet
Unit V
3 pages
Analytics and Data Science
No ratings yet
Analytics and Data Science
12 pages
Ch1-Introduction To Data Analytics & LifeCycle
No ratings yet
Ch1-Introduction To Data Analytics & LifeCycle
25 pages
Unit - 2 Learning Notes
No ratings yet
Unit - 2 Learning Notes
7 pages
Data Analytics Notes
No ratings yet
Data Analytics Notes
26 pages
CSCI946 w2-BDLifecycle
No ratings yet
CSCI946 w2-BDLifecycle
76 pages
Data Science & Cyber Security
100% (1)
Data Science & Cyber Security
13 pages
Unit - I - 2
No ratings yet
Unit - I - 2
63 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
61 pages
Data Science in IOT
No ratings yet
Data Science in IOT
220 pages
Data Science
No ratings yet
Data Science
5 pages
Introduction to Data Analytics
No ratings yet
Introduction to Data Analytics
42 pages
Data Science
No ratings yet
Data Science
11 pages
Data Analytics Lifecycle
No ratings yet
Data Analytics Lifecycle
16 pages
Wa0001.
No ratings yet
Wa0001.
9 pages
Data Analytics Lifecycle
No ratings yet
Data Analytics Lifecycle
2 pages
IDS Unit - 5
No ratings yet
IDS Unit - 5
6 pages
Business Analytics Unit I
No ratings yet
Business Analytics Unit I
45 pages
Trends in Data Science: AI and DS-I
No ratings yet
Trends in Data Science: AI and DS-I
32 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Data Analytics
No ratings yet
Data Analytics
11 pages
Da CH1 Slqa
No ratings yet
Da CH1 Slqa
6 pages
Data Science
No ratings yet
Data Science
2 pages
Da Unit 2
No ratings yet
Da Unit 2
18 pages
Data Science
No ratings yet
Data Science
10 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
Exploratory Data Analysis With Python
No ratings yet
Exploratory Data Analysis With Python
24 pages
BDA - M1 - T2 - Understanding Data Lifecycle
No ratings yet
BDA - M1 - T2 - Understanding Data Lifecycle
21 pages
Data Analytics Lifecycle
No ratings yet
Data Analytics Lifecycle
50 pages
DS Unit 1
No ratings yet
DS Unit 1
35 pages
Business Analytics Essentials
No ratings yet
Business Analytics Essentials
33 pages
Unit 1 - DSA
No ratings yet
Unit 1 - DSA
12 pages
Data Science Notes
No ratings yet
Data Science Notes
52 pages
Data Science: Lesson 4
No ratings yet
Data Science: Lesson 4
8 pages
Datascience
No ratings yet
Datascience
12 pages
Fdsa Unit 1
No ratings yet
Fdsa Unit 1
30 pages
Data Science & Big Data Essentials
No ratings yet
Data Science & Big Data Essentials
46 pages
Week 3 - LAQ
No ratings yet
Week 3 - LAQ
5 pages
Unit 1
No ratings yet
Unit 1
60 pages
Big Data Analytics Ans (AutoRecovered)
No ratings yet
Big Data Analytics Ans (AutoRecovered)
31 pages
Dsur Ea2352001010391 W3
No ratings yet
Dsur Ea2352001010391 W3
3 pages
Data Analytics 1
No ratings yet
Data Analytics 1
13 pages
MKT 512
No ratings yet
MKT 512
12 pages
Web Application
No ratings yet
Web Application
8 pages
BESTIU Brochure
No ratings yet
BESTIU Brochure
4 pages
Thesis Awaited Status
100% (1)
Thesis Awaited Status
8 pages
Student Guide: John Cade's Lithium Discovery
No ratings yet
Student Guide: John Cade's Lithium Discovery
6 pages
Adjei, 2021
No ratings yet
Adjei, 2021
185 pages
Past Simpl: Read and Answer The Questions
No ratings yet
Past Simpl: Read and Answer The Questions
4 pages
7607235897799298-26. Plato's Legacy and Plato's Communism
No ratings yet
7607235897799298-26. Plato's Legacy and Plato's Communism
1 page
Floating - Point Operations - Data Operations and Other: Application Instructions
No ratings yet
Floating - Point Operations - Data Operations and Other: Application Instructions
7 pages
The Sage Handbook of Social Media Research Methods Luke Sloan Online PDF
100% (1)
The Sage Handbook of Social Media Research Methods Luke Sloan Online PDF
99 pages
Do or Make - Grammar Exercise
No ratings yet
Do or Make - Grammar Exercise
3 pages
Creating Effective Teams: A Guide For Members and Leaders
No ratings yet
Creating Effective Teams: A Guide For Members and Leaders
10 pages
U2 - Progress Check - Revisión Del Intento
No ratings yet
U2 - Progress Check - Revisión Del Intento
7 pages
UT Dallas Syllabus For Eedg6375.501.11s Taught by William Swartz (wps100020)
No ratings yet
UT Dallas Syllabus For Eedg6375.501.11s Taught by William Swartz (wps100020)
4 pages
Principles of Microeconomics 5 Ed Frank
No ratings yet
Principles of Microeconomics 5 Ed Frank
301 pages
Chapter 009, Systems Design: Process Costing: Essay Questions
100% (2)
Chapter 009, Systems Design: Process Costing: Essay Questions
15 pages
School Calendar 2023 2024
No ratings yet
School Calendar 2023 2024
12 pages
Issues and Questions in History
No ratings yet
Issues and Questions in History
14 pages
Manual For The Atraumatic Restaurative Treatment ART
No ratings yet
Manual For The Atraumatic Restaurative Treatment ART
58 pages
Informative and Explanatory Writi... : # of Students
No ratings yet
Informative and Explanatory Writi... : # of Students
8 pages
Hemispherotomy PPT Draft 1
No ratings yet
Hemispherotomy PPT Draft 1
66 pages
Osho - Books I Have Loved (SUMMARY Book List)
100% (1)
Osho - Books I Have Loved (SUMMARY Book List)
4 pages
Kids Films Pelmanism
No ratings yet
Kids Films Pelmanism
3 pages
OT Believers Indwelt-McCabe
No ratings yet
OT Believers Indwelt-McCabe
50 pages
PARC 2023 Book of Abstracts (Final)
No ratings yet
PARC 2023 Book of Abstracts (Final)
188 pages
History of Music Education
No ratings yet
History of Music Education
20 pages
Case Study Alcordo Briones Pabillaran
No ratings yet
Case Study Alcordo Briones Pabillaran
18 pages
Study Plan
No ratings yet
Study Plan
4 pages
Analytical Aexposition Text
No ratings yet
Analytical Aexposition Text
5 pages
Research Methods Lecture Notes PPT - MST 2023 UPD
No ratings yet
Research Methods Lecture Notes PPT - MST 2023 UPD
29 pages

Data Science Roles & Lifecycle Guide

Uploaded by

Data Science Roles & Lifecycle Guide

Uploaded by

AJAY KUMAR GARG ENGINEERING COLLEGE, GHAZIABAD

Computer Science And Engineering

Data Analytics and Visualization (KDS-501)

Data Analytics LifeCycle

⚫ Client: The MapReduce client is the one who

1. Business User :The business user is the one who

6. Phase 6: Operationalize –The team communicates

You might also like