0% found this document useful (0 votes)
16 views14 pages

Ics054 Unit 1

Uploaded by

temp790648
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views14 pages

Ics054 Unit 1

Uploaded by

temp790648
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

UNIT – 1

INTRODUCTION TO DATA ANALYTICS

Definition: Data analytics is the process of examining, cleaning, transforming, and modelling data to
discover useful insights, inform conclusions, and support decision-making.
Purpose: It helps organizations to make data-driven decisions, identify patterns, and forecast trends.

Data analytics is a rapidly growing field that empowers businesses, governments, and organizations to
understand and make informed decisions based on vast amounts of data. With the evolution of AI,
machine learning, and cloud technologies, the future of data analytics is poised for even more
advancements, making it a crucial tool for success across industries.

Importance of Data Analytics


 Informed Decision-Making: Data analytics helps businesses make decisions based on real,
objective data rather than intuition.
 Competitive Advantage: Organizations can use data insights to outperform competitors and
identify new opportunities.
 Efficiency & Cost Reduction: Analytics can identify inefficiencies and areas for cost savings.

Techniques in Data Analytics


 Data Collection: Gathering data from various sources (e.g., surveys, sensors, web scraping).
 Data Cleaning: Removing duplicates, fixing errors, and dealing with missing values.
 Data Exploration: Visualizing data through graphs and charts to identify patterns.
 Statistical Analysis: Using mathematical techniques to analyze relationships in the data.
 Machine Learning: Leveraging algorithms to improve predictive accuracy over time.

SOURCES AND NATURE OF DATA


PART 1: SOURCES OF DATA
Data can be collected from a myriad of sources, generally categorized as Internal or External, and
further defined by how it is captured.

A. INTERNAL SOURCES
Data generated from within an organization during its routine operations.
● Transactional Databases: Records of business transactions (e.g., sales, purchases,
payments).
o Example: A Sales table in a company's SQL database.
● Customer Relationship Management (CRM) Systems: Contain customer interaction data,
support tickets, and lead information.
o Example: Salesforce or HubSpot data.
● Enterprise Resource Planning (ERP) Systems: Integrated management of core business
processes (e.g., finance, supply chain, manufacturing).
● Application Logs: Records of events generated by software applications (e.g., web server
logs, application error logs).
● Company Documents: Internal reports, emails, and presentations.
B. EXTERNAL SOURCES
Data obtained from outside the organization.
● Public Government Datasets: Open data provided by governments (e.g., census data,
health statistics, economic indicators).
o Example: data.gov, data.gov.uk, Eurostat.
● Commercial Data Providers: Companies that collect and sell specialized data.
o Example: Nielsen for market research, Thomson Reuters for financial data.
● APIs (Application Programming Interfaces): Structured channels to access data from other
web services and platforms.
o Example: Using the Twitter API to get tweets, or the Google Maps API to get location
data.
● Web Scraping/Crawling: The process of using code to automatically extract data from
websites.
o Example: Scraping product prices and reviews from e-commerce sites for competitive
analysis.
● Social Media: A vast source of unstructured public opinion, trends, and user-generated
content.
● Third-Party Surveys: Data collected by research firms.

C. THE RISE OF MACHINE-GENERATED DATA


This is a crucial modern source, often falling under both internal and external categories. It's data
automatically created by devices and systems without human intervention.
● Internet of Things (IoT) Sensors: Data from smart devices, wearables, industrial equipment,
and home appliances.
o Example: A temperature sensor in a factory, a smartwatch tracking heart rate.
● Web Streams: Real-time data feeds of user activity on a website or app.
o Example: Clickstream data showing every click a user makes on an e-commerce site.

D. OPEN DATA
Open data is accessible to everyone and free to use. However, if it’s high-level data, or it’s heavily
summarised and aggregated, it might not be very relevant to you. It might also not be in the format
you need, or it might be very difficult for you to make sense of it. All of these challenges can require a
lot of time to make the data usable.
• Government data
• Health and scientific data: World Health Organisation (WHO), Nature.com scientific data etc
• Social media: Google trends (i.e. look at national trends on search terms), Twitter (allows you to
search by tags and users, which can be downloaded by using Twitter APIs).

PART 2: NATURE OF DATA (DATA TYPES)


The nature of data defines its structure and format, which directly dictates the tools and techniques
needed to analyze it. It's primarily categorized by its structure.
1. STRUCTURED DATA
● Definition: Data that is highly organized and formatted in a fixed schema, making it easily
searchable and analyzable. It is typically stored in tabular form (rows and columns).
● Analogy: A well-organized filing cabinet where every document has a specific, labeled
folder.
● Examples:
o Relational databases (SQL tables: e.g., Customers, Products).
o Spreadsheets (Excel, Google Sheets), CSV (Comma-Separated Values) files.
● Analysis: Easiest to analyze using SQL and standard statistical software.
2. UNSTRUCTURED DATA
● Definition: Data that has no pre-defined format or organization. It constitutes the vast
majority of all data (~80-90%) and is the most challenging to analyze.
● Analogy: A box of random, unlabeled photos, documents, and audio recordings.
● Examples:
o Text: Emails, social media posts, word documents, articles.
o Multimedia: Images, video files, audio recordings, PDFs, PowerPoint presentations.
● Analysis: Requires advanced techniques like Natural Language Processing (NLP) for text,
computer vision for images, and complex data mining algorithms.
3. SEMI-STRUCTURED DATA
● Definition: Data that does not reside in a relational database but has some organizational
properties, like tags or markers, that make it more analyzable than raw unstructured data.
● Analogy: A self-addressed envelope. The information isn't in a strict table, but the structure
of the envelope (to, from, stamp) tells you what the data means.
● Examples:
o JSON (JavaScript Object Notation): A common format for transmitting data from
web APIs. It uses key-value pairs.
o XML (eXtensible Markup Language): Similar to JSON, uses tags to define elements.
o HTML: The code behind web pages, using tags to structure content.
● Analysis: Can be parsed and ingested into tools for analysis. Modern databases (NoSQL) and
tools like Python's Pandas library can handle this well.

SUMMARY

STRUCTURED SEMI-STRUCTURED UNSTRUCTURED

INTERNAL Sales records in a Server logs in JSON


Internal company emails
SOURCE SQL DB format

EXTERNAL Public economic Data from a Twitter API News articles, YouTube
SOURCE data in CSV call videos

PRIMARY Python (Pandas), NoSQL NLP, Computer Vision,


SQL, Excel, Pandas
TOOLS DBs AI

CLASSIFICATION OF DATA
1- STRUCTURED DATA

Structured data is highly organized and fits neatly into traditional data models such as tables, rows, and
columns. It is often stored in relational databases or spreadsheets and is easy to enter, store, query,
and analyze.

CHARACTERISTICS
 Fixed Format: Data follows a predefined schema (e.g., tables with rows and columns).
 Easily Searchable: Structured data can be easily accessed using standard database tools (e.g.,
SQL).
 High Consistency: Data types are consistent (e.g., dates, numbers, text).

EXAMPLES
 Databases: SQL databases like MySQL, PostgreSQL, and Oracle.
 Spreadsheets: Excel files or CSV files.
 CRM Systems: Customer databases containing structured fields (e.g., name, age, email).

2. SEMI-STRUCTURED DATA

Semi-structured data does not conform to a rigid schema like structured data, but it contains some
organizational properties that allow for easier analysis compared to unstructured data. It often
includes tags or markers that separate elements.

CHARACTERISTICS
 Flexible Format: Data has a loose structure (e.g., key-value pairs, tags, or metadata).
 Self-Describing: Data contains metadata that describes the data itself, making it easier to
interpret.
 Can Be Nested: Data may have nested elements (e.g., JSON or XML).

EXAMPLES
 JSON (JavaScript Object Notation): A flexible format often used in web APIs.
 XML (eXtensible Markup Language): Used for storing data in a hierarchical format.
 NoSQL Databases: Document-based stores (e.g., MongoDB) or key-value stores (e.g., Redis).
 Email Metadata: Subject, sender, timestamp, etc., that are structured but have unstructured
content.

3. UNSTRUCTURED DATA

Unstructured data is raw data that doesn't have a predefined format or structure, making it difficult to
store, process, or analyze using traditional tools and methods.

CHARACTERISTICS
 Lack of Organization: No predefined structure; it can be varied and messy.
 Diverse Formats: It could be text, images, audio, video, or even social media posts.
 Hard to Analyze Directly: Requires advanced techniques like Natural Language Processing (NLP),
machine learning, or image recognition for extraction of insights.
EXAMPLES
 Text Documents: Word files, PDFs, email bodies.
 Multimedia: Images, videos, audio recordings.
 Social Media Posts: Tweets, Instagram posts, YouTube videos, comments.
 Web Content: Blogs, websites, news articles.

COMPARISON: STRUCTURED VS. SEMI-STRUCTURED VS. UNSTRUCTURED DATA

SEMI-STRUCTURED UNSTRUCTURED
ASPECT STRUCTURED DATA
DATA DATA
Fixed schema
Flexible schema (e.g., No fixed schema or
Organization (tables, rows,
JSON, XML) organization
columns)
Data lakes, Object
Relational databases
Storage NoSQL databases, Files storage (e.g.,
(SQL)
Hadoop)
Parsing, NLP, machine
Processing SQL queries, BI tools transformation, learning, image
NoSQL queries processing
Hadoop, Spark,
SQL, Excel, BI tools NoSQL (MongoDB,
Tools TensorFlow, NLTK,
(Tableau, Power BI) Cassandra), Hadoop
OpenCV
Social media posts,
Customer databases, JSON, XML, NoSQL
Examples images, videos,
financial records databases
emails
Easy to manage and Flexible, scalable, Rich insights from
Advantages
analyze evolving schema diverse data types
Difficult to analyze
Limited to Requires
Challenges without advanced
predefined formats preprocessing
methods

DATA ANALYTICS STRATEGY BASED ON TYPE

For Structured Data:


 Traditional analytical methods, including statistical analysis and BI tools, work best.
 Common uses: Financial analysis, sales reports, inventory management.

For Semi-Structured Data:


 More flexibility is required; tools like NoSQL databases and data transformation scripts (e.g.,
Python) are essential.
 Common uses: Web logs, sensor data, customer feedback, and API responses.

For Unstructured Data:


 Machine learning models, deep learning (e.g., CNN for image processing, LSTM for text), and
NLP tools are used to extract meaning.
Common uses: Sentiment analysis on social media, object detection in images, speech-to-text
applications.

CHARACTERISTICS OF DATA

CHARACTERISTIC DESCRIPTION WHY IT MATTERS

Volume Amount of data Impacts storage and processing


Determines tools and
Variety Different data types and formats
complexity
Speed of data generation and
Velocity Enables real-time insights
processing
Veracity Accuracy and reliability of data Affects decision quality

Value Usefulness of the data Drives actionable insights

Variability Inconsistency in data patterns Challenges predictive modeling

Validity Data truly measures what it claims to Ensures analysis relevance

Timeliness Data is recent and up to date Important for timely decisions

Completeness No missing or incomplete data Supports accuracy and integrity

Granularity Level of detail in data Affects depth of analysis

Consistency Uniform format and structure Enables reliable comparisons

BIG DATA
Big Data refers to extremely large and complex datasets that exceed the processing capabilities of
traditional data management and analysis tools. These datasets come in various forms (structured,
semi-structured, unstructured) and are generated rapidly from sources like sensors, social media,
transactions, and devices.

CHARACTERISTICS
 VOLUME: The sheer amount of data, often measured in terabytes or petabytes, generated
every second.
 VELOCITY: The speed at which new data is produced and needs to be managed or analyzed.
 VARIETY: Multiple data formats including text, images, audio, video, logs, and sensor data.
 VERACITY: The accuracy, reliability, and trustworthiness of data sources.
 VALUE: The potential to extract useful insights or benefits from data analysis.
ANALYTICS CAPABILITIES OF BIG DATA PLATFORMS

 DISTRIBUTED DATA PROCESSING

 Platforms like Apache Hadoop and Apache Spark enable parallel computing across clusters,
allowing complex analytical tasks such as batch processing, real-time stream processing, and
machine learning on huge datasets.

 REAL-TIME ANALYTICS
 Many platforms support the ingestion and analysis of streaming data for immediate
insights, useful for applications like fraud detection, recommendation engines, and
operational monitoring.
 MACHINE LEARNING INTEGRATION
 Built-in machine learning libraries and support for frameworks let users build, train, and
deploy predictive models directly within the platform, accelerating AI and advanced
analytics workflows.
 SQL AND QUERYING ENGINES
 Big Data platforms provide fast SQL-based querying support (e.g., Google BigQuery, Amazon
Redshift) on structured and semi-structured data, making analytics accessible to data
analysts and business users familiar with relational querying.
 DATA VISUALIZATION AND REPORTING
 Integrated tools or third-party integrations like Power BI and Tableau enable visualization of
big data analytics results, supporting data-driven decision-making via interactive dashboards
and reports.
 SCALABILITY AND FLEXIBILITY
 Cloud-based Big Data platforms (AWS, Azure, Google Cloud) provide elastic scalability to
handle varying workloads and data volumes, supporting diverse analytics needs without
infrastructure constraints.
 DATA INTEGRATION
 These platforms consolidate data from disparate sources, enabling comprehensive analytics
that span operational systems, customer data, IoT, social media, and more, providing 360-
degree insights.

EXAMPLES OF BIG DATA ANALYTICS PLATFORMS

 Apache Hadoop & Spark: Open-source platforms for distributed processing and analytics.
 Google BigQuery: Serverless data warehouse with fast SQL analytics and built-in ML.
 Amazon EMR & Redshift: Cloud-native analytics and data warehousing solutions.
 Microsoft Azure HDInsight & Synapse Analytics: Managed cloud services with strong ML and BI
integration.
 Databricks: Unified platform for large-scale data engineering, analytics, and machine learning.
ANALYTIC PROCESS
The analytic process in data analytics is a structured methodology that transforms raw data into
meaningful insights to aid decision-making. It typically follows these key steps:

ANALYTIC PROCESS STEPS

1. DEFINE THE PROBLEM OR OBJECTIVE


 Clearly outline the specific question or business problem to answer. This step sets the
direction and goals of the entire analytics project and identifies the key performance
indicators (KPIs) to measure success.

2. COLLECT DATA
 Gather relevant data from diverse sources such as databases, APIs, surveys, web logs, and
IoT devices. The quality and relevance of collected data are vital for accurate analysis.
3. DATA CLEANING AND PREPARATION
 Process the raw data to correct errors, handle missing values, remove duplicates, and
transform data into suitable formats. This step ensures the reliability of subsequent
analyses.
4. DATA ANALYSIS
 Apply statistical methods, machine learning algorithms, or other analytical techniques to
explore, model, and identify patterns or relationships in the data. This can include
descriptive, diagnostic, predictive, and prescriptive analytics.
5. INTERPRETATION AND REPORTING
 Interpret the results within the context of the original problem, translating data patterns
into actionable insights. Present findings using visualizations, dashboards, or reports
tailored for stakeholders to support decision-making.

IMPORTANCE OF DATA ANALYTICS

 INFORMED DECISION-MAKING
 Data analytics empowers organizations to make decisions based on facts and trends rather
than intuition, reducing guesswork and increasing accuracy in business strategies and
operations.
 IMPROVED CUSTOMER UNDERSTANDING
 It helps businesses deeply understand customer behavior, preferences, and needs, enabling
personalized marketing and improved customer experiences, which enhance loyalty and
satisfaction.
 OPERATIONAL EFFICIENCY
 Analytics identifies inefficiencies, bottlenecks, and process optimizations to reduce costs
and improve productivity, resulting in streamlined business operations.
 RISK MANAGEMENT
 By analyzing historical data and identifying warning signs, data analytics proactively helps
manage risks related to markets, fraud, security, and compliance.
 COMPETITIVE ADVANTAGE
 Analytics enables companies to analyze competitors' strategies and market trends, allowing
them to innovate and position themselves better in the market.
 FASTER INNOVATION AND PRODUCT DEVELOPMENT
 Data-driven insights inform product development, helping businesses design products that
meet customer demands and respond quickly to market changes.
 COST REDUCTION AND REVENUE GROWTH
 Companies using data analytics often experience significant cost savings through optimized
resource allocation and increased revenues through targeted marketing and operational
improvements.
 REAL-TIME INSIGHTS
 Real-time data analytics provides up-to-date information that helps companies adapt
quickly to changes, seize opportunities, and mitigate issues immediately.
TYPES OF DATA ANLYTICS
1. DESCRIPTIVE ANALYTICS - "WHAT HAPPENED?"

● Purpose: To summarize and describe past data to understand what has occurred. It is the most
common and foundational type of analytics, providing a starting point for all further
investigation.
● Methods: Data aggregation, data mining, and summary statistics (e.g., mean, median, mode).
● Tools: Dashboards, reports, and Business Intelligence (BI) tools.
○ "Sales increased by 15% last quarter."
○ "The website had 100,000 visitors last month."
○ "The customer churn rate was 5%."

2. DIAGNOSTIC ANALYTICS - "WHY DID IT HAPPEN?"

● Purpose: To investigate past performance to determine the cause of an event or outcome. It


involves drilling down into the data, discovering patterns, and identifying correlations.
● Methods: Drill-down, data discovery, correlation analysis, and root cause analysis.
● Tools: SQL queries, OLAP (Online Analytical Processing), and more detailed visualizations.
○ "Why did sales increase by 15%? The analysis shows the spike was directly caused by a
successful marketing campaign in Europe and a competitor's product recall."
○ "Why did the website traffic drop? It correlated with a technical outage that lasted 3
hours."

3. PREDICTIVE ANALYTICS - "WHAT IS LIKELY TO HAPPEN?"

● Purpose: To use historical data to forecast future outcomes and trends. It uses statistical
models and machine learning to identify the likelihood of future results. It is about managing
uncertainty.
● Methods: Statistical modeling, machine learning, forecasting, and pattern matching.
● Tools: Python (scikit-learn, TensorFlow), R, SAS, and other ML platforms.
○ "Based on historical trends & current market data, we predict a 20% increase in
customer churn next month."
○ "This customer has a 92% probability of clicking on the promotional email."
○ "We forecast that demand for this product will peak in the first week of December."
4. PRESCRIPTIVE ANALYTICS - "WHAT SHOULD WE DO?"

● Purpose: To recommend actions you can take to affect desired outcomes or avoid future
problems. It is the most advanced type, often incorporating predictive models and simulating
the consequences of various choices to determine the best course of action.
● Methods: Optimization, simulation, recommendation engines, and complex machine learning
algorithms.
● Tools: Advanced ML, AI, and complex computational modeling software.
○ "To reduce the predicted customer churn, the model recommends offering a 15%
discount and a free service upgrade to these specific high-risk customers."
○ "To optimize delivery routes and minimize fuel costs, the system prescribes the
following schedule for drivers."
○ Netflix recommends a specific movie to you (prescription) based on what it predicts you
will enjoy.

COMMON DATA ANALYTICS TOOLS

 DATA COLLECTION & STORAGE

 Apache Kafka, Apache NiFi, AWS S3, Hadoop HDFS

 DATA CLEANING & TRANSFORMATION

 Python (Pandas, NumPy), R, Talend, Informatica

 DATA ANALYSIS & MODELING

 Python (Scikit-learn, TensorFlow), R, SAS, SPSS

 DATA VISUALIZATION

 Tableau, Microsoft Power BI, Google Data Studio, Looker

 BIG DATA PROCESSING

 Apache Spark, Hadoop MapReduce, Databricks

APPLICATIONS OF DATA ANALYTICS


 Customer Analytics in Retail
 Predictive Maintenance in Manufacturing
 Fraud Detection in Finance
 Supply Chain Optimization
 Marketing Optimization
 Risk Management in Insurance and Finance
 Healthcare
 Human Resources
 Energy Consumption Optimization
 E-commerce
 Transportation and Logistics
 Business Intelligence

ANALYSIS V/S REPORTING

ASPECT REPORTING ANALYSIS


Explain "why" and recommend "what's
Purpose Summarize "what happened"
next"
Approach Routine, structured, repetitive Exploratory, interpretive, complex
Dashboards, reports, visual
Output Insights, forecasts, recommendations
summaries
Data analysts, strategists, business
Users Business managers, executives
leaders
Strategic decision support and
Value Monitoring and transparency
optimization
DATA ANALYTICS LIFECYCLE

PHASE 1: DISCOVERY & PROBLEM DEFINITION


Understand the business context, objectives, and constraints. Define success.

● Key Activities:
○ Identify the Business Problem: What are we trying to solve or achieve?
○ Define Objectives: How will we measure success?
○ Identify Key Stakeholders: Who is the project for? Who will use the results?
● Key Outputs:
○ Clear project charter with defined objectives and success metrics.
○ A list of data sources needed and timeline.

PHASE 2: DATA ACQUISITION & PREPARATION


Obtain, clean, and transform raw data into a reliable and usable dataset. (Often the longest phase!)

● Key Activities:
○ Data Acquisition: Gathering data from identified sources (SQL databases, APIs, CRM, flat
files, IoT sensors, etc.).
○ Data Cleaning (Data Cleansing):
■ Handling missing values (impute or remove).
■ Correcting errors and inconsistencies.
○ Data Transformation:
■ Normalizing/scaling values.
■ Joining tables from different sources.
■ Formatting data types (e.g., converting strings to dates).
● Key Outputs:
○ A clean, consolidated, and well-structured dataset ready for analysis.
PHASE 3: DATA EXPLORATION & MODEL PLANNING (EDA)
Deeply understand data patterns and relationships to inform choice of analytical techniques.

● Key Activities:
○ Exploratory Data Analysis (EDA):
■ Calculating descriptive statistics (mean, median, standard deviation).
■ Creating visualizations: histograms, box plots, scatter plots, correlation matrices.
○ Identify Patterns & Anomalies: Find trends, outliers, and relationships between
variables.
○ Choose Modeling Techniques: Based on EDA, select appropriate algorithms (e.g., linear
regression, classification, clustering).
● Key Outputs:
○ Insights about data structure and relationships.
○ A plan for which models or analytical approaches to use.

PHASE 4: MODEL BUILDING & EXECUTION


Develop, train, and validate models to generate insights and predictions.

● Key Activities:
○ Data Splitting: Dividing the dataset into training, validation, and testing sets.
○ Model Development: Building the selected models (e.g., using Python's scikit-learn, R,
TensorFlow).
○ Model Training: Feeding the training data to the algorithm to let it learn patterns.
○ Model Validation & Tuning: Using the validation set to test performance and adjust
hyperparameters to avoid overfitting.
○ Model Testing: Evaluating the final model's performance on the unseen test set to get
an unbiased estimate of its accuracy.
● Key Outputs:
○ A trained, validated, and tested analytical model.
○ Model performance metrics (e.g., Accuracy, Precision, Recall, RMSE, R²).

PHASE 5: COMMUNICATE RESULTS & VALIDATE


Interpret the model's output, validate it against business objectives, and communicate findings
effectively to stakeholders.
● Key Activities:
○ Interpretation: Translating model results into actionable business insights. (e.g., "The top
three factors predicting churn are...")
○ Storytelling: Creating a narrative around the findings. "What does this mean, so what,
and now what?"
○ Data Visualization: Building clear, compelling dashboards, reports, and presentations
(e.g., using Tableau, Power BI).
● Key Outputs:
○ Final report, presentation, or dashboard.
○ A list of actionable recommendations for the business.

PHASE 6: OPERATIONALIZE & MEASURE IMPACT


Implement the model into ongoing business processes and continuously monitor its performance.
● Key Activities:
○ Deployment: Integrating the model into production systems (e.g., as an API, embedded
in an app, automating reports).
○ Monitoring: Continuously tracking the model's performance and data quality over time
to check for model drift (performance degradation).
○ Maintenance: Creating processes for retraining the model with new data.
● Key Outputs:
○ A fully deployed and operational analytics solution.
○ A monitoring and maintenance plan.

KEY ROLES AND THEIR RESPONSIBILITIES ACROSS THE LIFECYCLE

KEY CONTRIBUTION BY LIFECYCLE


ROLE PRIMARY RESPONSIBILITIES
PHASE
Phase 1 (Discovery): Absolutely
Defines the "Why." Provides domain critical. They frame the problem.
Business
expertise, secures funding, defines Phase 5 (Communicate): Validates
Stakeholder /
business objectives and KPIs for that results align with business
Executive Sponsor success, and champions the project. needs. Phase 6 (Operationalize):
Ensures the project gets adopted.

Keeps the "How" on track. Manages All Phases. Their role is continuous,
timelines, resources, budget, and ensuring the project moves
Project Manager /
scope. Facilitates communication efficiently from discovery through to
Scrum Master between technical and business teams operationalization without going
and mitigates risks. over budget or missing deadlines.

The Storyteller & Explorer. Bridges the Phase 1 (Discovery): Helps clarify
gap between business and technical data requirements. Phase 2
teams. Uses SQL, statistics, and (Preparation): Often heavily involved
Data Analyst
visualization to find patterns, generate in cleaning. Phase 3 (Exploration):
reports, and translate data into Core role. Phase 5 (Communicate):
actionable business insights. Creates dashboards and reports.

The Foundation Builder. Designs,


Phase 2 (Preparation): Absolute core
builds, and manages the data
role. They build the pipelines to
infrastructure. Responsible for data
Data Engineer acquire and prepare data. Phase 6
pipelines, ETL/ELT processes, data
(Operationalize): Helps deploy
warehouses/lakes, and ensuring data
models into production systems.
is accessible, reliable, and scalable.

The Model Builder. Applies advanced Phase 3 (Exploration): Plans the


statistical, machine learning, and modeling approach. Phase 4 (Model
algorithmic techniques to build Building): Core role. Develops, trains,
Data Scientist
predictive and prescriptive models. and tunes models. Phase 5
Focuses on answering complex "what (Communicate): Explains model logic
if" questions. and outcomes.
The Production Specialist. A
specialized role focused on taking Phase 4 (Model Building): Optimizes
models from a prototype (e.g., a code for production. Phase 6
Machine Learning
Jupyter Notebook) and deploying (Operationalize): Core role. Manages
Engineer them into scalable, reliable, and deployment, monitoring, and
automated production systems retraining pipelines.
(MLOps).

The Infrastructure Guardian. Manages


Phase 2 (Preparation): Provisions
the cloud or on-premise hardware and
storage and compute resources.
IT/DevOps software infrastructure. Ensures
Phase 6 (Operationalize): Critical for
Engineer security, compliance, scalability, and
maintaining the production
integration with other enterprise
environment.
systems.

You might also like