0% found this document useful (0 votes)

26 views29 pages

Lecture 1

Data science is an interdisciplinary field focused on extracting knowledge from structured and unstructured data using scientific methods and algorithms. The role of a data scientist involves analyzing large datasets, leveraging machine learning, and communicating insights to bridge business gaps. The CRISP-DM framework outlines the data science process, including business understanding, data preparation, modeling, evaluation, and deployment.

Uploaded by

lufunosape

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views29 pages

Lecture 1

Uploaded by

lufunosape

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Introduction to Data Science

Lecture 1
22 July 2022
What is data science

Wikipedia:

Data science is an interdisciplinary field that uses scientific

methods, processes, algorithms and systems to extract knowledge
and insights from noisy, structured and unstructured data,[1][2] and
apply knowledge and actionable insights from data across a broad
range of application domains.
What is data science
Data Science Job Growth
Role of Data Scientist

• A Data scientist’s role comprises handling large amounts of data

and then analyzing it using data-driven methodologies.

• Once they can make sense of the data, they bridge the business
gaps by communicating it to the information technology leadership
teams and understanding the patterns and trends through
visualizations.

• Data scientists also leverage Machine Learning and AI, use their
programming knowledge around Java, Python, SQL, Big data
Hadoop, and data mining.

• They require to have great communication skills to translate to the

business their data discovery insights effectively.
Data Scientist vs Data Analyst
Where does the Data come from?

• Lots of data is being collected and warehoused:

• Web data, e-commerce
• Financial transactions, bank/credit transactions
• Online trading and purchasing
• Social Network

• Google processes 20 PB a day (2008)

• Facebook has 60 TB of daily logs
• eBay has 6.5 PB of user data + 50 TB/day
• (5/2009) 1000 genomes project: 200 TB
Structured vs Unstructured data

• Structured data
• Is data that has been predefined and formatted to a
set structure before being placed in data storage
• The best example of structured data is the relational database:
the data has been formatted into precisely defined fields, such
as credit card numbers or addresses, in order to be easily
queried with SQL.
• Easily used by machine learning algorithms: The largest
benefit of structured data is how easily it can be used by
machine learning.
• Unstructured data
• Is data stored in its native format and not processed until it
is used
• Examples include audio, videos and text data
Data Types
Where does the Data come from?

• Cost of 1 TB of disk: Around

• R800 Storage is dirt CHEAP!!!
What to Take Away?

● Using data to drive knowledge

discovery is not novel.

● Emerged from a need to extend

understanding to vast, complex
datasets.

● Data Science is compute intensive

● Important for everyone to

develop data science awareness.
Data Science: How is it
done?

Data Science and the

Scientific Method
A Critical and Established Means for Finding Truth

Source: “The Scientific Method Rap”, https://www.tes.com/lessons/Kj-mMpS6VLrjJg/the-scientific-method,

last accessed: 03/05/2018
Scientific Method - The
Steps
Making the Data Central

+ = ?
The *Data* Scientific
Method
The *Data* Scientific Method – CRISP
DM
The *Data* Scientific Method – CRISP
DM
1. Business understanding – What does the business need?
2. Data understanding – What data do we have / need? Is it
clean?
3. Data preparation – How do we organize the data for modeling?
4. Modeling – What modeling techniques should we apply?
5. Evaluation – Which model best meets the business objectives?
6. Deployment – How do stakeholders access the results?
CRISP DM - Business
Understanding

• Any good project starts with a deep understanding of the

customer’s needs.
• The Business Understanding phase focuses on understanding the
objectives and requirements of the project.
• It can be broken down as follows:
1. Determine business objectives: You should first
“thoroughly understand, from a business perspective, what
the customer really wants to accomplish.” (CRISP-DM
Guide) and then define business success criteria.
2. Assess situation: Determine resources availability, project
requirements, assess risks and contingencies, and conduct a
cost-benefit analysis.
3. Determine data mining goals: In addition to defining the
business objectives, you should also define what success looks
like from a technical data mining perspective.
4. Produce project plan: Select technologies and tools and
define detailed plans for each project phase.
CRISP DM - Data
Understanding

• Adding to the foundation of Business Understanding, it drives the

focus to identify, collect, and analyze the data sets that can help
you accomplish the project goals.

• It can be broken down as follows:

1. Collect initial data: Acquire the necessary data and (if
necessary) load it into your analysis tool.
2. Assess situation: Determine resources availability, project
requirements, assess risks and contingencies, and conduct a
cost-benefit analysis.
3. Describe data: Examine the data and document its surface
properties like data format, number of records, or field
identities.
4. Explore data: Dig deeper into the data. Query it, visualize
it, and identify relationships among the data.
5. Verify data quality: How clean/dirty is the data? Document
any quality issues.
CRISP DM - Data
Preparation

• A common rule of thumb is that 80% of the project is

data preparation.

• It can be broken down as follows:

1. Select data: Determine which data sets will be used and
document reasons for inclusion/exclusion.
2. Clean data: Often this is the lengthiest task. Without it, you’ll
likely fall victim to garbage-in, garbage-out. A common
practice during this task is to correct, impute, or remove
erroneous values.
3. Construct data: Derive new attributes that will be helpful.
For example, derive someone’s body mass index from height
and weight fields.
4. Integrate data: Create new data sets by combining data
from multiple sources.
5. Format data: Re-format data as necessary. For example, you
might convert string values that store numbers to numeric
values so that you can perform mathematical operations
CRISP DM -
Modeling

• What is widely regarded as data science’s most exciting work is

also often the shortest phase of the project.

• It can be broken down as follows:

1. Select modeling techniques: Determine which algorithms
to try (e.g. regression, neural net).
2. Generate test design: Pending your modeling approach, you
might need to split the data into training, test, and validation
sets.
3. Build model: As glamorous as this might sound, this
might just be executing a few lines of code like “reg =
LinearRegression().fit(X, y)”..
4. Assess model: Generally, multiple models are competing
against each other, and the data scientist needs to interpret
the model results based on domain knowledge, the pre-
defined success criteria, and the test design
CRISP DM -
Evaluation

• Whereas the Assess Model task of the Modeling phase focuses on

technical model assessment, the Evaluation phase looks more
broadly at which model best meets the business and what to do
next.

• It can be broken down as follows:

1. Evaluate results: Do the models meet the business success
criteria? Which one(s) should we approve for the business?
2. Review process: Review the work accomplished. Was
anything overlooked? Were all steps properly executed?
Summarize findings and correct anything if needed.
3. Determine next steps: Based on the previous three tasks,
determine whether to proceed to deployment, iterate further,
or initiate new projects.
CRISP DM -
Deployment

• A model is not particularly useful unless the customer can

access its results.

• It can be broken down as follows:

1. Plan deployment: Develop and document a plan for
deploying the model.
2. Plan monitoring and maintenance: Develop a thorough
monitoring and maintenance plan to avoid issues during the
operational phase (or post-project phase) of a model.
3. Produce final report: The project team documents a
summary of the project which might include a final
presentation of data mining results.
4. Review project: Conduct a project retrospective about what
went well, what could have been better, and how to improve
in the future.
A Bit Closer to
Reality...

Getting our Hands Dirty

with a Real Problem
Commuting Times In South
Africa
Commuting Times In South Africa
(Questions)

1. In which region is public transport

most/least efficient?

2. What factors increase an individual’s commute time?

3. Which district experiences the greatest levels of

weekday congestion?

4. Does an individual’s weight influence the mode of

transportation they use?

5. Pose your own question...

Commuting Times In South
Africa (Questions)

Consider:

● Realistic
○ Would the data support an answer?
○ Sanity Check

● Addresses real needs of client

○ What is gained from the answer?

● Define Assumptions
○ Additional data sources
○ Limiting constraints
Thank you

Questions

Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
42 pages
PAM - Unit1 PDF
No ratings yet
PAM - Unit1 PDF
217 pages
What Is CRISP DM - Data Science Process Alliance
No ratings yet
What Is CRISP DM - Data Science Process Alliance
20 pages
Module 5 - Data Science Methodology
No ratings yet
Module 5 - Data Science Methodology
17 pages
PAM - Complete
No ratings yet
PAM - Complete
322 pages
Data Mining Applications & CRISP-DM
No ratings yet
Data Mining Applications & CRISP-DM
5 pages
CRISP DM For Data Science 2025
No ratings yet
CRISP DM For Data Science 2025
6 pages
Data Science Process Alliance CRISP DM For Data Science
No ratings yet
Data Science Process Alliance CRISP DM For Data Science
7 pages
CRISP DM For Data Science
No ratings yet
CRISP DM For Data Science
7 pages
What Is CRISP DM - Data Science Process Alliance
No ratings yet
What Is CRISP DM - Data Science Process Alliance
8 pages
CRISP-DM Data Mining Methodology Guide
No ratings yet
CRISP-DM Data Mining Methodology Guide
25 pages
Unit 6.
No ratings yet
Unit 6.
6 pages
Crisp DM
No ratings yet
Crisp DM
14 pages
Datsci A2
No ratings yet
Datsci A2
80 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
Notes On Data Science Methodologies
No ratings yet
Notes On Data Science Methodologies
4 pages
Summary of Data Science
No ratings yet
Summary of Data Science
5 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
63 pages
CRISP-DM Phase 1 - Business Understanding
No ratings yet
CRISP-DM Phase 1 - Business Understanding
4 pages
Week 5 CRISP-DM Process and Its Applications PDF
No ratings yet
Week 5 CRISP-DM Process and Its Applications PDF
24 pages
Data Science Methodology
No ratings yet
Data Science Methodology
14 pages
Trends in Data Science: AI and DS-I
No ratings yet
Trends in Data Science: AI and DS-I
32 pages
Unit 1
No ratings yet
Unit 1
34 pages
Data Science Introduction
No ratings yet
Data Science Introduction
13 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
Data Science Introduction
No ratings yet
Data Science Introduction
35 pages
DS CRISP-DM Model
No ratings yet
DS CRISP-DM Model
2 pages
What Is Data Mining?: Dama-Ncr
No ratings yet
What Is Data Mining?: Dama-Ncr
36 pages
DSA Lecture1
No ratings yet
DSA Lecture1
15 pages
Topic 2 Business in Practice and The GRISP-DM Framework
No ratings yet
Topic 2 Business in Practice and The GRISP-DM Framework
22 pages
Unit 3 (DS)
No ratings yet
Unit 3 (DS)
32 pages
What Is Data Mining?: Dama-Ncr
No ratings yet
What Is Data Mining?: Dama-Ncr
36 pages
Data Mining
100% (2)
Data Mining
36 pages
Big Data Analytics - Quick Guide - Tutorialspoint
No ratings yet
Big Data Analytics - Quick Guide - Tutorialspoint
50 pages
Module 1
No ratings yet
Module 1
35 pages
PM Unit 1
No ratings yet
PM Unit 1
41 pages
Crisp-Dm: Elgounidi Hajar Safsafi Aya El Malki Ikram Aqaabich Reda
No ratings yet
Crisp-Dm: Elgounidi Hajar Safsafi Aya El Malki Ikram Aqaabich Reda
87 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Introduction of Data Science
No ratings yet
Introduction of Data Science
28 pages
Data Science-Lec 1
No ratings yet
Data Science-Lec 1
17 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
31 pages
Datsci A2 Full Notes
No ratings yet
Datsci A2 Full Notes
81 pages
Big Data and Data Science Guide
No ratings yet
Big Data and Data Science Guide
62 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Unit 1
No ratings yet
Unit 1
11 pages
Data Science Introduction
No ratings yet
Data Science Introduction
24 pages
Data Science
No ratings yet
Data Science
11 pages
CRISP-DM & Business Understanding
No ratings yet
CRISP-DM & Business Understanding
5 pages
Session Summary CRISP Data Mining: Business Understanding
No ratings yet
Session Summary CRISP Data Mining: Business Understanding
4 pages
Course Introduction
No ratings yet
Course Introduction
38 pages
Week 2 Lecture 3
No ratings yet
Week 2 Lecture 3
62 pages
Unit 1
No ratings yet
Unit 1
30 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
DS Unit 1
No ratings yet
DS Unit 1
26 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
Course Introduction
No ratings yet
Course Introduction
38 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
Consumers As Individuals Lec 2
No ratings yet
Consumers As Individuals Lec 2
13 pages
Experiment No 5 Hvac
No ratings yet
Experiment No 5 Hvac
3 pages
Biology HSC 2006 Exam Mapping Grid
No ratings yet
Biology HSC 2006 Exam Mapping Grid
31 pages
Surface Finish Study - ES - ITP Aero
No ratings yet
Surface Finish Study - ES - ITP Aero
128 pages
Cartography PDF
No ratings yet
Cartography PDF
7 pages
Mathematics: Self-Learning Module 7
50% (4)
Mathematics: Self-Learning Module 7
15 pages
Eigo Ganbare JLPT n3 Kanji 1czb
No ratings yet
Eigo Ganbare JLPT n3 Kanji 1czb
9 pages
IB English: Pygmalion’s Bride Analysis
No ratings yet
IB English: Pygmalion’s Bride Analysis
10 pages
Why Is IMRAD Format Important in A Research Paper
No ratings yet
Why Is IMRAD Format Important in A Research Paper
1 page
Curriculum Vitae Nicholas Thuries
No ratings yet
Curriculum Vitae Nicholas Thuries
1 page
Chevron Bead Production in Southwestern
No ratings yet
Chevron Bead Production in Southwestern
4 pages
Sehss 2025
0% (1)
Sehss 2025
1 page
Bot Youtube Comentar Curtir
No ratings yet
Bot Youtube Comentar Curtir
3 pages
Microbio Lab 8
100% (5)
Microbio Lab 8
4 pages
Term 2 Examination Timetable (March 2025) - Invigilation Rota
No ratings yet
Term 2 Examination Timetable (March 2025) - Invigilation Rota
4 pages
UNIT 4: Middle Chil Dhood (The Primary Schooler)
No ratings yet
UNIT 4: Middle Chil Dhood (The Primary Schooler)
12 pages
Scilab Manual For Control Systems by Mrs Supanna S Kumar Electrical Engineering KLE DR M.S Sheshgiri College of Engg & Tech
No ratings yet
Scilab Manual For Control Systems by Mrs Supanna S Kumar Electrical Engineering KLE DR M.S Sheshgiri College of Engg & Tech
18 pages
Paper Chromatography Ink Separation
No ratings yet
Paper Chromatography Ink Separation
2 pages
Zambian Secondary School Syllabus Science G10 12
No ratings yet
Zambian Secondary School Syllabus Science G10 12
123 pages
Emotional Intelligence Presentation
100% (2)
Emotional Intelligence Presentation
31 pages
Newton's Laws: Concepts & Exercises
No ratings yet
Newton's Laws: Concepts & Exercises
2 pages
Pit-Stop: Bernie Sander
No ratings yet
Pit-Stop: Bernie Sander
19 pages
Differential Equation
No ratings yet
Differential Equation
6 pages
SSC CGL Strategy
No ratings yet
SSC CGL Strategy
22 pages
COURSE OUTLINE Digital Literacy 2023
No ratings yet
COURSE OUTLINE Digital Literacy 2023
8 pages
Paras
No ratings yet
Paras
2 pages
Dr. Suvandan Saraswat: Machine Design I (NME-501)
No ratings yet
Dr. Suvandan Saraswat: Machine Design I (NME-501)
47 pages
01 +Deep+Breathing+1-6
No ratings yet
01 +Deep+Breathing+1-6
6 pages
Advanced Reading Part 5
No ratings yet
Advanced Reading Part 5
6 pages
Fluid Flow Problems and Equations: Sadananda Konchady September 2015
No ratings yet
Fluid Flow Problems and Equations: Sadananda Konchady September 2015
62 pages

Lecture 1

Uploaded by

Lecture 1

Uploaded by

Introduction to Data Science

Data science is an interdisciplinary field that uses scientific

• A Data scientist’s role comprises handling large amounts of data

• They require to have great communication skills to translate to the

• Lots of data is being collected and warehoused:

• Google processes 20 PB a day (2008)

• Cost of 1 TB of disk: Around

● Using data to drive knowledge

● Emerged from a need to extend

● Data Science is compute intensive

● Important for everyone to

Data Science and the

Source: “The Scientific Method Rap”, https://www.tes.com/lessons/Kj-mMpS6VLrjJg/the-scientific-method,

• Any good project starts with a deep understanding of the

• Adding to the foundation of Business Understanding, it drives the

• It can be broken down as follows:

• A common rule of thumb is that 80% of the project is

• It can be broken down as follows:

• What is widely regarded as data science’s most exciting work is

• It can be broken down as follows:

• Whereas the Assess Model task of the Modeling phase focuses on

• It can be broken down as follows:

• A model is not particularly useful unless the customer can

• It can be broken down as follows:

Getting our Hands Dirty

1. In which region is public transport

2. What factors increase an individual’s commute time?

3. Which district experiences the greatest levels of

4. Does an individual’s weight influence the mode of

5. Pose your own question...

● Addresses real needs of client

You might also like