0% found this document useful (0 votes)
26 views29 pages

Lecture 1

Data science is an interdisciplinary field focused on extracting knowledge from structured and unstructured data using scientific methods and algorithms. The role of a data scientist involves analyzing large datasets, leveraging machine learning, and communicating insights to bridge business gaps. The CRISP-DM framework outlines the data science process, including business understanding, data preparation, modeling, evaluation, and deployment.

Uploaded by

lufunosape
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views29 pages

Lecture 1

Data science is an interdisciplinary field focused on extracting knowledge from structured and unstructured data using scientific methods and algorithms. The role of a data scientist involves analyzing large datasets, leveraging machine learning, and communicating insights to bridge business gaps. The CRISP-DM framework outlines the data science process, including business understanding, data preparation, modeling, evaluation, and deployment.

Uploaded by

lufunosape
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Introduction to Data Science

Lecture 1
22 July 2022
What is data science

Wikipedia:

Data science is an interdisciplinary field that uses scientific


methods, processes, algorithms and systems to extract knowledge
and insights from noisy, structured and unstructured data,[1][2] and
apply knowledge and actionable insights from data across a broad
range of application domains.
What is data science
Data Science Job Growth
Role of Data Scientist

• A Data scientist’s role comprises handling large amounts of data


and then analyzing it using data-driven methodologies.

• Once they can make sense of the data, they bridge the business
gaps by communicating it to the information technology leadership
teams and understanding the patterns and trends through
visualizations.

• Data scientists also leverage Machine Learning and AI, use their
programming knowledge around Java, Python, SQL, Big data
Hadoop, and data mining.

• They require to have great communication skills to translate to the


business their data discovery insights effectively.
Data Scientist vs Data Analyst
Where does the Data come from?

• Lots of data is being collected and warehoused:


• Web data, e-commerce
• Financial transactions, bank/credit transactions
• Online trading and purchasing
• Social Network

• Google processes 20 PB a day (2008)


• Facebook has 60 TB of daily logs
• eBay has 6.5 PB of user data + 50 TB/day
• (5/2009) 1000 genomes project: 200 TB
Structured vs Unstructured data

• Structured data
• Is data that has been predefined and formatted to a
set structure before being placed in data storage
• The best example of structured data is the relational database:
the data has been formatted into precisely defined fields, such
as credit card numbers or addresses, in order to be easily
queried with SQL.
• Easily used by machine learning algorithms: The largest
benefit of structured data is how easily it can be used by
machine learning.
• Unstructured data
• Is data stored in its native format and not processed until it
is used
• Examples include audio, videos and text data
Data Types
Where does the Data come from?

• Cost of 1 TB of disk: Around


• R800 Storage is dirt CHEAP!!!
What to Take Away?

● Using data to drive knowledge


discovery is not novel.

● Emerged from a need to extend


understanding to vast, complex
datasets.

● Data Science is compute intensive

● Important for everyone to


develop data science awareness.
Data Science: How is it
done?

Data Science and the


Scientific Method
A Critical and Established Means for Finding Truth

Source: “The Scientific Method Rap”, https://www.tes.com/lessons/Kj-mMpS6VLrjJg/the-scientific-method,


last accessed: 03/05/2018
Scientific Method - The
Steps
Making the Data Central

+ = ?
The *Data* Scientific
Method
The *Data* Scientific Method – CRISP
DM
The *Data* Scientific Method – CRISP
DM
1. Business understanding – What does the business need?
2. Data understanding – What data do we have / need? Is it
clean?
3. Data preparation – How do we organize the data for modeling?
4. Modeling – What modeling techniques should we apply?
5. Evaluation – Which model best meets the business objectives?
6. Deployment – How do stakeholders access the results?
CRISP DM - Business
Understanding

• Any good project starts with a deep understanding of the


customer’s needs.
• The Business Understanding phase focuses on understanding the
objectives and requirements of the project.
• It can be broken down as follows:
1. Determine business objectives: You should first
“thoroughly understand, from a business perspective, what
the customer really wants to accomplish.” (CRISP-DM
Guide) and then define business success criteria.
2. Assess situation: Determine resources availability, project
requirements, assess risks and contingencies, and conduct a
cost-benefit analysis.
3. Determine data mining goals: In addition to defining the
business objectives, you should also define what success looks
like from a technical data mining perspective.
4. Produce project plan: Select technologies and tools and
define detailed plans for each project phase.
CRISP DM - Data
Understanding

• Adding to the foundation of Business Understanding, it drives the


focus to identify, collect, and analyze the data sets that can help
you accomplish the project goals.

• It can be broken down as follows:


1. Collect initial data: Acquire the necessary data and (if
necessary) load it into your analysis tool.
2. Assess situation: Determine resources availability, project
requirements, assess risks and contingencies, and conduct a
cost-benefit analysis.
3. Describe data: Examine the data and document its surface
properties like data format, number of records, or field
identities.
4. Explore data: Dig deeper into the data. Query it, visualize
it, and identify relationships among the data.
5. Verify data quality: How clean/dirty is the data? Document
any quality issues.
CRISP DM - Data
Preparation

• A common rule of thumb is that 80% of the project is


data preparation.

• It can be broken down as follows:


1. Select data: Determine which data sets will be used and
document reasons for inclusion/exclusion.
2. Clean data: Often this is the lengthiest task. Without it, you’ll
likely fall victim to garbage-in, garbage-out. A common
practice during this task is to correct, impute, or remove
erroneous values.
3. Construct data: Derive new attributes that will be helpful.
For example, derive someone’s body mass index from height
and weight fields.
4. Integrate data: Create new data sets by combining data
from multiple sources.
5. Format data: Re-format data as necessary. For example, you
might convert string values that store numbers to numeric
values so that you can perform mathematical operations
CRISP DM -
Modeling

• What is widely regarded as data science’s most exciting work is


also often the shortest phase of the project.

• It can be broken down as follows:


1. Select modeling techniques: Determine which algorithms
to try (e.g. regression, neural net).
2. Generate test design: Pending your modeling approach, you
might need to split the data into training, test, and validation
sets.
3. Build model: As glamorous as this might sound, this
might just be executing a few lines of code like “reg =
LinearRegression().fit(X, y)”..
4. Assess model: Generally, multiple models are competing
against each other, and the data scientist needs to interpret
the model results based on domain knowledge, the pre-
defined success criteria, and the test design
CRISP DM -
Evaluation

• Whereas the Assess Model task of the Modeling phase focuses on


technical model assessment, the Evaluation phase looks more
broadly at which model best meets the business and what to do
next.

• It can be broken down as follows:


1. Evaluate results: Do the models meet the business success
criteria? Which one(s) should we approve for the business?
2. Review process: Review the work accomplished. Was
anything overlooked? Were all steps properly executed?
Summarize findings and correct anything if needed.
3. Determine next steps: Based on the previous three tasks,
determine whether to proceed to deployment, iterate further,
or initiate new projects.
CRISP DM -
Deployment

• A model is not particularly useful unless the customer can


access its results.

• It can be broken down as follows:


1. Plan deployment: Develop and document a plan for
deploying the model.
2. Plan monitoring and maintenance: Develop a thorough
monitoring and maintenance plan to avoid issues during the
operational phase (or post-project phase) of a model.
3. Produce final report: The project team documents a
summary of the project which might include a final
presentation of data mining results.
4. Review project: Conduct a project retrospective about what
went well, what could have been better, and how to improve
in the future.
A Bit Closer to
Reality...

Getting our Hands Dirty


with a Real Problem
Commuting Times In South
Africa
Commuting Times In South Africa
(Questions)

1. In which region is public transport


most/least efficient?

2. What factors increase an individual’s commute time?

3. Which district experiences the greatest levels of


weekday congestion?

4. Does an individual’s weight influence the mode of


transportation they use?

5. Pose your own question...


Commuting Times In South
Africa (Questions)

Consider:

● Realistic
○ Would the data support an answer?
○ Sanity Check

● Addresses real needs of client


○ What is gained from the answer?

● Define Assumptions
○ Additional data sources
○ Limiting constraints
Thank you

Questions

You might also like