Data
Science 101
Arik Pelkey
Pentaho Senior Director – Product Marketing, Hitachi Vantara
Scott Cooley
Pentaho Data Scientist, Hitachi Vantara
Agenda
This session will provide an introduction to data science fundamentals.
• What is Data Science?
• Common Use Cases and Algorithms
• The Data Science Process
• Building a Data Science Team
• The Future
AI, Machine Learning, and Deep Learning
• AI: Getting machines
to do what humans
are good at
• Machine Learning:
Feeding an algorithm
data to learn and
predict something
• Deep Learning: A type
of machine learning
Image from https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/.
Data Science: Solving Problems with Data
Computer science, HACKING MATH AND Algorithms and
data engineering and SKILLS Machine STATISTICS numerical
wrangling, coding Learning KNOWLEDGE techniques to
derive insights
DATA
SCIENCE
Danger Traditional
Zone! Research
Understanding of the
underlying assumptions Domain knowledge,
SUBSTANTIVE business acumen, experience,
EXPERIENCE value to the business
Diagram from Drew Conway: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram.
What’s all the fuss?
This stuff was created many many years ago
• Bayes Theorem • Thomas Bayes mid 1700’s
• Regression • Legendre, Gauss and Galton
early 1800’s
• Neural Networks • McCulloch and Pitts early 1940s
Here is a sample footnote.
Think about All Our Data and Compute
SKA - 2020
(Square Kilometer Array Telescope)
It is still
GROWING!
Will generate as much data in
a day as the entire PLANET
does in a year!
https://www.computerworld.com.au/article/392735/ska_telescope_generate_more_data_than_entire_internet_2020/.
Types of Machine Learning
✕
Regression – Looking for Classification – Similar to
✕✕
✕ a statistical relationship ✕ regression but looking for
✕ ✕
✕ across variables that △
separations in the data
△
✕ may give us an estimate △
△
△
given predefined classes.
of a particular outcome. (Supervised)
✕
Clustering – Do not have Anomaly Detection –
◇
✕ ◇ predefined classes but △ △△ ? Identification of outliers
✕ △△
◇ △△ △
◇ trying to find groups or △ △
△ △
△
based upon expected
△ △
△ △△
sets based upon data at ranges of data.
?
△ hand. (Unsupervised)
Here is a sample footnote.
Labelled vs Unlabelled
Lets say we want to Classify Houses by Size Supervised
Given Features or Feature Set Learning
Use the labels
to build a
FullBath HalfBath Bedrooms Home Age Size Label model. Model
1 0 2 56 M used to classify
1 1 3 59 L new house size
2 1 3 20 M
based ONLY on
2 1 3 19 S the known
feature set.
Unsupervised
SIZE is missing! We need to look for similarities in the data
and group them into clusters.
More on Machine Learning
Machine Learning is a methodology to create a model based on sample data and
use the model to make a prediction or strategy using a more algorithmic approach.
SUPERVISED LEARNING MODEL
Historical records that contain
square feet, number of
bathrooms, zip code….
Records that contain the price
the house sold for
Iterate the algorithm over the
combined data to train the model
Use the trained model to predict
outcome on new records
The Data Science Process: Getting from Raw Data to Outcomes
Formal Framework CRISP–DM The Data Science Workflow
Cross Industry Standard Process
for Data Mining
Joe Blizstein and Hanspeter Pfister created for Harvard Data Science course.
Specialist Traditional Data Science Team
Data Scientist (DS)
– Prepares data, engineers features, most valuable skill: training models.
Data Engineer (DE)
– Data acquisition focus. Build data pipelines. Not uncommon to have 5:1 ratio
DE:DS
Data Analyst (DA)
– Assist DS with data prep
Application architect (AA)
– Design complete solution; deploy and maintain models in production
Mythical Creatures
Trends
• Automation
• Tools for Citizen Data Scientists
• Pre-trained models in the cloud
Here is a sample footnote.
Hiring Guidance
Here is a sample footnote.
Defining Success
• Easy for the tangible
– Search order optimization
– Recommendation engine or CTR
• Hard for others
– Lead scoring
– Attrition
• Try to measure direct outcomes
• Rarely a silver bullet
• Think ROI
Here is a sample footnote.
Typical Data Science Project
DS DS DS DS DS
DE DA
AA AA AA
Understand ID and Prepare data Train Deploy Update
business procure and build model models models
objectives training data new features
Preventive Maintenance:
Caterpillar
Marine Asset Intelligence
Fleet Data via Data Scientist
Satellite Data Mining and
Predictive
Maintenance
Data Data
Integration Integration
Data Business User (COO)
Reporting on
Local Equipment
Marts Operations and
sensor and Efficiency
Server Data
Dashboards and
Reports on Machine
Performance
Cross Department (Onboard and
Operations Data Onshore)
Scheduling/ERP
The Future
• Scaling up / enabling more data scientists
• Model management
• Improved productivity
• Support for containerized applications.
Here is a sample footnote.
Pentaho ML Orchestration
• Makes data science
teams more productive
• Broad support for open
source libraries in
various languages
Summary
• What is Data Science
• Common Use Cases and Algorithms
• The Data Science Process
• Building a Data Science Team
• The Future
Next Steps
Want to learn more?
• Schedule a Meet the Expert
• Read Mark Hall’s Machine Learning with Pentaho Blog