Lesson 1.
DATA SCIENCE AND
AUTOMATION COURSE
MASTER DEGREE SMART
TECHNOLOGY ENGINEERING
Introduction
TEACHER
Mirko Mazzoleni
PLACE
University of Bergamo
Who I am
• Name: Mirko Mazzoleni
• Studies: Ph.D. Engineering and Applied Sciences at University of Bergamo (Control
specialization) + Master degree Computer Engineering (CE) at University of Bergamo
• Currently: Assistant Professor @ University of Bergamo
✓ System identification, machine learning, fault detection, condition monitoring
✓ System identification and data analysis (Master Degree Computer Engineering)
✓ Data science and automation (Master Degree Mechanical Engineering)
• Contact details:
✓ mirko.mazzoleni@unibg.it ✓ http://cal.unibg.it/ CAL research laboratory
✓ https://mirkomazzoleni.github.io/ ✓ https://www.facebook.com/calunibg/
2 /37
Course content 6. Decision trees
Part I: Data science 7. Neural networks
8. Machine vision
1. Introduction to data science
8.1 Classic approaches
1.1 The business perspective
8.2 Convolutional neural networks and deep
1.2 CRISP-DM process
learning
1.3 Supervised vs. unsupervised problems
9. Unsupervised learning
2. Linear regression
9.1 k-means clustering
3. Feasibility of learning
9.2 Principal Component Analysis
3.1 Bias-Variance tradeoff
10. Fault diagnosis
4. Logistic regression
10.1 Model-based fault diagnosis
5. Overfitting and regularization
10.2 Signal-based fault diagnosis
5.1 Validation and cross-validation
10.3 Data-driven fault diagnosis
5.2 Performance metrics
3 /37
Course content
Part II: Automation
12. Introduction to industrial automation 15. Structured text language
13. Introduction to PLC 16. Automatic PLC code generation
14. Ladder language 17. Laboratory experience
4 /37
Evaluation
• Written exam – 2 hours
Up to 25 points
• Theoretical open questions and exercises
+
• [OPTIONAL] Small data analysis project
(groups of max 3 people) Up to 6 points
5 /37
Data science projects in the CAL research group
1. Forecasting of sales volume (for food industry)
• Development of the data management platform
• Algorithm design
• Testing/validation
6 /37
Data science projects in the CAL research group
Plant disease
2. Image processing classification
People identification
and classification
Blimp
7 /37
Data science projects in the CAL research group
3. Fault diagnosis
Bearing inner
race fault
Ballscrew
jam in EMA
8 /37
Data science projects in the CAL research group
4. Industrial automation
ICT for remote mantainance Automatic transplant machine
9 /37
Outline
1. Introduction to data science
2. The business perspective and the CRISP-DM process
3. Supervised vs. unsupervised problems
10 /37
Outline
1. Introduction to data science
2. The business perspective and the CRISP-DM process
3. Supervised vs. unsupervised problems
11 /37
Why
Retail $0,8T
Travels $480B
Business value created by
Logistics $475B
the AI up to 2030 [1] Automotive & assembly $405B
Materials $300B
Advanced electronics & semiconductors $291B
Healthcare systems & services $267B
$13 High tech
Telecom
$267B
$174B
Trillions
Oil & gas $173B
Agricoulture $164B
• It is difficult to find an industrial sector that will not benefit from AI in the near future
12 /37
We will use the terms “machine learning”, “data mining”, “data science” quite
Why interchangeably in this course
Data science has been deemed as the sexiest job of the 21st century
• Virtually every aspect of business is now open to data collection (operations,
manufacturing, supply-chain management, customer behaviour, marketing campaigns)
• Collected information need to be analyzed properly in order to get actionable results
• A huge amount of data requires specific infrastructures to be handled
• A huge amount of data requires computational power to be analyzed
• We can let computers perform decisions given previous examples
• Rising of specific job titles
13 /37
Learning examples
Recent years: stunning breakthroughs in computer vision applications
14 /37
Learning examples
Recent years: stunning breakthroughs in computer vision applications
15 /37
What learning is about
Machine learning and data science are meaningful to be applied if:
1. A pattern exists
2. We cannot pin it down mathematically (an analytical solutions does not exists)
3. We have data on it
Assumption 1. and 2. are not mandatory:
• If a pattern does not exist, I do not learn anything
• If I can describe the pattern mathematically, I will not presumably learn the best relation
• The real constraint is assumption 3
16 /37
Data types
The data can have different formats. The most typical is that of a table
House
# bedrooms Price (1000$) • AIM: predict house prices
area(feet 2 )
523 1 115
645 1 150 Regression
708 2 210
1034 3 280 • The data can come from a
2290 4 355
database or from .csv, Excel files…
2545 4 440
A B Learn the relation from House area to Price
Learn the relation from House area AND
A B #bedrooms to Price
17 /37
Data types
Another type of data can be an image
Picture Label
• AIM: recognize if there is a cat in the image
Cat
Not cat Classification
Cat • Learn the relation from an image to a «class of
belonging» (cat vs. not cat)
Not cat
18 /37
Data are dirty
Garbage IN, garbage OUT
House
# bedrooms Price (1000$)
Data problems: area(feet 2 )
523 1 115
• Missing values 645 1 0,001
708 unknown 210
1034 3 unknown
• Not correct values
unknown 4 355
2545 unknown 440
Different data types
Structured data
Images, audio, text Not structured data
19 /37
Machine learning vs. data science
House area (feet 2 ) # bedrooms # bathrooms Recently renowed Price (1000$)
523 1 2 No 115
645 1 3 No 150
708 2 1 No 210
1034 3 3 Si 280
2290 4 4 No 355
2545 4 5 Si 440
A B
Machine learning Data science
• Predict B given A • Houses with 3 bathrooms are more expensive
Output: Code and than those with 2 bathrooms of the same size
• Running software program
• Recently renovated Output: Slide deck
(web site\ mobile app)
houses cost 15% more
20 /37
Machine learning vs. data science
Other tools
AI
ML
Deep
learning
Data science
21 /37
Outline
1. Introduction to data science
2. The business perspective and the CRISP-DM process
3. Supervised vs. unsupervised problems
22 /37
Data-analytic thinking Picture taken from [1]: Provost, Foster, and Tom Fawcett. “Data
Science for Business: What you need to know about data mining
and data-analytic thinking”. O'Reilly Media, Inc., 2013
Data-driven decision-making (DDD) refers to the practice
of basing decisions on the analysis of data, rather than
purely on intuition [2, 3]
• Some decisions can be made automatically (finance,
recommendations)
• Data engineering and processing is a fundamental
support to industrial analytics
• Data, and the capability to extract useful knowledge from
data, should be regarded as key strategic asset
✓ Need to invest to acquire the right data (even lose
money)
✓ Understand data science even if you will not do it
23 /37
Approaching a data mining problem
Cross Industry Standard Process for Data Mining
(CRISP-DM) https://mineracaodedados.files.wordpress.com/2012/04/the-crisp-
dm-model-the-new-blueprint-for-data-mining-shearer-colin.pdf
Iteration is the rule rather the exception:
• Business understanding
• Data understanding
• Data preparation
• Modeling
• Evaluation
• Deployment
24 /37
CRISP-DM: Business understanding
Cast the business problems into one or more data science problems
• Frame the problem such that one or more sub-problems involve
building models for a data mining task (classification, regression,
probability estimation, and so on)
• Think carefully about the use scenario
✓ What exactly do we want to do?
✓ How exactly would we do it?
✓ What parts of this use scenario constitute possible data mining models?
25 /37
CRISP-DM: Data understanding
Identify the available and needed data
• Costs/benefits of acquiring each source of data
• Are the data at disposal related to the business problem?
• Can we use a proxy for data that we can not have?
• As data understanding progresses, the solution paths may differ
26 /37
CRISP-DM: Data preparation
Clean and prepare data for use with algorithms
• Usually the algorithms we employ require data in a different
format with respect to the available one
✓ Convert string to numbers, infer missing data, import data from excel files, …
• Data preprocessing/cleaning/labeling [3] (most of data science project time is
spent here)
• Pay attention to not use historical data that will not be available when your model
will be used
27 /37
CRISP-DM: Modeling
Estimate a mathematical model to extract pattern from data
• In most cases, standard algorithms can be directly applied on
data
• The aim is to find a model in order to use it on unseen data
• The type of the model has to be chosen based on:
✓ What data mining task we want to solve
✓ Performance measures
✓ Availability of libraries for deployement
28 /37
CRISP-DM: Evaluation
Assess the validity of the results
• We could find patterns that exist only in the particular dataset
that we have at disposal (overfitting)
• The devised solution and the model’s decisions should the comprehensible by the
stakeholders
• Usually evaluation is performed before deploying. In this case, build environments
that closely mimic the real use scenario
• Evaluation can be performed also on-line (in production) [4]
29 /37
CRISP-DM: Deployment
Put the model (or the data mining steps) into production
• Usually requires to re-code the model, to make it compatible with
the existing technology
• This step can require quite investment in time. Usually the data science team builds a
propototype that is then passed to the development team
• For this reason, it is suggested to involve a member of the development team in the
early phases of the data science project
• Deployment can involve not only the final model, but also previous phases (data
collection, model building, evaluation)
30 /37
From business problems to data mining tasks
Each data science project is unique. The aim is to decompose the business problem
into subtasks for which a common approach exists.
There are many machine learning algorithms. However, they address a handful of tasks:
• Classification and class probability estimation • Profiling
• Regression • Link prediction
• Symilarity matching • Data reduction
• Clustering • Causal modeling
• Co-occurrence grouping
31 /37
Outline
1. Introduction to data science
2. The business perspective and the CRISP-DM process
3. Supervised vs. unsupervised problems
32 /37
Supervised vs unsupervised methods
A specific data science task can be tackled via a supervised or unsupervised approach
Unsupervised A B
“Do our customers naturally fall into different groups?”
There is no a specific target (or purpose) for the grouping. The aim is only to find similarities between
individuals
Supervised A B
“Can we find groups of customers who have particularly high likelihoods of canceling
their service soon after their contract expire?”
There is a specific target: find people who will leave when contract expires. In this case, there must be data
on the target. The value of the target for an individual is called label or class. We need a dataset of people
that we know they left (labeled dataset)
33 /37
Supervised vs unsupervised methods
• Classification and class probability estimation
• Regression Supervised
• Causal modeling
• Symilarity matching
• Link prediction Supervised or Unsupervised
• Data reduction
• Clustering
• Co-occurrence grouping Unsupervised
• Profiling
34 /37
Business problems as data science examples
Supervised Unsupervised
• Spam e-mail detection system • Market segmentation
• Credit approval • Market basket analysis
• Recognize objects in images • Language models (word2vec)
• Find the relation between house • Social network analysis
prices and house sizes
• Low-order data representations
• Predict the stock market
• Movies recommendation
Supervised or unsupervised
35 /37
Additional resources
MOOC Books
• Learning from data (Yaser S. Abu-Mostafa - EDX) • Data science for business (Foster Provost, Tom
Fawcett)
• Machine learning (Andrew Ng - Coursera)
• An Introduction to Statistical Learning, with
• Deep learning (Andrew Ng - Coursera) application in R (Gareth James, Daniela Witten, Trevor
Hastie and Robert Tibshirani)
• The analytics edge (Dimitris Bertsimas - EDX) • Neural Networks and Deep Learning
(Michael Nielsen)
• Statistical learning (Trevor Hastie and
Robert Tibshirani - Standford Lagunita)
• P̂attern Recognition and Machine
Learning (Christopher Bishop)
36 /37
References
1. Notes from the AI frontier: Modeling the impact of AI on the world economy, 2018.
2. Provost, Foster, and Tom Fawcett. “Data Science for Business: What you need to know about data mining and
data-analytic thinking”. O'Reilly Media, Inc., 2013.
3. Brynjolfsson, E., Hitt, L. M., and Kim, H. H. “Strength in numbers: How does data driven decision making affect firm
performance?” Tech. rep., available at SSRN: http://ssrn.com/abstract=1819486, 2011.
4. Pyle, D. “Data Preparation for Data Mining”. Morgan Kaufmann, 1999.
5. Kohavi, R., and Longbotham, R. “Online experiments: Lessons learned”. Computer, 40 (9), 103–105, 2007.
6. Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin. ”Learning from data”. AMLBook, 2012.
7. Andrew Ng. ”Machine learning”. Coursera MOOC. (https://www.coursera.org/learn/machine-learning)
37 /37