0% found this document useful (0 votes)
8 views123 pages

Introduction To Machine Learning

Uploaded by

mybusiness2803
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views123 pages

Introduction To Machine Learning

Uploaded by

mybusiness2803
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

Introduction to Machine Learning

espace
Master ESA - University of Orléans

Christophe HURLIN

University of Orléans and IUF

Master Econométrie et Statistique Appliquée

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 1 / 123
Introduction

Outline

1. Introduction

2. AI and ML: Key Definitions

3. Basic Concepts of ML

4. ML Algorithms

5. Taxonomy of Data

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 2 / 123
Introduction

Objectives of the Session

1 Understand the distinction between


Artificial Intelligence, Machine Learning,
Deep Learning, and NLP.

2 Define key ML concepts: model, algorithm,


learning modes, supervised vs.
unsupervised learning.

3 Identify the main prediction tasks:


regression and classification.

4 Introduce the ML vocabulary: features,


targets, label, hypotheses, and loss
functions.

5 Learn about the most frequently used ML


algorithms in economics and finance. Credit: iStock

6 Understand the typology of data and how


data structure affects ML applications.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 3 / 123
Introduction

Recommended Books on Machine Learning

Hull, J. (2021), Machine Learning in


Business:, An Introduction to the World of
Data Science.

Hastie, T., Tibshirani, R., and Friedman, J.


(2009). The Elements of Statistical
Learning., 2nd ed. Springer.

Géron, A. (2017), Hands-On Machine


Learning with, Scikit-Learn and
TensorFlow. O’Reilly Media.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 4 / 123
Introduction

Recommended Readings on ML in Economics and Finance

Recommended Readings

A selection of key references for understanding the application of Machine Learning in


economics and public policy:

Mullainathan, S. and Spiess, J. (2017). Machine Learning: An Applied Econometric Ap-


proach. Journal of Economic Perspectives, 31(2), 87–106.

Varian, H. (2014). Big Data: New Tricks for Econometrics. Journal of Economic Perspec-
tives, Spring, 3–28.

Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., and Mullainathan, S. (2018).
Human Decisions and Machine Predictions. Quarterly Journal of Economics, 133(1),
237–293.

Haghighi, M. , Joseph, A., Kapetanios, G., Kurz, C., Lenza, M., and Marcucci, J. (2024).
Machine Learning for Economic Policy. Journal of Econometrics, in press.
Desai, A. (2023). Machine Learning for Economics Research: When, What and How.
Bank of Canada, Staff Analytical Note 2023–16.

Athey, S. and Imbens, G. (2019). Machine Learning Methods That Economists Should
Know About. Annual Review of Economics, 11, 685–725.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 5 / 123
AI and ML: Key Definitions

Outline

1. Introduction

2. AI and ML: Key Definitions

3. Basic Concepts of ML

4. ML Algorithms

5. Taxonomy of Data

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 6 / 123
AI and ML: Key Definitions

Many Words, Many Concepts

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 7 / 123
AI and ML: Key Definitions

Many Words, Many Concepts

Source: Desai (2023)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 8 / 123
AI and ML: Key Definitions

Definition: Artificial Intelligence

Definition: Artificial Intelligence

Artificial Intelligence (AI) is a branch of computer science dedicated to designing


systems capable of performing tasks that would normally require human intelligence.
These tasks include reasoning, learning, perception, understanding natural language,
and decision-making in uncertain environments.

Remark: The terms Artificial Intelligence and Machine Learning are often confused. While AI is
the broader field, ML is a subfield of AI that focuses specifically on systems that learn from data.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 9 / 123
AI and ML: Key Definitions

Definition: Generative AI

Definition: Generative Artificial Intelligence

Generative AI (GenAI) refers to a class of AI models capable of creating new con-


tent—such as text, images, audio, or code—by learning the underlying patterns of ex-
isting data. These models do not merely classify or predict; they generate data that
resembles the training data, often using architectures such as Generative Adversarial
Networks (GANs) or Large Language Models (LLMs).

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 10 / 123
AI and ML: Key Definitions

Definition: Machine Learning

Definition: Machine Learning

Machine Learning (ML) is a subfield of Artificial Intelligence that focuses on developing


algorithms and statistical models that allow computers to perform specific tasks without
being explicitly programmed. Instead, these systems learn patterns and rules from data
and improve their performance through experience.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 11 / 123
AI and ML: Key Definitions

Definition: Natural Language Processing

Definition: Natural Language Processing

Natural Language Processing (NLP) is a subfield of Artificial Intelligence that focuses


on the automatic analysis, understanding, and generation of text written in natural lan-
guage. It enables machines to process and extract meaning from human language in
textual form, using computational and statistical techniques.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 12 / 123
AI and ML: Key Definitions

Definition: Deep Learning

Definition: Deep Learning

Deep Learning is a branch of Machine Learning based on artificial neural networks with
many layers (hence “deep”). These models are capable of learning complex patterns from
large amounts of data and are particularly effective in tasks such as image recognition,
natural language processing, and speech analysis.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 13 / 123
AI and ML: Key Definitions

AI, ML, NLP, Deep Learning and Gen AI

Source: Misra (2024), Medium

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 14 / 123
AI and ML: Key Definitions

AI, ML, NLP, Deep Learning and Gen AI

Source: IBM, 2025

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 15 / 123
AI and ML: Key Definitions

Definition: Large Language Models (LLMs)

Definition: Large Language Model

A Large Language Model (LLM) is a deep learning model trained on massive corpora
of text data to understand, generate, and manipulate human language.
LLMs rely on advanced neural network architectures, typically based on transformers,
to model the statistical relationships between words, phrases, and contexts. They are
capable of a wide range of natural language processing (NLP) tasks, including:
• Text generation
• Summarization
• Translation
• Question answering
• Dialogue systems

LLMs are pre-trained on vast corpora and often fine-tuned for specific applications or
domains.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 16 / 123
AI and ML: Key Definitions

LLM and other technologies

Source: Pressman et al. (2024)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 17 / 123
AI and ML: Key Definitions

LLM and other technologies

Source: Alomari (2024)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 18 / 123
AI and ML: Key Definitions

Examples of Prominent LLMs

Widely Used Large Language Models (as of 2025)


• GPT-4 – Developed by OpenAI. Used extensively in chatbots, content generation,
and research.
• Claude 3 – Developed by Anthropic. Designed for safety, transparency, and
alignment.
• Gemini 1.5 – Google DeepMind’s multi-modal model with language and reasoning
capabilities.
• Mistral – A family of open-weight, efficient models (e.g., Mistral-7B, Mixtral).
• LLaMA 3 – Meta’s open-source model series, widely adopted in academic and
commercial settings.
• Command R+ – Cohere’s model optimized for retrieval-augmented generation
(RAG).

These models are trained on large text corpora and are deployed through APIs or inte-
grated into enterprise solutions.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 19 / 123
AI and ML: Key Definitions

Popular LLMs in 2025

Source: GraffersID, 2025

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 20 / 123
AI and ML: Key Definitions

AI and ML: Key defintions

Key Concepts

1 Artificial Intelligence.

2 Generative Artificial Intelligence.

3 Machine Learning.

4 Deep Learning.

5 Natural Language Processing (NLP).

6 Large Language Models.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 21 / 123
Basic Concepts of ML

Outline

1. Introduction

2. AI and ML: Key Definitions

3. Basic Concepts of ML

4. ML Algorithms

5. Taxonomy of Data

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 22 / 123
Basic Concepts of ML

Objectives of this section

This section introduces key concepts and


terminology of ML, including:

1 Core vocabulary of Machine Learning,

2 Learning paradigms: supervised,


semi-supervised, and unsupervised
learning,

3 Types of prediction tasks: regression and


classification,

4 Distinction between algorithms and


models.
Credit: iStock

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 23 / 123
Basic Concepts of ML

Terminology: From Econometrics to Machine Learning

Terminology in ML (data):

• Observation / Data point (example or instance): An observation is denoted zi = (xi , yi ),


where xi ∈ Rd is a vector of features, and yi is the outcome.
• Training dataset: The dataset used to train a model: (x1 , y1 ), . . . , (xn , yn ).

• Features (covariates): The explanatory variables xi = (xi1 , xi2 , . . . , xid ).

• Target (dependent variable): The outcome variable yi to be predicted. Also called label in
classification problems.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 24 / 123
Basic Concepts of ML

Structure of the Data Matrix

In supervised ML, the data is typically organized into a matrix of features X ∈ Rn×d and a target
vector y ∈ Rn :

• Each row represents an observation (or example): one unit of analysis.


• Each column corresponds to a feature (covariate).
• The last column is the target variable y .

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 25 / 123
Basic Concepts of ML

Terminology: From Econometrics to Machine Learning

Terminology in ML (model):

• Prediction function (hypothesis): The model’s output function, typically written as f̂ (x) or
ŷ , which approximates the relationship between inputs x and target y .
• Loss function: A function that measures the error between predicted values ŷi and actual
values yi , used to train the model. Common examples: squared error, cross-entropy.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 26 / 123
Basic Concepts of ML

Terminology: From Econometrics to Machine Learning

The ML terminology applied to the basic linear regression model.


Source: Data Science Blogathon (2020)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 27 / 123
Basic Concepts of ML

Learning Paradigms in Machine Learning

Definition: Learning Modes in Machine Learning

In Machine Learning, tasks are categorized into different learning modes based on the
structure of the data and the type of feedback available:

Supervised Learning: The algorithm is trained on labeled data, meaning each input xi is associ-
ated with an output yi . The goal is to learn a function f that maps inputs to outputs.

Unsupervised Learning: The data is unlabeled. The goal is to discover hidden patterns or struc-
tures in the input data xi , such as clusters or latent factors.

Semi-supervised Learning: Combines a small amount of labeled data with a large amount of
unlabeled data. The algorithm leverages both to improve learning accuracy.

Reinforcement Learning: An agent learns to make decisions by interacting with an environment


and receiving feedback in the form of rewards or penalties.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 28 / 123
Basic Concepts of ML

Learning Paradigms in Machine Learning

Source: DataBaseTown (2020)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 29 / 123
Basic Concepts of ML

Learning Paradigms in Machine Learning

Source: Baheti (2021), V7labs

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 30 / 123
Basic Concepts of ML Supervised Learning

Learning Modes

Supervised Learning

Source: Khulbe (2022), Medium

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 31 / 123
Basic Concepts of ML Supervised Learning

Definition: Supervised Learning

Definition: Supervised Learning

Supervised learning is a machine learning approach where the algorithm is trained on a


labeled training dataset, meaning that each observation in the training data is associated
with a known output (or label).

The objective is to learn a general mapping rule, called the model,that can be applied
to unseen data to produce accurate predictions.
• Classification: Predict a discrete label or category (e.g., spam vs. not spam).
• Regression: Predict a continuous numerical value (e.g., house price, GDP
growth).

Once the model has been trained, it can be used to predict the output for new inputs
not encountered during training.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 32 / 123
Basic Concepts of ML Supervised Learning

Example of Supervised Learning

Example: Fruit Image Classification

We consider an example in which we have a set of labeled images (data) as follows:


• 1 if it’s an apple
• 2 if it’s a citrus fruit
• 3 if it’s a watermelon
• 4 if it’s a banana
For each image, a set of features (e.g., color, shape, etc.) is extracted and represented
as vectors. These vectors are used to train the supervised learning algorithm.

The resulting model can then be used to classify new images that were never seen
during the training phase.

Source: Calvo (2019)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 33 / 123
Basic Concepts of ML Supervised Learning

Example of Supervised Learning

Supervised Learning. Source: Calvo (2019)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 34 / 123
Basic Concepts of ML Supervised Learning

Example of Supervised Learning

Supervised Learning. Source: Calvo (2019)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 35 / 123
Basic Concepts of ML Supervised Learning

Regression vs Classification

Regression and Classification

In supervised learning, two major types of problems are distinguished based on the
nature of the target variable (Y ):
• Regression: when the target variable is continuous. The goal is to predict a
numerical value from the explanatory variables.
Example: predicting the price of a house or the temperature.
• Classification: when the target variable is categorical. The goal is to assign each
observation to one or more predefined categories.
Example: detecting whether an email is spam or not, or classifying images of
fruits.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 36 / 123
Basic Concepts of ML Supervised Learning

Regression vs Classification

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 37 / 123
Basic Concepts of ML Supervised Learning

Applications of Supervised Learning in Economics and Finance

Examples of Supervised Learning Applications

Supervised learning is the most widely used ML approach in economics and finance. It
is particularly suited to prediction tasks with labeled historical data. Typical applications
include:
• Credit risk prediction: estimating the probability of default using features from
loan applications or account behavior.
• Fraud detection: identifying anomalous or fraudulent transactions in real-time.
• Forecasting: predicting macroeconomic indicators (e.g., GDP growth, inflation) or
financial variables (e.g., stock returns, interest rates).
• Customer segmentation and targeting: classifying households or firms based
on consumption or investment profiles.
• Text classification: categorizing financial disclosures, news articles, or central
bank communications.

These applications often use structured datasets and require careful feature engineering
and validation.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 38 / 123
Basic Concepts of ML Supervised Learning

Application: Credit Risk Assessment

Dumitrescu, E.-I., Hué, S., Hurlin, C., and Tokpavi, S. (2022),


Machine Learning for Credit Scoring:, Improving Logistic
Regression with Non-Linear Decision-Tree Effects, European
Journal of Operational Research, 297(3), 1178–1192.

• Ensemble methods (e.g., Random Forest) typically outperform


standard logistic regression models in credit scoring, but their
lack of interpretability is a major drawback.

• In this article, the authors propose an interpretable and


high-performing scoring method called PLTR (Penalised
Logistic Tree Regression Model), which integrates decision
tree outputs into logistic regression to enhance predictive
performance.

• Rules extracted from shallow decision trees built on raw


features are used as new predictors in a penalised logistic
regression model.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 39 / 123
Basic Concepts of ML Supervised Learning

Application: Credit Risk Assessment

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 40 / 123
Basic Concepts of ML Supervised Learning

Application: Credit Risk Assessment

Description of one of the datasets used in Dumitrescu et al. (2022).


Source: University of California at Irvine (UCI)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 41 / 123
Basic Concepts of ML Unsupervised Learning

Learning Modes

Unsupervised Learning

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 42 / 123
Basic Concepts of ML Unsupervised Learning

Unsupervised Learning

Definition: Unsupervised Learning

Unsupervised learning refers to machine learning methods that work with unlabeled
data, meaning there is no predefined target variable.

The objective is not to predict a target, but to uncover hidden structures, patterns, or
associations within the data.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 43 / 123
Basic Concepts of ML Unsupervised Learning

Unsupervised Learning Tasks

The goal of unsupervised learning is to discover and extract useful information from data without
relying on labeled outputs.

Main tasks include:


• Clustering: grouping similar observations into clusters.

• Dimensionality Reduction: projecting high-dimensional data into a lower-dimensional


space while preserving relevant information (e.g., Principal Component Analysis – PCA).

Unsupervised learning is particularly useful for exploratory data analysis, allowing researchers to
reveal underlying structures or anomalies without prior assumptions.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 44 / 123
Basic Concepts of ML Unsupervised Learning

Unsupervised Learning Tasks

Unsupervised Learning. Source: Calvo (2019)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 45 / 123
Basic Concepts of ML Unsupervised Learning

Unsupervised Learning: Clustering

K-Means is one of the most well-known clustering algorithms.


It assigns each observation to a cluster in order to minimize the variance within each cluster.

Source: Kefanjin (2019)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 46 / 123
Basic Concepts of ML Unsupervised Learning

Applications of Unsupervised Learning in Economics and Finance

Examples of Unsupervised Learning Applications

Unsupervised learning is widely applied in empirical economics and financial analysis for
pattern discovery and dimensionality reduction. Common use cases include:
• Clustering of consumers or firms: identifying latent segments based on
behavior, preferences, or financial indicators.
• Anomaly detection: detecting unusual patterns in transactions, financial
statements, or macroeconomic time series.
• Topic modeling: extracting themes from large text corpora such as news articles,
policy documents, or academic literature.
• Dimensionality reduction: simplifying large datasets (e.g., survey data or panel
data) using methods like PCA before visualization or modeling.
• Market structure analysis: uncovering patterns in product characteristics or
pricing strategies across firms.

These techniques are particularly valuable in exploratory analysis and for preprocessing
data in supervised pipelines.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 47 / 123
Basic Concepts of ML Unsupervised Learning

Application: Credit Risk Assessment

Bakoben M., Bellotti T., Adams N. (2020).


Identification of credit risk based on cluster analysis of account
behaviours. Journal of the Operational Research Society, 71,
775–783.

• Clustering-based method to group bank accounts according to


behavioural risk (revolving credit).
• Monthly credit card data from 494 accounts in a UK bank over
a period of up to two years.
• Behaviour modelled parametrically using a VAR model (e.g.,
utilisation and repayment rates).
• Clustering based on dissimilarities derived from these
behavioural models.
• Real-world behavioural data grouped into meaningful credit
card usage segments.
• A new default prediction model incorporates cluster
membership as a predictor, improving performance (AUC
gain).

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 48 / 123
Basic Concepts of ML Semi-supervised Learning

Learning Modes

Semi-supervised Learning

Source: Atten (2023), Medium

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 49 / 123
Basic Concepts of ML Semi-supervised Learning

Semi-Supervised Learning

Definition: Semi-Supervised Learning

Semi-supervised learning is a hybrid machine learning approach that combines labeled


data and unlabeled data to train a predictive model.
The algorithm learns from a small set of labeled examples while also leveraging the struc-
ture of the larger set of unlabeled data to improve learning efficiency and accuracy.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 50 / 123
Basic Concepts of ML Semi-supervised Learning

Semi-Supervised Learning

Advantages of semi-supervised learning

This approach is particularly useful when:


• Acquiring labeled data is costly, time-consuming, or requires domain expertise (e.g. fraud
detection).
• A large volume of unlabeled data is readily available.

The goal is to exploit the unlabeled data to:


• Enhance model performance compared to purely supervised learning.
• Reduce the need for labeled data while maintaining good generalization.

Semi-supervised methods are widely used in applications such as image classification, natural
language processing, and fraud detection, where unlabeled data is abundant but labels are scarce.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 51 / 123
Basic Concepts of ML Semi-supervised Learning

Semi-Supervised Learning

Source: Teksands (2021)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 52 / 123
Basic Concepts of ML Semi-supervised Learning

Semi-Supervised Learning

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 53 / 123
Basic Concepts of ML Semi-supervised Learning

Applications of Semi-Supervised Learning in Economics and Finance

Examples of Semi-Supervised Learning Applications

Semi-supervised learning is useful in situations where labeled data are scarce or costly
to obtain, but large amounts of unlabeled data are available. In economics and finance,
this setting is common. Typical applications include:
• Credit scoring: using a small set of labeled defaults with a large pool of
unlabeled accounts to improve risk prediction.
• Fraud detection: combining a few confirmed fraud cases with a broad set of
transactions to identify suspicious patterns.
• Text classification: labeling economic or financial documents (e.g., reports,
reviews, complaints) using limited annotations.
• Customer segmentation: leveraging sparse labeled information to guide
clustering and behavioral profiling.
• Predicting regulatory outcomes: using partial outcomes from past decisions to
generalize over unlabeled policy cases.

These methods allow analysts to reduce annotation costs while maintaining predictive
accuracy, making them attractive for applied research and industry practice.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 54 / 123
Basic Concepts of ML Semi-supervised Learning

Application: Credit Risk Assessment

Shen F., Yang Z., Zhao X., Lan D. (2022).


Reject inference in credit scoring using a three-way decision
and safe semi-supervised support vector machine.
Information Sciences, 606, 614–627.

• In credit risk modeling, rejected loan applicants are often excluded


from the training data, leading to sample selection bias.
• Accepted applicants are labeled, while rejected applicants remain
unlabeled.
• This paper introduces a two-step method to reintegrate rejected
clients into the learning process:
(i) Adjustment of feature distributions between accepted and
rejected populations.
(ii) A semi-supervised classification method based on safe
support vector machines (S4VM).
• This strategy improves the accuracy of credit scoring models by
leveraging the information contained in the rejected sample.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 55 / 123
Basic Concepts of ML Semi-supervised Learning

Common Strategies in Semi-Supervised Learning

Typical Approaches in Semi-Supervised Learning

Although not covered in this training program, several strategies are commonly used in
semi-supervised learning (SSL):
• Self-training: A supervised model is trained on labeled data and then used to
generate pseudo-labels for the unlabeled data. The most confident predictions are
iteratively added to the training set.
• Label Spreading: Labels from labeled examples are propagated to unlabeled
ones based on similarity. This approach assumes that nearby points (e.g., in
terms of nearest neighbors) are likely to share the same label.

These techniques exploit the structure of the data to improve learning performance, even
when labeled data is scarce.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 56 / 123
Basic Concepts of ML Reinforcement Learning

Learning Modes

Reinforcement Learning

Source: Wikipedia

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 57 / 123
Basic Concepts of ML Reinforcement Learning

Definition: Reinforcement Learning

Definition: Reinforcement Learning

Reinforcement Learning (RL) is a subfield of machine learning where an agent learns


to make sequential decisions by interacting with an environment. At each step, the agent
receives feedback in the form of rewards or penalties, allowing it to improve its decision-
making policy over time.
The core objective is to learn a strategy (called a policy) that maximizes the cumulative
reward over time.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 58 / 123
Basic Concepts of ML Reinforcement Learning

Reinforcement Learning

Source: MathWorks, Matlab R2025a

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 59 / 123
Basic Concepts of ML Reinforcement Learning

Applications of RL in Economics

Examples of Economic Applications

Reinforcement Learning has gained interest in economics and finance, particularly in


problems involving sequential decision-making under uncertainty. Typical applications
include:
• Dynamic pricing and revenue management: adjusting prices in real-time based
on market response.
• Portfolio optimization: selecting investment strategies that evolve over time to
maximize returns.
• Monetary policy simulation: modeling central bank behavior in response to
macroeconomic indicators.
• Bidding strategies: in online advertising auctions or energy markets.
• Experimental economics: modeling learning and adaptation in games or
bargaining situations.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 60 / 123
Basic Concepts of ML Reinforcement Learning

Main Algorithms in Reinforcement Learning

Families of RL Algorithms

The main categories of Reinforcement Learning algorithms include:


• Value-based methods: Learn a value function (e.g., Q-values) to derive optimal
actions.
Example: Q-learning, Deep Q-Networks (DQN).
• Policy-based methods: Directly optimize the policy that maps states to actions.
Example: REINFORCE, Proximal Policy Optimization (PPO).
• Actor-Critic methods: Combine value and policy learning in a single framework.
Example: Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient
(DDPG).

These algorithms are often trained using simulation or interaction with complex environ-
ments.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 61 / 123
Basic Concepts of ML Other Types of Learning

Learning Modes

Other Types of Learning Modes

Source: Naveen (2021), Nomidl

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 62 / 123
Basic Concepts of ML Other Types of Learning

Beyond Classical Learning Paradigms

Other Branches of Machine Learning

In addition to supervised, unsupervised, and semi-supervised learning, other major


paradigms include:
• Deep Learning (DL): builds multi-layered neural networks to learn hierarchical
representations of complex data. Widely used in image recognition, speech
processing, and natural language understanding.
• Natural Language Processing (NLP): enables machines to process and
understand human language in text or speech form.
• Self-Supervised Learning: trains models using unlabeled data by creating
artificial labels from the data itself (e.g., predicting masked parts of input).
Common in representation learning and language modeling.
• Online Learning: learns incrementally from streaming data without requiring full
retraining.

In this training, we focus on the foundations of ML and will briefly mention some of these
paradigms without detailed coverage.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 63 / 123
Basic Concepts of ML Other Types of Learning

Deep Learning

Deep Learning is a subfield of machine learning that uses multi-layered neural networks to auto-
matically learn data representations.

Unlike traditional machine learning, deep learning eliminates the need for manual feature en-
gineering. Instead, the algorithm learns both the features and the decision rule directly from raw
data through successive layers of abstraction.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 64 / 123
Basic Concepts of ML Other Types of Learning

Natural Language Processing (NLP)

Natural Language Processing (NLP) refers to the branch of AI focused on enabling machines to
understand, interpret, generate, and respond to human language, both written and spoken.

NLP combines insights from computational linguistics, statistics, and machine learning, particularly
deep learning.

Source: Amazinum

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 65 / 123
Basic Concepts of ML Other Types of Learning

Other Types of Learning

Source: Misra (2024), Medium

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 66 / 123
Basic Concepts of ML Summary of Learning Modes

Summary of Learning Modes

Summary of Learning Modes

Source: DataBaseTown (2020)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 67 / 123
Basic Concepts of ML Summary of Learning Modes

Summary of Learning Modes

Source: Forbes, Teich (2000)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 68 / 123
Basic Concepts of ML Summary of Learning Modes

AI, Machine Learning and Learning Modes

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 69 / 123
Basic Concepts of ML Summary of Learning Modes

AI, Machine Learning and Learning Modes

Source: IBM (2022), Machine learning and data science

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 70 / 123
Basic Concepts of ML Summary of Learning Modes

Basic Concepts of ML

Key Concepts

1 Model feature.
2 Target or label.
3 Prediction function or hypothesis.
4 Loss function.
5 Regression vs. classification model.
6 Labelled vs. unlabelled data.
7 Supervised learning.
8 Unsupervised learning.
9 Semi-supervised learning.
10 Reinforcement learning.
11 Deep learning.

12 Natural Language Processing.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 71 / 123
ML Algorithms

Outline

1. Introduction

2. AI and ML: Key Definitions

3. Basic Concepts of ML

4. ML Algorithms

5. Taxonomy of Data

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 72 / 123
ML Algorithms Most often used ML algorithm

What is an Algorithm?

Definition: Algorithm

An algorithm is a finite, well-defined sequence of instructions or rules designed to per-


form a specific task or solve a particular problem.

In the context of ML, an algorithm defines the procedure by which a model is learned
from data. It specifies how the model’s parameters are estimated, updated, and opti-
mized based on a given objective function.

An algorithm is not the final model itself, but the method used to generate the model
from data.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 73 / 123
ML Algorithms Most often used ML algorithm

Algorithms in Machine Learning

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 74 / 123
ML Algorithms Most often used ML algorithm

Algorithms in Machine Learning

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 75 / 123
ML Algorithms Most often used ML algorithm

Algorithms in Machine Learning

From Learning Modes to Algorithms

Each learning mode (supervised, unsupervised, semi-supervised, and reinforcement


learning) includes a wide variety of algorithms.
It is difficult to list them exhaustively or to determine which are the most commonly used,
as this largely depends on the application domain and the nature of the data.

However, in economics and finance, some algorithms are more frequently employed in
practice. We now present those that are most commonly used in these fields for each
learning mode.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 76 / 123
ML Algorithms Most often used ML algorithm

Machine Learning Algorithms

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 77 / 123
ML Algorithms Most often used ML algorithm

Machine Learning Algorithms

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 78 / 123
ML Algorithms Most often used ML algorithm

Machine Learning Algorithms

Source: Scikit-Learn

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 79 / 123
ML Algorithms Most often used ML algorithm

Machine Learning Algorithms

Source: LinkedIn

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 80 / 123
ML Algorithms Most often used ML algorithm

Most Common Machine Learning Algorithms Used on Kaggle

Source: Kaggle survey 2017, cited in Gupta (2018), Medium

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 81 / 123
ML Algorithms Most often used ML algorithm

Most Common Machine Learning Algorithms Used on Kaggle

Source: Kaggle surveys 2017-202, cited in Capellupo (2021), Toward Data Science.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 82 / 123
ML Algorithms What should be a ML formation for economists?

Transition Question

What should be a Machine Learning training


for economists?
What knowledge, methods, and tools are truly useful
for economic research and policy analysis?

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 83 / 123
ML Algorithms What should be a ML formation for economists?

ML Background of Economists

Most of the ML methods banks were developed between


the mid-1980s and the early 2000s
• CART algorithm: Breiman et al. (1984)
• Bagging methods: Breiman (1996)
• Random forests: Breiman (2001)
• XGBoost: Chen and Guestrin (2016), but the Boosting
technique has its roots in an early publication by Freund
and Schapire (1997)

But ML was only introduced into economics and econometrics


master’s programs in the mid-2010s.

Breiman, L., Friedman, J., Olshen, R. and C. Stone (1984), Classification and Regression
Trees, Wadsworth, Int. Group.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 84 / 123
ML Algorithms What should be a ML formation for economists?

Example: Master Program in Econometrics (Univ. Orleans)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 85 / 123
ML Algorithms What should be a ML formation for economists?

Example 3: Most Common ML Algorithms in Credit Scoring

Note: This figure shows the number of times each ML algorithm is used across 110 articles from a literature survey on credit scoring.
Source: Markov et al. (2022), Credit scoring methods: Latest trends and points to consider, Journal of Finance and Data Science, 8,
180-201.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 86 / 123
ML Algorithms What should be a ML formation for economists?

ML Algorithms

Key Concepts

1 Algorithm vs. model.

2 Most often used algorithms.

3 Logistic Regression.

4 Decision Trees.

5 Random Forests.

6 Neural Networks.

7 Ensemble Methods.

8 Support Vector Machines.

9 Gradient Boosted Machines.

10 Convolutional Neural Networks.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 87 / 123
Taxonomy of Data

Outline

1. Introduction

2. AI and ML: Key Definitions

3. Basic Concepts of ML

4. ML Algorithms

5. Taxonomy of Data

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 88 / 123
Taxonomy of Data

Typology of Data

Different categories and notions can be used to characterize data:

Source: Luna-Reyes, L. F., Martin, E. G., and Ivonchyk, M. (2022). Data Analytics for Public Policy and Management. Pressbooks.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 89 / 123
Taxonomy of Data

Typology of Data

We will review the following key concepts:


• Quantitative data.
• Qualitative data.
• Cross-sectional data.
• Time series data.
• Panel or longitudinal data.
• Structured, semi-structured, and unstructured data.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 90 / 123
Taxonomy of Data

Quantitative Data

Definition: Quantitative Data

Quantitative data (or numerical data) refers to measurable information that takes numer-
ical values. It is typically divided into:
• Discrete quantitative data: distinct, countable values, usually integers.
• Continuous quantitative data: values that can take an infinite number of
possibilities within a given interval.

Examples of Discrete and Continuous Quantitative Data

Examples of discrete data: Number of children in a household, number of customers in


a store.
Examples of continuous data: Salary, height, weight of an individual.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 91 / 123
Taxonomy of Data

Qualitative Data

Definition: Qualitative Data

Qualitative data (or categorical data) refers to non-numeric information that describes
attributes or characteristics. It can be divided into:
• Nominal qualitative data: categories without any intrinsic order.
• Ordinal qualitative data: categories with a meaningful order.

Examples of Nominal and Ordinal Qualitative Data

Examples of nominal data: Eye color (blue, green, brown), housing type (apartment,
house, studio).
Examples of ordinal data: Satisfaction level (dissatisfied, neutral, satisfied), school rank-
ing (first, second, third).

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 92 / 123
Taxonomy of Data

Encoding Categorical Variables

Definition: Encoding Categorical Variables

Encoding categorical variables refers to the process of transforming qualitative (cate-


gorical) variables into numerical variables so they can be used in statistical analyses or
econometric and machine learning models.

Common encoding methods:


• One-Hot Encoding: Each category is represented by a separate binary variable indicating
the presence (1) or absence (0) of that category.
• Label Encoding: Each category is replaced by a unique integer. This method is simple but
may introduce an artificial ordering.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 93 / 123
Taxonomy of Data

Encoding Categorical Variables

Example: Label Encoding vs. One-Hot Encoding

Consider a variable "Animal" with the categories: Cat, Dog, Bird.


1. Label Encoding:
• Cat: 1
• Dog: 2
• Bird: 3

2. One-Hot Encoding:

Id Animal Cat Dog Bird

1 Cat 1 0 0

2 Dog 0 1 0

3 Bird 0 0 1

4 Cat 1 0 0

5 Dog 0 1 0

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 94 / 123
Taxonomy of Data

Binary Variables and 0/1 Encoding

Definition: Binary Variables

A binary variable is a variable that can take only two values, typically 0 and 1, represent-
ing opposite states such as yes/no or true/false.

Note: Categorical variables with two categories can be directly encoded as 0 and 1.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 95 / 123
Taxonomy of Data

Exercise: Credit Scoring

Exercise: Credit Scoring

The goal is to build a credit scoring model to estimate the probability that a borrower
will default within the first 12 months of a loan contract.
• The dataset includes several characteristics of the borrower and the loan, which
serve as explanatory variables in the logistic regression model.
• The target variable is a binary default indicator equal to 1 in case of default and 0
otherwise.

Data: The data are provided in the file scoring_data.xlsx.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 96 / 123
Taxonomy of Data

Exercise: Credit Scoring

Variable Role Type Unit Values / Categories

Default indicator Target Binary 1 = Default within 12 months

0 = No default within 12 months

Tenure in current job Feature Continuous Years

Borrower age Feature Continuous Years

Car purchase price Feature Continuous Euros

Loan amount Feature Continuous Euros

Monthly payment as a share of monthly income Feature Continuous %

Down payment > 50% of vehicle value Feature Binary 1 = Yes

0 = No

Expected loan duration Feature Continuous Months

Credit event in last 6 months Feature Binary 1 = Yes

0 = No

Marital status (married or other) Feature Binary 1 = Married

0 = Other

Homeownership status Feature Binary 1 = Homeowner

0 = Other

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 97 / 123
Taxonomy of Data

Exercise: Credit Scoring

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 98 / 123
Taxonomy of Data

Qualitative and Quantitative Data

Qualitative vs. Quantitative Data

Many software tools require you to declare — or allow you to adjust — the data type when
importing a dataset. For example, in Python, variable types are automatically detected,
but it is possible (and often recommended) to specify them manually to ensure consis-
tency and accuracy.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 99 / 123
Taxonomy of Data

Qualitative and Quantitative Data

Main Variable Types in Python:

• Continuous variables (‘float‘, ‘int‘): Represent numeric values that can take a wide range of
values. Examples: Property price (‘float‘), age of an individual (‘int‘).
• Categorical variables (‘category‘, ‘object‘): Take a limited number of distinct values
representing categories. Examples: Type of contract (‘category‘), product color (‘object‘).
• Binary variables (‘bool‘, ‘category‘): Variables with only two levels (0/1 or True/False).
Examples: Homeownership (‘bool‘), default indicator (‘category‘).
• Ordinal variables (‘category‘ with order): Categorical variables with a natural order.
Examples: Satisfaction level (‘category‘ with defined order), movie rating (1 star, 2 stars,
etc.).

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 100 / 123
Taxonomy of Data

Exercise: Credit Scoring

1 import pandas as pd
2
3 # Load the Excel file
4 file_path = "Scoring_data.xlsx"
5 df = pd.read_excel(file_path, sheet_name="Data")
6
7 # Show types detected automatically by pandas
8 print("Automatically detected types:")
9 print(df.dtypes)
10
11 # Define categorical (binary) and continuous variables
12 binary_variables = ["Default", "Down payment", "Credit event", "Married"
, "Homeowner"]
13 continuous_variables = ["Age", "Car price", "Funding amount", "Job
tenure", "Loan duration", "Monthly payment"]
14
15 # Convert binary variables to ’category’ (or use ’bool’ if preferred)
16 df[binary_variables] = df[binary_variables].astype("category")
17
18 # Show types after explicit conversion
19 print("\nTypes after explicit conversion:")
20 print(df.dtypes)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 101 / 123
Taxonomy of Data

Exercise: Credit Scoring

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 102 / 123
Taxonomy of Data

Typology of Data

Different categories and notions can be used to characterize data:

Source: Luna-Reyes, L. F., Martin, E. G., and Ivonchyk, M. (2022). Data Analytics for Public Policy and Management. Pressbooks.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 103 / 123
Taxonomy of Data

Structured Data

Definition: Structured Data

Structured data refers to data that is predefined and formatted according to a specific
structure (typically in relational databases or tables) before being stored in a data ware-
house.

Example of Structured Data

A relational database is the most common example of structured data: information is


organized into clearly defined fields, such as credit card numbers or addresses, making
it easy to query using SQL.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 104 / 123
Taxonomy of Data

Structured Data

Definition: Relational Database

A relational database is a system for storing and managing structured data, organized
into interrelated tables. Each table consists of rows (records) and columns (attributes),
clearly defining the structure of the data.

Example of a Relational Database

A relational database for a company storing employee information might contain the fol-
lowing tables:
• Employees table: (ID, Last Name, First Name, Position, Salary)
• Departments table: (ID, Department Name, Manager)
• Projects table: (ID, Project Name, Budget, Associated Department)
Relationships between these tables allow for efficient querying using the SQL language.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 105 / 123
Taxonomy of Data

Structured Data

Source: PhoenixNAP Global IT Services, June 2021

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 106 / 123
Taxonomy of Data

Structured Data

Source: PhoenixNAP Global IT Services, June 2021

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 107 / 123
Taxonomy of Data

SQL

Definition: SQL

SQL (Structured Query Language) is a programming language used to interact with rela-
tional databases.

Example of an SQL statement

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 108 / 123
Taxonomy of Data

SQL

Source: medium.com

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 109 / 123
Taxonomy of Data

Typology of Data

Different categories and notions can be used to characterize data:

Source: Luna-Reyes, L. F., Martin, E. G., and Ivonchyk, M. (2022). Data Analytics for Public Policy and Management. Pressbooks.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 110 / 123
Taxonomy of Data

Unstructured Data

Definition: Unstructured Data

Unstructured data refers to information that is not organized according to a predefined


format. It lacks the rigid structure of relational databases and is often stored as text,
images, videos, or documents without a fixed schema.

Example of Unstructured Data

Social media content, free-text data, and audio recordings are typical examples of un-
structured data. Unlike relational databases, these data require specialized technologies
such as NoSQL databases or Natural Language Processing (NLP) techniques to be
processed effectively.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 111 / 123
Taxonomy of Data

Unstructured Data

Source: Lawtomated, *Structured vs. Unstructured Data: What are they and why care?*, April 2019.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 112 / 123
Taxonomy of Data

Unstructured Data

Source: Edureka!, May 2020.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 113 / 123
Taxonomy of Data

Semi-Structured Data

Definition: Semi-Structured Data

Semi-structured data refers to information that does not follow a rigid schema like rela-
tional databases but still has an implicit organization through tags, metadata, or a hierar-
chical structure. It lies between structured and unstructured data.

Examples of Semi-Structured Data


• Emails: An email includes free text (unstructured), but also well-defined fields
such as sender, recipient, subject, date, which makes it semi-structured.
• HTML documents: A web page written in HTML is organized using tags (‘<title>‘,
‘<h1>‘, ‘<p>‘, ‘<table>‘), although the actual content is flexible and free-form.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 114 / 123
Taxonomy of Data

Semi-Structured Data

Source: Educba.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 115 / 123
Taxonomy of Data

Semi-Structured Data

Source: hackernoon.com.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 116 / 123
Taxonomy of Data

Semi-Structured Data

Source: hackernoon.com.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 117 / 123
Taxonomy of Data

Typology of Data

Different categories and notions can be used to characterize data:

Source: Luna-Reyes, L. F., Martin, E. G., and Ivonchyk, M. (2022). Data Analytics for Public Policy and Management. Pressbooks.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 118 / 123
Taxonomy of Data

Big Data

Definition: Big Data

Big Data refers to large, diverse, and high-velocity datasets that exceed the capabilities
of traditional data management tools. These datasets require advanced technologies for
collection, storage, processing, and visualization.

The 4 V’s of Big Data


• Volume: Massive quantities of data (e.g., social media, IoT).
• Variety: Diversity of formats (e.g., text, images, videos, structured and
unstructured data).
• Velocity: Data generated and processed in real time or near real time (e.g.,
streaming).
• Veracity: Reliability and quality of the collected data.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 119 / 123
Taxonomy of Data

Big Data

Source: Analytixlabs

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 120 / 123
Taxonomy of Data

Definition: High-Dimensional Problem

High-Dimensional Setting

A machine learning or statistical problem is said to be in a high-dimensional setting


when the number of features (variables or covariates), denoted d, exceeds the number
of observations, denoted n:
d ≫n
In such settings, classical estimation techniques (such as ordinary least squares) may fail
due to the lack of sufficient data to estimate all parameters reliably.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 121 / 123
Taxonomy of Data

Taxonomy of Data

Key Concepts

1 Quantitative data.

2 Qualitative data.

3 One-Hot encoding vs. label encoding.

4 Binary variable.

5 Structured data.

6 Relational databases.

7 SQL (Structured Query Language).

8 Semi-structured data.

9 Unstructured data.

10 Big data.

11 4 Vs: volume, variety, velocity, veracity.

12 High-dimensional setting.

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 122 / 123
Taxonomy of Data

End of Session

Christophe Hurlin (University of Orléans and IUF)

Christophe HURLIN (University of Orléans and IUF) Introduction to Machine Learning September 14, 2025 123 / 123

You might also like