Expected Goals in Soccer
Expected Goals in Soccer
MASTER
Eggels, H.P.H.
Award date:
2016
Link to publication
Disclaimer
This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student
theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document
as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required
minimum study period may vary in duration.
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
• You may not further distribute the material or use it for any profit-making activity or commercial gain
Expected Goals in Soccer:
Explaining Match Results
using Predictive Analytics
Master Thesis
H.P.H. Eggels
Supervisors:
TU/e: dr. M. Pechenizkiy
TU/e: dr. R.J. Almeida
PSV: MSc. R. van Elk
PSV: dr. Luc van Agt
Where data based decision making is taking over businesses and elite sports, elite soccer is lacking
behind. In elite soccer, decisions are still often based on emotions and recent results. As results
are, however, dependent on many aspects, the reasons for these results are currently unknown by
the elite soccer clubs. In our study, a method is proposed to determine the expected winner of a
match.
Since goals are rare in soccer, goal scoring opportunities are analyzed instead. By analyzing
which team created the best goal scoring opportunities, a feeling can be created which team
should have won the game. Therefore, it is important that the quality of goal scoring opportunities
accurately reflect reality. Therefore, the proposed method ensures that the quality of a goal scoring
opportunity is given as the probability of the goal scoring opportunity resulting in a goal. It is
shown that these scores accurately match reality.
The quality scores of individual goal scoring opportunities are then aggregated to obtain an
expected match outcome, which results in an expected winner. In little more than 50% of the
cases, our method is able to determine the correct winner of a match. The majority of incorrect
classified winners comes from close matches where a draw is predicted.
The quality scores of the proposed method can already be used by elite soccer clubs. First of
all, these clubs can evaluate periods of time more objectively. Secondly, individual matches can be
evaluated to evaluate the importance of major events during a match e.g. substitutions. Finally,
the quality metrics can be used to determine the performance of players over time which can be
used to adjust training programs or to perform player acquisition.
Expected Goals in Soccer: Explaining Match Results using Predictive Analytics iii
Contents
Contents v
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Outline of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Modeling 23
4.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Class Imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Evaluation 31
5.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Reliability Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Eye Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4 Match Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7 Conclusion 49
7.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2 Limitations & Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Bibliography 53
Introduction
In sports, many people are conservative which makes revolutions in sports difficult. Currently,
however, a revolution in sports is taking place. In this revolution, data plays a major role in the
decision making of important decisions, e.g. buying players.
One of the most influential people in this revolution of data based decisions in sports is Bill
James. Bill James is a statistician who has been studying Baseball since the 1970s, most often
through the use of data. Among his contributions to baseball is a statistic to quantify player’s
contribution to scored runs: Runs created. Runs created correlates well to actual runs scored.
Total Bases
Runs Created = (Hits + Walks) ∗
At Bats + Walks
The success of this statistic measure comes from the success story of Billy Beane. Beane, the
general manager of a low-budget team, began applying James’ principles. This resulted in the
Oakland Athletics being competitive with teams with much higher budgets. This success story
was captured by Michael Lewis in his book Moneyball [35]. In response to the successes of the
Oakland Athletics, the major teams in the NFL also started hiring statisticians.
The revolution of statistics in baseball was also noticed in other sports, one of which being
basketball. In the current days, basketball teams are using ”Player Tracking” technologies to
evaluate the efficiency of a team by analyzing player’s movements during a match. In order to do
so, the basketball players and the basketball are tracked with 25 Hz during matches [1].
Currently, data-based decisions are also more often made in soccer. The first origin of soccer
analytics goes back to the 1930s when Charles Reep started analyzing soccer matches. During his
career, he annotated around 2200 matches, which eventually lead to his article ”Skill and Chance
in Association Football” published in the Journal of the Royal Statistical Society in 1968 [8].
Reep was the first to collect that amount of data in soccer. His findings, however, were only
of limited quality. Reep was too focussed on finding evidence for the way soccer should be played
in his own vision. He, however, failed to see the drawbacks of this playing style [2].
Where soccer analytics go back a long way, the effective use of these analyses was limited for
a long time. With the rising interest of analytics in other sports, soccer is currently catching up
in the use of data-based decision making. Currently, the data is no longer collected by individuals
such as Reep, but by entire companies such as Opta, Prozone, and StatDNA. The effort with
which this data is collected also dramatically decreased. Where it took Reep about 80 hours to
analyze a single game, currently data is collected by video cameras which track all the players in
real time [8].
Besides the dramatic developments in the data collection methods, the data is also used for a
wider variety of applications. Maybe one of the oldest applications of data analytics in soccer is
in the betting industry where data is used to determine the odds of winning, losing, and drawing.
Currently, however, it is possible to bet on almost everything, from the amount of corners for each
side to the number of faults and yellow cards.
Furthermore, the data is extensively used by the soccer clubs themselves. Not very long ago,
clubs had to exchange video files with each other to be able to analyze opponents. With the
upcoming companies such as Prozone, this is not longer necessary and clubs have almost all
matches of their opponent available. The rising use of data in soccer also raises some barriers.
Anderson and Sally found many situations in which data could help analyzing matches, but where
coaches were not willing to use the data since they would know better anyways [2].
Soccer clubs not only use data to analyze matches but also to analyze individual players. By
analyzing individual players from both their own team and other teams, soccer teams are able
to determine which players should be traded. Furthermore, they are able to find the best player
from other teams to fill in a position at their own team. When Anderson and Sally used this
approach to help a team in the transfer window, they got very positive feedback from the board.
The manager, however, was not that enthusiastic about the data-driven approach over his own
judgement [2].
At PSV, this data-driven revolution is also taking place. In the approach PSV is following,
however, the projects originate from business-oriented demands. The demands for data-driven
solutions mainly come from soccer trainers themselves. This way, soccer trainers can contribute
to the way in which the projects are carried out and are more likely to effectively use the results.
“How can we quantify the quality of a goal scoring opportunity created by a team?”
By answering this question, not only the quality of a created goal scoring opportunity can be
determined, but also the performance of the player in that situation can be determined. The shot
of a player results in a goal or not. Comparing this to the expected quality of the goal scoring
opportunity leads to insights in the performance of the shot of that player. Therefore, this research
also answers the following question:
“How can we quantify the value of a shot given the scoring opportunity?”
1.2 Methodology
The size of data of both the tactical data and the physical data provided by PSV ask for data
mining solutions. Data mining is extracting useful information from large datasets and databases.
Data mining, therefore, lies in the intersection of statistics, machine learning, data management,
and databases, pattern recognition, artificial intelligence and much more [9].
One of the major data mining frameworks is the Cross-Industry Standard Process for Data
Mining (CRIPS-DM)[45]. This data mining framework is used as the generic framework for this
thesis. A graphical representation of this model is provided in Figure 1.1.
Business Data
understanding understanding
Data
preparation
Deployment
Data
Evaluation Modeling
Below is some more elaboration on the steps provided in the CRISP-DM methodology by
Shearer:
1. Business Understanding: Focuses on understanding the project objectives from a busi-
ness perspective, converting this knowledge into a data mining problem definition, and then
developing a preliminary plan designed to achieve the objectives.
2. Data Understanding: First collection of the data after which the familiarity with the data
is increased such that data quality is identified and first insights are discovered.
3. Data Preparation: Covers all activities to construct the final dataset.
4. Modeling: Selection of modeling techniques and calibrating the parameters of these tech-
niques to optimal values.
5. Evaluation: Review the model’s construction to be certain that it properly achieves the
business objectives.
6. Deployment: Organizing the gained knowledge in a presentable way such that the customer
can use it.
team as Poisson processes, hereby predicting the probability of a team to win a game [30, 24].
Other statistical approaches have studies the relationship between possession time and team’s
performance [29].
In machine learning, many complex systems have been proposed to predict the winner of
a match. An ensemble of k-nn predictors is proposed by Hoekstra [26]. Dobravec notes the
importance for non-expert based features [15]. Therefore, he proposed a method of predicting
match outcomes using these latent features. Other attempts have been used by analyzing the
density of network graphs in order to predict match outcomes [19].
The approaches taken in these papers are all dependent on data available before the match.
More insights might be generated when including data generated during a match. Kerr uses
data generated during a match to predict match outcomes [31]. Therefore, Kerr trains Logistic
Regression on features extracted during the entire match e.g. possession time.
Lucey et al. [36], however, take a different approach. They determine different features (e.g.
distance to the goal), to determine the quality of individual goal scoring opportunities. They
call this metric the Expected Goal Value (EGV). The Expected Goal Value is determined by
applying Logistic Regression to the extracted features.
Similarly to Lucey et al. a quality metric is determined for individual goal scoring opportunities.
Contrary to Lucey et al., however, in our approach, it is ensured that the quality metric can be
interpreted as the probability that a goal scoring opportunity results in a goal. Therefore, it is
important that the predicted scores match reality accurately. Furthermore, the quality metric in
our approach is used to estimate match results.
This chapter translates the given business problem into a predictive analytics framework, including
a set of data mining tasks. It is important to make this translation in the early stages of this thesis
since these choices determine the interpretation of the eventual results. First of all, the category
of data mining tasks in which the business problem can be categorized is discussed. Secondly, the
appropriate techniques of this category are discussed. Then, a technique (calibration) is introduced
to improve these techniques. Some of the characteristics of the model are then elaborated on. Of
these characteristics, the need for a range interval is discussed in Section 2.4. Then, the influence
of the bias, variance, and the irreducible error and their implications are elaborated on. The
eventual interpretation of the obtained scores are discussed in Section 2.6. Finally, the chapter is
finished with a conclusion.
• Discovering Patterns and Rules: Besides techniques which build models, there also
exist data mining techniques who are concerned with pattern detection. An example of such
techniques is the task in which combinations of items that occur frequently in transaction
databases are found [20]
• Retrival by Content: Consists of finding patterns from the data which are previously
defined by the user. This kind of tasks is most commonly used for text and image data sets.
Search engine Google, for example, uses this kind of retrieval methods to locate documents
on the Web [6]
The objective of this thesis is, as stated in Section 1.1, to determine the quality of a goal scoring
opportunity. The quality of such a goal scoring opportunity is determined based on input variables.
Therefore, the problem of this thesis asks for predictive modeling techniques. More specifically, the
problem of this thesis asks for a numerical output in which the quality of a goal scoring opportunity
is provided. Therefore, one could argue that regression techniques best suit this problem. As we
see later in this chapter, however, there also exists methods in which classification techniques can
result in numerical output. Furthermore, the provided data set exists of binary target variables
(goal or no goal) and therefore, classification techniques suit this problem best.
Other approaches could, however, still be useful with the current data set. Exploratory Data
Analysis and Descriptive modeling could, for example, be used to to get a better understanding
of the data.
2.2 Classification
Classification is the task of learning a target function f that maps each attribute set x to one of
the predefined class labels y [49]. In the case of the expected goal model the class labels are goal
or no goal. Therefore, yi ∈ {goal, no goal}.
In classification problems, no single best classification algorithm exists which is better than all
the other available classification algorithms [51]. One way to determine which algorithm to choose
is to determine the best classification algorithm for a given problem by cross-validation [32]. By
using cross-validation, multiple classification algorithms are trained and performance metrics are
calculated accordingly. Based on these metrics, the best model can then be selected. For this
thesis, four different algorithms are used: Logistic Regression, Decision trees, Random Forest, and
Ada Boost. Implementations of these algorithms by scikit-learn are used [42]. The random forest
and the Ada-boost algorithm are examples of ensemble learners. Ensemble methods use multiple
learning algorithms to obtain better predictions than could have been obtained from the individual
learning algorithms [44].
Decision Tree is a classifier expressed as a recursive partition of the instance space [37].
Random Forest is a set of multiple decision trees where the predicted class is determined from
the mode of the classes [25]. Random forests are able to correct for the poor generalization
of decision trees [22]
Ada Boost starts by fitting a classifier on the original data. Then, additional copies of the
classifier are fitted on the data where the weights of incorrectly classified instances such that
the new classifiers focus on more difficult cases [18]. For this thesis, decision trees are used
as the underlying classifier for Ada Boost.
In order to determine the best of these algorithms, multiple performance metrics can be used
which all have their own benefit. Some of these metrics are computed with the help of the confusion
matrix (Table 2.1). These metrics are listed below [47].
Actual
Positive (Goal) Negative (Non-Goal)
Positive (Goal) True Positive (TP) False Positive (FP)
Predicted
Negative (Non-Goal) False Negative (FN) True Negative (TN)
TP
• Precision: T P +F P
TP
• Recall: T P +F N
Precsion·Recall
• F-score: Precsion+Recall
1 TP TN
• Area under the ROC Curve (AUC): 2 T P +F N + T N +F P
A typical way of evaluating classifiers is using the 0-1 loss function, where zero error is given to
the values which the classifier predicted correctly and 1 to the incorrect predictions. The objective
of this thesis, however, is not to predict all the samples correctly but to score different goal scoring
opportunities. Therefore, it is more important to rank different scoring opportunities relative to
each other. Therefore, the Area Under the Curve (AUC) seems the best performance metric. The
AUC can be interpreted as the probability that the classifier assigns a higher score to a random
positive example than to a random negative example [4]. The AUC, therefore, gives insights in the
way in which the classifier is able to rank better scoring opportunities indeed higher. Performance
metrics resulting from the 0-1 loss function do, however, still provide value as they give insights
into how much the classifier is indeed able to give high scores to good scoring opportunities.
Therefore, the precision, recall, and F-score are also provided in the evaluation of classifiers.
2.3 Calibration
As previously mentioned in Section 2.1, classification outputs can be represented as a probability.
This is most often done by computing the class membership probabilities. Class membership
probabilities can be interpreted as the confidence of a classifier of a sample belonging to a certain
class. To obtain better probabilities, these probabilities are re-calibrated. Two main techniques
exist to map the model predictions to posterior probabilities: Platt Calibration and Isotonic
Regression. Since we are dealing with a two class problem, these methods can easily be applied
to the current problem.
Niculescu-Mizil and Caruana show that Platt scaling outperforms Isotonic regression when the
data set is relatively small. When the size of the data set, however, increases (1000 samples or
more) Isotonic regression outperforms Platt scaling [40]. Since the data set provided consists of
more than 1000 samples, Isotonic regression is used further on.
The performance of the calibrated model differs from the performance of the non-calibrated
classifier. For the calibrated classifier, however, the main objective is to provide good estimates of
the actual probabilities. The performance of the calibrated classifier can, therefore, be determined
by comparing the calibrated probabilities to actual probabilities. Since no actual probabilities are
provided, bins with similarly calibrated probabilities are created. The actual ratio of goals is then
determined for those bins and plotted against the mean probabilities of the bins. This method is
similar to the method proposed by Niculescu-Mizil and Caruana [40]. Furthermore, the brier score
proposed by Brier is calculated [5]. This score corresponds in many ways to the mean squared
error and thus provides the error of the probabilities to the classification problem.
n-Nearest Neighbours finds the n Nearest Neighbours of the sample. The standard deviation
could then be computed over these n samples.
Clustering techniques cluster similar samples together. The inner-standard deviation of these
clusters could then be computed as the standard deviation of the points.
To compute the standard deviation using n-Nearest Neighbours, the Nearest Neighbour al-
gorithm should have to compute the nearest neighbours of each individual sample. This would
be very resource intensive. By using clustering algorithms, however, only the standard deviation
of the cluster has to be computed. Therefore, clustering is used instead of Nearest Neighbours.
Gaussian mixture models is a clustering technique which is commonly used for kernel density
estimation. Therefore, Gaussian mixture models are used to calculate the standard deviation of
clusters.
Bias is the average distance between the actual value and the predicted value.
Variance is the deviation between a single prediction x and the average prediction x̄.
Irreducible Error (Noise) the term that cannot be reduced to any model.
Variance and Bias are best explained graphically in Figure 2.1. Figure 2.1 shows that for bias,
the soccer balls are indeed close to each other. They are, however, not close to the target. In
the case of high variance, the mean of the points indeed lies close to the target, the individual
attempts, however, do not hit the target.
Now the concepts of Bias, Variance & Irreducible Error are clear, the implications of these
concepts to the stated problem can be discussed. During this discussion, it is important to keep
(a) Low bias, low variance (b) High bias, low variance
(c) Low bias, high variance (d) High bias, High variance
the goal of this thesis in mind: Analyze goal scoring opportunities created by a given team with
given players.
In each predictive modeling task, ideally one would create a model with zero bias and zero
variance. Relating such a model to Figure 2.1 would lead to a situation in which the target would
almost always be hit (Figure 2.1a), except for external factors influencing the shot (the irreducible
error). In terms of classification, the classifier would always be able to assign the proper class label
to each case. In most real world applications this is, however, impossible. Therefore, the goal is
to minimize the bias and variance of the predictive model.
In predictive modeling, the input of the predictive model is one of the main aspects which
determine the bias, variance, and the irreducible error of the model. To make this clear, lets go
back to the situation in Figure 2.1 in which a player aims at a target in a goal. Without any
knowledge of the situation, it is hard to determine the likelihood of the player hitting the target.
With prior knowledge such as the distance of the player to the target, this would be somewhat
easier. Therefore, one could argue that collecting features of the situation of the player could lead
to a reasonable estimate of the likelihood that the player hits the target.
Now lets consider the situation in which a player with poor shooting accuracy attempts to
score. Due to the player characteristics the outcome of the goal attempt, given the situation,
would differ significantly from this player. The predictive model is, however, not able to anticipate
to these fluctuations and would, therefore, have a higher error. The goal attempts of a player with
good shooting accuracy, however, would fluctuate less and are therefore more predictable. A
predictive model trained on only the players with good shooting accuracy would, therefore, have
more predictive power. Therefore, one could argue that the predictive model should be trained on
only the players with good scoring accuracy.
But, what if the original player with the poor shooting skills would try to hit the target again?
Would the predictive model trained on the best players still be able to accurately determine the
likelihood of the player hitting the target? Intuitively, this would still provide some insights since
the situation in which the players find themselves did not change. The likelihood of the poor
player hitting the target would, however, be significantly lower than the excellent player hitting
the target. Therefore, it seems important to take the quality of the player into account as well.
Since lower quality players are less predictable, the predictive model would probably lead to a
less optimal model (higher variance). The practical meaning of such a model would, however, be
limited since it would be only applicable to the better players.
Now lets have a look at the situation in which the player, instead of hitting the target, attempts
to score a goal. To make the situation more realistic, a goalkeeper is standing in the goal trying to
prevent the player from scoring and thus aims to stop the goal attempt. In this case, the likelihood
of the player scoring does not only depend on the situation and the player quality anymore but
also of the quality of the goalkeeper. If an excellent goalkeeper would be on the line, the likelihood
of the goalkeeper stopping the goal attempt would be higher than when a poor goalkeeper is on
the line. Therefore, the quality of the opposing goalkeeper is included in the model.
The same reasoning could be applied to the other defenders who are trying to stop the attacker
from scoring. The quality of a goal scoring opportunity is, however, determined at the moment at
which the player attempts to score. At this moment, the position of the defenders on the field is,
however, already given. The only action the defenders could perform at this moment is blocking
the goal attempt. Intuitively, there, however, seems to be no significant difference in player quality
regarding the blocking of a goal attempt. Therefore, it seems useless to include defender quality
as input for the predictive model.
2.6 Interpretation
After calibration is applied to the classifier, the classifier scores can be interpreted as posterior
probabilities. Let pi be the probability that a goal is scored from a goal scoring opportunity i.
The goal scoring opportunities can then be modelled as a Bernoulli random variable yi ∼ Ber (pi ).
Then, the expected number of goals in a match of n goal scoring opportunities is equal to:
" n # n n
X X X
E [#Goals] = E yi = E [yi ] = pi (2.2)
i=1 i=1 i=1
2.7 Conclusion
In this chapter, the choice of predictive modeling, and more specifically classification, is justified.
Furthermore, different classification algorithms are discussed. These classification algorithms are
trained based on the Area Under the Curve (AUC). This performance metric ensures that better
goal scoring opportunities indeed score higher. Since the classification algorithms only provide
a confidence of belonging to a particular class, a calibration step is added. By calibrating the
classification algorithms, more realistic probabilities are obtained. Due to the background of the
problem, it is important to provide a range prediction to single predictions. Finally, the importance
of adding player quality data to the input and the interpretation of the final scores were discussed.
In order to apply the methods discussed in Chapter 2, a good understanding of the data is
required. Therefore, this chapter elaborated on the available data. First of all, the three different
data sources are discussed. Three aspects of the given data sources are therefore discussed. First
of all, the methods to collect the data are discussed. Secondly, the general format of the obtained
data is discussed. Finally, for each of the data sources (potential) data quality issues are discussed.
After the introduction of the data sources, this chapter shows how these data sources are combined
to provide more valuable insights. Finally, this chapter is finished with a conclusion.
related events, however, would result in missing data. Currently, only cards (yellow and red) are
included as related events. The structure of the ORTEC API is provided in Figure 3.1.
ORTEC API
The events resulting from the ORTEC API are provided in JSON format. Since collecting all
the data takes time, the JSON data is stored in CSV files for later use.
perform tactical analyses, such as quantification of pass options, transition moments, crosses, and
ball pressure [27].
During matches, however, the RFID technology cannot be used due to legislation. Therefore,
during matches players are tracked with the use of cameras. This, however, leads to quality issues
which are addressed in Section 3.2.4
Table 3.2: Different versions of marker names. In this table, the marker names are already in
lower cases and spaces have been removed
The web scraper starts from a base URL. In this base URL, the main competitions of which
data should be collected can be selected. Then, the web scraper starts by selecting a single team,
of which the individual players are scraped. This process is repeated until all the teams of all
the competitions are finished. Then, the web scraper continues the process for other seasons if
specified by the user. The process if graphically represented by Figure 3.2
Collect
All teams All season
Start Select season Select Team Select Player attributes finished?
Yes finished?
Yes End
player
The scraped data is stored in CSV files such that the data does not have to be collected every
time.
Table 3.5: Analysis of the player data quality (Source: Daily Mirror [14])
Passing Tackling
Name Rating Accuracy Difference Name Rating Success Difference
Barry 85 86.8 0 Cahill 85 79.4 -7
Coutinho 85 80.6 6 Canas 85 58.3 5
Fernandinho 85 88.3 -3 Clichy 85 77.2 -6
Fletcher 85 89.1 -6 Evra 85 68.4 3
Milner 85 84 3 Terry 85 62.1 4
Nasri 85 91.5 -9 Zabaleta 85 76.7 -4
Oscar 85 83.5 4 Agger 86 74.1 3
van Persie 85 76.7 7 Ivanovic 86 83.1 -4
Allen 86 86.8 5 Koscielny 86 75.9 1
Carrick 86 88.6 0 Mertesacker 86 70.7 5
Gerrard 86 86 7 Cole 87 69.2 9
Hazard 86 83.3 10 Ferdinand 87 90 -2
Yaya Toure 86 90.1 -2 Medel 87 77.2 1
Arteta 87 92.1 -2 Vidic 89 75.9 5
Britton 87 90.3 0 Kompany 90 70.8 9
Mata 87 88.6 3
Ozil 88 88 8
Silva 89 88.2 8
In conclusion, the results show that the rankings from EA Sports are questionable but the
results do not show that there are serious data quality issues. Therefore, the attributes ranked by
EA Sports are used in this thesis.
Since many player names occur and some might be similar. the number of player names from
one data source to which the player name is compared should be minimized. When mapping
the player names from the spatiotemporal data, the player names are therefore matched for each
individual match. By mapping the player names for each individual match, the maximum number
of player names of the tactical data to which the player name can be matched equals 28 (22
starting players and at most 3 substitutions for each team).
To limit the number of player names to be matched with the player data, the players are
matched to the players of a particular team in a season. Since, however, the team names are also
inconsistent among the data sources, there might not always be a direct match. Mapping the team
names, however, seems more difficult since there are many teams with short names and it might
be more beneficial to map teams to short teams (lower number of incorrect characters).
1 Merging of the data sets has been conducted in cooperation with Kees Hendriks, who performed his graduation
3.4.2 Timestamp
The timestamps of the tactical data and the spatiotemporal data are slightly different due to
different starting moments of the match. Therefore, the timestamps have to be modified. By
subtracting the start of the match, both data sources start the match at t = 0. By ensuring that
both data sources start the matches as t = 0, timestamps could be matches immediately.
There is, however, one more difference between the tactical data and the spatiotemporal data.
The tactical data restarts counting the second half, where the spatiotemporal data continues
counting throughout the half time break. This can, however, be solved easily. By subtracting
the start of the second half, the second half also starts at t=0. Since this could (potentially)
lead to issues in data manipulations, later on, the choice is made to start the second half at sixty
minutes after the start of the game. The correct time stamp could, therefore, be determined with
equation 3.1
(
t − tStart , if first half
tnew = (3.1)
t − tStart second half + 60 ∗ 60000, otherwise
Figure 3.3: Dimensions of the fields of the tactical data and the spatiotemporal data
There is, however, one more difference in measuring the location of players on the field. In the
tactical data, teams always play from left to right, in the spatiotemporal data, however, teams
play as they played actually during the match (one-half from left to right and the other half from
right to left). To overcome this issue, firstly, the goalkeeper (player with the highest absolute
value) at the beginning of the match was selected. With the assumption that the goalkeeper is
always on his own half, the direction of play could be determined from that team and thus the
location could be adapted.
Sometimes, however, the goalkeeper does not have a location at the beginning of the match.
Then, a striker could have the most absolute value and the wrong side of the field could be selected.
To exclude these cases, the most absolute player for each half is selected. Since the goalkeeper
is the most absolute player during a half, the direction of play can be determined and thus the
location of the players can be modified. Equation 3.2 provides the conditional equation for the
first half. Here, si,t is the location of player i at time t and ni is the number of measurements for
player i. When modifying the location of the players during the second half, the only difference is
t > 3300000 instead of t < 3300000
( P Pt<3300000 P Pt<3300000
10 si,t 10 si,t
si,t , if min i=0
t=0
ni < max i=0
t=0
ni
si,t = (3.2)
si,t ∗ −1, otherwise
After the locations of the players have been adjusted such that they always play from left to
right, the difference of dimensions can be accounted for. The quality of the locations of the both
data sources has to be determined in order to select one. Figure 3.4 shows the locations of the
tactical data and the spatiotemporal data of three goal scoring opportunities in a match.
(a) The locations of which the goals are scored ac- (b) A video snapshot of the first goal
cording to the tactical data (yellow) and spatiotem-
poral data (red)
(c) A video snapshot of the second goal (d) A video snapshot of the third goal
Figure 3.4: Comparison of the spatial data from ORTEC with the Inmotio data
As can be seen from the data, the locations are quite similar. The main difference in location
comes from moment c where the spatiotemporal data seems to be better. The spatiotemporal
data, however, has issues with tracking players which could lead to missing data or incorrect
values that are way off (Section 3.2.4). Furthermore, more data is available from the tactical data
source. Therefore, the locations of the players are extracted from the tactical data source.
An important note to make here is that the tactical data does not provide data about sur-
rounding players and that therefore, data from surrounding players is still extracted from the
spatiotemporal data. The location of the player attempting to score and his surrounding players,
therefore, comes from different data sources. This could lead to small differences when comparing
these locations. Figure 3.4, however, already showed that these locations are very similar.
3.5 Conclusion
In this chapter, the three different data sources are introduced. With the introduction of the data
sources, the data collection methods, the general format, and (potential) data quality issues are
discussed. Finally, this chapter elaborated on the methods applied to combine the data sources.
In order to combine the data sources, adjustments in player names, timestamp and location had
to be made.
Modeling
This chapter elaborates on the modeling steps which have to be taken to actually obtain the
predictive model. First of all, features are extracted from the three data sources. These features
are then prepared to ensure that the model is trained properly. Furthermore, the class imbalance
is discussed and a solution is proposed.
Spatial Features
One of the most important aspects of the ORTEC data is the location on the field. The location of
the field from which an attacker is trying to score seems to be an important variable in determining
the chance of a goal from a given goal scoring opportunity. There are various ways in which the
location of the field can be included in a model. Lucey et al. for example determine a probability
distribution of shot locations resulting in a goal [36]. During this thesis, however, a different
approach is used. From the (x, y) coordinates, two features are determined: (1) The distance to
the goal and (2) the angle to the goal. Three different cases can be selected in order to calculate
these features. The attacker can be standing on the left-hand side of the goal (2) right in front of
the goal or (3) on the right-hand side of the goal. The formula of all these cases are presented in
Table 4.1. A geographical representation of the three scenarios is also given in Figure 4.1
−1
Where angle1 = tan (GW/2+y)
∆x and ∆x = FW 2 −x
Context
Besides the location of the goal attempt, the ORTEC data also includes information about the
context of a goal attempt. Like Lucey et al. six different contexts are used: Open-play, Counter,
Table 4.1: Formula to calculate the distance to the goal and the angle to the goal
(a) Scenario 1: Left side (b) Scenario 2: In front (c) Scenario 3: Right side
Figure 4.1: Different scenarios for which the distance to the goal and the angle to the goal have
to be computed
• Open play: Goal attempt created from a series of passes on the opposition’s half of the
field.
• Corner: Set piece from the corner of the field when the ball passed the goal line, is not
counted as a goal and is last touched by a defender
• Penalty: Set piece after a foul of the opposition in own box. A penalty is always taken 11
meters from the goal
• Direct free kick: Set piece after a foul of the opposition outside the own box which is
taken directly on goal.
• Indirect free kick: Set piece after a foul of the opposition outside the own box which is
not taken directly on goal.
Now the different contexts are defined, these contexts can be extracted from the data. For both
penalty and direct free kicks, this is very straightforward since they are both directly encoded by
ORTEC. For the other contexts, however, some more work has to be done. In case the context
of the goal scoring opportunity is not one of those, we have to look at previous events from the
data. Here, we look at the last fifteen seconds before the event. It can, however, occur that there
have not been any events during the last fifteen seconds e.g. there was an injury treatment. In
this case, the last three events are used.
For the previous events, the first of the following is used:
• If the event was an indirect free kick: context → Indirect free kick
• If the event was possession gain on the first 15 meters of the opposition’s half: context →
Counter
It is, however, very difficult to define open play situations. Therefore, the context is encoded
as open-play when none of the previous events occurs in the selected previous events.
For some contexts, the context itself might not be accurate enough. Therefore, additional infor-
mation about the origin of a chance is created from the data. A distinction is made between a
dribble, rebound, cross pass, long pass, and possession gain.
• Dribble: The attacker dribbled and thus took on a defender just before shooting.
• Rebound: The attacker picks up the ball after it reflects from the bar, is blocked by a
defender, or saved by the goalkeeper.
• Cross pass: The attacker attempts to score from a cross given from the side of the field.
• Long pass: A scoring opportunity created after a pass longer than 30 meters
• Possession gain: The attacker gains possession by taking the ball from a defender or the
goalkeeper.
To extract these features from the ORTEC data, the last 10 events before the goal scoring
opportunity are investigated. The first from the above events which occurred before the goal
scoring opportunity is selected and the corresponding feature is extracted.
Current Score
Another feature which can be extracted from the ORTEC data is the current score line. The
current score line could eventually represent the defensiveness of the oppositional team. The
effect of this feature is best explained by some scenarios. Teams, who are behind, for example, are
more likely to attack which means that more of their players are attacking. This automatically
results in fewer players defending their own goal. Which leads to fewer defenders and thus more
space for the opposition. Furthermore, more attackers leads to more passing options and thus
more choices for both attacker and defender to choose from which could affect the quality of goal
scoring opportunities.
Degree of Difficulty
Even when all the above features are included in the model, the quality of goal scoring opportunities
can still not be defined properly. Lets take, for example, a cross from open-play where the team
of the attacker is 1-0 behind. Intuitively, the quality of the goal scoring opportunity is not the
same if the ball is above the ground compared to the same situation in which the ball is on the
ground. Since there is no z value of the ball of the moment on which the attacker shoots, the
best we can do is to select the high attribute as encoded by ORTEC. This results in a categorical
variable where the ball is high or not.
Furthermore, the quality of the same goal scoring opportunity depends on with which part
of the body the attacker attempts to score. If the attacker, for example, attempts to score with
his head aiming the ball is more difficult than when the attacker attempts to score with his foot.
Besides the increase in accuracy, attempting to score with the foot also results in higher power
and thus less time for a goalkeeper to respond to the attempt.
Table 4.2: Selected attributes of the player and the Goal Keeper from the player data
Table 4.2 shows different player attributes of the players from the Player data. Not all the
attributes of player are, however, useful in a particular situation. Therefore, different scenarios
are created for each of the player attributes. An overview of the player attributes used as a feature
for the different scenarios is provided in Table 4.3.
For the Goal Keeper, it is, however, difficult to determine the importance of the attributes in
a given situation. At the time of the shot, for example, the location of the ball relative to the
goalkeeper when passing the goalkeeper is not known yet. Therefore, it is impossible to decide
whether the goalkeeper should dive or not (and thus if diving is important in this situation).
Therefore the mean of the selected Goal Keeper attributes is used as a feature of the goalkeeper.
to save (block) ball, defends a larger area than the defender. Therefore, the number of attackers
and the number of defenders between the goal and the attacker are extracted as features.
Whether or not the goalkeeper is standing between the attacker and the goal could be included
as a categorical variable as well. Not all the situations in which a goalkeeper is between the
attacker and the goal are, however, the same. In situations where the goalkeeper is very close to
the attacker, it is very difficult to shoot the ball passed the goalkeeper. On the other hand, when
the goalkeeper is standing far from the attacker, the goalkeeper has more time to react to the goal
attempt and is thus more likely to stop the goal attempt. Since the distance of the goalkeeper to
the attacker seems to play an important role, the choice is made to include the Euclidean distance
of the goalkeeper to the attacker as a feature instead of whether or not the goalkeeper is in line
with the attacker and the goal. A special case exists, however, when the goalkeeper is not in line
with the goal and the attacker. In this case, the maximum distance of the goalkeeper to the goal
from the whole data is used as this distance.
The same logic from the goalkeeper can be applied to the case of the defender. On one side,
it is harder to shoot passed the defender when the defender is standing close, on the other side,
the defender has more time to respond to the goal attempt when he is standing further away.
Therefore, besides the number of defenders standing in line of the goal, the Euclidean distance of
the defender to the attacker is also extracted as a feature.
The features derived from the spatiotemporal data are visualized in Figure 4.2.
Goalkeeper
Defender in line
Figure 4.2: Visualization of the features derived from the spatiotemporal data.
X − Xmin
Xscaled = ∗ (max − min) + min (4.1)
Xmax − Xmin
Not all features are, however, within order. Take for example the feature context. Here, it is
difficult to say whether a corner is necessarily better than a free kick. These variables are so-called
categorical variables. These variables are set to dummy variables such they can be included in the
classification algorithms. This would, for example, lead to the case where a variable says whether
it is a corner (1) or not (0) and another variable saying whether it is a free kick (1) or not (0). An
overview of the categorical variables and the numerical variables is provided in Table 4.4.
Numerical Categorical
Dist to goal Number of attackers in line Part of body
Angle to goal Number of defenders in line Originates from
Player quality Distance nearest defender in line Current score line
Goal keeper quality Distance goal keeper Context
High
• Balancing
• Ensemble learners
For this thesis, a combination of balancing (under-sampling the majority class) and generating
synthetic instances is used. Chawla et al. show that this is a good method to deal with imbalanced
data [10]. Both methods are briefly discussed below.
4.4 Conclusion
In this chapter, the features extracted from the different data sources are provided. An overview
of these features is provided in Table 4.5. Furthermore, the preparation of these features is
discussed. Here, the distinction was made between the preparation for numerical features using
min-max normalization and the casting of categorical features to dummy variables. Finally, a
method of solving the class-imbalance problem is proposed where a combination of under and
over-sampling is used.
Evaluation
This chapter elaborates on the performance evaluation of the models provided in Chapter 4.
Multiple techniques are used in this chapter to evaluate the performance of the predictive model.
First of all, the performance metrics for predictive modeling are discussed in Section 5.1. Secondly,
the performance of the calibration step is evaluated in Section 5.2.
The true business value of the models is, however, generated by determining the likelihood of a
goal scoring opportunity. The generic performance metrics can only determine the performance of
the model based on the data itself. Therefore, to further evaluate the performance of the model,
the performance of the model is determined by conducting an eye test with a business expert.
The goal of this thesis is to explain match results based on expected goals. Therefore, the
relation to the expected goals and the match outcomes is evaluated in Section 5.4. Finally, the
chapter ends with a conclusion.
Test Set
Validati
on set
+
+
-
+
+
Model 1
-
+ +
+ +
Training
- -
set - +
Model 2 + Final Model +
Performance
...
metrics
-
+
-
+
Model n +
Parameter Selection
Figure 5.1: Visualization of the model (parameter) selection and estimation of generalization
performance of the selected (final) model
Table 5.2 shows that the Random Forest classifier outperforms, or at least equals, the other
classification algorithms on all performance metrics. Furthermore, the table shows that splitting
the data for different context does not lead to a superior classifier.
Table 5.3 shows that, after calibration, the performance metrics lower. Especially the recall
and F-score are significantly lower than before calibration. The lowering of the recall can be
explained by the definition recall. The lowering of the F-score then results from the lower recall.
In Section 2.2, recall was already defined as the number of correctly predicted goals divided
by the number of actual goals. Since the scores resulting from the calibrated classifier are much
lower, less goal scoring opportunities are predicted to result in a goal. This also corresponds to
the real life case, where there are only a few goal scoring opportunities which result in goals more
than half of the time. Intuitively, it is, therefore, correct that the recall is low.
Calibration plots (Brier score 0.0817) Calibration plots (Brier score 0.0802) Calibration plots (Brier score 0.0813)
1.0 1.0 1.0
Fraction of positives
Fraction of positives
As can be seen from Figure 5.2 the reliability graphs all follow the optimal line closely. This
indicates that the predicted values follow the actual values. Therefore, the calibrated class mem-
bership scores can be interpreted as probabilities.
Table 5.4: Features and posterior probabilities for the events 1-4
Table 5.5: Features and posterior probabilities for the events 5-8
Table 5.6: Evaluation of match outcomes according to the expected goals model
What stands out from Table 5.6 is that in only 1366 of the 5020 matches, the exact score of the
match was predicted based on the expected goals. If, however, one goal difference is accepted, 3443
of the 5020 matches have correctly predicted scores. Therefore, it seems that the expected goals
model is, in most cases, almost correct. The MSE Match strengthens this statement. The MSE
match shows that the average MSE of the result of a match is 2.366. Therefore, the average number
√
of goals predicted difference goals of both teams differs 2.366 ≈ 1.538 from the actual difference
in goals, which means that many matches would be in the range of only one goal difference.
Furthermore, Table 5.6 shows that the results of match outcomes are not very different across
leagues or seasons.The only exceptions are the 1.Bundesliga and the Barclays Premier League,
where too few matches were evaluated. In the other cases, about 1 out of four matches the actual
score is predicted correctly, and the number of matches in which the score was at most one goal off
is about 2.5 times as high. Furthermore, the MSE values of the number of goals scored per team
and the difference in goals scored do not differ much from the mean. Therefore, one could conclude
that the expected goals model is generalizable across different leagues and different seasons. In
the further analysis, results of different leagues and seasons could, therefore, be aggregated.
So far, just the exact results are examined. More interestingly, maybe, is how often the
expected goals model predicted the correct winner. This is given by the number of correct results
in Table 5.6. Obviously, the number of correctly predicted matches is higher than the correctly
predicted scores. What stands out, however, that the number of correctly predicted matches is not
close to the number of scores predicted correctly where one goal difference was allowed. This shows
that games where the model is one goal off in the match, this one goal also influences the result
of the match. To evaluate in which cases the one goal difference most often influences the result,
the problem of predicting the winner of a match is defined as a three class problem where either
Team 1 wins, Team 2 wins or the game ends in a draw. The confusion matrix of the three-class
problem is provided in Table 5.7.
Table 5.7: Confusion matrix of the three class problem of predicting the winner of a match
Actual
Win 1 Draw Win 2 Total
Win 1 1079 329 210 1618
Predicted Draw 599 524 591 1714
Win 2 220 362 1106 1688
Total 1898 1215 1907 5020
Table 5.7 shows that most of the incorrect classified match outcomes originate from predicted
draws. Predicted draws are, most likely, games which were very tight. Table 5.6 already showed
that in many cases, the model was only one goal off. In tight games, one goal off means that the
result of the match is predicted incorrectly. This was most likely the case of the predicted draws.
To dive even further in the predicted match outcomes, the most common actual match results and
predicted match results are provided in Table 5.8.
Actual
2-0 2-1 3-0 2-2 0-0 1-1 1-0 3-1 Total
1-1 193 222 68 61 170 216 374 62 1469
2-1 182 201 121 53 41 86 159 131 1189
1-2 37 85 10 44 26 97 85 33 476
3-1 48 46 42 16 7 20 15 42 382
Predicted
2-2 21 50 9 27 3 29 24 31 278
2-0 71 16 36 0 11 20 38 23 259
1-0 36 19 12 1 47 13 75 6 221
3-0 21 9 20 0 1 4 13 11 126
Total 630 705 347 246 363 553 840 372 5020
Table 5.7 already showed that the predicted goals model is most often incorrect when the model
predicts a draw. Therefore, it is most interesting to look at the predicted draws in Table 5.8. Lets
take for example the predicted score of 1-1. What stands out that in 596 of the 1469 cases,
the actual result was a close win with at most one goal off (374 cases of 1-0 and 222 cases of
2-1). Furthermore, in 447 of the 1469 cases, the result was correct, which is a similar ratio as in
Table 5.7. Of these 447 cases, however, the model predicted another score in 231 of the cases.
This could indicate that the expected goals at which the expected goal model assigns a goal to
a team has to be tweaked. The value at which the model assigns a goal to a team is called the
threshold. The result of the threshold value on the predicted results is provided in Figure 5.4.
Figure 5.4 shows that the highest number of correctly predicted match results is obtained with
a threshold of around 0.2. When doing so, the number of correctly predicted wins is very high.
The correctly predicted draws, however, are very low. Since the correctly predicted number of
draws was low already the choice is made to stick with the threshold value at 0.5.
5.5 Conclusion
In this chapter, the performance of the predictive model is discussed. At first, the performance
metrics of the different classification algorithms were evaluated. The random forest classifier is
clearly outperforming the other classification algorithms. Then, the performance of the calibration
is determined. Reliability graphs showed that the predicted scores follow the optimal line closely.
Therefore, the conclusion could be drawn that the calibrated class membership probabilities could
be interpreted as actual probabilities.
To determine to which extent the proposed expected goals model fits the business problem,
the scores for individual goal attempts were evaluated. Here, it was also seen that the confidence
intervals are very wide. Therefore, no valid conclusions on a single goal attempt can be made.
Finally, the match outcomes as predicted by the expected goals model are determined and evalu-
ated to the actual results of matches. The expected goals model is limited in predicting the match
result, especially in tight games.
The main goal of this thesis is to evaluate matches over a period of time to base strategic decisions
on. This is, however, only a part of the possible applications of the expected goals as introduced
in this thesis. The different applications for soccer clubs can be classified into three different
categories: analysis of a period of time, a single match, and a particular player. These three
different categories are discussed in this chapter.
Due to confidentiality, data of PSV cannot be shared publicly. Therefore, the analyses in this
chapter are performed on different soccer clubs. More specifically, the soccer club FC Barcelona
is used for this case study. The visualizations, however, are similar to the case of PSV.
2014-2015 8
6 1 3 0 6 2 6
6
Exp. Goals
4 2 5 0 2
1 0 2 5 5 5 3 3 6 3 4 5 3 2 2 1 4 2
2 2 1 fc 2barcelona
2 3 1 0 0 opponent
0
2 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 2 2 0 1 1 1 0 1 0 0 2 0 0 0 0 0 difference
2
3 0
fc barcelona-elche cf
villarreal-fc barcelona
fc barcelona-granada cf
fc barcelona-athletic bilbao
levante-fc barcelona
fc barcelona-eibar
rayo vallecano-fc barcelona
fc barcelona-celta de vigo
fc barcelona-sevilla fc
fc barcelona-espanyol
fc barcelona-cordoba
eibar-fc barcelona
fc barcelona-atletico madrid
deportivo-fc barcelona
fc barcelona-villarreal
fc barcelona-levante
fc barcelona-malaga cf
cordoba-fc barcelona
fc barcelona-rayo vallecano
fc barcelona-real madrid
fc barcelona-almeria ud
fc barcelona-valencia cf
espanyol-fc barcelona
fc barcelona-getafe cf
fc barcelona-real sociedad
fc barcelona-deportivo
2015-2016 6
6 4 6 3
1 1 5 2 3 3 4 6 4 4 6 2 2 2 2 5 4 5 3 2 1 8
Exp. Goals
4 1 1
2 2 1 2 2 3 2 1 1 2 3
0 2 1 1 2 0 0 2 5 3
0 0
2 0 1 1 1
4 1 1 2 2 0 1
0 0 0 0 1 0 1 2 0 0 0 0 1 1 0 1 1 1 0 1 1 0 0 1 1 2 1 1 2 1 0 0 0 0 0
1
fc barcelona-granada
athletic bilbao-fc barcelona
fc barcelona-malaga cf
fc barcelona-eibar
eibar-fc barcelona
atletico madrid-fc barcelona
as roma-fc barcelona
fc barcelona-levante ud
celta de vigo-fc barcelona
fc barcelona-las palmas
getafe-fc barcelona
fc barcelona-bate borisov
fc barcelona-villarreal cf
real madrid-fc barcelona
fc barcelona-real sociedad
valencia cf-fc barcelona
fc barcelona-real betis
fc barcelona-athletic bilbao
malaga cf-fc barcelona
fc barcelona-rayo vallecano
fc barcelona-as roma
fc barcelona-atletico madrid
fc barcelona-celta de vigo
sporting gijon-fc barcelona
las palmas-fc barcelona
arsenal-fc barcelona
fc barcelona-sevilla fc
rayo vallecano-fc barcelona
fc barcelona-getafe
fc barcelona-arsenal
villarreal cf-fc barcelona
fc barcelona-real madrid
granada-fc barcelona
fc barcelona-atletico madrid
real sociedad-fc barcelona
atletico madrid-fc barcelona
fc barcelona-valencia cf
deportivo de la coruna-fc barcelona
fc barcelona-sporting gijon
real betis-fc barcelona
fc barcelona-espanyol barcelona
Figure 6.1: Expected goals by and conceded by FC Barcelona
For this visualization, the pre-attentive attribute color is used to make a clear distinction be-
tween the selected team, the opposing team and the difference between these different information
types.
Another visualization of the performance over time is in the form of a league table. The
league table is often used to present the ranking of teams during the league. This league table,
however, can also be created with expected match outcomes. The expected league table could
then be compared to the actual league table to check the expected performance over time versus
the actual performance. The league table, however, does not provide insights in which games are
expected to be lost. Figure 6.1 could provide insights in individual matches. The expected league
table of the Primera Division at the end of season 2015/2016 is provided in Table 6.1.
The expected number of goals of the players in a league can be used to evaluate players who
did not have an exceptional season in terms of actual goals but did have a good season in terms
of exceptional goals. Identifying these players could provide a good way of determining good
players who are relatively cheap compared to the more popular players. Figure 6.2 shows a way
of determining these players.
40 Luis Suarez
Cristiano Ronaldo
30
Actual goals
Lionel Messi
Karim Benzema
Antoine Griezmann
20 Gareth Bale Aritz Aduriz
Lucas Perez Martinez
Imanol Agirretxe
10
Inaki Williams Dannis
Sergio
Alvaro
Araujo
Vazquez Garcia
0
0 10 20 30 40 50
Expected goals
The Expected Goals over a season are here plotted with the Actual Goals scored during that
same season. In order to easily see the difference between the distributions of the Expected Goals
and the Actual Goals, a Q-Q plot is used [28]. As one would expect, the players roughly follow the
diagonal, which means that the Expected Goals of the players come close to the actual goals. In
this plot, some more visualization technique is used. The size of the circles (one of the pre-attentive
attributes [17]) is used to display the number of goal scoring opportunities that the players needed
in order to receive the Expected Goals and the Actual Goals. Furthermore, the color red is used
to select the players of a particular team (in this case FC Barcelona).
5 0.15
4
0.10
3
Expected
2 0.05
1
0.00
0
1 0.05
8 0.20
6 0.15
4 0.10
Actual
2 0.05
0 0.00
2 0.05
4 0.10
0 15 30 45 45 60 75 90 105 0 15 30 45 45 60 75 90 105
All matches Mean per time period
Figure 6.3: Evaluation of effectiveness over time periods during a particular season
Score during match athletic bilbao-fc barcelona 1.4 Exp goal players in athletic bilbao-fc barcelona
2.0
0
1.2
1.5 1.0
0.8
1.0
0.6 1
Exp. Goals
Lionel Messi
Luis Suarez
Ibai Gomez
Andres Iniesta
Sergi Roberto
Aritz Aduriz
Markel Susaeta
Aymeric Laporte
0.5
0 15 30 45 45 60 75 90
1-0
Time
(a) The continuing of one match (b) Expected goals of all players during one match
Table 6.2 shows that Lionel Messi scores high on the attributes ball control, dribbling, and
acceleration. The similar players, selected by the Nearest Neighbour algorithm, score high on
these attributes as well
The Expected Goals and the actual goals of these similar players are plotted with the use of a
Q-Q plot in Figure 6.5. Similar to Figure 6.2, using the Q-Q plot [28], one could determine which
player performs better than expected or performs worse than expected. If a player performs as
expected, it will lay on the diagonal of the plot.
To ensure that the selected player, in this case, Lionel Messi, easily stands out from the others,
the pre-attentive attribute color is used [17]. Furthermore, the size of the circles is used to show
the similarity between the players. The similarity score is a result from the Nearest Neighbour
algorithm.
40
35
30
25
Actual goals
20
15
10
5
0
0 5 10 15 20 25 30 35 40
Expected goals
Team Player Matches Exp Success Success
FC Bayern Munich Arjen Robben 2.0 0 2
Juventus Paulo Dybala 5.0 1 1
Chelsea Eden Hazard 8.0 1 0
FC Barcelona Neymar Jr. 0.0 0 0
Manchester City Sergio Aguero 8.0 4 2
FC Bayern Munich Franck Ribery 6.0 1 0
SL Benfica Jonas Oliveira 0.0 0 0
Napoli Lorenzo Insigne 0.0 0 0
Sevilla FC Yevgen Konoplyanka 0.0 6 6
Real Sociedad Carlos Vela 29.0 6 5
Atletico Madrid Antoine Griezmann 44.0 22 31
Manchester United Juan Garcia 0.0 1 1
Borussia Dortmund Marco Reus 0.0 0 0
Udinese Antonio Natale 0.0 0 0
VfL Wolfsburg Julian Draxler 7.0 1 3
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
fc barcelona-elche cf fc barcelona-levante
villarreal-fc barcelona valencia cf-fc barcelona
fc barcelona-athletic bilbao
fc barcelona-sevilla fc
levante-fc barcelona
rayo vallecano-fc barcelona
malaga cf-fc barcelona
fc barcelona-real sociedad
fc barcelona-granada cf
almeria ud-fc barcelona
rayo vallecano-fc barcelona
fc barcelona-eibar ca osasuna-fc barcelona
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
fc barcelona-valencia cf
Exp Success
projects [23].
villarreal-fc barcelona valencia cf-fc barcelona
fc barcelona-athletic bilbao
fc barcelona-sevilla fc
levante-fc barcelona
rayo vallecano-fc barcelona
malaga cf-fc barcelona
fc barcelona-real sociedad
fc barcelona-granada cf
almeria ud-fc barcelona
rayo vallecano-fc barcelona
fc barcelona-eibar ca osasuna-fc barcelona
fc barcelona-ca osasuna
fc barcelona-levante
real madrid-fc barcelona
fc barcelona-malaga cf
espanyol-fc barcelona
fc barcelona-rayo vallecano
eibar-fc barcelona fc barcelona-real betis
espanyol-fc barcelona
47
at the top of the GUI represent different projects performed at PSV. Therefore, other students can easily add their
1 The GUI is created in cooperation with Kees Hendriks, another student graduating at PSV. The different tabs
match, or a player. The report in combination with the GUI allows the technical staff to share
for later use. Furthermore, a multi-paged report can be generated in pdf format for a season, a
teams, matches, and players. The visualizations can then be viewed in the GUI and stored to pdf
(GUI) is created 1 . The GUI allows the user to easily switch between different leagues, seasons,
To allow for easy use of the visualizations described in this chapter, a Graphical User Interface
goals of a player are represented as the expected goal per goal attempt. The expected goals per
get the expected goals. In order to provide this kind of information, the ratio of the expected
CHAPTER 6. CASE STUDY: FC BARCELONA
Figure 6.10: Screenshot of the Graphical User Interface
Chapter 7
Conclusion
In this chapter, the thesis is concluded. Therefore, the main contributions of the thesis are
discussed. Furthermore, limitations of the proposed model and future work are discussed.
on reducing the variance between similar scoring opportunities. More and more data is getting
available as the time continues since more matches are played. More data could reduce the variance
of the model. Furthermore, feature selection could be performed to select only the most important
features.
Even more insights in match results could be obtained by adding actions before the actual goal
attempt. This could be used to determine the best possible action to perform at a given moment
for a given player. By doing this, crosses which did not get touched would also be included in the
expected goals model. Furthermore, choices made by individual players could be evaluated. This
could then be used by trainers for training purposes. An example of such a situation in which the
best possible decision could be determined is provided in Figure 7.1
In Figure 7.1 the player has five intuitive options (4 passes and a goal attempt). Currently, the
expected goals model is only able to determine the probability that the goal attempt will result in
a goal. In other words, the expected goals model is able to determine the quality of the choice to
shoot on goal. The model is, namely, trained on data in which a player already chose to shoot on
goal. Passing the ball to one of his teammates could, however, lead to a better Expected Goals.
Intuitively, when pass 1 succeeds of Figure 7.1, the goal attempt created for the receiving player.
Since it seems quite possible that this pass will succeed this, intuitively, seems to be the best
decision.
Creating a model which determines the best decisions at a given time, however, leads to new
challenges. First of all, the assumption that players are not moving is no longer reasonable. Since
players could move in all possible directions, it is hard to predict where they are going. If, however,
only one action is investigated (as in Figure 7.1) the position of the players could be determined
by the current speed, acceleration, and direction of movement. The speed and acceleration are
already given in the spatiotemporal data. The direction of movement can be determined by, for
example, looking at the direction of movement of the last 500 milliseconds. Furthermore, the
probability of success has to be predicted for multiple types of events. Similar approaches as
presented in this thesis could, however, be followed to determine the probability of success for
other events.
Currently, the predictive model is trained on the complete population of players with player
characteristics. As discussed in Section 2.5, this leads to the desired Expected Goals, and thus
expected performance. By training the predictive model, the model could provide insights into
different types of performance. When the predictive model is, for example, trained on only the
best players in the population, the model would indicate desired performance (closest to optimal
performance). Furthermore, the model could be trained on only the players of a league to provide
insights into the typical performance of players in that league. This could give insights into the
differences in leagues.
Besides improvements in the expected goal model itself, the expected goals model opens new
possibilities for further research. Some of the possibilities can already be seen from visualizations
as shown in Chapter 6. First of all, the influence of events on the outcome of matches could be
determined. A first approach is provided with Figure 6.4a. A more quantitative approach would,
however, provide insights into a more general influence.
Secondly, the expected goals could be used in player acquisition. In this thesis, a first approach
for getting similar players is provided in Section 6.3.1. Elaboration on this approach could lead
to interesting insight which improves decision making of player acquisition. The expected goals
could be included to find players who are performing better than expected.
[2] Chris Anderson and David Sally. The numbers game: why everything you know about football
is wrong. Penguin UK, 2013. 1, 2
[3] Richard A. Becker, Stephen G. Eick, and Allan R. Wilks. Visualizing Network Data. IEEE
Transactions on Visualization and Computer Graphics, 1(1):16–28, 1995. 5
[4] Andrew P Bradley. The use of the area under the roc curve in the evaluation of machine
learning algorithms. Pattern recognition, 30(7):1145–1159, 1997. 7
[5] Glen W Brier. Verification of forecasts expersses in terms of probaility. Monthly Weather
Review, 78(1):1–3, 1950. 8
[6] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Pro-
ceedings of the Seventh International World-Wide Web Conference, pages 107–117, Brisbane,
Australia, 1998. 6
[7] J.J. Bull. Sky sports use football manager database to profile players in real life. The
Telegraph, 2015. 15
[8] C . Reep, B . Benjamin. Skill and Chance in Association Football. Journal of the Royal
Statistical Society ., 131(4):581–585, 1968. 1
[9] Soumen Chakrabarti, Martin Ester, Usama Fayyad, Johannes Gehrke, Jiawei Han, Shinichi
Morishita, Gregory Piatetsky-Shapiro, and Wei Wang. Data mining curriculum: A proposal
(version 1.0). Intensive Working Group of ACM SIGKDD Curriculum Committee, page 140,
2006. 3
[10] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. Smote:
synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–
357, 2002. 28
[11] Xinhua Cheng and John M. Wallace. Cluster Analysis of the Northern Hemisphere Wintertime
500-hPa Height Field: Spatial Patterns, 1993. 5
[12] Anthony C. Constantinou, Norman E. Fenton, and Martin Neil. Pi-football: A Bayesian net-
work model for forecasting Association Football match outcomes. Knowledge-Based Systems,
36(February 2016):322–339, 2012. 3
[13] Corinna Cortes and Daryl Pregibon. Giga-mining. In KDD, pages 174–178, 1998. 5
[14] Daily Mirror. FIFA ratings vs real life: How does the video game measure up
to actual Premier League stats? http://www.mirror.co.uk/sport/football/news/
fifa-ratings-vs-real-life-4027925, 2014. Online; accessed 2 August 2016. 16, 17
[15] Stefan Dobravec. Predicting sports results using latent features: A case study. In Informa-
tion and Communication Technology, Electronics and Microelectronics (MIPRO), 2015 38th
International Convention on, pages 1267–1272. IEEE, 2015. 4
[16] Robert Donnelly and W Michael Kelley. The humongous book of Statistics Problems. Penguin,
2009. 39
[17] Stephen Few. Tapping the power of visual perception. Visual Business Intelligence Newsletter,
2004. 39, 41, 42, 44
[18] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning
and an application to boosting. In European conference on computational learning theory,
pages 23–37. Springer, 1995. 6
[19] Thomas U Grund. Network structure and team performance: The case of english premier
league soccer teams. Social Networks, 34(4):682–690, 2012. 4
[20] David Hand, David Hand, Heikki Mannila, Heikki Mannila, Padhraic Smyth, and Padhraic
Smyth. Principles of data mining, volume 30. 2001. 5, 6
[21] Ruud van Elk Harm Eggels, Mykola Pechenizkiy. Explaining soccer match outcomes with
goal scoring opportunities predictive analytics. In MLSA, Riva del Garda, Italy, 2016. ECM-
L/PKDD. 4
[22] Trevor J.. Hastie, Robert John Tibshirani, and Jerome H Friedman. The elements of statistical
learning: data mining, inference, and prediction. Springer, 2011. 6
[23] Kees Hendriks. PSV football data analysis: An elaborate analysis on what possession loss
caauses and what causes possession loss. Master’s thesis, University of Technology Eindhoven,
the Netherlands, 2016. 18, 47
[24] Andreas Heuer, Christian Mueller, and Oliver Rubner. Soccer: is scoring goals a predictable
Poissonian process? EuroPhysics Letters, 89(3):2–6, 2010. 4
[25] Tin Kam Ho. Random decision forests. In Document Analysis and Recognition, 1995., Pro-
ceedings of the Third International Conference on, volume 1, pages 278–282. IEEE, 1995.
6
[26] Vincent Hoekstra, Pieter Bison, and Guszti Eiben. Predicting football results with an evolu-
tionary ensemble classifier. page 68, 2012. 4
[27] Inmotio. Football. http://www.inmotio.eu/en-GB/34/football.html, 2016. Online; ac-
cessed 5 August 2016. 14
[28] Heer Jeffrey, Bostock Michael, and Ogievetsky VADIM. A tour through the visualization zoo.
Communications of the ACM, 53(6):56–67, 2010. 41, 44, 45
[29] P. D. Jones, N. James, and S. D. Mellallieu. Possession as a performance indicator in footbal.
International Journal of Performance Analysis of Sport, 4(February 2016):98–102, 2004. 4
[30] Dimitris Karlis and Ioannis Ntzoufras. Analysis of sports data by using bivariate poisson
models. Journal of the Royal Statistical Society: Series D (The Statistician), 52(3):381–393,
2003. 4
[31] Matthew George Soeryadjaya Kerr. Applying machine learning to event data in soccer. PhD
thesis, Massachusetts Institute of Technology, 2015. 4
[32] R. Kohavi. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model
Selection. In International Joint Conference on Artificial Intelligence, number 0, pages 0–6,
1995. 6
[33] Helge Langseth. Beating the bookie: A look at statistical models for prediction of football
matches. In SCAI, pages 165–174, 2013. 3
[34] Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions and reversals.
In Soviet physics doklady, volume 10, page 707, 1966. 18
[35] Michael Lewis. Moneyball: The art of winning an unfair game. WW Norton & Company,
2004. 1
[36] Patrick Lucey, Alina Bialkowski, Mathew Monfort, Peter Carr, and Iain Matthews. ”Quality
vs Quantity”: Improved Shot Prediction in Soccer using Strategic Features from Spatiotem-
poral Data. In Proc. 9th Annual MIT Sloan Sports Analytics Conference, pages 1–9, 2015. 4,
23
[37] Oded Maimon and Lior Rokach. Data mining and knowledge discovery handbook, volume 2.
Springer, 2005. 6
[38] Tim McGarry and Ian M Franks. On winning the penalty shoot-out in soccer. Journal of
Sports Sciences, 18(6):401–409, 2000. 33
[39] RA Mollineda, R Alejo, and JM Sotoca. The class imbalance problem in pattern classification
and learning. In II Congreso Español de Informática (CEDI 2007). ISBN, pages 978–84, 2007.
28
[40] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised
learning. Proceedings of the 22nd international conference on Machine learning ICML 05,
(1999):625–632, 2005. 7, 8
[41] ORTEC. Performance Analytics. http://ortecsports.com/performance-analytics/,
2016. Online; accessed 5 August 2016. 11
[43] J.C. Platt. Probabilistic outputs for support vector machines and comparison to regularized
likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999. 7
[44] Lior Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33(1-2):1–39, 2010. 6
[45] Colin Shearer, Hugh J Watson, Daryl G Grecich, Larissa Moss, Sid Adelman, Katherine
Hammer, and Stacey a Herdlein. The CRIS-DM model: The New Blueprint for Data Mining.
Journal of Data Warehousing14, 5(4):13–22, 2000. 3
[46] Sofifa. Players. http://sofifa.com/players, 2016. Online; accessed 5 August 2016. 15
[47] Marina Sokolova and Guy Lapalme. A systematic analysis of performance measures for
classification tasks. Information Processing and Management, 45(4):427–437, 2009. 6
[48] K Stuart. Why clubs are using football manager as a real-life scouting tool. The Guardian,
4:20–58, 2014. 15
[49] Pang-Ning Tan et al. Introduction to data mining. Pearson Education India, 2006. 6, 8
[50] Strother H Walker and David B Duncan. Estimation of the probability of an event as a
function of several independent variables. Biometrika, 54(1-2):167–179, 1967. 6
[51] David H Wolpert. The Lack of A Priori Distinctions Between Learning Algorithms. Neural
Computation, 8(7):1341–1390, 1996. 6
[52] Show-Jane Yen and Yue-Shi Lee. Cluster-based under-sampling approaches for imbalanced
data distributions. Expert Systems with Applications, 36(3):5718–5727, 2009. 29
[53] Bianca Zadrozny and C Elkan. Obtaining calibrated probability estimates from decision trees
and naive Bayesian classifiers. Icml, pages 1–8, 2001. 7
[54] Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass
probability estimates. Proceedings of the eighth ACM SIGKDD international conference on
Knowledge discovery and data mining KDD 02, pages 694–699, 2002. 7