Advanced Methods of Data Analysis (AMDA)
Group Project Study Group 4
• Anirudh Banerjee
Suggesting dominant • Faraz Riyaz Shaliban
openings for chess games • Sonal Priya
• Ujjwal Gahlot
Project Motivation
• Chess is one of the most popular board game with as many
as 605 million players worldwide.
• Due to global lockdowns in 2020, there has been an
exponential increase in the amount of chess being played.
• More people have started playing chess on a regular basis
and on a more serious level.
• An opening is a key part of the chess game and with prior
preparation can give an advantage in a chess game.
• There is a vast variety of openings, which may be daunting
for beginners to learn.
• To solve this issue, we decided to build a model which aims
Number of chess games played per month on Lichess website
to predict which openings will give the highest winning
percentage at various ELO rating levels.
Introduction
Primary Aim: Suggest which
opening gives the highest winning ~20k data points were used in the initial database
percentage
Var Name Type of Variable Details
Binary Dependent
Win/Lose Result of the match
Variable
Secondary Aim: Suggest whether Continuous
Player ELO Rating of the player
to play an open or closed game Variable
Opponent Continuous
(attacking or defensive) ELO Variable
Rating of the opponent
Categorical Whether the player is playing
White/Black
Binary Variable white or not.
Opening Categorical
Chess opening name
Name Multi Variable
Time Categorical
Data Source: Control Multi Variable
Time controls of the game
Chess Game Categorical Whether the game is a rated
Rated
Dataset (Lichess) Binary Variable game
on Kaggle Turns
Continuous
Number of moves in the game
Variable
Source: https://www.kaggle.com/datasnaek/chess
Analysis Approach
Model 1
Ordinal regression for loss, draw
and win
One hot Final model and
Data Cleaning Data partition Model 2 conclusion
encoding
Binomial logistic regression for
white winning
Opening reduction Creating dummy Training data : 70%
(Python) variables for –
Validation data : 20%
Filtering out non-rated • Opening
games • Victory status Test data : 10%
• White rating Model 3
Filtering out openings • Winner
with less than 400 data Ordinal logistic regression with
points interaction terms for rating and
openings
Creating rating difference
predictor
Normalization of rating
Descriptive Statistics
Descriptive Statistics
Winning Based on White and Black ELO Scatter Plot of Ratings and Results
Difference (Normalized) 3000
2500
From 4 to 3
From 3 to 2
2000
From 2 to 1
From 1 to 0
White
1500 White
From 0 to -1
From -1 to -2 Black
1000
From -2 to -3 Draw
From -3 to -4 500
Less than -4
0% 20% 40% 60% 80% 100% 0
0 500 1000 1500 2000 2500 3000
White Win Black Win Draw Black
Result Distribution for Popular Opneings
ThreeKnightsOpening
BirdOpening
CenterGame
KingsIndianDefense
KingsGambit
IndianGame
FourKnightsGame
ScandinavianDefense
KingsPawnGame
SicilianDefense
0% 20% 40% 60% 80% 100% 120%
White Black Draw
Descriptive Statistics
Wins occur at a lower number of turns whereas draws White wins when the white player’s rating is higher.
take more number of turns. Black wins when the black player’s rating is higher.
Draws occur when the ratings are similar.
Normalizing ELOs
• Chess ELOs are meant to map the difference in skills.
However, the difference is not linear.
• The difference in skill between 2700 and 2600 is generally
higher than the difference between 2100 and 2000.
• The overall distribution of chess ELOs is a normal
distribution. Various papers and websites have shown that
the mean is approximately 1500-1525 and the standard Distribution of Blitz (~5 min game) ELOs on Lichess
deviation is approximately 400.
• Hence, we have normalized the ratings.
• For normalization we have taken z-values (with 1500 mean
and 400 std dev) and taken the exponential of the z-value.
• The key statistic we have used is the difference in the
normalized rating of white and black.
Distribution of Rapid (~10 min game) ELOs on Lichess
Model 1 : Ordinal Regression for loss, draw and win
We use ordinal logistic regression to calculate the probability of
either side winning or a draw.
The variables taken are:
Probability that black wins
• Number of turns in the match 𝑒 !!/#"∑$ $$%$
• The rating difference 𝑃 𝑌=0 =
• The opening 1 + 𝑒 !!/#"∑$ $$%$
• Opening_ply
• Time format
• White rating Probability of draw
Based on multicollinearity and p-values, we drop the following variables:
𝑒 !#/%"∑$ $$%$ 𝑒 !!/#"∑$ $$%$
𝑃 𝑌=1 = !#/%"∑$ $$ %$
−
opening_ply, increment_2, increment_3, increment_4, increment_5, 1+𝑒 1 + 𝑒 !!/#"∑$ $$%$
increment_6, z_white, white_rat_2, white_rat_3, white_rat_4,
white_rat_5, white_rat_6, white_rat_7, white_rat_8, white_rat_9,
opening_CaroKannDefense, opening_IndianGame,
opening_KingsGambit, opening_KingsIndianDefense, Probability that white wins
opening_NimzoIndianDefense, opening_NimzoLarsenAttack,
opening_OwenDefense, opening_PircDefense, opening_QueensGambit, 𝑒 !#/%"∑$ $$%$
opening_RussianGame, opening_RuyLopez, 𝑃 𝑌 =2 =1−
opening_ScandinavianDefense, opening_ScotchGame, 1 + 𝑒 !#/%"∑$ $$%$
opening_SicilianDefense, opening_SlavDefense, opening_ViennaGame
The model assumption of the predicted variable being on an ordinal scale is true as order is Loss < Draw < Win (for white)
We checked for multicollinearity, p-value and goodness of fit
Model 1 : Ordinal Regression for loss, draw and win
Probability that black wins
𝑒 !!/#"∑$ $$%$
𝑃 𝑌=0 =
1 + 𝑒 !!/#"∑$ $$%$
Probability of draw
𝑒 !#/%"∑$ $$%$ 𝑒 !!/#"∑$ $$%$
𝑃 𝑌=1 = !#/%"∑$ $$ %$
−
1+𝑒 1 + 𝑒 !!/#"∑$ $$%$
Probability that white wins
𝑒 !#/%"∑$ $$%$
𝑃 𝑌 =2 =1−
1 + 𝑒 !#/%"∑$ $$%$
𝛼 𝑖𝑠 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑣𝑎𝑙𝑢𝑒 𝑎𝑡 𝑡ℎ𝑒 𝑜𝑟𝑑𝑒𝑟𝑒𝑑 𝑙𝑒𝑣𝑒𝑙,
𝛽! 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
Model 1 : Ordinal Regression for loss, draw and win
Running the model on the test data, we get the following results:
Model 1 : Ordinal Regression for loss, draw and win
Based on the model we suggest opening
for players.
Openings for black:
1. Van’t Kruijs Opening
2. Bird Opening
3. Hungarian Opening
Openings for white:
1. Nimzowitsch Defense
2. Zukertort Opening
3. Bishop’s Opening
Model 2 : Binomial Logistic Regression for white winning
We use logistic regression to calculate the probability of white
winning.
The variables taken are:
• Number of turns in the match
• The rating difference
• The opening Probability that white wins:
• Opening_ply
𝑒 $!"∑$ $$%$
• Time format 𝑃 𝑌=1 =
• White rating 1 + 𝑒 $!"∑$ $$%$
Based on multicollinearity and p-values, we drop the following variables:
𝛽" 𝑖𝑠 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑣𝑎𝑙𝑢𝑒,
opening_ply, increment_2, increment_3, increment_4, increment_5, increment_6, 𝛽! 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
z_white, white_rat_2, white_rat_3, white_rat_4, white_rat_5, white_rat_6,
white_rat_7, white_rat_8, white_rat_9, opening_BirdOpening,
opening_CaroKannDefense, opening_DutchDefense, opening_FourKnightsGame,
opening_FrenchDefense, opening_HorwitzDefense, opening_HungarianOpening, The model assumption of a binary variable was
opening_IndianGame, opening_ItalianGame, opening_KingsIndianDefense,
opening_KingsGambit, opening_KingsKnightOpening, opening_KingsPawnOpening,
satisfied as the binary outputs were 1 – White
opening_ModernDefense, opening_NimzoIndianDefense, Win and 0 – White Not Win (Loss/Draw).
opening_NimzoLarsenAttack,opening_NimzowitschDefense, opening_OwenDefense,
opening_PircDefense, opening_RussianGame, opening_RuyLopez,
opening_ScnadinavianDefense, opening_ScotchGame, opening_SicilianDefense,
We checked for multicollinearity, p-value and
opening_ViennaGame, opening_ZukertortOpening goodness of fit in each model implemented
Model 2 : Binomial Logistic Regression for white winning
Probability that white wins:
𝑒 $!"∑$ $$%$
𝑃 𝑌=1 =
1 + 𝑒 $!"∑$ $$%$
Running the model on the test data, we get the following results:
Confusion Matrix and Statistics
The cut-off value we got from maximising
the accuracy is 0.5.
Model 2 : Binomial Logistic Regression for white winning
Based on the model we suggest opening
for players.
Openings for black:
1. Van’t Kruijs Opening
2. Bird Opening
3. Hungarian Opening
Openings for white:
1. Nimzowitsch Defense
2. Zukertort Opening
3. Bishop’s Opening
Model 3 : Ordinal Logistic Regression with interaction terms
We use ordinal logistic regression to calculate the probability of
either side winning or a draw.
Probability that black wins
In this model we have included interaction variables between 𝑒 !!/#"∑$ $$%$
ratings and openings which enable us to suggest the best 𝑃 𝑌=0 =
openings for different rating ranges. 1 + 𝑒 !!/#"∑$ $$%$
The variables taken are:
• The rating difference
• Interaction variables between opening and rating Probability of draw
𝑒 !#/%"∑$ $$%$ 𝑒 !!/#"∑$ $$%$
We have divided the ratings into 9 bins: 𝑃 𝑌=1 = !#/%"∑$ $$ %$
−
1. 700 – 900 1+𝑒 1 + 𝑒 !!/#"∑$ $$%$
2. 900 – 1100
3. 1100 – 1300
4. 1300 – 1500 Probability that white wins
5. 1500 – 1700
6. 1700 – 1900
7. 1900 – 2100 𝑒 !#/%"∑$ $$%$
𝑃 𝑌 =2 =1−
8. 2100 – 2300 1 + 𝑒 !#/%"∑$ $$%$
9. 2300 – 2500
We use ordinal logistic model with interaction between the white rating groups and the turns played. However, we found that all
the coefficients turned out to be significant and hence, the data does not give any insights about what type of game to play.
Model 3 : Ordinal Logistic Regression with interaction terms
Running the model on the test data, we get the following
results:
Conclusion
Based on the model 3 we suggest openings for players of different ratings:
Rating: 900 - 1100 Rating: 1100 - 1300 Rating: 1300 - 1500 Rating: 1500 - 1700
Openings for black: Openings for black: Openings for black: Openings for black:
1. Queen’s Gambit 1. Philidor Defense 1. Italian Game 1. King’s Pawn Game
2. Italian Game 2. Caro-Kann Defense 2. Philidor Defense 2. Queen’s Pawn Game
3. Caro-Kann Defense 3. Sicilian Defense 3. Sicilian Defense 3. Caro-Kann Defense
Openings for white: Openings for white: Openings for white: Openings for white:
1. Philidor Defense 1. English Opening 1. Caro-Kann Defense 1. Queen’s Gambit
2. Ruy Lopez 2. Ruy Lopez 2. Ruy Lopez 2. Philidor Defense
3. Scandinavian Defense 3. French Defense 3. Queen’s Pawn Game 3. Italian Game
Rating: 1700 - 1900 Rating: 1900 - 2100 Rating: 2100 - 2300 Rating: 2300 - 2500
Openings for black: Openings for black: Openings for black: Openings for black:
1. Italian Game 1. Sicilian Defense 1. Ruy Lopez 1. Ruy Lopez
2. Queen’s Pawn Game 2. Queen’s Pawn Game 2. French Defense 2. Italian Game
3. King’s Pawn Game 3. Ruy Lopez 3. Queen’s Pawn Game 3. English Opening
Openings for white: Openings for white: Openings for white: Openings for white:
1. Queen’s Gambit 1. Caro-Kann Defense 1. King’s Pawn Game 1. Caro-Kann Defense
2. Philidor Defense 2. English Opening 2. Philidor Defense 2. French Defense
3. Caro-Kann Defense 3. Philidor Defense 3. English Opening 3. Queen’s Pawn Game