0% found this document useful (0 votes)
93 views18 pages

Advanced Methods of Data Analysis (AMDA) : Suggesting Dominant Openings For Chess Games

The document summarizes a group project analyzing chess opening data to predict which openings result in the highest winning percentage at different player rating levels. Over 20,000 chess games were analyzed from Lichess to build ordinal regression, logistic regression, and interaction models. Descriptive statistics showed white wins when white's rating is higher, black wins when black's rating is higher, and draws occur with similar ratings. Ratings were normalized based on the standard ELO distribution to better model the relationship between rating differences and game outcomes.

Uploaded by

Pritish Bhatt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views18 pages

Advanced Methods of Data Analysis (AMDA) : Suggesting Dominant Openings For Chess Games

The document summarizes a group project analyzing chess opening data to predict which openings result in the highest winning percentage at different player rating levels. Over 20,000 chess games were analyzed from Lichess to build ordinal regression, logistic regression, and interaction models. Descriptive statistics showed white wins when white's rating is higher, black wins when black's rating is higher, and draws occur with similar ratings. Ratings were normalized based on the standard ELO distribution to better model the relationship between rating differences and game outcomes.

Uploaded by

Pritish Bhatt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Advanced Methods of Data Analysis (AMDA)

Group Project Study Group 4


• Anirudh Banerjee
Suggesting dominant • Faraz Riyaz Shaliban

openings for chess games • Sonal Priya


• Ujjwal Gahlot
Project Motivation

• Chess is one of the most popular board game with as many


as 605 million players worldwide.
• Due to global lockdowns in 2020, there has been an
exponential increase in the amount of chess being played.
• More people have started playing chess on a regular basis
and on a more serious level.
• An opening is a key part of the chess game and with prior
preparation can give an advantage in a chess game.
• There is a vast variety of openings, which may be daunting
for beginners to learn.
• To solve this issue, we decided to build a model which aims
Number of chess games played per month on Lichess website
to predict which openings will give the highest winning
percentage at various ELO rating levels.
Introduction

Primary Aim: Suggest which


opening gives the highest winning ~20k data points were used in the initial database
percentage
Var Name Type of Variable Details
Binary Dependent
Win/Lose Result of the match
Variable
Secondary Aim: Suggest whether Continuous
Player ELO Rating of the player
to play an open or closed game Variable
Opponent Continuous
(attacking or defensive) ELO Variable
Rating of the opponent

Categorical Whether the player is playing


White/Black
Binary Variable white or not.
Opening Categorical
Chess opening name
Name Multi Variable
Time Categorical
Data Source: Control Multi Variable
Time controls of the game
Chess Game Categorical Whether the game is a rated
Rated
Dataset (Lichess) Binary Variable game
on Kaggle Turns
Continuous
Number of moves in the game
Variable

Source: https://www.kaggle.com/datasnaek/chess
Analysis Approach

Model 1
Ordinal regression for loss, draw
and win

One hot Final model and


Data Cleaning Data partition Model 2 conclusion
encoding
Binomial logistic regression for
white winning
Opening reduction Creating dummy Training data : 70%
(Python) variables for –
Validation data : 20%
Filtering out non-rated • Opening
games • Victory status Test data : 10%
• White rating Model 3
Filtering out openings • Winner
with less than 400 data Ordinal logistic regression with
points interaction terms for rating and
openings
Creating rating difference
predictor

Normalization of rating
Descriptive Statistics
Descriptive Statistics
Winning Based on White and Black ELO Scatter Plot of Ratings and Results
Difference (Normalized) 3000

2500
From 4 to 3
From 3 to 2
2000
From 2 to 1
From 1 to 0

White
1500 White
From 0 to -1
From -1 to -2 Black
1000
From -2 to -3 Draw
From -3 to -4 500
Less than -4

0% 20% 40% 60% 80% 100% 0


0 500 1000 1500 2000 2500 3000
White Win Black Win Draw Black

Result Distribution for Popular Opneings

ThreeKnightsOpening
BirdOpening
CenterGame
KingsIndianDefense
KingsGambit
IndianGame
FourKnightsGame
ScandinavianDefense
KingsPawnGame
SicilianDefense
0% 20% 40% 60% 80% 100% 120%

White Black Draw


Descriptive Statistics

Wins occur at a lower number of turns whereas draws White wins when the white player’s rating is higher.
take more number of turns. Black wins when the black player’s rating is higher.
Draws occur when the ratings are similar.
Normalizing ELOs

• Chess ELOs are meant to map the difference in skills.


However, the difference is not linear.
• The difference in skill between 2700 and 2600 is generally
higher than the difference between 2100 and 2000.
• The overall distribution of chess ELOs is a normal
distribution. Various papers and websites have shown that
the mean is approximately 1500-1525 and the standard Distribution of Blitz (~5 min game) ELOs on Lichess
deviation is approximately 400.
• Hence, we have normalized the ratings.
• For normalization we have taken z-values (with 1500 mean
and 400 std dev) and taken the exponential of the z-value.
• The key statistic we have used is the difference in the
normalized rating of white and black.
Distribution of Rapid (~10 min game) ELOs on Lichess
Model 1 : Ordinal Regression for loss, draw and win
We use ordinal logistic regression to calculate the probability of
either side winning or a draw.
The variables taken are:
Probability that black wins
• Number of turns in the match 𝑒 !!/#"∑$ $$%$
• The rating difference 𝑃 𝑌=0 =
• The opening 1 + 𝑒 !!/#"∑$ $$%$
• Opening_ply
• Time format
• White rating Probability of draw
Based on multicollinearity and p-values, we drop the following variables:
𝑒 !#/%"∑$ $$%$ 𝑒 !!/#"∑$ $$%$
𝑃 𝑌=1 = !#/%"∑$ $$ %$

opening_ply, increment_2, increment_3, increment_4, increment_5, 1+𝑒 1 + 𝑒 !!/#"∑$ $$%$
increment_6, z_white, white_rat_2, white_rat_3, white_rat_4,
white_rat_5, white_rat_6, white_rat_7, white_rat_8, white_rat_9,
opening_CaroKannDefense, opening_IndianGame,
opening_KingsGambit, opening_KingsIndianDefense, Probability that white wins
opening_NimzoIndianDefense, opening_NimzoLarsenAttack,
opening_OwenDefense, opening_PircDefense, opening_QueensGambit, 𝑒 !#/%"∑$ $$%$
opening_RussianGame, opening_RuyLopez, 𝑃 𝑌 =2 =1−
opening_ScandinavianDefense, opening_ScotchGame, 1 + 𝑒 !#/%"∑$ $$%$
opening_SicilianDefense, opening_SlavDefense, opening_ViennaGame

The model assumption of the predicted variable being on an ordinal scale is true as order is Loss < Draw < Win (for white)
We checked for multicollinearity, p-value and goodness of fit
Model 1 : Ordinal Regression for loss, draw and win

Probability that black wins


𝑒 !!/#"∑$ $$%$
𝑃 𝑌=0 =
1 + 𝑒 !!/#"∑$ $$%$

Probability of draw

𝑒 !#/%"∑$ $$%$ 𝑒 !!/#"∑$ $$%$


𝑃 𝑌=1 = !#/%"∑$ $$ %$

1+𝑒 1 + 𝑒 !!/#"∑$ $$%$

Probability that white wins

𝑒 !#/%"∑$ $$%$
𝑃 𝑌 =2 =1−
1 + 𝑒 !#/%"∑$ $$%$

𝛼 𝑖𝑠 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑣𝑎𝑙𝑢𝑒 𝑎𝑡 𝑡ℎ𝑒 𝑜𝑟𝑑𝑒𝑟𝑒𝑑 𝑙𝑒𝑣𝑒𝑙,


𝛽! 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
Model 1 : Ordinal Regression for loss, draw and win
Running the model on the test data, we get the following results:
Model 1 : Ordinal Regression for loss, draw and win

Based on the model we suggest opening


for players.

Openings for black:


1. Van’t Kruijs Opening
2. Bird Opening
3. Hungarian Opening

Openings for white:


1. Nimzowitsch Defense
2. Zukertort Opening
3. Bishop’s Opening
Model 2 : Binomial Logistic Regression for white winning
We use logistic regression to calculate the probability of white
winning.
The variables taken are:
• Number of turns in the match
• The rating difference
• The opening Probability that white wins:
• Opening_ply
𝑒 $!"∑$ $$%$
• Time format 𝑃 𝑌=1 =
• White rating 1 + 𝑒 $!"∑$ $$%$
Based on multicollinearity and p-values, we drop the following variables:
𝛽" 𝑖𝑠 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑣𝑎𝑙𝑢𝑒,
opening_ply, increment_2, increment_3, increment_4, increment_5, increment_6, 𝛽! 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
z_white, white_rat_2, white_rat_3, white_rat_4, white_rat_5, white_rat_6,
white_rat_7, white_rat_8, white_rat_9, opening_BirdOpening,
opening_CaroKannDefense, opening_DutchDefense, opening_FourKnightsGame,
opening_FrenchDefense, opening_HorwitzDefense, opening_HungarianOpening, The model assumption of a binary variable was
opening_IndianGame, opening_ItalianGame, opening_KingsIndianDefense,
opening_KingsGambit, opening_KingsKnightOpening, opening_KingsPawnOpening,
satisfied as the binary outputs were 1 – White
opening_ModernDefense, opening_NimzoIndianDefense, Win and 0 – White Not Win (Loss/Draw).
opening_NimzoLarsenAttack,opening_NimzowitschDefense, opening_OwenDefense,
opening_PircDefense, opening_RussianGame, opening_RuyLopez,
opening_ScnadinavianDefense, opening_ScotchGame, opening_SicilianDefense,
We checked for multicollinearity, p-value and
opening_ViennaGame, opening_ZukertortOpening goodness of fit in each model implemented
Model 2 : Binomial Logistic Regression for white winning

Probability that white wins:


𝑒 $!"∑$ $$%$
𝑃 𝑌=1 =
1 + 𝑒 $!"∑$ $$%$

Running the model on the test data, we get the following results:

Confusion Matrix and Statistics


The cut-off value we got from maximising
the accuracy is 0.5.
Model 2 : Binomial Logistic Regression for white winning

Based on the model we suggest opening


for players.

Openings for black:


1. Van’t Kruijs Opening
2. Bird Opening
3. Hungarian Opening

Openings for white:


1. Nimzowitsch Defense
2. Zukertort Opening
3. Bishop’s Opening
Model 3 : Ordinal Logistic Regression with interaction terms

We use ordinal logistic regression to calculate the probability of


either side winning or a draw.
Probability that black wins
In this model we have included interaction variables between 𝑒 !!/#"∑$ $$%$
ratings and openings which enable us to suggest the best 𝑃 𝑌=0 =
openings for different rating ranges. 1 + 𝑒 !!/#"∑$ $$%$
The variables taken are:
• The rating difference
• Interaction variables between opening and rating Probability of draw

𝑒 !#/%"∑$ $$%$ 𝑒 !!/#"∑$ $$%$


We have divided the ratings into 9 bins: 𝑃 𝑌=1 = !#/%"∑$ $$ %$

1. 700 – 900 1+𝑒 1 + 𝑒 !!/#"∑$ $$%$
2. 900 – 1100
3. 1100 – 1300
4. 1300 – 1500 Probability that white wins
5. 1500 – 1700
6. 1700 – 1900
7. 1900 – 2100 𝑒 !#/%"∑$ $$%$
𝑃 𝑌 =2 =1−
8. 2100 – 2300 1 + 𝑒 !#/%"∑$ $$%$
9. 2300 – 2500

We use ordinal logistic model with interaction between the white rating groups and the turns played. However, we found that all
the coefficients turned out to be significant and hence, the data does not give any insights about what type of game to play.
Model 3 : Ordinal Logistic Regression with interaction terms

Running the model on the test data, we get the following


results:
Conclusion
Based on the model 3 we suggest openings for players of different ratings:

Rating: 900 - 1100 Rating: 1100 - 1300 Rating: 1300 - 1500 Rating: 1500 - 1700

Openings for black: Openings for black: Openings for black: Openings for black:
1. Queen’s Gambit 1. Philidor Defense 1. Italian Game 1. King’s Pawn Game
2. Italian Game 2. Caro-Kann Defense 2. Philidor Defense 2. Queen’s Pawn Game
3. Caro-Kann Defense 3. Sicilian Defense 3. Sicilian Defense 3. Caro-Kann Defense

Openings for white: Openings for white: Openings for white: Openings for white:
1. Philidor Defense 1. English Opening 1. Caro-Kann Defense 1. Queen’s Gambit
2. Ruy Lopez 2. Ruy Lopez 2. Ruy Lopez 2. Philidor Defense
3. Scandinavian Defense 3. French Defense 3. Queen’s Pawn Game 3. Italian Game

Rating: 1700 - 1900 Rating: 1900 - 2100 Rating: 2100 - 2300 Rating: 2300 - 2500

Openings for black: Openings for black: Openings for black: Openings for black:
1. Italian Game 1. Sicilian Defense 1. Ruy Lopez 1. Ruy Lopez
2. Queen’s Pawn Game 2. Queen’s Pawn Game 2. French Defense 2. Italian Game
3. King’s Pawn Game 3. Ruy Lopez 3. Queen’s Pawn Game 3. English Opening

Openings for white: Openings for white: Openings for white: Openings for white:
1. Queen’s Gambit 1. Caro-Kann Defense 1. King’s Pawn Game 1. Caro-Kann Defense
2. Philidor Defense 2. English Opening 2. Philidor Defense 2. French Defense
3. Caro-Kann Defense 3. Philidor Defense 3. English Opening 3. Queen’s Pawn Game

You might also like