MATH 308 Tokyo Medal Ranking and PCA
Archer Yang
Data Description
From Wikipedia: ”The 2020 Summer Olympics was an international multi-sport event held in Tokyo,
Japan, from 23 July to 8 August 2021. The games were postponed by one year as part of the impact of the
COVID-19 pandemic on sports. However, the Games was referred to by its original date in all medals,
uniforms, promotional items, and other related media in order to avoid confusion in future years.”
Overall, there are 93 nations received at least one medal.
# load data
library(readr)
data.source<-paste0("https://raw.githubusercontent.com/",
"mcgillstat/regression/main/data/medals_2020.csv")
medals <- read_csv(data.source, col_names = FALSE)
colnames(medals) <- c("Country","Gold","Silver","Bronze")
print(medals,n=93)
## # A tibble: 93 x 4
## Country Gold Silver Bronze
## <chr> <dbl> <dbl> <dbl>
## 1 UnitedStates 39 41 33
## 2 China 38 32 19
## 3 Japan 27 14 17
## 4 GreatBritain 22 20 22
## 5 ROC 20 28 23
## 6 Australia 17 7 22
## 7 Netherlands 10 12 14
## 8 France 10 12 11
## 9 Germany 10 11 16
## 10 Italy 10 10 20
## 11 Canada 7 7 10
## 12 Brazil 7 6 8
## 13 NewZealand 7 6 7
## 14 Cuba 7 3 5
## 15 Hungary 6 7 7
## 16 SouthKorea 6 4 10
## 17 Poland 4 5 5
## 18 CzechRepublic 4 4 3
## 19 Kenya 4 4 2
## 20 Norway 4 2 2
## 21 Jamaica 4 1 4
## 22 Spain 3 8 6
## 23 Sweden 3 6 0
## 24 Switzerland 3 4 6
## 25 Denmark 3 4 4
## 26 Croatia 3 3 2
## 27 Iran 3 2 2
## 28 Serbia 3 1 5
## 29 Belgium 3 1 3
## 30 Bulgaria 3 1 2
## 31 Slovenia 3 1 1
## 32 Uzbekistan 3 0 2
## 33 Georgia 2 5 1
## 34 ChineseTaipei 2 4 6
## 35 Turkey 2 2 9
## 36 Greece 2 1 1
## 37 Uganda 1 1 4
## 38 Ecuador 2 1 0
## 39 Ireland 2 0 2
## 40 Israel 0 2 4
## 41 Qatar 2 0 1
## 42 Bahamas 2 0 0
## 43 Kosovo 0 0 2
## 44 Ukraine 1 6 12
## 45 Belarus 1 3 3
## 46 Romania 1 3 0
## 47 Venezuela 3 0 4
## 48 India 1 2 4
## 49 HongKong 1 2 3
## 50 Philippines 1 2 1
## 51 Slovakia 2 1 4
## 52 SouthAfrica 1 2 0
## 53 Austria 1 1 5
## 54 Egypt 1 1 4
## 55 Indonesia 1 1 3
## 56 Ethiopia 1 1 2
## 57 Portugal 1 2 4
## 58 Tunisia 1 1 0
## 59 Estonia 1 0 1
## 60 Fiji 0 1 2
## 61 Latvia 0 1 2
## 62 Thailand 0 1 2
## 63 Bermuda 1 0 0
## 64 Morocco 0 0 1
## 65 PuertoRico 0 0 1
## 66 Colombia 0 4 1
## 67 Azerbaijan 0 3 4
## 68 DominicanRepublic 0 3 2
## 69 Armenia 0 2 2
## 70 Kyrgyzstan 0 2 1
## 71 Mongolia 0 1 3
## 72 Argentina 0 1 2
## 73 SanMarino 1 2 3
## 74 Jordan 0 1 1
## 75 Malaysia 1 1 2
## 76 Nigeria 1 1 2
## 77 Bahrain 0 1 0
## 78 Lithuania 1 0 1
## 79 Namibia 1 0 1
## 80 NorthMacedonia 1 0 1
## 81 SaudiArabia 1 0 1
## 82 Turkmenistan 1 0 1
## 83 Kazakhstan 0 0 8
## 84 Mexico 0 0 4
## 85 Finland 0 0 2
## 86 Botswana 0 0 1
## 87 BurkinaFaso 0 1 1
## 88 Ghana 0 1 1
## 89 Grenada 0 1 1
## 90 IvoryCoast 0 1 1
## 91 Kuwait 0 1 1
## 92 Moldova 0 1 1
## 93 Syria 0 1 1
Here are summary statistics of the data
# PCA applied to medal counts in 2020 Summer Game
# Create a data matrix without 1st column (country name)
X_raw = as.matrix(medals[,-1])
summary(X_raw)
## Gold Silver
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 1.000
## Median : 1.000 Median : 1.000
## Mean : 3.667 Mean : 3.688
## 3rd Qu.: 3.000 3rd Qu.: 4.000
## Max. :39.000 Max. :41.000
## Bronze
## Min. : 0.000
## 1st Qu.: 1.000
## Median : 2.000
## Mean : 4.634
## 3rd Qu.: 5.000
## Max. :33.000
Clearly, three variables in data are on the same scale, still we choose to center and scale the data
# Standardize the data
X = scale(X_raw, center = T, scale = T)
Now we perform the PCA (one dimensional). Perform the eigen-decomposition
1 >
S= X X = BΛB >
N
to get B = [b1 , b2 , b3 ] and Λ = diag(λ1 , λ2 , λ3 )
# Compute the principal components
S = t(X) %*% X / nrow(X)
B = eigen(S)
B
## eigen() decomposition
## $values
## [1] 2.74534630 0.15136108 0.07103456
##
## $vectors
## [,1] [,2] [,3]
## [1,] -0.5819132 -0.3965166 0.710036333
## [2,] -0.5816912 -0.4072185 -0.704136719
## [3,] -0.5683418 0.8227683 -0.006315571
This is the matrix B contains principal component vectors
B$vectors
## [,1] [,2] [,3]
## [1,] -0.5819132 -0.3965166 0.710036333
## [2,] -0.5816912 -0.4072185 -0.704136719
## [3,] -0.5683418 0.8227683 -0.006315571
The first principle component b1 is the first column of the matrix B.
# First principal component (PC1) is the 1st column of matrix B
b1 = B$vectors[,1]
Now, for any country n, we can project the vector of medal counts xn
xn = [goldn , silvern , bronzen ]>
onto b1 to obtain the coordinates of the project observations
z1n = b>
1 xn = −0.581 × goldn − 0.581 × silvern − 0.568 × bronzen
for n = 1, 2, 3 (US, China, Japan). Notice that the corresponding weights are quite similar for gold, silver
and bronze medals, suggesting their impacts on the data variation are almost equal.
# coordinates of PC1 for all countries
z1 = X %*% b1
dim(z1)
## [1] 93 1
# The coordinate of US for the first component
z1[1]
## [1] -8.879672
X[1,] %*% b1
## [,1]
## [1,] -8.879672
# The coordinate of China for the first component
z1[2]
## [1] -6.692825
# The coordinate of Japan for the first component
z1[3]
## [1] -4.00174
We obtain all the coordinates z11 , z12 , . . . , z1N for N = 93 countries, using which as scores, we rank
the countries.
# Ranking by coordinates on first PC
result1 = rank(z1, ties.method='min')
result1
## [1] 1 2 5 4 3 6 9 10 8 7 11 12 15 18 14
## [16] 13 19 24 25 32 27 17 28 21 23 31 41 26 40 47
## [31] 52 51 30 22 20 62 43 70 60 42 69 88 71 16 37
## [46] 61 39 35 45 59 38 68 34 43 49 56 35 87 81 63
## [61] 63 63 93 89 89 50 33 48 55 67 54 63 45 73 56
## [76] 56 92 81 81 81 81 81 29 53 71 89 73 73 73 73
## [91] 73 73 73
Note that here we used rank() function to assign a rank to each element in a vector. To handle
ties with rank() function, you can use the ties.method argument to specify how we should handle
ties. In the code, we set ties.method=’max’, this will assign every tied element to the highest rank
(elements ranked in the 3rd and 4th position would both receive a rank of 4). For details, see https:
//www.statology.org/sort-order-rank-in-r/.
Let’s see how diffferent the results are, based on different ranking methods
# Ranking by total count
score2 = as.vector(-X_raw%*%matrix(c(1,1,1),nrow=3))
result2 = rank(score2, ties.method='min')
# Ranking by gold count
score3 = -X_raw%*%matrix(c(1,0,0),nrow=3)
result3 = rank(score3, ties.method='min')
# ranking difference PC1 vs total
result1 - result2
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1
## [16] 0 0 1 0 3 1 0 2 1 0 2 8 0 7 5
## [31] 4 3 1 0 0 9 1 7 7 0 6 17 0 0 4
## [46] 8 6 2 3 6 5 5 1 1 1 3 2 16 10 0
## [61] 0 0 4 0 0 2 0 0 2 4 1 0 3 2 3
## [76] 3 3 10 10 10 10 10 0 0 0 0 2 2 2 2
## [91] 2 2 2
# ranking difference PC1 vs gold
result1 - result3
## [1] 0 0 2 0 -2 0 2 3 1 0 0
## [12] 1 4 7 -1 -2 2 7 8 15 10 -5
## [23] 6 -1 1 9 19 4 18 25 30 29 -4
## [34] -12 -14 28 0 36 26 -25 35 54 4 -27
## [45] -6 18 17 -8 2 16 4 25 -9 0 6
## [56] 13 -8 44 38 -4 -4 -4 50 22 22 -17
## [67] -34 -19 -12 0 -13 -4 2 6 13 13 25
## [78] 38 38 38 38 38 -38 -14 4 22 6 6
## [89] 6 6 6 6 6
We can see that the ranking result from the PCA is quite similar to that from the total count, but quite
different to that based on the gold medal count.
Here is the summary of three different ranking results
# Add three ranking results into the data
dat1 = data.frame(X_raw)
rownames(dat1) <- medals$Country
dat1$PC1 = result1
dat1$Tot = result2
dat1$By_G = result3
dat1
## Gold Silver Bronze PC1 Tot By_G
## UnitedStates 39 41 33 1 1 1
## China 38 32 19 2 2 2
## Japan 27 14 17 5 5 3
## GreatBritain 22 20 22 4 4 4
## ROC 20 28 23 3 3 5
## Australia 17 7 22 6 6 6
## Netherlands 10 12 14 9 9 7
## France 10 12 11 10 10 7
## Germany 10 11 16 8 8 7
## Italy 10 10 20 7 7 7
## Canada 7 7 10 11 11 11
## Brazil 7 6 8 12 12 11
## NewZealand 7 6 7 15 13 11
## Cuba 7 3 5 18 18 11
## Hungary 6 7 7 14 13 15
## SouthKorea 6 4 10 13 13 15
## Poland 4 5 5 19 19 17
## CzechRepublic 4 4 3 24 23 17
## Kenya 4 4 2 25 25 17
## Norway 4 2 2 32 29 17
## Jamaica 4 1 4 27 26 17
## Spain 3 8 6 17 17 22
## Sweden 3 6 0 28 26 22
## Switzerland 3 4 6 21 20 22
## Denmark 3 4 4 23 23 22
## Croatia 3 3 2 31 29 22
## Iran 3 2 2 41 33 22
## Serbia 3 1 5 26 26 22
## Belgium 3 1 3 40 33 22
## Bulgaria 3 1 2 47 42 22
## Slovenia 3 1 1 52 48 22
## Uzbekistan 3 0 2 51 48 22
## Georgia 2 5 1 30 29 34
## ChineseTaipei 2 4 6 22 22 34
## Turkey 2 2 9 20 20 34
## Greece 2 1 1 62 53 34
## Uganda 1 1 4 43 42 43
## Ecuador 2 1 0 70 63 34
## Ireland 2 0 2 60 53 34
## Israel 0 2 4 42 42 67
## Qatar 2 0 1 69 63 34
## Bahamas 2 0 0 88 71 34
## Kosovo 0 0 2 71 71 67
## Ukraine 1 6 12 16 16 43
## Belarus 1 3 3 37 33 43
## Romania 1 3 0 61 53 43
## Venezuela 3 0 4 39 33 22
## India 1 2 4 35 33 43
## HongKong 1 2 3 45 42 43
## Philippines 1 2 1 59 53 43
## Slovakia 2 1 4 38 33 34
## SouthAfrica 1 2 0 68 63 43
## Austria 1 1 5 34 33 43
## Egypt 1 1 4 43 42 43
## Indonesia 1 1 3 49 48 43
## Ethiopia 1 1 2 56 53 43
## Portugal 1 2 4 35 33 43
## Tunisia 1 1 0 87 71 43
## Estonia 1 0 1 81 71 43
## Fiji 0 1 2 63 63 67
## Latvia 0 1 2 63 63 67
## Thailand 0 1 2 63 63 67
## Bermuda 1 0 0 93 89 43
## Morocco 0 0 1 89 89 67
## PuertoRico 0 0 1 89 89 67
## Colombia 0 4 1 50 48 67
## Azerbaijan 0 3 4 33 33 67
## DominicanRepublic 0 3 2 48 48 67
## Armenia 0 2 2 55 53 67
## Kyrgyzstan 0 2 1 67 63 67
## Mongolia 0 1 3 54 53 67
## Argentina 0 1 2 63 63 67
## SanMarino 1 2 3 45 42 43
## Jordan 0 1 1 73 71 67
## Malaysia 1 1 2 56 53 43
## Nigeria 1 1 2 56 53 43
## Bahrain 0 1 0 92 89 67
## Lithuania 1 0 1 81 71 43
## Namibia 1 0 1 81 71 43
## NorthMacedonia 1 0 1 81 71 43
## SaudiArabia 1 0 1 81 71 43
## Turkmenistan 1 0 1 81 71 43
## Kazakhstan 0 0 8 29 29 67
## Mexico 0 0 4 53 53 67
## Finland 0 0 2 71 71 67
## Botswana 0 0 1 89 89 67
## BurkinaFaso 0 1 1 73 71 67
## Ghana 0 1 1 73 71 67
## Grenada 0 1 1 73 71 67
## IvoryCoast 0 1 1 73 71 67
## Kuwait 0 1 1 73 71 67
## Moldova 0 1 1 73 71 67
## Syria 0 1 1 73 71 67
All analysis above can be done through a simple function in R, i.e. prcomp
# Alternatively, you can use function prcomp to perform PCA analysis
pca_result = prcomp(~ Gold + Silver + Bronze, data = dat1, center = T, scale= T)
# coordinates of PC1 for all countries
pca_result$x[,1]
## UnitedStates China
## -8.87967167 -6.69282455
## Japan GreatBritain
## -4.00174004 -4.58505818
## ROC Australia
## -5.22035889 -3.02053352
## Netherlands France
## -2.13520920 -1.85496936
## Germany Italy
## -2.23357384 -2.51876505
## Canada Brazil
## -1.07053464 -0.79524617
## NewZealand Cuba
## -0.70183289 -0.24962058
## Hungary SouthKorea
## -0.70739085 -0.72224495
## Poland CzechRepublic
## -0.17783256 0.09745592
## Kenya Norway
## 0.19086920 0.36779303
## Jamaica Spain
## 0.26942839 -0.45372763
## Sweden Switzerland
## 0.28367588 -0.09987997
## Denmark Croatia
## 0.08694659 0.36223507
## Iran Serbia
## 0.45069698 0.25891906
## Belgium Bulgaria
## 0.44574562 0.53915890
## Slovenia Uzbekistan
## 0.63257218 0.62762082
## Georgia ChineseTaipei
## 0.36162847 -0.01697602
## Turkey Greece
## -0.12029203 0.71547613
## Uganda Ecuador
## 0.51814024 0.80888941
## Ireland Israel
## 0.71052477 0.51258228
## Qatar Bahamas
## 0.80393805 0.89735133
## Kosovo Ukraine
## 0.87633267 -0.67147558
## Belarus Romania
## 0.43462969 0.71486953
## Venezuela India
## 0.44079425 0.42967832
## HongKong Philippines
## 0.52309161 0.70991817
## Slovakia SouthAfrica
## 0.43523629 0.80333145
## Austria Egypt
## 0.42472696 0.51814024
## Indonesia Ethiopia
## 0.61155352 0.70496680
## Portugal Tunisia
## 0.42967832 0.89179336
## Estonia Fiji
## 0.88684200 0.78787075
## Latvia Thailand
## 0.78787075 0.78787075
## Bermuda Morocco
## 0.98025528 0.96974595
## PuertoRico Colombia
## 0.96974595 0.61589829
## Azerbaijan DominicanRepublic
## 0.42412036 0.61094692
## Armenia Kyrgyzstan
## 0.69940884 0.79282212
## Mongolia Argentina
## 0.69445747 0.78787075
## SanMarino Jordan
## 0.52309161 0.88128403
## Malaysia Nigeria
## 0.70496680 0.70496680
## Bahrain Lithuania
## 0.97469731 0.88684200
## Namibia NorthMacedonia
## 0.88684200 0.88684200
## SaudiArabia Turkmenistan
## 0.88684200 0.88684200
## Kazakhstan Mexico
## 0.31585299 0.68950611
## Finland Botswana
## 0.87633267 0.96974595
## BurkinaFaso Ghana
## 0.88128403 0.88128403
## Grenada IvoryCoast
## 0.88128403 0.88128403
## Kuwait Moldova
## 0.88128403 0.88128403
## Syria
## 0.88128403
# First principal component (PC1)
pca_result$rotation[,1]
## Gold Silver Bronze
## -0.5819132 -0.5816912 -0.5683418
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 1.6659 0.3912 0.26797
## Proportion of Variance 0.9251 0.0510 0.02394
## Cumulative Proportion 0.9251 0.9761 1.00000
plot(pca_result)
Variances
0.0 0.5 1.0 1.5 2.0 2.5
pca_result