Colley Matrix: Bias-Free Football Rankings
Colley Matrix: Bias-Free Football Rankings
r
nw
rd r
_
1
0
(1 r)
n
r
nw
d r
=
1 + n
w
2 + n
+ n
w
, (6)
which recovers equation (1). It is an interesting exercise to check a few more examples.
4. Strength of Schedule
The simple statistic developed in the last section would sufce to produce a ranking if we
were condent that all teams had played a schedule of similar strength, or for instance a round-
robin tournament. While a round-robin with 117 teams would require 6786 games, Division I-A
teams play typically a tenth that, so there is absolutely no assurance that the quality of opponents
from team to team is close to the same. Contrast this with the NFL, or especially the Major League,
where each team plays a very healthy sample of the entire league during the regular season.
This problem is complicated by the addition of still more teams in the form of non-I-A op-
ponents. If one were to use those games as input, he would have to form ratings of all the I-AA
teams, which would require ratings of teams in still lesser divisions, since many I-AA teams play
7
such opponents. Forming sensible ratings which relate Florida State to Emory & Henry is ex-
tremely difcult and is frankly beyond the scope of this method. The reason is that my method, in
its simplicity, relies on some interconnectedness between opponents, which simply does not exist
between a given NAIA squad and a given Division I-A squadtheres barely enough interconnect-
edness among the I-A teams themselves! Most other computer rankings within the BCS system
do endeavor to compute such ratings, and in my opinion, do nearly as good a job as is possible
at making sense of such disparate and competitively disconnected teams. To preserve simplic-
ity and total objectivity (no ad hoc division adjustment, etc.), my rating system must ignore all
games against non-I-A opponents. Therefore, padding the schedule with I-AA teams contributes
absolutely nothing to a teams rating.
We may then proceed with mathematical adjustments for strength of schedule within Division
I-A itself.
The number of wins in equation (1) may be divided into n
w,i
= (n
w,i
n
,i
)/2 + n
tot,i
/2
(which the reader can check). Recognizing that the second term may be written as
n
tot,i
1/2
allows one to identify the sum as that of the ratings of team is opponents, if those opponents are
all random (r = 1/2) teams. Instead, then, of using r = 1/2 for all opponents, we now use their
actual ratings, which gives an obvious correction to n
w,i
.
n
eff
w,i
= (n
w,i
n
,i
)/2 +
n
tot,i
j=1
r
i
j
, (7)
where r
i
j
is the rating of the j
th
opponent of team i. The second term (the summation) in equation
(7) is the adjustment for strength of schedule.
Now, the rub. When the teams are not random, but ones which have played other teams,
which may or may not have played some teams in common with the rst team, etc., how does one
possibly gure out simultaneously all the r
i
j
s which are inputs to the r
i
s, which are themselves
r
i
j
s for other r
i
s, etc.?
5. The Iterative Scheme
The most obvious way to solve such a problem is a technique called iteration, which is em-
ployed by several of the other computer ranking methods. The way it works is one rst computes
the ratings, as if all the opponents were random (r = 1/2) teams, using equation (1). Next, each
teams strength of schedule is computed according to its opponents ratings, using equation (7).
The ratings are re-computed with the new schedule strengths, and then strengths of schedule are
re-computed from the new ratings. With luck, the changes to the ratings get smaller and smaller
8
with each step of these calculations, and after a time, when the changes are negligibly small for
any teams rating (a part in a million, say), one calls the list of ratings, at that point, nal.
Here is a very simple example of the iterative technique, after only one week of play, where a
team that has played is either 0-1 against a 1-0 team, or vice versa. Before any iterations, the basic
Laplace statistic from equation (1) is computed for each team. Letting r
W,0
be the initial rating of
a winning team, and r
L,0
be the initial rating of the losing team, one nds that equation (1) initially
gives
Initial ratings
r
W,0
= (1 + 1)/(2 + 1) = 2/3 0.6667
r
L,0
= (1 + 0)/(2 + 1) = 1/3 0.3333.
(8)
The rst adjustment for strength of schedule (theres been only one game, so schedule strength
is just the rating of the one opponent) is made by computing n
eff
w
for each team, using equation
(7):
First Correction
n
eff
w,W,1
= (1 0)/2 + 1/3 = 5/6
n
eff
w,L,1
= (0 1)/2 + 2/3 = 1/6.
(9)
Because the 1-0 team beat a 0-1 team, worse than an average team, the 1-0 team is punished, and
given only 5/6 of a win, whereas the losing team lost to a 1-0 team, better than an average team,
and is rewarded by suffering only 5/6 of a loss. One can see how the method explicitly gives to
one team only by taking from another.
The next step is to re-compute the ratings, given the new n
eff
w
values. Plugging back into
equation (1) yields:
Ratings After First Iteration
r
W,1
= (1 + 5/6)/(2 + 1) = 11/18 0.6111
r
L,1
= (1 + 1/6)/(2 + 1) = 7/18 0.3889.
(10)
Lets look at just one more iteration.
Second Correction
n
eff
w,W,2
= (1 0)/2 + 7/18 = 8/9
n
eff
w,L,2
= (0 1)/2 + 11/18 = 1/9.
(11)
Ratings After Second Iteration
r
W,2
= (1 + 8/9)/(2 + 1) = 17/27 0.6296
r
L,2
= (1 + 1/9)/(2 + 1) = 10/27 0.3704.
(12)
If one examines the ratings of the winning team after the zeroth, rst and second iterations,
one nds that the values r
W,{0,1,2}
{0.6667, 0.6111, 0.6296}, show rst a correction down, then
9
a correction up, by a lesser amount. Corrections that alternate in sign, and shrink in magnitude
are hallmarks of convergence, meaning that with each iteration, the scheme is closer to nding a
nal, consistent value. Table 1 shows how these numbers converge to a part in a million after 11
iterations.
In fact, one can demonstrate that the nal ratings in this simple case are explicitly the sums of
converging series (compare to Table 1),
r
L
=
1
2
_
1
1
3
+
1
9
1
27
+
=
1
2
n=0
(1/3)
n
=
1
2
1
1+1/3
=
3
8
,
(13)
where the last line is the standard formula for the sum of a geometric series. In this simple case,
the iterative method converges rapidly and stably, as a classic alternating geometric series.
Also note that the results converge to an average rating of 1/2, which is the same average as if
there had been no game played at all; average rating has been conserved.
The ratings may converge nicely, but how can one know that these are the right answers?
Furthermore, is the method extensible to the prodigiously more complicated case of 117 teams
having played 11 or 12 games each?
6. The Colley Matrix Method
The previous section showed how an iterative correction for strength of schedule could pro-
vide consistent results that make intuitive sense for the simple one game case, but left us with the
question of how do we know that the result is really right?
Let us return, then, to the example of the two teams 1-0, and 0-1 after their rst game. Refer-
ring to equations (1) and (7), we have
r
W
=
1+1/2+r
L
2+1
r
L
=
11/2+r
W
2+1
.
(14)
A simple rearrangement gives
3r
W
r
L
= 3/2
r
W
+ 3r
L
= 1/2,
(15)
a simple two-variable linear system. Plugging in the results from the iterative technique (Table 1),
one discovers that indeed r
W
= 5/8 and r
L
= 3/8 work exactly.
10
This exercise illustrates that linear methods can be used for two teams, but begs the question,
can the ratings of many teams, after many games, be computed by simple linear methods?
6.1. The Matrix Solution
Returning to equations (1) and (7), using the same denitions for r
i
and r
i
j
, one nds that
equations (1) and (7) can be rearranged in the form:
(2 + n
tot,i
)r
i
n
tot,i
j=1
r
i
j
= 1 + (n
w,i
n
,i
)/2, (16)
which is a system of N linear equations with N variables.
It is convenient at this point to switch to matrix form by rewriting equation (16) as follows,
Cr =
b, (17)
where r is a column-vector of all the ratings r
i
, and
, (20)
where is the vector of the true (nal) ratings, and
_
5 0 1 1 1
0 4 1 0 1
1 1 6 1 1
1 0 1 4 0
1 1 1 0 5
_
_
_
_
r
a
r
b
r
c
r
d
r
e
_
_
=
_
_
1/2
1
1
1
3/2
_
_
, (24)
13
Solving the matrix equation yields sensible results:
r = {19, 24, 23, 22, 27}/46 {0.413, 0.522, 0.500, 0.478, 0.587} (25)
As with the two team case, the ratings average to exactly 1/2 (23/46). Notice that team b, having
played a 2-1 team and a 2-2 team, is rated higher than team d, having played a 1-2 team and a
2-2 team, despite identical 1-1 records: an example of how strength of schedule comes into play.
Finally, there is consistency in that team c has a rating of exactly 1/2, because that team is 2-2
against teams whose ratings average to exactly 1/2.
7. Comments on the Colley Matrix Method
There has been discussion of the fact that the matrix method (and the iterative method) con-
serves an average ranking of 1/2. The reason I emphasize that point is that, as such, the ratings
herein require no renormalization. All teams started out with a rating of 1/2, and only by exchange
of rating points with other teams does one teams rating change. The rst subsection below shows
why the rating scheme explicitly conserves total ratings points. The subsection which follows es-
tablishes that the matrix in equation (17) is indeed positive denite, which allows for quick and
stable solution of the matrix equation. The last subsection shows that the matrix C is singular if
winning percentages are used in place of Laplaces formula, and thus, the method cannot be used
with straight winning percentages.
7.1. Conservation of Average Rating
Why does the matrix solution always preserve the average rating of 1/2? We can tackle this
problem by examining the construction of the matrix C. From the denition of the matrix C in
equation (19), it follows that the matrix can be represented as:
C = 2I +
all games
k
G
k
, (26)
14
where I is the identity matrix, and G
k
is a matrix operator added for each game k. G
k
always has
the form that G
k
ii
= G
k
jj
= 1, and G
k
ij
= G
k
ji
= 1, with all other entries 0, like this:
G
k
=
_
_
.
.
.
.
.
.
1 1
.
.
.
.
.
.
1 1
.
.
.
.
.
.
_
_
. (27)
Carrying out the multiplication r
= G
k
r, we nd that the i
th
and j
th
entries in r
are r
i
r
j
, and
r
j
r
i
, respectively, while all other entries in r
are obviously zeroes. The sum of all the r
values,
is, therefore, zero, no matter what the values of r. Hence,
N
i=1
(Cr )
i
=
N
i=1
(2Ir )
i
= 2
N
i=1
r
i
. (28)
What about the other side of equation (17),
b? The denition of b
i
from equation (18) is
b
i
= 1 +(n
w,i
n
,i
)/2. It is easy to see that the total of the b
i
s must be N, since each win by one
team must be offset by a loss by that teams opponent, so that the total number of wins must equal
the total number of losses, and therefore,
b
i
= N.
So, summing both sides of equation (17), we have
i
(Cr )
i
=
i
b
i
2
i
r
i
= N;
(29)
r
i
N
=
1
2
, (30)
i.e., the average value of r is exactly one-half.
7.2. The Colley Matrix is Positive Denite
In order to use Cholesky decomposition and back substitution to solve the matrix equation
(17), the matrix C must be positive denite, such that, for any non-trivial vector v, this inequality
holds
v
T
(Cv) > 0. (31)
Recalling our separation of C into 2I +
k
G
k
, and noting that matrix multiplication is distribu-
tive, (A+B)v = Av +Bv, we can examine the inequality piece-wise. Obviously the matrix 2I is
15
positive denite, which leaves the G
k
s. In the subsection above, we discovered that the multipli-
cation Gr yielded zeroes in all entries, except the i
th
and j
th
which contained r
i
r
j
and r
j
r
i
,
respectively. The product r
T
(Cr) is thus computed as
r
T
(G
k
r ) = r
i
(r
i
r
j
) + r
j
(r
j
r
i
) = (r
i
r
j
)
2
0. (32)
Since (r
i
r
j
)
2
0, and r
T
(2Ir ) > 0, the matrix C must be positive denite.
7.3. Singularity of C for Straight Winning Percentages
We have seen that the iterative method would fail if straight winning percentages were used
(i.e. if one removed the +1 and +2 from the numerator and denominator in equation [1]). In per-
forming the same exercise with the matrix method, equation (19) would change C into a singular
matrix!
For the one game case, the result is obvious.
C =
_
1 1
1 1
_
, (33)
obviously singular. In the general case, its easy to see from equation (19) that if one removes
the addend of 2 from c
ii
, the total of the c
ij
s for any row or any column is zero. Therefore, in
performing the legal row operation of adding all the rows, one has produced a row of all zeros;
hence the matrix is singular. The matrix method, like the iterative method, cannot work with
simple winning percentages.
8. Performance
With the mathematics of the ranking method on a rm foundation, the next question is how
well does the method perform. Answering that question is difcult because we do not actually
know the truth. We simply do not know in any precise way how good Western Michigan is
relative to Hawaii; therefore there is really nothing to check our results against.
The best we can do is compare our results to the venerable press polls, just as a mutual sanity
check. Before I even begin, I should caution that the press polls are articially correlated simply
because the coaches are well aware of the AP poll, and the press writers are well aware of the
Coaches poll. Nonetheless we shall proceed.
Table 3 compares the nal Colley rankings against the nal AP rankings for 19982002 and
against the Coaches Poll rankings for 19992002. The rst thing to notice is that the national
16
champion is agreed upon every year in all three ranking systems, which is ultimately the most im-
portant test for our purposes. Just scanning down the lists, the rankings are usually quite consistent
within a few places, with an occasional outlier.
To quantify the agreement, I have found it more useful to think in terms of ranking ratios,
or percentage differences, rather than simple arithmetic ranking differences. I have previously
used the median absolute difference as a gure of merit, but have concluded this statistic poorly
describes the behavior of the ranking comparisons. To show that, I have plotted in Fig. 2 the
arithmetic differences (top) and ratios (bottom) of my rankings vs. the press rankings for 1999
2002 (all lumped together). In the plots, the histograms for the AP and Coaches rankings are
over-plotted (theyre very similar).
Throwing all six groups together (three years, AP, Coaches), one can compute the direct
average and variance of the distributions, which are listed as avg and s on the plots (yes, s is
the square-root of the variance, and has a
_
n/(n 1) factor). The corresponding normal curve
is plotted as a dotted line. Another way of nding the mean and variance of a distribution is to t
the integral of the normal curvethats P(x) for you calculator statisticians, or the error function
with
2s in the right places for you math pedantsto the cumulative distribution. I have plotted
those resulting normal curves as solid lines at top and bottom. If a distribution is normal, these two
methods should be nearly identical. Obviously at top, the two curves are not so identical, but at
bottom things seem much better.
Of course there are dozens of formal ways to check Gaussianity of a distribution, but I dont
want to belabor the point. I just want to illustrate that the ratios are much better behaved than the
arithmetic differences. What does this mean? It means that its much more accurate to say my
rankings disagree with the press rankings by a typical percentage, rather than by a typical number.
To compute that typical percentage, one may average the absolute differences of the logs of
the rankings, as such,
= exp
_
1
25
25
i=1
| log j
C
(team
i
) log i|
_
, (34)
where i is the press ranking (either AP or coaches), and j
C
(team
i
) is the Colley ranking of that
team. Its hard to come up with a good name for this statistic, ; perhaps mean absolute ratio.
Anyway, I list that statistic at the bottom of Table 3 for each of the 4 years of the BCS. The values
are typically about 1.25, which means that my rankings agree with the press rankings within about
25% in either direction, so you might have an error of 1 ranking place at around #5, but about 5
places by #20. Inspecting the columns of Table 3 shows this to be quite a good description of the
relative rankings.
Is that a good agreement?
17
Who knows, really? But to me (at least) the agreement is surprisingly good:
The press polls started with a pre-season poll, with all the pre-conceived notions of history
and tradition such an endeavor demands, then week by week allowed their opinions and
judgments to migrate, being duly impressed or disappointed in the styles of winning and
losing by certain teams, being more concerned about recent games than earlier ones, perhaps
mentally weighting games seen on television as more important, perhaps having biases (good
or bad) toward local schools one sees more often... ad nauseam.
My computer rankings started with nothing, literally no information, but then, given only
wins and losses, generated a ranking with pure algebra.
That two such processes produce even remotely consistent results is, frankly, remarkable to
me.
I hope in this section we can agree to have learned, despite a lack of truth data, comparison
of the press polls and my rankings shows both that the press and coaches must not be too loony,
and that the Colley Matrix system yields common sense results.
9. Conclusions
Colleys Bias Free College Football Ranking Method, based on solution of the Colley Matrix,
has been developed with several salient features, desirable in any computer poll that claims to be
unbiased.
1. It has no bias toward conference, tradition, history, or prognostication.
2. It is reproducible; one can check the results.
3. It uses a minimum of assumptions.
4. It uses no ad hoc adjustments.
5. It nonetheless adjusts for strength of schedule.
6. It ignores runaway scores.
7. It produces common sense results that compare well to the press polls.
This information, the weekly poll updates, as well as useful college football links may be
found on the Internet home for Colleys Rankings:
http://www.colleyrankings.com/.
WNC would like to thank A. Peimbert and J. R. Gott for their contributions in many lively
18
discussions on the subject of rankings. All programming for this method was done by WNC in
IDL, FORTRAN, C, C++, Perl and shell script in the Solaris Unix and Linux environments.
REFERENCES
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P., 1992, Numerical Recipes in
FORTRAN, Cambridge University Press, Cambridge, UK, pp. 89-91
This preprint was prepared with the AAS L
A
T
E
X macros v5.0.
19
Convergence of Ratings via Iteration
Iteration Winning Teams Rating Losing Teams Rating
0 0.666667 0.333333
1 0.611111 0.388889
2 0.629630 0.370370
3 0.623457 0.376543
4 0.625514 0.374486
5 0.624829 0.375171
6 0.625057 0.374943
7 0.624981 0.375019
8 0.625006 0.374994
9 0.624998 0.375002
10 0.625001 0.374999
11 0.625000 0.375000
Table 1: Convergence of ratings to nal, stable values, after 11 iterations, for the simple two team,
one game case. The initial ratings are 2/3 for the winner and 1/3 for the loser, before any adjustment
for schedule strength. Moving down the table are successive adjustments for strength of schedule.
Because the winning team beat a below average (0-1) team, while the losing team lost to an above
average (1-0) team, the nal ratings are lower for the winning team, and greater for the losing team
than were the initial ratings.
20
Iterative Convergence of Ratings in a Round-Robin
iter. team 1 other teams
1. r +
1
r =
n
eff
w
=
eff
w
n
eff
w
eff
w
+
1
2. r = r = +
1
/(2 + n)
n
eff
w
=
eff
w
+ n
1
/(2 + n) n
eff
w
=
eff
w
+ (n 1)
1
/(2 + n)
3. r = + n
1
/(2 + n)
2
r = + (n 1)
1
/(2 + n)
2
n
eff
w
=
eff
w
+ n(n 1)
1
/(2 + n)
2
n
eff
w
=
eff
w
+ [(n 1)
2
+ n]
1
/(2 + n)
2
4. r = + n(n 1)
1
/(2 + n)
3
r = + [(n 1)
2
+ n]
1
/(2 + n)
3
.
.
.
.
.
.
Table 2: Iterative convergence in a round-robin. Ratings r and effective wins n
eff
w
are computed
from equations (1) and (7) in each iteration. Starting with the correct (nal) values, and
eff
w
, an
error
1
is added to team 1s rating. Column 1 gives the propagation of that error in team 1 through
4 iterations; Column 2 does the same for all other teams (whose errors will be equivalent). As one
moves down the table, the errors shrink (slowly).
21
Comparison of Final Rankings with Press Polls
Colley Ranking for Teams with Given Press Rank
press 1998 1999 2000 2001 2002
rank AP AP Coaches AP Coaches AP Coaches AP Coaches
1 1 1 1 1 1 1 1 1 1
2 2 5 2 3 3 3 3 3 3
3 3 2 5 2 2 4 4 2 2
4 6 11 11 5 6 2 2 4 4
5 8 3 3 6 5 6 6 5 5
6 4 7 7 4 4 10 10 6 11
7 10 4 4 7 8 8 5 11 6
8 5 6 6 8 11 5 8 8 8
9 11 10 10 11 7 7 7 7 7
10 9 8 8 9 13 11 13 12 12
11 7 9 9 13 9 13 11 10 14
12 13 13 12 22 22 9 9 14 17
13 12 12 15 19 19 15 15 13 13
14 17 15 13 17 12 12 12 19 20
15 16 14 14 10 17 16 16 17 18
16 22 24 24 12 10 14 17 18 19
17 14 28 23 15 30 17 14 9 9
18 19 23 17 20 23 31 31 20 24
19 20 17 28 29 15 18 18 24 32
20 23 18 22 30 20 20 20 21 23
21 24 21 18 23 29 30 21 15 21
22 15 27 27 28 27 24 28 25 28
23 27 22 21 14 18 28 30 28 15
24 21 16 29 27 14 33 19 32 26
25 26 26 16 18 32 19 24 23 25
Colley vs. poll 1.224 1.309 1.281 1.287 1.331 1.262 1.232 1.200 1.253
AP vs. Coaches n/a 1.071 1.098 1.037 1.082
Table 3: Comparison of nal rankings to AP Poll for 19982002, and to the Coaches Poll for
19992002. At bottom is a statistic , described in the text. Essentially, it is the typical ratio of
the Colley ranking to the poll ranking, or vice versa, so that the larger of the two always in the
numerator, (specically is the exponent of the mean of the absolute values of the logs of the
ratios), so = 1.25 means the rankings would differ by typically one place at #4, and 5 places at
#20.
22
Fig. 1. Probability distribution functions for the Laplace dice problem, analogous to rating by
wins and losses. At top left is the initial condition {no dies thrown; no games played}. At top
right is {one die left; one game won}: the probability density must be zero at the left. At bottom
left is {two dies left; two games won}: the probability densities multiply. At bottom right is {one
die left, one die right; one game won, one game lost}: the probability density must be zero at the
left and at the right. The functions here have been normalized to have an integral of one, which is
irrelevant in section 3 of the text, because the normalizations explicitly divide out.
23
Fig. 2. Two different ways for comparing the Colley Rankings with the Press Polls. (at top)
Arithmetic differences, (Colley press), between the nal rankings for 19992002 for both the
AP and Coaches polls. Over-plotted are the normal curves from direct measurement of mean
(= avg) and standard deviation (= s), and from tting for the mean (= ) and standard deviation
(= ). (at bottom) Same plots, but for ratios (Colley press) in logarithmic space.