SUROP Report
SUROP Report
November, 2016
Abstract
1 Introduction
Covariance-correlation matrix is of fundamental importance wherever large data sets are involved
and relations between many random variables are to be understood. In modern portfolio theory,
accurate estimation of covariance-correlation is crucial to risk management and asset allocation [1],
as correlations measure the tendency of collective movement of different stocks, and underpin the
interactions between them.
In this project we will consider a Random Matrix Theory based approach to testing a proposed
multi-layer structured correlation model, constructed with information extracted through mode and
clustering analyses.
The following section gives an overview of the data analysed and its processing. In Section 3 we
will introduce the Marčenko-Pastur law in Random Matrix Theory; in Sections 4 and 5 we will
consider the features of our data using mode and hierarchical clustering analyses; then in Section 7
we propose a multi-layered structure in our correlation model, with the effect of layer depth analysed
in Section 8; finally, in Section 10, we will briefly discuss possible developments to this project.
All computational work in this project has been carried out in MATLAB. The Live Scripts are
published on the author’s web-page [2].
    ∗
     This project is part of the Summer Undergraduate Research Opportunities Programme (SUROP), and is supported
by the Bridgwater Scheme.
                                                        1
2        Data Overview
The S&P 500 stock market data1 studied are stored as a matrix whose rows represent the trading
days and columns the different stocks. We first calculate the logarithmic returns for all consecutive
trading days for each stock,
                                                       pi     pi − pi−1
                                   log return = log        (≈           ),   i>1
                                                      pi−1       pi−1
where pi represent the price index of a stock on the i-th trading day.
This leaves us a matrix of T = 1258 rows of observations and P = 452 columns corresponding
to each individual stock. We will demean each column by subtracting the column average and
normalise the entries so that the total variance for any stock is one. The data matrix X : T × P is
now standardised, and the empirical (covariance-)correlation matrix is simply
                                                         1 T
                                                 E=        X X.                                       (1)
                                                         T
We will denote the underlying, or true, correlation matrix by C – this is the object that we will attempt
to model based on the spectrum of E.
Random Matrix Theory (RMT) was first introduced by John Wishart in 1928, who was the first di-
rector at the Statistical Laboratory at the University of Cambridge. It became a prominent field of
study when the physicist Eugene Wigner applied it to spacing of energy levels in nuclear physics
[3].
The theory has a collection of universality laws, since it concerns the emergent behaviours of large
classes of random matrices in the asymptotic limit, that is to say, when the dimensions of the matrix
tend to infinity. One important instance of these is the Marčenko-Pastur law for a class of random
matrices called the Wishart ensemble, which include all correlation matrices:
Theorem 1 (The Marčenko-Pastur law). If X is a T × P random matrix whose entries are indepen-
dently identically distributed (i.i.d.) random variables (r.v.’s) with mean 0 and variance σ 2 < ∞, then
the eigenvalue density function (e.d.f) of matrix (1) is
                                                  p
                                              1     (λ+ − λ)(λ − λ− )
                                    f (λ) =     2
                                                                                                      (2)
                                            2πσ             rλ
                                                                        √
in the limit P, T → ∞ and P/T → r ∈ (0, 1), where λ± = σ 2 (1 ± r)2 .
    1
        See the author’s webpage [2].
                                                          2
                                                                                 histogram of observed eigenvalues
                                                              1.2
                                                                                 Marčenko-Pastur distribution
                                                              0.8                   0.01
                                                                                                    market mode
                                                                                 0.005
                                                              0.6
                                                                                      0
                                                                                               20   40   60       80   100
                                                              0.4
0.2 signals
                                                               0
                                                                    0   1   2   3              4    5    6        7     8
                                                                                     eigenvalue
Figure 1: The Marčenko-Pastur distribution does not match the empirical eigenvalue distribution of
the our S&P 500 stock market correlations. The eigenvalues lying outside the prediction range are
regarded as signals that our stock market data are not purely random.
For our standardised data, variance σ 2 = 1. The key parameter of the Marčenko-Pastur distribution
is then the concentration r = P/T , which intuitively represents the abundance of observations
compared to the number of variables. An interpretation of the Marčenko-Pastur law in our context
is that if the logarithmic returns of our stocks are independently, identically distributed, i.e. totally
random, then regardless of the underlying distribution2 , the observed eigenvalue distribution of E is
governed by (2).
This result provides a natural test for the null hypothesis that the data are completely random: if we
plot the observed eigenvalue density function against the Marčenko-Pastur distribution, any eigen-
values that lie far out from the Marčenko-Pastur prediction can be regarded as signals, suggesting
that the data are not truly random. In Figure 1, we see that the Marčenko-Pastur distribution is far
from a match to our observed empirical eigenvalue distribution. There are many signals above and
below the edges of the Marčenko-Pastur prediction, all suggesting our S&P 500 stock market data
are not purely random.
However, this is not at all surprising. We know that in reality many stocks are related and the
market structure is entangled and complex. Our aim is to construct a better model for the underlying
correlation matrix C, and then use the techniques in RMT to derive a new prediction for the limiting
empirical eigenvalue distribution when P, T → ∞, which improves the detection for any new
signals.
4 Mode Analysis
The eigenvalue and eigenvector pairs of the empirical correlation matrix E are referred to as the
modes of the market. They give insight into the interactions between individual stocks as well as
market sectors.
One feature of these modes is the localisation of the eigenvector components, conveniently mea-
    2
        As long as its first and second moments are bounded, which is a reasonable assumption.
                                                                                           3
sured by the inverse participation ratio
                                                        P
                                                        X
                                            IPR(v) =          |ṽi |4
                                                        i=1
where ṽ is the vector v demeaned and normalised. In Figure 2 we have shown the components
of the market mode corresponding to the largest eigenvalue, the 9-th mode and the lowest mode
corresponding to the least eigenvalue. We see that the market mode has a low IPR, which means in
this mode all stocks move in a similar fashion, responding to the overall trend of the market (and
hence its name). The 8th mode is more localised, and we can differentiate the edges of the market
sectors. The lowest mode is highly localised, and the interaction between two companies in the
energy and industrial sectors are clearly visible.
These different types of modes suggest there are correlation interactions at the stock, sector and
market levels, so a layered model may be appropriate to capture such interactions.
It is observed that components of the market mode eigenvector are relatively uniform, and more
crucially, have the same sign. The proposition below may explain this.
Proposition 1. If A be a positive square matrix, then
    1) it has a positive real eigenvalue λ1 with multiplicity 1 that has the largest magnitude of all its
       eigenvalues;
    2) it has a unique positive unit eigenvector and it corresponds to λ1 . All other eigenvectors must
       have at least one negative or non-real component.
Proof. Let e1 be the unit eigenvector for the largest eigenvalue λ1 of A. Then
                                           e1 = arg max xT Ax.
                                                   kxk=1
By reordering the basis we can without loss of generality (w.l.o.g.) assume that the first k compo-
nents of e1 are negative, and the rest are positive, where 1 < k < n and n is the dimension of matrix
A. Hence                         XX                           XX
                       eT1 Ae1 =          (e1 )i Aij (e1 )j +    (e1 )i Aij (e1 )j .
                                  i≤k j≤k                       i>k j>k
However, switching the signs of (e1 )i≤k preserves the norm while increasing the sum above, so by
reductio ad absurdum, the non-zero components of e1 must be of the same sign. In fact, e1 cannot
have a zero component by the consistency condition Ae1 = λ1 e1 .
By orthogonality of eigenvectors belonging to different eigen-subspaces, the other eigenvectors
must have components of mixed signs.
If we remove the market mode from the correlation matrix E, i.e. compute the ‘modified’ correlation
matrix
                                       E 0 = E − λ1 v1 v1T                                       (3)
                                                    4
                       0.08
0.07
0.06
0.05
0.04
0.03
                       0.02
                                                    Financials                     Materials
                       0.01                   Energy                            IT      Telecom
                                        Consumer Staples              Industrials         Utilities
                               Consumer Discretionary          Healthcare
                          0
                               0    50      100    150     200     250      300         350    400     450
(a) The market mode with eigenvalue 99.1 and IPR 3.98 × 10−5 .
                                                  Financials        Industrials
                             Consumer Discretionary          Healthcare       IT
                         0.1          Consumer Staples
                                            Energy
0.05
                       -0.05
                                                                                          Materials
                                                                                              Telecom
                        -0.1                                                                     Utilities
                       -0.15
                               0    50      100    150     200     250      300         350    400     450
(b) The 8th mode with eigenvalue 3.46 and IPR 6.51 × 10−3 .
                                                                                         Materials
                                                  Financials                                 Utilities
                         0.4                Energy                                               Telecom
                                      Consumer Staples
                             Consumer Discretionary
                         0.2
                        -0.2
                                                            Healthcare
                                                                         Industrials
                                                                                   IT
                        -0.4
                        -0.6
                               0     50     100    150     200     250       300        350    400     450
(c) The lowest mode with eigenvalue 0.0596 and IPR 0.149.
Figure 2: Plots of the eigenvector components of three different modes with their IPRs calculated.
                                                            5
                                                                ROST
                                                                RSH
                                                                BBBY
                                                                JWN
                                                                HD
                                                                BBY
                                                                COST
                                                                JCP
                                                                KSS
                                                                WMT
                                                                TGT
                                                                FDO
                                                                LTD
                                                                TJX
                                                                GPS
                                                                ANF
                                                                WAG
                                                                HCBK
                                                                CVS
                1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Figure 3: Two visualisations of the stock market structure and relations without the market mode.
The minimum spanning tree stock nodes are coloured by market sector.
where λ1 , v1 are the largest eigenvalue and eigenvector, then hierarchical clustering analysis may
reveal hidden market structure and relations under the overall market movement.
To perform clustering we must specify a distance measure; a natural choice is the dissimilarity dis-
tance, defined by
                                      dij = 1 − corr(i, j)                                       (4)
where corr(i, j) is the correlation between stocks i and j. Since correlations are reflexive and always
between -1 and 1, this meets the criteria of a metric 3 . It is also a convenient choice as dij is linear in
the correlations.
We will here adopt the average linkage method, which means the distance between two clusters I, J
is the average distance between all pairs of stocks from the two sector
                                                 1      X
                                      DIJ =                   dij                             (5)
                                               |I||J| i∈I,j∈J
Our clustering analysis has given us a new way of defining different market sectors based on the
average dissimilarity distances. We have found that if we do not remove the market mode, the aver-
age linkage method produces a large number of singleton clusters. To avoid this, we have removed
    3
        The triangle inequality can be easily checked.
                                                                       6
the market mode before performing clustering. We then redo the mode analysis in Section 4 to
find whether this new classification of market sectors is satisfactory. The results are presented in
Figure 4.
We see that the new classification of the market sectors unfortunately does not have clear boundaries
as we have seen in Figure 2. This is due to the removal of the market mode: despite being reasonably
uniform, the market mode still contains a substantial amount of structural information about the
market, as we could already differentiate the sectors in the original market mode plot in Figure 2.
There is a major caveat here: although we have removed the market mode in our clustering analysis,
to construct the multi-layer structured correlation model we will restore it. This is because, as could
be seen in Figure 4a, the market mode is not perfectly uniform and the boundaries of one particular
sector could already be distinguished.
This means that in this particular case the market mode contains structural information about the
stock market, and we need to keep it if we are to build an accurate correlation model. In fact, if
we did not, as computations have shown, the constructed correlation matrix might not be positive
semi-definite and arbitrary control of the negative entries would have to be implemented.
In the multi-layered model, the diagonal blocks of the correlation matrix model C represent the
correlations inside the sub-clusters in the lowest layer from the top of the hierarchy. These diagonal
blocks make up large diagonal blocks at a higher layer, and at each layer the off-diagonal blocks
represent the correlation interactions between intermediate clusters in that layer. All background
entries will be filled in with average correlations in that part of the layer, and the diagonal entries
will be set to unit.
For example, for a two-sector market toy model,
                                                                                     
                                1 α1 · · · α1
                                                .
                             α1 1 . . . ..
                                                                                     
                                                                                      
                             .
                             . ... ...                              β                
                             .                α1
                                                                                      
                                                                                      
                               α    · · ·   α  1
                                                                                     
                       C= 1                 1
                                                                                                    (6)
                                                                                     
                                                        1     α2         ···    α2
                                                                                      
                                                                                     
                                                                         ...    ..   
                                                       α2      1                 .   
                            
                                         β              ..    ...       ...
                                                                                      
                                                                                      
                                                         .                     α2    
                                                        α2    ···        α2     1
where diagonal blocks represent two clusters with respective average internal correlations α1,2 , and
the constant off-diagonal block entry β represents the average interaction correlation between the
two clusters.
                                                  7
                        0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
                          0
                               0   50   100   150   200   250   300   350   400   450
(a) The market mode with eigenvalue 99.1 and IPR 3.98 × 10−5 .
0.15
0.1
0.05
-0.05
-0.1
                       -0.15
                               0   50   100   150   200   250   300   350   400   450
(b) The 8th mode with eigenvalue 3.46 and IPR 6.51 × 10−3 .
0.6
0.4
0.2
-0.2
-0.4
                        -0.6
                               0   50   100   150   200   250   300   350   400   450
(c) The lowest mode with eigenvalue 0.0596 and IPR 0.149.
Figure 4: Plots of the eigenvector components of the same three modes. The vertical coloured lines
define the boundaries of newly classified market sectors.
                                                     8
                                                1
                                                                                                                               histogram of observed eigenvalues
       50                                       0.9                                         1.2
                                                                                                                               Marčenko-Pastur distribution
                                                0.5                                         0.6
      250
                                                0.4
      300
                                                                                            0.4
                                                0.3
      350
                                                0.2                                         0.2
      400
                                                0.1
      450                                                                                    0
               100    200   300       400                                                         0                        5                  10                   15
                                                                                                                                 eigenvalue
Figure 5: The heat-map of a 50-layer correlation matrix model and its simulated analytic prediction
for the empirical correlation matrix spectrum.
Now that we have a model from the underlying correlation matrix C with eigenvalues denoted
λ1 , . . . , λP , we will used techniques in RMT to derive a predicted limiting eigenvalue distribution
for the empirical correlation matrix E.
Let the limiting empirical eigenvalue density function (e.d.f.) of E be f (λ), then Marčenko and
Pastur have shown [4] that its Stieltjes transform, related by the transform pair
                               Z ∞
                                         f (λ)
                      G(z) =         dλ        ,    f (λ) = lim Im G(λ + i),                  (7)
                                −∞      λ−z                  →0
Solving this polynomial equation would give the new predicted distribution for the empirical cor-
relation matrix spectrum, but in practice it is computationally costly and may suffer numerical in-
stability. Instead, simulations of the empirical correlation matrix by randomly generated Gaussian
data subject to correlation matrix C would suffice.
In Figure 5, we have shown the heat-map of a 30-layer correlation model constructed with informa-
tion from clustering analysis along with the simulated analytic prediction for the empirical correla-
tion matrix spectrum.
                                                        9
8     Dependence on Layer Depth of the Proposed Model
The key parameter of the proposed model that we could control is the layer depth, i.e. the number
of layers constructed. To increase the layer depth, we essentially divide the lowest layer: in this
process the original diagonal sub-blocks are split into two smaller diagonal sub-blocks, and new
off-diagonal blocks are created.
To understand this process in detail as well as its effect on the eigenvalues of the correlation model,
we consider the following matrices:
                                                                                      
                                                                       0     M1 B
              M := Mm (1, α), M1,2 := Mm1,2 (1, α1,2 ) and M :=                                    (10)
                                                                             B T M2
where m = m1 + m2 , and
                                               ···
                                                         
                                 x    y                y
                                     ...        ...    .. 
                            y                          .
                            
                Mn (x, y) ≡  .      ...
                                      ...                 ,    B = β(1, . . . , 1)T (1, 1, . . . , 1) .
                             ..                       y            | {z } |              {z        }
                                                                             m1              m2
                              y ··· y                  x
                            |      {z                     }
                                           n
The interpretation of these matrices is that M, M1,2 are all diagonal sub-blocks in the correlation
model C and they have the same general form of Mn (x, y) : n × n with diagonal entries x = 1 and
off-diagonal y = α, α1,2 . When layer division happens, the sub-block M becomes M 0 , and we can
view this change as a perturbation to entries α to α1,2 , β depending on its location, whether in M1,2
or in B (T ) .
                                                           10
Using the identity for invertible matrix block V
                                                     S − T V −1 U T
                                                               
                            S T          I      0
                                                   =                  ,
                           U V       −V −1 U I            0       V
we have
               det(M 0 − λI) = det (M1 − λI) − B(M2 − λI)−1 B T det(M2 − λI)
                                                              
Although in this derivation we have assumed λ 6= 1 − α2 , 1 + (m2 − 1)α2 , the characteristic equa-
tion (12) with p(λ) ≡ q(µ) is valid ∀λ because any divergence is offset by the det(M2 − λI) factor
in equation (11).
                                                     11
8.3                                       Interpretation of the solutions of the polynomial equation (14)
In the case α1 = α2 = β ≡ α, equation (14) has two roots λ1 = 1 + (m1 + m2 − 1)α and λ2 = 1 − α,
just as expected for eigenvalues of M since now M 0 = M . Here we note that λ2 coincides with the
other eigenvalues arising from the factor (1 − λ − α1 )m1 −1 (1 − λ − α2 )m2 −1 in the characteristic
polynomial (11).
When layer division takes place to increase the layer number, we may have α1 = α2 ≡ α 6= β so
that equation (14) is perturbed to
                                    1.2                                                                                                 1.2
                                                          1                                                                                              1
                                                                                                        normalised eigenvalue density
    normalised eigenvalue density
                                     1                                                                                                   1
                                                         0.5                                                                                            0.5
                                    0.8                                                                                                 0.8
                                                          0                                                                                              0
                                                               0   0.5   1    1.5   2   2.5   3                                                               0   0.5   1        1.5   2   2.5   3
                                    0.6                                                                                                 0.6
                                    0.4                                                                                                 0.4
                                                       histogram of observed eigenvalues                                                              histogram of observed eigenvalues
                                                       Marčenko-Pastur distribution                                                                  Marčenko-Pastur distribution
                                    0.2                                                                                                 0.2
                                                       simulated analytic prediction                                                                  simulated analytic prediction
                                     0                                                                                                   0
                                          0        5                     10                   15                                              0   5                         10                   15
                                                         eigenvalue                                                                                     eigenvalue
Figure 6: Comparison of the predicted empirical spectral density functions of a 10-layer correlation
model and a 148-layer one with the market mode.
We saw in Section 6 that the new classification of sectors was not robust, whereas in Section 4 we
saw at the market mode level the distinction between pre-assigned sectors was already obvious. We
wonder if this prior information on sectoring could be incorporate in our model. Here we propose
the following construction based on the observations.
We will assume the pre-assigned sectors have zero (little) correlations, whereas inside each sector
                                                                                                   12
                                                       1.2
                                                                      1
                                                       0.8
                                                                      0
                                                                           0    0.5   1        1.5   2   2.5   3
                                                       0.6
Figure 7: Predicted empirical spectral density functions of the alternative model with prior sectoring,
superposed with the M-P law and the previous analytic prediction.
the stocks have equal mutual correlations. The underlying correlation matrix then takes the block-
diagonal form                                             
                                        M1
                                      
                                            M2            
                                                           
                                                 . .      
                                                     .    
                                                        Ms
where Mi ≡ Mmi as defined in equation (10), and s = 10 is the number of pre-assigned sectors (as
in Figure 2). The data are processed without the removal of the market mode.
We repeat the procedure as in Section 7.2 to plot the predicted empirical spectrum of the correlation
matrix in Figure 7.
We see that this model, even without detailed layer division, is a reasonable fit to the observed eigen-
value density, but it resembles in shape more of the M-P distribution than the observed distribution,
or the previous 148-layer model.
Through mode and clustering analyses, we have been able to construct a multi-layer structured
correlation model for the S&P 500 stock market. By analysing the dependence of the spectrum of
our model on the layer depth, we have shown analytically that increasing the number of layers
improves the match between (simulated) prediction of the empirical spectral density function with
observed eigenvalues of the empirical correlation matrix.
This results in a reliable estimation for the underlying correlation structure of the market analysed,
which may then have a positive impact on investment portfolios.
However, our study of the stock market correlations can be further developed by considering:
   1) edge asymptotics;
Our correlation matrix is finite-dimensional, which means eigenvalues are expected to leak out of
the edges of the predicted distribution of the empirical correlation matrix spectrum, which is derived
                                                                           13
in the asymptotic limit. This leakage effect could be studied using the Tracy-Widom law.
  2) time evolution;
The underlying stock market data have been assumed to have a stationary distribution in time, and
this is unlikely a good assumption. We need to build the time parameter/variable t into our model.
  3) fine-tuning.
We have performed the hierarchical clustering analysis with average linkage to build a binary tree
structure. The correlation model can be improved with tailored clustering method suited for the
stock market data.
Acknowledgements
Many thanks goes to my project supervisor, Dr Lucy Colwell, and her PhD student Chongli Qin at
the Department of Chemistry, University of Cambridge, whose guidance and help have been crucial
to this research project. I am also grateful for the generous support by the Bridgwater Scheme.
References
[1] S. Pafka, M. Potters, and I. Kondor. Exponential Weighting and Random-Matrix-Theory-Based
    Filtering of Financial Covariance Matrices for Portfolio Optimization. arXiv:cond-mat/0402573,
    February 2004.
[4] V. A. Marčenko and L. A. Pastur. Distribution of eigenvalues in certain sets of random matrices.
    Mathematics of the USSR-Sbornik, 1(4), 1967.
14