0% found this document useful (0 votes)
149 views6 pages

Utilizing Macroeconomic Factors For Sector Rotation Based On Interpretable Machine Learning and Explainable AI

hk

Uploaded by

jabali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views6 pages

Utilizing Macroeconomic Factors For Sector Rotation Based On Interpretable Machine Learning and Explainable AI

hk

Uploaded by

jabali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Utilizing Macroeconomic Factors for Sector

Rotation based on Interpretable Machine Learning


and Explainable AI
Ye Zhu Chao Yi Yixin Chen
Data Center Data Center Data Center
China Asset Management Co.,Ltd. China Asset Management Co.,Ltd. China Asset Management Co.,Ltd.
Beijing,100033,China Beijing,100033,China Beijing,100033,China
zhuye tj@163.com yic@chinaamc.com chenyx@chinaamc.com

Abstract—This paper focuses on the application of explainable Chen et al. [2] tested seven macroeconomic factors such as
AI in finance, introducing the use of machine learning models industrial production and inflation, and found that inflation, in-
such as multiple linear regression, ridge regression, and random dustrial production, changes in the risk premium and twists in
forest. We also compare their effects through empirical analysis
on Chinese stock market. In addition, we propose three methods, the yield curve have a significant impact on the stock market.
which are feature selection, discretization of returns, and signal Adam and Tweneboha [3] analyzed the short-term and long-
timing strategy, to improve the utility of our model. The empirical term effects of four macroeconomic factors on stock market
results show that our models can effectively select industries that indexes based on Ghanaian data, and found that inflation and
will perform well in the future, further proving the importance exchange rate have important effects on stock prices in the
and application feasibility of explainable AI in the financial field.
short term, while in the long run, interest rates and inflation
Keywords—explainable AI, random forest, feature selection, have more significant effects. Singh et al. [4] tested the
macroeconomic factors, crowded market indicator relationship between index returns and macroeconomic factors
through Taiwan data. The results showed that exchange rate
I. I NTRODUCTION and GDP have a significant impact on the overall economic
Aristotle said: “Knowing yourself is the beginning of all situation, while inflation, exchange rate and money supply
wisdom”. In recent years, big data artificial intelligence repre- have an impact on large and medium-sized companies.
sented by deep learning has developed rapidly, and machines
B. Constructing Macroeconomic Factors
have gradually surpassed humans in perception capabilities
such as image and speech recognition. However, the learning Based on the macroeconomic factors in the references above
and prediction of machine learning is often “black box” and [2] [3] [4], we construct five types of factors. In order to
lacks interpretability, which greatly reduces the credibility of increase the frequency of modeling, we unify the data to
the prediction results. Therefore, Machine Learning Explain- monthly, and each type of factor is synthesized from some
able Artificial Intelligence (XAI) is developed to make the AI indicators.
learning process transparent so that the results and process can • Growth factor : GDP can well reflect the growth of the
be better interpretable. market economy, but the frequency of publication is low.
This paper applies part of the explainable AI models to Therefore, we use Project Management Institute(PMI),
financial data, constructs an industry rotation strategy based the growth rate of infrastructure investment and the
on macroeconomic data, focuses on the interpretability of the growth rate of total industrial profits to reflect the growth
model and the interpretation of the results, and proves the rate of GDP. After unified data preprocessing and after
effectiveness of the explainable AI model in the financial field. removing the seasonal trend, we use the reciprocal of
volatility as a weight for weighted synthesis.
II. BACKGROUND
• Inflation factor : Reflecting the inflation of life and
A. The Impact of Macroeconomic Factors on the Stock Market production through the oil price, pork price and thread
It is well known that macroeconomic data has an impact on price index, not only can reflect the CPI, but also make
the stock market. In 2004, Merrill Lynch Wealth Management the extra lag of the data unnecessary.
released the Merrill Lynch investment clock model [1]. Based • Rate factor : This factor focuses on bond market in-
on economic growth factor(Gross Domestic Product, GDP) terest rates. Through the yield to maturity of the one-
and inflation factor (Consumer Price Index, CPI), the busi- year treasury bond and the yield to maturity of the
ness cycle was divided into four stages: recession, recovery, 10-year treasury bond, it reflects the short-term market
overheating and stagflation period. structure and the long-term market structure, respectively,

978-1-7281-6251-5/20/$31.00 © 2020 IEEE


   
and characterizes the market liquidity and fundamental y1 
1 x11 ··· x1p
 ε1
 y2  ε2 
prosperity. 1 x21 ··· x1p 

y =  . ,X =  ,ε =  . , (2)
  
• Credit factor : This factor is the credit spread between the  ..  · · · ··· ···  .. 

five-year AAA short-term and medium-term bill yield and yn 1 xn1 ··· xnp εn
the five-year treasury bond yield to maturity, reflecting
investors confidence in the market. 
β0
• Exchange factor : The stock market in mainland China
 β1 
is considered, therefore, we are concerned about China’s β =  . ,
 
(3)
main trading countries, and the choice is the middle price  .. 
of US dollar against RMB. βp
In addition, it can be found from the references that macroe-
conomic factors have short-term and long-term effects on the
X
J(β) = (y − Xβ)2 (4)
stock market. Therefore, this paper add the lagging term of
different periods of the macroeconomic factors to the model to In the above formula, y is the rate of return of industry
further increase the interpretability of our model. See Feature indexes, with n = 24, for the model rolls forward for
Combinations in Preliminary Study for details. two years each time; X is the macroeconomic factors, with
p ∈ {5, 10, 15} ; beta is the estimated coefficient; and ε is
C. Data Source the residual of our model; J(β) is the minimized objective
The data in this paper comes from the WIND financial function.
database, which is the most comprehensive and powerful tool Ridge regression is based on multiple linear regression and
for financial professionals who need the most complete infor- adds L2 regular term to the objective function:
mation on Chinese stocks, bonds, funds, futures, RMB rates,
and the economy. The CITIC Industry Index classification is X
selected in this paper. Excluding the comprehensive finance J(β) = (y − Xβ)2 + λ||β||22 (5)
industry with a small amount of data, the data window period where λ is the penalty coefficient, which is a non-negative
is from April 29, 2010 to August 13, 2020. number. Generally, we use cross-validation to get the value of
III. P RELIMINARY STUDY λ. The addition of the regular term ensures that the X T X +λI
is full rank and reversible, making the estimated coefficient
Based on these five macroeconomic factors, we can apply more stable and reliable.
the model to predict the trend of industry returns. Random forest is multiple decision trees based on bagging.
A. Explainable AI Models By randomly selecting training samples and randomly select-
ing training feature subsets, the anti-noise ability of the model
In order to explore the effectiveness of explainable AI
is improved. Finally, the average value of all trees is output
models applied to financial data, the models tried in this
as the prediction result. Since about one-third of the samples
chapter are multiple linear regression, ridge regression and
are used for model verification, there is no need to split the
random forest regression.
test data in advance, and it is not easy to overfit.
Kibria and Banik [5] compared the performance of ordinary
least squares (OLS) estimation and ridge regression estimation B. Feature Combinations
under different parameter constructions from the perspective
Because macroeconomic factors may have short-term and
of numerical testing. The results showed that the performance
long-term effects, this paper selects features with different lag
of ridge regression estimation is closely related to the con-
periods to build the model. The combination of factors we
struction of penalty parameters, but it will always perform
have tried are shown in Table I.
better than OLS. Saleh et al. [6] introduced multiple linear
regression , ridge regression and their parameter estimation, TABLE I
and explained the relationship between ridge regression esti- FACTOR COMBINATIONS
mation and LASSO (Least absolute shrinkage and selection
operator). Breiman [7] believed that the random forest model Sets Factors in set
could improve the prediction accuracy through the algorithm factors (1st period lag) {Gt−1 , It−1 , Rt−1 , Ct−1 , Et−1 }
of multiple decision trees. However, Smith et al. [8] found factors (1st+6th periods lag) {Gt−1 , It−1 , Rt−1 , Ct−1 , Et−1 ,
Gt−6 , It−6 , Rt−6 , Ct−6 , Et−6 }
that under certain neuroscience prediction data sets, the per- factors (1st+6th+12th periods lag) {Gt−1 , It−1 , Rt−1 , Ct−1 , Et−1 ,
formance of multiple linear regression is better than that of Gt−6 , It−6 , Rt−6 , Ct−6 , Et−6 ,
random forest regression. Gt−12 , It−12 , Rt−12 , Ct−12 , Et−12 }
* {G, I, R, C, E} represents growth factor, inflation factor, rate factor, credit
The formula of the multiple linear regression model is :
factor and exchange factor respectively.
** The subscript of {G, I, R, C, E} represents the number of lag period.
y = Xβ + ε, (1)
C. Results • In each rolling, first output the feature importance through
Using the rate of return as a predictor variable, for each a random forest modeling, select the features with the
industry, at the end of each month, through rolling modeling highest importance ranking and the total importance more
of the rate of return and macroeconomic factors of the past than 0.8, and then build the model through multiple linear
two years, we predict the rate of return of next month. Sort regression and random forest to predict rate of return.
the predicted rate of return, and select the top five to structure The variable combination selected here is factors (1st period
combination. + 6th period + 12th period lag). Table III is the backtest
Since the modeling period is two years forward, the backtest results of models after feature selection. From the results,
period is from May 31, 2012 to August 13, 2020. considering the improved method of modeling after feature
Table II is the backtest results of our empirical tests. From selection, the method of dimensionality reduction through
the results, when the number of variables increase, the model PCA has effectively improved the predictive ability of the
results first become better and then become worse. This shows model.However, the model after selecting the features by
that the macroeconomic factors do have a lagging effect on the importance of the features do not win the benchmark
the rate of return, but the too lagged factors have insufficient in the evaluation index, which shows that the features after
influence and may cause the model to overfit and make the importance screening may still have a strong correlation,
the results worse. The results of the ridge regression and which makes the multiple regression fall into the fitting fallacy.
the random forest model are improved compared with the
B. Discretization of Returns
multiple regression model, which shows that there is indeed a
correlation between the factors and will cause disturbance to The result of directly predicting the rate of return is not
the model. good. It may be because we only took out the top five
industries with the predicted rate of return, but the predicted
IV. F URTHER STUDY rate of return may have a large deviation. From the perspective
In this section, we propose three methods to improve the of macroeconomic factors alone, it may not be possible to
utility of our model. find the best industry each time, thus failing to bridge the gap
with the industry equivalence benchmark.But if from another
A. Feature Selection
perspective, we can discretize the rate of return, and then fit
When the number of variables gradually increases, the the probability of a positive rate of return through rolling
problems arise: first, the collinearity is enhanced, and the modeling. Using the same three feature combinations, the
second is that the model will become more and more prone model uses logistic regression and random forest classification
to overfit. Therefore, feature selection can be made on the models to generate the probability that the rate of return is
combination of variables, and then we incorporate the rest positive.
into the model. Logistic regression is often used in classification problems,
Abdi and Williams [9] introduced the principal compo- and its formula is:
nent analysis (PCA) method in detail, which is to obtain
the orthogonal principal components, and to reduce the data 1
dimension without losing key information by extracting impor- y= (6)
1+ e−(ωT x+b)
tant information from the data table. Liaw and Wiener [10]
found that although the importance of the features obtained In the above formula, y = P (Y = 1|x) is the probability
by constructing random forest will be different each time, that the rate of return is positive, where Y ∈ {−1, 1} is the
the ordering is basically unchanged. Grömping [11] compared sign of the rate of return; x is the microeconomic factors; ω T
the effects of linear regression and random forest on the and b is the parameters to be estimated.
importance of features. The simulation experiment results Random forest is jointly determined by multiple decision
showed that random forest can select important variables trees, so the random forest model can also directly output the
that are highly correlated with the response variable and
have better predictive ability. Speiser et al. [12] compared
TABLE III
the different feature selection methods of random forest for BACKTEST RESULTS OF MODELS AFTER FEATURE SELECTION
multiple classification data sets. The random forest model
Annualized rate of return Sharpe ratio
based on feature importance for feature selection has a lower
Benchmark 8.8% 0.321
out-of-bag error rate and better simplicity. MLR 11.3% 0.398
PCA
The methods used in this paper are as follows: RF 11.3% 0.395
MLR 8.7% 0.305
• Firstly, reduce the dimensionality of the variables through FS
RF 11.4% 0.396
PCA, select the principal component factors that have * MLR represents multiple linear regression model ; RF represents
the highest variance contribution rate in each rolling and random forest regression model.
** Benchmark is the industry equal weight index.
the total variance contribution rate exceeds 0.9, and then *** PCA represents principal component analysis; FS represents feature
establish model through multiple linear regression and importance selection.
random forest to predict rate of return;
TABLE II
BACKTEST RESULTS OF MODELS

1st period lag 1st + 6th periods lag 1st + 6th + 12th periods lag
Annualized rate of return Sharpe ratio Annualized rate of return Sharpe ratio Annualized rate of return Sharpe ratio
MLR 9.4% 0.330 7.9% 0.264 10.8% 0.358
RR 7.2% 0.250 14.0% 0.467 12.2% 0.399
RF 10.8% 0.364 13.0% 0.439 10.3% 0.358
Benchmark 8.8% 0.321 8.8% 0.321 8.8% 0.321
* MLR represents multiple linear regression model ; RR represents ridge regression model ; RF represents random forest regression model.
** Benchmark is the industry equal weight index.
***Annualized rate of return refers to the rate of return obtained by converting investment income into one year. Sharpe ratio measures the ratio of benefits
to risks. A higher annualized rate of return or a higher sharpe ratio indicates a better portfolio performance.

probability that the return rate is positive, that is, the number of
decision trees with positive results divided by the total number
of decision trees.
Breiman [7] found that random forests perform better than
other classifiers in classification, such as discriminant analysis,
support vector machines, etc. Couronné et al. [13]compared
the random forest model with default parameters and the
logistic regression model based on multiple real data sets.
In about 69% of the data sets, random forest classification
performed better than logistic regression.
We consider the improved method of modeling after dis-
cretizing the rate of return. First, we try to construct a portfolio
of five industries with the highest probability of positive
returns. From the backtest results in Table IV, the lagging
Fig. 1. Trend of net value of our portfolio and benchmark
influence of macroeconomic factors is very obvious, and the
prediction ability of the model with the 6th period lag and the
12th period lag has been greatly improved.
Secondly, we set up a threshold and try to select industries
with a positive rate of return that are above the threshold to
construct a portfolio. If none of them exceeds the threshold,
the current time point adopts industry equivalence. We choose
the first period lag of macroeconomic factors as the feature
combination, and choose the random forest classification mod-
el and a threshold of 0.5. Fig. 1 is the trend of net value of
our portfolio and benchmark. Fig. 2 is the relative strength of
our portfolio and benchmark. We can see that the performance
of this portfolio is relatively good. However, changes to the
combination of factors or slight adjustments to the threshold
Fig. 2. Relative strength of our portfolio and benchmark

TABLE IV
BACKTEST RESULTS OF CLASSIFICATION MODELS will make the results unsatisfactory. This shows that if we only
want to predict the sign of the rate of return, we can find the
Annualized rate of return Sharpe ratio
Benchmark 8.8% 0.321 effective situation by adjusting the parameters, but in this way,
1st 7.1% 0.245 the parameter setting is not subjective and the stability of the
LR 1st+6th 12.3% 0.415 strategy is not enough.
1st+6th+12th 14.0% 0.487
1st 8.7% 0.300
RF 1st+6th 14.2% 0.495 C. Signal Timing Optimization in Pursuit of Revenue
1st+6th+12th 11.9% 0.413
*
This section analyzes whether there is overheating of trading
LR represents logistic regression model; RF represents random forest
classification model.
on the cross-section, so as to optimize the model strategy and
** Benchmark is the industry equal weight index. increase the interpretability of the strategy. When the industry
*** 1st means factors (1st period lag); 1st+6th means factors (1st period
index has risen in the past period of time and the crowdedness
lag + 6th period lag), and 1st+6th+12th means factors (1st period lag
+ 6th period lag + 12th period lag).
index exceeds the historical quantile threshold, the industry is
considered to be crowded and should not be selected into the
portfolio, and those that are already in the portfolio should be TABLE VI
cleared. BACKTEST RESULTS AFTER CONGESTION OPTIMIZATION
Yang and Zhou [14] explained excess returns by construct- Annualized rate of return Sharpe ratio
ing investor sentiment indicators and stock crowding indica- Benchmark 10.8% 0.358
tors. The empirical results support the view that crowding and monthly 10.6% 0.355
CTC
daily 11.3% 0.383
investor sentiment significantly affect stock prices. Kinlaw et monthly 10.8% 0.357
CVC
al. [15] used congestion and valuation to classify the state daily 10.7% 0.359
of the sector. The empirical results show that congestion can monthly 11.7% 0.399
Kurtosis
daily 10.5% 0.387
effectively avoid the bubble period. monthly 11.4% 0.391
Our method of constructing the congestion index is to select CF
daily 11.6% 0.436
multiple indicators from six aspects: momentum, liquidity, * CTC is the correlation coefficient between turnover rate and closing price;
deviation rate, volume-price correlation coefficient, volatility CVC is the correlation coefficient between volume and closing price;
Kurtosis is the kurtosis of the index return; CF is the composite indicator
and distribution characteristic coefficient, and assign different generate from indicators above.
calculation window periods and quantile thresholds. We test ** Benchmark is the backtest result of the multiple regression model, which

the effectiveness of monthly optimization and daily optimiza- uses 1st+6th+12th periods lag as the feature combination.
tion for each indicator on the basis of industry equal weight
benchmarks. Among them, the monthly optimization strategy
refers to selecting all industries that are not crowded at the
end of the month and constructing an equal weight portfolio;
The daily optimization strategy means that on the basis of the
monthly optimization strategy, the index is monitored every
day, and if there is a signal of congestion in the holding
industry, the industry will be cleared immediately to hold
cash. Table V shows the finally selected indicators and their
corresponding window periods and thresholds.
If any congestion indicator sends a congestion signal, we
follow the signal and clear the index, that is, we combine
these three congestion indicators into a composite indicator.
From the results of daily optimization in Table VI, composite
indicator is better than single indicators.
Finally, we try to combine multiple improvement measures. Fig. 3. Trend of net value of our portfolio and benchmark
First, we discretize the rate of return, and then we use compos-
ite indicators for optimization. How does the model perform? *Benchmark is the industry equal weight index; Original strategy uses
random forest classification and feature combination is macroeconomic factors
Take an empirical example, where the feature combination is (1st period lag + 6th period lag); Monthly and Daily optimization strategy is
macroeconomic factors (1st period lag + 6th period lag) and based on the original strategy and optimized using the composite crowdedness
the model is random forest classification. Fig. 3 is the results of indicator.
it. It is shown that our strategy has significantly outperformed
the industry equal weight benchmark, and the Sharpe ratio has
empirical results show that the establishment of a model to
also been greatly improved.
construct a portfolio can greatly increase the annualized rate
V. C ONCLUSION of return and the Sharpe ratio compared with the industry
equal weight benchmark, which fully proves the application
From the perspective of macroeconomic factors, this paper
prospects of the explainable AI models in the financial field.
uses the CITIC industry index data of the past ten years
and applies explainable AI models on a rolling basis to ACKNOWLEDGMENT
predict index returns and construct investment portfolios. The
This work is supported by China Asset Management. We
also thank the reviewer for very helpful comments.
TABLE V R EFERENCES
O UR CONGESTION INDICATORS
[1] T. Greetham and M. Hartnett, “The Investment Clock, Special Report
Index Period Threshold #1: Making Money from Macro,” p. 28, 2004.
CTC 60 1% [2] N.-F. Chen, R. Roll, and S. A. Ross, “Economic forces and the stock
CVC 40 1% market,” Journal of business, pp. 383–403, 1986.
Kurtosis 60 1% [3] A. M. Adam and G. Tweneboah, “Macroeconomic factors and stock
* CTC is the correlation coefficient between turnover rate and market movement: Evidence from ghana,” Munich personal RePEc
closing price; CVC is the correlation coefficient between archive, 2008.
volume and closing price; Kurtosis is the kurtosis of the [4] T. Singh, S. Mehta, and M. Varsha, “Macroeconomic factors and stock
index return. returns: Evidence from taiwan,” Journal of economics and international
finance, vol. 3, no. 4, pp. 217–227, 2011.
[5] B. Kibria and S. Banik, “Some ridge regression estimators and their
performances,” Journal of Modern Applied Statistical Methods, vol. 15,
no. 1, p. 12, 2016.
[6] A. M. E. Saleh, M. Arashi, and B. G. Kibria, Theory of ridge regression
estimation with applications. John Wiley & Sons, 2019, vol. 285.
[7] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp.
5–32, 2001.
[8] P. F. Smith, S. Ganesh, and P. Liu, “A comparison of random forest
regression and multiple linear regression for prediction in neuroscience,”
Journal of neuroscience methods, vol. 220, no. 1, pp. 85–91, 2013.
[9] H. Abdi and L. J. Williams, “Principal component analysis,” Wiley
interdisciplinary reviews: computational statistics, vol. 2, no. 4, pp. 433–
459, 2010.
[10] A. Liaw, M. Wiener et al., “Classification and regression by randomfor-
est,” R news, vol. 2, no. 3, pp. 18–22, 2002.
[11] U. Grömping, “Variable importance assessment in regression: linear
regression versus random forest,” The American Statistician, vol. 63,
no. 4, pp. 308–319, 2009.
[12] J. L. Speiser, M. E. Miller, J. Tooze, and E. Ip, “A comparison of random
forest variable selection methods for classification prediction modeling,”
Expert Systems with Applications, vol. 134, pp. 93–101, 2019.
[13] R. Couronné, P. Probst, and A.-L. Boulesteix, “Random forest versus
logistic regression: a large-scale benchmark experiment,” BMC bioin-
formatics, vol. 19, no. 1, p. 270, 2018.
[14] C. Yang and L. Zhou, “Individual stock crowded trades, individual stock
investor sentiment and excess returns,” The North American Journal of
Economics and Finance, vol. 38, pp. 39–53, 2016.
[15] W. Kinlaw, M. Kritzman, and D. Turkington, “Crowded trades: Impli-
cations for sector rotation and factor timing,” The Journal of Portfolio
Management, vol. 45, no. 5, pp. 46–57, 2019.

You might also like