INDIAN STATISTICAL INSTITUTE
Project Report
Analysis of UK Road Accident Data
(Jan 1969−Dec 1984)
Authors:
Supervisor:
Arijit Dutta
Dr. Swagata Nandi
Kaustav Nandy
October 20, 2010
Abstract
In this project we try to do appropriate time series analysis with the data on number
of deaths and serious road accidents durig the period Jan 1969 to Dec 1984. We have
done it in two approaches. In the first approach, we carry out time series analysis using
classical decomposition model. Also we fit an appropriate SARIMA model without
going into any seperate trend analysis.
1 Introduction:
We are given monthly data on the number of deaths and serious road accidents during the
period January 1969 to December 1984. It is mentioned that Seatbelt law is intrduced in
the month of February in 1983. Obviously our interest will be to study the effectiveness
of the Seatbelt law. A simple trend analysis should reveal whether there is any significant
change in the number of death and injuries. Along with studying long term pattern, we
will also concentrate on how the number of deaths and injuries varies throughout the year.
2 Preliminary Study:
2.1 Plotting of Data:
First we plot the monthly data to see the general nature of the data:
Plot of the Data
2500
No. of Deaths and Injuries
2000
1500
1000
1970 1975 1980 1985
Time
Figure 1: Plot of the raw data
Comments:
• There is a perfect seasonal pattern in the data.
• There is an increasing trend till the year 1973 and then it falls a little.
1
• Till year 1983, it shows more or less same seasonal patten.
• During year 1983, there is a sudden fall in the number of deaths and injuries.
• This sudden fall may be due to the Seatbelt law.
Clearly the data has a varying pattern througout. Let us take a look on the data dividing
it into three parts. Then the patterns will be more clear.
Plot of the Data
2500
2000
1500
1000
1969 1970 1971 1972 1973 1974
No. of Deaths and Injuries
2500
2000
1500
1000
1974 1975 1976 1977 1978 1979
2500
2000
1500
1000
1980 1981 1982 1983 1984 1985
Time
Figure 2: Plot of the data dividing in three parts
Comments:
• From Fig. 2, the conclusions already drawn become more clear.
• Since we are trying to comment on the fall in the number of deaths after the Seatbelt
Law is introduced, we will go for a seperate trend analysis. So we consider an Additive
Classical Decomposition Model.
3 Classical Decomposion Model:
3.1 Brief Description of the Model:
Let Yt be the number of death at the time pount t. Decompose Yt as follows:
Yt = Tt + St + Ct + et
2
where,
Tt = Value of the Trend component at time t
St = Value of the Seasonal component at time t
Ct = Value of the Cyclical component at time t
et = Error at time t
Trend Analysis: The trend analysis is done by moving average method. But for that
first we have to determine the period of seasonality. From Fig. 2, we see that the period of
seasonality is more or less 1 year. So we take a 12-month moving average to determine the
trend values.
Plot of Moving Average Values
2000
1800
Moving Avg. Value
1600
1400
1970 1975 1980 1985
Time
Figure 3: Plot of the Moving Average Values
Comments:
• From the plot we see that there is an increasing trend till second half of 1973.
• Then there is a decreasing trend till 1975.
• It remains stable till 1983.
• Suddenly it drops at 1983.
• This is due to the Seatbelt Rule.
• Overall there is a decreasing trend in the data.
3
Seasonal Component: After calculating the moving average, it is an interesting thing
to take a look on the seasonal components. Let us see the nature of the seasonal indices
through a plot.
400
300
200
Seasonal Indices
100
0
−100
−200
2 4 6 8 10 12
Month
Figure 4: Plot of the Seasonal Indices
Comments:
• The seasonal indices are low during the first 9 months.
• It grows rapidly during the last three months.
• It reaches the maximum in the month of December.
Random Component: Let us refer the cyclical and the error component together as
random component. Now take a look on the plot of the random component to see whether
they are at all random.
4
200
100
Random Component
0
−100
−200
−300
0 50 100 150
Month
Figure 5: Plot of the Random Component
Comments:
• The random component contains too much high frequency components.
• So it is difficult to think that the random components are really random.
Now without going to analyse the random component seperately, we will try to fit a general
SARIMA model. It will take care of all the deterministic part in the data.
4 Fitting of SARIMA model
4.1 Preliminary Study from the Plots:
ACF plot: First consider the ACF plot of the data to see whether the data is at all
stationary and if so what kind of model apts it.
5
ACF
1.0
0.8
0.6
ACF
0.4
0.2
0.0
0.0 0.5 1.0 1.5 2.0
Lag
Figure 6: ACF plot of the raw data
Comments:
• Since the ACfs are not at all decreasing, there is clear non-stationarity in the data.
• First we should take one difference and see whether the data becomes stationary or
not.
ACF Plot of Differenced Data: Let us take one difference of lag one. It will remove
any linear pattern in the data. The ACF looks like the following:
ACF Plot
1.0
0.8
0.6
ACF
0.4
0.2
0.0
−0.2
0.0 0.5 1.0 1.5 2.0 2.5
Lag
Figure 7: ACF plot of the differenced data
Comments:
• From the plot it is clear that the data is stationary in non seasonal sense.
6
• But there is some sort of seasonal non-stationarity in the data.
• So we should take one seasonal difference.
• Obviously the period of seasonality is 12 months.
ACF Plot of the Doubly Differenced Data: Following is the ACF plot of the doubly
differenced data.
ACF Plot
1.0
0.5
ACF
0.0
−0.5
0.0 0.5 1.0 1.5
Lag
Figure 8: ACF plot of the doubly differenced data
Comments:
• From the ACF plot it is legitimate to assume that the data appears to be stationary.
• The correlation for lag 1 is high enough.
• But it falls rapidly even at lag 2.
• So Moving Average of lag 1 for the non-seasonal data may be appropriate here.
• Also the correlation for lag 12 is high.
• This is due to the seasonal pattern in the data.
• Notice that the correlation for lag 11 and 13 are significant.
• So there may be an MA process there with lag 1 in the seasonal data.
• A SARIMA model with appropriate parameter will be suitable here.
7
4.2 SARIMA Model:
From the plots and conclusions drawn from it, we may go for a SARIMA model with
parameter (0, 1, 1) × (0, 1, 1)12 . The fitted model will be as follows:
(1 − B)(1 − B 12 )Yt = (1 − 0.6029471B)(1 − 0.9050338B 12 )Zt
where,
Zt ∼ W N (0, 18068)
Comments:
• Notice that the Zt s have very high variance.
• Also there may be some effect due to the Seatbelt Rule. So we should eliminate the
effect of it(if any) and then should do the SARIMA analysis.
4.3 Removing Effect of Seatbelt Rule:
Consider the model:
Yt = γct + At
Where,
• ct takes value 1 or 0 according as the seatbelt rule is there or not.
• γ = Coeffecient determining the effect of seatbelt rule.
• At = Unexplained part(we will fit time series model on it)
Fitted Model: The fitted model comes out to be:
Yt = −319.2297ct + At
SARIMA Model: Now we fit SARIMA model with At values and it comes out to be:
(1 − B)(1 − B 12 )At = (1 − 0.6833B)(1 − .8721B 12 )Zt
where,
Zt ∼ N (0, 17423)
8
Comments:
• Here also the variance of Zt is high enough.
• Also we should test on the randomness of the residual component.
• We will check it by Ljung Box test.
• Also the assumption of normality is verified by simple qqplot.
4.4 Test of Randomness:
First we plot the residual to see the randomness of the residual.
Plot of Residual
200
0
Residual
−200
−400
1970 1975 1980 1985
Time
Figure 9: Plot of the Residual
Clearly it is hard to conclude anything from this plot. So we should consider some test
of randomness. We will consider Ljung Box Test. Our purpose is to test:
H0 : The data is random
H1 : The data is not random.
The Ljung Box test statistic is:
h
X ρ̂2k
Q = n (n + 2)
k=1
n−k
where,
n = Sample size
ρ̂k = Autocorrelation of lag k
9
h = Number of lags to be tested
At α level of significance, the rejection hypothesis of randomness is:
Q > χ21−α,h
where, χ21−α,h is the α quantile of the chisquare distribution with h d.f. Following table
shows p-value for different values of h.
Table 1: p-values for different h
h p-value h p-value
1 0.1958407 3 0.2363558
6 0.13894704 8 0.10180052
11 0.21072616 14 0.18139466
25 0.07653932 35 0.11781835
45 0.26980223 55 0.28486153
Comments:
• Here for every lag the p-value is more than 0.05. So the residual can be considered to
be random.
• But it is not very clear from this table that how the p-value behaves as the number
of lags considered increases.
Let us consider the plot of the p-value as a function of h.
0.8
0.6
p value
0.4
0.2
0 20 40 60 80 100
Value of h
Figure 10: Plot of p-value
10
Comments:
• The p-value increases as h increases.
• So the residual has a very low higher lag dependence.
• Obviously an overall conclusion will be that the residual appears to be random.
Now we will see the qqplot for checking normality of the residual in a graphical way.
●
●
●● ●●
●●
200 ●●
●
●
●
●●●
●●
●●
●●●
●
●●
●●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●●
●
●
Residual Quantile
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
0 ●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●●
●●
●
●
●●
●●●
−200 ●●
●
●
●
●●
●
●
●
●
−400
●
−3 −2 −1 0 1 2 3
Normal Quantile
Figure 11: Q-Q Plot of the Residual
Comment:
• From q-q plot we can conclude that the residual to be normal.
4.5 Prediction:
Using Innovation Algorithm, we have predicted the value for the last two years. Instead of
giving point estimation of the predicted values, a 95% confidence interval is given here.
11
2500
2000
1500
1000
1970 1975 1980 1985
Time
Figure 12: Prediction of Time Series Values
5 Fourier Series Analysis:
Taking the doubly differentiating the data we will try to get some spectral density function.
For that we have to get a periodogram. For the given data the periodigram takes the
following form:
5000
spectrum
500
50 100
10
5
0 1 2 3 4 5 6
frequency
bandwidth = 0.0192
Figure 13: Periodogram
12
Comments:
• The periodigram here is not over the range (−π, π).
• This is raw periodogram, which is not a very good estimator of the spectral density
function.
• So we will consider smoothed periodogram using Daniell kernel smoothing technique
and Modified Daniell kernel smoothing technique.
The smoothed periodogram using the two prementioned smoothing technique are as follows:
20000
10000
10000
5000
5000
2000
2000
spectrum
spectrum
1000
1000
500
200
500
100
0 1 2 3 4 5 6 0 1 2 3 4 5 6
frequency frequency
bandwidth = 0.334 bandwidth = 0.0962
Figure 14: Soothed Periodogram
6 Conclusion:
After studying and doing analysis with the data, we can give an overall conclusion as follows:
• The data contains a trend component along with seasonal one.
• Initially the data was non stationary both in seasonal and non seasonal way. We can
remove both of them by taking differences.
• Also there is a significant effect of the Seatbelt Rule. It reduces the number of road
accident significantly.
• After removing all these effect, we see that a SARIMA (0, 1, 1) × (0, 1, 1)12 model will
be appropriate.
7 Acknowledgement:
• Dr. Swagata Nandi, ISI Delhi.
• Computer Systems:
13
– GNU/Linux as the operating system
– C + + as programming language.
– R as interpreting language.
– LATEXas a document preperation system.
• reference:
– Introduction to Time Series and Forecasting, P.J. Brockwell and R.K. Devis
14