CO461 - Data Warehousing and Data Mining
PROJECT REPORT
Weather Forecasting of Time Series
data using ARIMA and
Auto ARIMA Model
By
Niwedita (171CO227) and Shweta Hariharan Iyer (171CO245)
Final year B-Tech Computer Science and Engineering
National Institute of Technology, Karnataka
27 November 2020
Abstract
Weather Forecasting has become necessary in recent times due to changing and
unpredictable weather conditions. Weather predictions are used to protect life and
property. The agricultural field is completely dependent on temperature and
precipitation forecasts. ARIMA is a popular model for weather forecasting. ARIMA is
applied on a univariate time series data to predict the observations' future values. Since
ARIMA works only on stationary data, its variant Auto ARIMA is introduced for
performing weather forecasting on non-stationary data. In this project, we implement
ARIMA and Auto ARIMA for stationary and nonstationary data respectively. We also
introduce a slight modification in the data cleaning process. We then compare the
results obtained by ARIMA and Auto ARIMA before and after the introduction of the
modification.
2
TABLE OF CONTENTS
TOPIC PAGE NO.
I. Introduction 4
II. Dataset Description 5
III. Materials and Methods 5
IV. Results and Discussions 10
V. Conclusion 12
VI. References 12
List of Figures
1. Temperature and Dewpoint dataset
2. Import libraries and load dataset
3. Data cleaning
4. Trend and Seasonality of temperature data
5. Mean, rolling mean and standard deviation of temperature data
6. ACF and PACF plots of temperature data
7. Residual error plots of temperature data
8. ARIMA model results of temperature data
9. Actual versus forecast plot of temperature testing data
10. Rolling mean and standard deviation of dewpoint dataset
11. Plot of Actual vs ARIMA forecast values of temperature data using Forward Fill
Data Cleaning process
12. Plot of Actual vs Auto ARIMA forecast values of dewpoint data using Forward
Fill Data Cleaning process
13. Plot of Actual vs ARIMA forecast values of temperature data using Mean of the
Observations Data Cleaning process
14. Plot of Actual vs Auto ARIMA forecast values of dewpoint data using Mean of
the Observations Data Cleaning process
List of Tables
1. Comparison of MSE of temperature and dewpoint data with Forward Fill and
Mean of the Observations data cleaning methods
3
I. Introduction
Weather forecasting is performed on Time Series Data. A time series is a sequence where a
metric is recorded over regular time intervals. We use this type of series to forecast any event
in the future such as temperature, rainfall, humidity, budgets etc.
For prediction we are going to use one of the most popular models for time series,
Autoregressive Integrated Moving Average (ARIMA) which is a standard statistical model for
time series forecast and analysis. An ARIMA model can be understood by outlining each of its
components as follows:
● Autoregression (AR) - refers to a model that shows a changing variable that regresses
on its own lagged, or prior, values. The notation AR(p) indicates an autoregressive
model of order p.
Yt = 𝛼 + 𝛽1Yt-1 + 𝛽2Yt-2 + ...+𝛽pYt-p + 𝜺1
● Integrated (I) - represents the differencing of raw observations to allow for the time
series to become stationary, i.e., data values are replaced by the difference between
the data values and the previous values.
● Moving average (MA) - incorporates the dependency between an observation and a
residual error from a moving average model applied to lagged observations. The
notation MA(q) refers to the moving average model of order q.
Y= 𝛼 + 𝜺t + 𝜙1𝜺t-1 + 𝜙2𝜺t-2+.... + 𝜙q𝜺t-q
Equation of the ARIMA model- Combination of AR and MA models [2]
Yt = 𝛼 + 𝛽1Yt-1 +
𝛽2Yt-2 + ...+𝛽pYt-p𝜺t + 𝜙1𝜺t-1 + 𝜙2𝜺t-2+.... + 𝜙q𝜺t-q
Before using the ARIMA model, we need to check whether the dataset is stationary or not.
Check for below necessary conditions:
● Constant mean
● Constant variance
● An auto covariance that does not depend on time
If we have constant Mean and Variance, and our Test statistic is less than Critical Values, so we
already have a stationary Time series. So our 'd' value will become 0 in the ARIMA Model.
And if it was non-stationary, in that case we would use below techniques to make it stationary
by using any of the below techniques:
● Decomposing
● Differencing [1]
Auto ARIMA is a variance of ARIMA that is particularly useful for non-stationary dataset. Auto
ARIMA saves the task of differencing and computing p, q, d values of ARIMA. Forecasting is
done directly by fitting the Auto ARIMA model on the univariate time series data.
The organization of this work follows the order below. Section II describes the dataset. Section
III explains the methodology involved in weather forecasting. Section IV summarizes the
methods and discusses the modifications . Finally, Section V summarizes the model.
4
II. Dataset Description
A time series weather dataset is used to implement the ARIMA model of forecasting. The
dataset contains weather data for New Delhi, India from year 1996 to 2017.This weather
dataset includes several attributes such as temperature, dewpoint, humidity, wind direction
etc. We apply the ARIMA model on various univariate time series from the New Delhi weather
dataset. Univariate time series is a time series that consists of only single observations
recorded sequentially over equal time increments. Here, the ARIMA model of weather
forecasting is applied to the temperature data from the weather dataset. The Auto ARIMA
model of weather forecasting is applied to the dewpoint data from the weather dataset.
Following are some data values from temperature and dewpoint dataset respectively.
Fig1: Temperature and Dewpoint dataset
III. Materials and Methods
The following process is used to implement the ARIMA model of weather forecasting on
temperature data of New Delhi.
1. Import statsmodels and pmdarima Python module for loading the ARIMA model. Import
numpy, pandas, matplotlib, seaborn Python libraries for implementation and load the
temperature dataset for ARIMA forecasting.
Fig2: Import libraries and load dataset
5
2. Perform data cleaning on the dataset to fill the missing values.
Fig3: Data cleaning
3. Plot the data to check the trend and seasonality of the time series weather data.
Fig4: Seasonality of the data
4. Plot the actual mean, rolling mean and standard deviation of the training data to check if
the data is stationary. This is required as the ARIMA model can only work only
stationary data. For forecasting non-stationary data, the Auto ARIMA model is used.
6
Fig5: Mean, rolling mean and standard deviation of data
5. For stationary data, plot ACF and PACF plots to get parameter values for the ARIMA
model.
Fig6: ACF and PACF plots
7
Grey dotted lines are confidence intervals which are used to find the value of p and q.
p - the point where PACF crosses the upper confidence level. In our case, *p = 2.
q - the point where ACF crosses the upper confidence level. In our case, *q = 2.
d - number of nonseasonal differences needed for stationarity. In this case it is 0, since
this series is already stationary.
6. Apply ARIMA model on the stationary training data and perform model fitting. Plot the
model's residual errors.
Fig7: Residual error plots
7. Perform weather forecasting using the above model on the testing data. Plot the actual
and forecast weather values of testing data.
Fig8: ARIMA model results
The 'coef' column in the ARIMA model results summary gives the coefficients of AR and MA
models. They are then combined to form the equation for the ARIMA model. This ARIMA
equation is then used to forecast weather values.
8
Fig9: Actual versus forecast plot
8. To numerically compute the accuracy of the model, calculate Mean Squared
Error(MSE) between the actual and forecast weather values of testing data.
Non- Stationary Data :
For Non-stationary data, the ARIMA model of weather forecasting will not give accurate results.
For this, the Auto ARIMA model is used. Auto ARIMA bypasses the need to have the data
stationary and computing the parameter values for the ARIMA model. Here, the time series
data of dewpoint from the New Delhi dataset is non-stationary. The data will not have a constant
rolling mean and standard deviation. For such data, using the same above mentioned
procedure, Auto ARIMA model is applied. ACF and PACF plots need not be plotted as Auto
ARIMA bypasses it.
Fig10 : Rolling mean and standard deviation of dewpoint dataset.
We have implemented ARIMA and Auto ARIMA on stationary temperature data and
non-stationary dewpoint data respectively. In the implementation, data cleaning is performed
using Forward Fill. Forward Fill method propagates the last valid observation forward to fill the
missing values.
We propose a model to modify this Data Cleaning process. The mean of all the observations is
used to fill the missing values. This modification is applied to both the stationary temperature
data and non-stationary dewpoint data.
9
IV. Results and Discussions
The weather forecasting experiment is conducted on two observations from the New Delhi time
series weather dataset- temperature data which is stationary and dewpoint data which is
non-stationary. The ARIMA model of weather forecasting is used for temperature data and the
Auto ARIMA model is used for dewpoint data. Mean Squared Error(MSE) is computed to
evaluate the model's performance. Further, the data cleaning process is modified to fill the
missing data values using the mean of the observations. The following table summarises the
results obtained. It shows the MSE obtained from applying ARIMA and Auto ARIMA on
temperature and dewpoint data respectively, with once the Forward Fill and then the Mean
method for Data Cleaning, to fill missing values.
Temperature data Dewpoint data
(Stationary) (Non- Stationary)
Forward Fill 9.645860513873451 2.245820662235592
Mean of the Observations 9.785734936454212 2.1717549429236898
Table1: Comparison of MSE of temperature and dewpoint data with Forward Fill and Mean of the
Observations data cleaning methods
The following graphs are plots of Actual vs Forecast Values of temperature and dewpoint using
ARIMA and Auto ARIMA. The first two graphs show the ARIMA and Auto ARIMA
implementation using Forward Fill Data Cleaning process to fill the missing values. The next
two graphs show the ARIMA and Auto ARIMA implementation using Mean of Observations
Data Cleaning process to fill the missing values.
Fig11 : Plot of Actual vs ARIMA forecast values of temperature data using Forward Fill Data Cleaning
process
10
Fig12 : Plot of Actual vs Auto ARIMA forecast values of dewpoint data using Forward Fill Data Cleaning
process
Fig13 : Plot of Actual vs ARIMA forecast values of temperature data using Mean of the Observations
Data Cleaning process
Fig14 : Plot of Actual vs Auto ARIMA forecast values of dewpoint data using Mean of the Observations
Data Cleaning process
11
The above table and graph results show that for Stationary data, Forward fill Data Cleaning in
ARIMA forecasting is a better approach than Mean of the Observations to fill missing values.
For Non-stationary data, Mean of the Observations Data Cleaning process in Auto ARIMA
forecasting is a better approach than Forward fill to fill missing values.
V. Conclusion
We have implemented the ARIMA model of weather forecasting on New Delhi's weather
dataset. The ARIMA model is applied on New Delhi's temperature data as the data is
stationary. Auto ARIMA model is applied on New Delhi's dewpoint data as the data is
non-stationary. The implementation uses Forward Fill method in the data cleaning process to
fill missing values. We proposed a modification in the data cleaning process- to fill missing
values using the mean of the observations. ARIMA and Auto ARIMA are then applied on the
modified temperature and dewpoint data. Mean Squared Error(MSE) is used to evaluate the
model's performance for a certain data cleaning method.
We observe that for Stationary data, Forward fill Data Cleaning in ARIMA forecasting is a better
approach than Mean of the Observations to fill missing values. For Non-stationary data, Mean
of the Observations Data Cleaning process in Auto ARIMA forecasting is a better approach
than Forward fill to fill missing values.
VI. References
1) Krishna, G.V., 2015. An integrated approach for weather forecasting based on data mining and
forecasting analysis. International Journal of Computer Applications, 120( 11).
2) Saikhu, A., Arifin, A.Z. and Fatichah, C., 2017, October. Rainfall forecasting by using
autoregressive integrated moving average, single input and multi input transfer function. In 2017
11th International Conference on Information & Communication Technology and System (ICTS)
(pp. 85-90). IEEE.
3) Yang, Y., Lin, H., Guo, Z. and Jiang, J., 2007. A data mining approach for heavy rainfall
forecasting based on satellite image sequence analysis. Computers & geosciences, 33(1),
pp.20-30.
12