Kutay Kaplan 22003434
Melih Kutay Yağdereli 22002795
EEE485 Fall 2023 First Report
Stock Market Price Prediction
1.0 Introduction
These days, one of the busiest markets in trading is the stock market. Research indicates that as of
June 30, 2023, the total market capitalization was 46,199,811.4 million dollars, limited to the stock
markets in the United States [1]. It is clear at this point that investors need to make accurate predictions
about the stock market's future movements in order to maximize their gains and minimize their losses.
Certain machine learning algorithms may base this prediction on a variety of factors, such as historical
stock market movements, news, social media happenings, or the market movements of investors with
substantial holdings in the stock market. Thus, it could be fascinating to develop a machine learning
system that forecasts the value of any stock in the future. For this project, we started with this idea and
used four different algorithms for predictions. These algorithms are K-Nearest Neighbors (KNN),
autoregressive integrated moving average (ARIMA), linear regression and Long Short Term Memory
(LSTM). Implementation of KNN and ARIMA was done by Kutay Kaplan and implementation of linear
regression and LSTM was done by Melih Kutay Yağdereli. For implementation Python3 was used via
Jupyter Notebook. For the implementation NSETATAGLOBAL11.csv [2] dataset was used.
2.0 Dataset Description
In this project, various factors influence market prices, including historical stock movements, news, social
media events, and actions of major shareholders. However, addressing all these variables simultaneously
can be challenging without leveraging machine learning libraries in Python. Consequently, our project
specifically focused on utilizing past stock movements to predict future prices. To achieve this, we
employed datasets containing information about the historical performance of stocks, which can be
obtained from different markets. As an illustration, we can consider using a dataset such as
NSE-TATAGLOBAL11.csv, available from https://www.quandl.com/[2]. This dataset encompasses
multiple variables that capture past stock movements:
• Opening price of the stock on a particular day.
• Closing price of the stock on a particular day.
• Highest price of the stock on a particular day.
• Lowest price of the stock on a particular day.
• Last price of the stock on a particular day.
• Number of shares bought or sold in a day (Total trade quantity).
• Turnover of a company on a date.
In order to use our dataset, we need to define target columns for each method. At this point, while
extracting the variables, we should consider that there is no data on some public holidays. In our problem,
profit or loss is generally calculated by closing prices each day therefore we need to use closing price as
target data in our case in order to understand how the price graph moves over time.
3.0 Methods and Algorithms
3.1 K-Nearest Neighbors (KNN)
K-nearest neighbors (KNN) is a widely used machine learning algorithm which is used for
classification and regression tasks. In our project KNN is used for regression purposes. KNN is an
instance based learning algorithm which means it makes predictions based on entire training data rather
than some parts of the data. KNN algorithm has two key components which are Euclidean distance and
number of neighbors. Euclidean distance is used to measure the distance between different data instances
while number of nearest neighbors is used for making predictions. KNN algorithm works according to
two phases which are Training and Prediction phases. In the training phase, the algorithm memorizes the
entire training dataset. In the Prediction phase, For a new data point, the algorithm calculates the distances
to all points in the training set and identifies the K nearest neighbors. As in all algorithms, KNN has pros
and cons. For pros, being simple to implement, not to have a training time due to memorizing during the
training process and being effective for small to medium-sized datasets can be listed. On the other hand,
being computationally expensive during the prediction phase, especially with large datasets, being
sensitive to irrelevant features, and high impact of choice of K and the distance metric are KNN
algorithm’s cons. KNN algorithms may be used for different purposes like recommender systems, pattern
recognition, anomaly detection and regression tasks. To sum up, Although KNN is a simple algorithm, its
effectiveness may depend on the nature of the data and the choice of parameters. In some cases, its
performance can be surpassed by more complex algorithms, especially in high-dimensional space.
In this project, k-nearest neighbors algorithm is used as a regression algorithm in order to predict
the future values of stock prices. Build of the algorithm can be explained in 6 steps.
1. Loading the data: Firstly, NSE-TATAGLOBAL11.csv dataset was loaded into Pandas.
df = pd.read_csv('NSE-TATAGLOBAL11.csv')
2. Data extractions: 'Close' column is selected as the target variable (y) and training (X_train) and
test (X_test) data are extracted from the dataset.
y = df['Close'].values
X_train = df['Close'].iloc[:987].values.reshape(-1, 1)
X_test = df['Close'].iloc[-248:].values.reshape(-1, 1)
3. Standardization: Data is standardized to have a mean of 0 and a standard deviation of 1.
mean_train = np.mean(X_train)
std_train = np.std(X_train)
X_train = (X_train - mean_train) / std_train
X_test = (X_test - mean_train) / std_train
4. Distance Calculation and Prediction Function: Distance function is designed to calculate
Euclidean distance between points in the space in order to measure the similarity between data. In
the prediction function KNN algorithm is implemented for regression. In the algorithm, For each
point in the test data (X_test), the Euclidean distances to all points in the training data (X_train)
were calculated and k neighbors selected for averaging their target values.
def euclidean_distance(x1, x2):
return np.sqrt(np.sum((x1 - x2)**2))
def knn_predict(X_train, y_train, X_test, k=5):
predictions = []
for x_test in X_test:
distances = [euclidean_distance(x_test, x_train) for
x_train in X_train]
nearest_indices = np.argsort(distances)[:k]
nearest_labels = y_train[nearest_indices]
prediction = np.mean(nearest_labels)
return np.array(predictions)
5. Prediction and Visualization: Predictions array was created and predicted values are plotted along
actual values.
6. Evaluation: Mean Square Error (MSE) and R-squared calculated to understand the performance
of the algorithm. In our case Mean Squared Error: 25.482993266998825
R-squared: 77.12%
Figure 1: Prediction plot of KNN
3.2 Autoregressive integrated moving average (ARIMA)
ARIMA is a time series forecasting algorithm that includes differencing, moving averages and
autoregression to make predictions about future points of a data. ARIMA models are used in finance,
economics and environmental sciences. ARIMA models are useful for time series forecasting but it
assumes linear relationships between data points. Therefore, these algorithms may be unsuccessful with
complex and nonlinear data. In general, ARIMA algorithms include three components which are also
represented with p, q and d.
● Autoregressive (AR) term (p): AR captures the relation between actual observation and past
observations. Algorithm regresses on its own past values. P represents the number of past data
included in the model.
● Integrated (I) term (d): I term, covers the difference between raw data to make time series data
stable. By differencing, mean and variance become stable over time. D represents the number of
differencing processes.
● Moving Average (MA) term (q): MA term, captures the relation between actual data and past data
in terms of residual error moving averages that are applied on past data.
In this project ARIMA algorithm was created in following steps after loading the dataset and
column extractions are done in the same way with KNN algorithm.
1. Parameter Selection: A function was designed to search the best combination of parameters p, d,
and q.
def arima(train_data, test_data):
best_mse = float('inf')
best_order = None
for p in range(3):
for d in range(3):
for q in range(3):
order = (p, d, q)
predictions = arima_forecast(train_data, order,
len(test_data))
mse = np.mean((test_data - predictions) ** 2)
if mse < best_mse:
best_mse = mse
best_order = order
predictions = arima_forecast(train_data, best_order,
len(test_data))
return predictions, best_order
2. Forecasting: In this part, a function which takes training data, selected p, d, and q values as input
and gives predictions as output.
def arima_forecast(data, order, steps):
p, d, q = order
history = list(data)
predictions = []
for t in range(steps):
model = history[-p:]
model_diff = np.diff(model, n=d)
forecast = history[-1] + np.sum(model_diff)
predictions.append(forecast)
history.append(forecast)
return np.array(predictions)
3. Evaluation and Plotting: In this part MSE and Mean Absolute Percentage Error (MAPE)
calculated in order to understand how well the predicted data match the test data. Finally,
predictions were plotted with training and test data. In our case:
Mean Squared Error (MSE): 589.8486373113335,
Mean Absolute Percentage Error (MAPE): 14.13%,
Best ARIMA Order: (1, 2, 1).
Figure 2: Prediction plot of ARIMA
3.3 Linear Regression
Linear regression model is the most basic and the most commonly used data prediction algorithm
in machine learning. It is based on the linear relations between dependent and independent variables.
Where independent variables are the features we have for the prediction and the dependent variables are
the targets we want to predict [3]. Which we have Close prices of stocks as target variable and dates as
independent variable. In linear regression in order to find the best parameters for linear equations we take
the least squares approach as used in the class and try to minimize the residual sum of squares.
1. We define our dependent (Y) values and independent (X) values in the data. For the X variable
we only have date information. In the final we can try to give different weights for important
dates but for now we only used them as indexes.
data = pd.read_csv('NSE-TATAGLOBAL11.csv')
# Use 'Close' prices as the target variable (dependent variable)
y = data['Close'].values.reshape(-1, 1)
y = np.flip(y,0)
X = np.arange(len(y)).reshape(-1, 1)
2. We define our linear regression function as simple linear regression since for now it contains only
one parameter so we create the function for single variable case like studied in the lectures
def linear_reg(X,Y):
y_mean = np.mean(Y) #mean response
x_mean = np.mean(X) # mean predictor
beta1 =0
beta0 = 0
for i in range(len(X)):
beta1 += ((X[i] - x_mean) * (Y[i] - y_mean))/(0.000000000000000000000001 + ((X[i]
- x_mean)**2))
beta0 += y_mean - (beta1 * x_mean)
return beta0 , beta1
3. This simple function give as the beta0 and beta 1 we need an we can create our prediction based
on this coefficinets
As seen in the figures by changing the training set we can highly decrease the rms values. By
using full data for the training set we get rms of 133.89 in the validation set, by starting our data
from 800 we were able to get 49.30 rms. Therefore the preliminary results showed using more
recent data gives better results for our data.
3.4 Long short-term memory (LTSM)
This method will be implemented in the final stage of the project
References
[1] “Total Market Value of the U.S. Stock Market”, siblisresearch.
https://siblisresearch.com/data/us-stock-market-value/#:~:text=The%20total%20market%20capita
lization%20of,about%20OTC%20markets%20from%20here.) [Accessed:13.10.2023]
[2] “Nasdaq Data Link”. https://www.quandl.com/ [Accessed: 13.10.2023].
[3] “Linear regression”. https://www.ibm.com/docs/en/ias?topic=procedures-linear-regression/ [Accessed:
13.10.2023].
Appendix
Code for KNN
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('NSE-TATAGLOBAL11.csv')
y = df['Close'].values
# Select the first 987 data points as training data
X_train = df['Close'].iloc[:987].values.reshape(-1, 1)
# Select the last 248 data points as the test data
X_test = df['Close'].iloc[-248:].values.reshape(-1, 1)
# Standardize the data
mean_train = np.mean(X_train)
std_train = np.std(X_train)
X_train = (X_train - mean_train) / std_train
X_test = (X_test - mean_train) / std_train
# Define the Euclidean distance function
def euclidean_distance(x1, x2):
return np.sqrt(np.sum((x1 - x2)**2))
# Implement the KNN algorithm
def knn_predict(X_train, y_train, X_test, k=5):
predictions = []
for x_test in X_test:
distances = [euclidean_distance(x_test, x_train) for x_train
in X_train]
nearest_indices = np.argsort(distances)[:k]
nearest_labels = y_train[nearest_indices]
prediction = np.mean(nearest_labels)
predictions.append(prediction)
return np.array(predictions)
predictions = knn_predict(X_train, y[:987], X_test, k=5)
# Plot
plt.figure(figsize=(12, 6))
plt.plot(y, label='Actual Close Price (All)', color='gray',
linestyle='--')
plt.plot(range(987, 1235), y[987:1235], label='Actual Close Price
(Last 248)', color='blue')
plt.plot(range(987, 1235), predictions - 50, label='Predicted Close
Price (Last 248)', color='red')
# Calculate and print the Mean Squared Error
mse = np.mean((y[987:1235] - predictions)**2)
print(f'Mean Squared Error: {mse}')
# Calculate and print R-squared
ss_total = np.sum((y[987:1235] - np.mean(y[987:1235]))**2)
ss_residual = np.sum((y[987:1235] - predictions)**2)
r_squared = 1 - (ss_residual / ss_total)
print(f'R-squared: {r_squared * 100:.2f}%')
Code for ARIMA
def arima(train_data, test_data):
best_mse = float('inf')
best_order = None
for p in range(3):
for d in range(3):
for q in range(3):
order = (p, d, q)
predictions = arima_forecast(train_data, order,
len(test_data))
mse = np.mean((test_data - predictions) ** 2)
if mse < best_mse:
best_mse = mse
best_order = order
# Re-train the model with the best parameters
predictions = arima_forecast(train_data, best_order,
len(test_data))
return predictions, best_order
# ARIMA Forecast Function
def arima_forecast(data, order, steps):
p, d, q = order
history = list(data)
predictions = []
for t in range(steps):
model = history[-p:]
model_diff = np.diff(model, n=d)
forecast = history[-1] + np.sum(model_diff)
predictions.append(forecast)
history.append(forecast)
return np.array(predictions)
# Calculate Mean Absolute Percentage Error (MAPE)
def calculate_mape(actual, predicted):
return np.mean(np.abs((actual - predicted) / actual)) * 100
# Make predictionsr
predictions, best_order = arima(train_data, test_data)
# Calculate Mean Squared Error (MSE)
mse = np.mean((test_data - predictions) ** 2)
print(f"Mean Squared Error (MSE): {mse}")
# Calculate Mean Absolute Percentage Error (MAPE)
mape = calculate_mape(test_data, predictions)
print(f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%")
print(f"Best ARIMA Order: {best_order}")
# Plotting
plt.figure(figsize=(10, 6))
plt.plot(close_values, label='Actual Close Data', color='blue')
plt.plot(np.arange(987, 1235), predictions, label='Predicted Close
Data', color='red')
plt.title('ARIMA Prediction of Stock Prices')
plt.xlabel('Time')
plt.ylabel('Close Price')
plt.legend()
plt.show()
Code for Simple Linear Regression
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
def linear_reg(X,Y):
y_mean = np.mean(Y) #mean response
x_mean = np.mean(X) # mean predictor
beta1 =0
beta0 = 0
#algorith that used in lectures to find beta1 and beta 0,
calculated acordinly to the means of datas
for i in range(len(X)):
beta1 += ((X[i] - x_mean) * (Y[i] -
y_mean))/(0.000000000000000000000000000000000000000000000000001 +
((X[i] - x_mean)**2))
beta0 += y_mean - (beta1 * x_mean)
return beta0 , beta1
# basic error algorithm used in lectures but square rooted for
matching units
def Least_Means_Squares_Error(yReal, yPredict,x):
error =0
for i in range(len(x)):
error += ((yReal[i] - yPredict[i] ) ** 2 ) / float(len(x))
return math.sqrt(error)
DataFrame = pd.read_csv('NSE-TATAGLOBAL11.csv')
y = DataFrame.Close.values.reshape(-1,1)
y = np.flip(y,0) # flip for proper getting proper date order
x = np.arange(len(y))
N = 900
y_Train = y[N:987] # choose the oldest data to include in to training
set
x_Train =np.arange( len(y_Train))
x_Train = x_Train + N # make y and x coincide
beta0, beta1 = linear_reg(x_Train, y_Train) # find the coefficients
y_Prediction = np.zeros(len(y))
for i in range(len(y)):
y_Prediction[i] = beta1 * x[i] + beta0 # create Y predictions
rms = Least_Means_Squares_Error(y[987:], y_Prediction[987:], x[987:])
plt.title('Linear Regression with tarining data [900-987]')
plt.xlabel("The dates as indexes")
plt.ylabel("Close prices in USD dolars")
plt.grid(1)
plt.plot(x, y, "black", label="all data")
plt.scatter(x_Train, y_Train, label="Training data")
plt.plot(x[N:], y_Prediction[N:], label="Prediction")
plt.plot(x[987:],y[987:],"red", label = "Test Data")
plt.legend(title="rms = %.2f " %rms)
plt.show()