0% found this document useful (0 votes)
57 views81 pages

Predicting Cryptocurrency Returns Using Classification and Regression Machine Learning Models

This dissertation by Amal Alshehri explores the use of classification and regression machine learning models to predict cryptocurrency returns, focusing primarily on Bitcoin. It discusses the methodology of transforming price data into percentage changes and binary classes, applying k-fold cross-validation for classification, and visualizing regression results. The study concludes that while the models show low accuracy, they provide a valuable tool for predicting future price variations in the cryptocurrency market.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views81 pages

Predicting Cryptocurrency Returns Using Classification and Regression Machine Learning Models

This dissertation by Amal Alshehri explores the use of classification and regression machine learning models to predict cryptocurrency returns, focusing primarily on Bitcoin. It discusses the methodology of transforming price data into percentage changes and binary classes, applying k-fold cross-validation for classification, and visualizing regression results. The study concludes that while the models show low accuracy, they provide a valuable tool for predicting future price variations in the cryptocurrency market.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

PREDICTING CRYPTOCURRENCY RETURNS

USING CLASSIFICATION AND REGRESSION


MACHINE LEARNING MODELS

Amal Alshehri

A dissertation submitted to
The School of Computing Sciences of the University of East Anglia
in partial fulfilment of the requirements for the degree of
MASTER OF SCIENCE.
AUGUST, 2022

© This dissertation has been supplied on condition that anyone who consults it is
understood to recognise that its copyright rests with the author and that no
quotation from the dissertation, nor any information derived therefrom, may be
published without the author or the supervisor’s prior consent.
SUPERVISOR(S), MARKERS/CHECKER AND ORGANISER

The undersigned hereby certify that the markers have independently


marked the dissertation entitled “Predicting Cryptocurrency Returns
Using Classification and Regression Machine Learning Models”
by Amal Alshehri, and the external examiner has checked the marking, in
accordance with the marking criteria and the requirements for the degree of
Master of Science.

Supervisor:
Dr. Antony Jackson

Markers:
Marker 1: Dr. Beatriz De La Iglesia

Marker 2: Dr. xxxxxx

External Examiner:
Checker/Moderator

Moderator:
Dr. Wenjia Wang

ii
DISSERTATION INFORMATION AND STATEMENT

Dissertation Submission Date: August, 2022

Student: Amal Alshehri

Title: Predicting Cryptocurrency Returns Using


Classification and Regression Machine Learning
Models

School: Computing Sciences

Course: Computing Science

Degree: MSc.

Duration: 2021--2022

Organiser: Dr. Wenjia Wang

STATEMENT:
Unless otherwise noted or referenced in the text, the work described in
this dissertation is, to the best of my knowledge and belief, my own work. It has
not been submitted, either in whole or in part for any degree at this or any other
academic or professional institution.
Subject to confidentiality restriction if stated, permission is herewith
granted to the University of East Anglia to circulate and to have copied for
non-commercial purposes, at its discretion, the above title upon the request of
individuals or institutions.

Signature of Student

iii
Abstract

People are starting to see the cryptocurrency market as a viable source of income and
investment, similar to the stock market, as the concept of cryptocurrencies continues
to gain popularity. Additionally, several projects use tokens or coins built utilizing
blockchain technology. Furthermore, it is sound knowledge that the Bitcoin variety
dominates the cryptocurrency market, posing whether Bitcoin is predictable. However,
machine learning can forecast cryptocurrencies more accurately than established ana-
lytic methods like technical analysis. Predicting bitcoin returns is related to financial
machine learning, which uses time series to forecast price variance. This study starts
with the daily close price of bitcoin for its initial dataset. The price is transformed
into percentages (daily price percent change) and binary classes, which categorize into
two classes, “Up” and “Down,” after which a time series is applied to produce two
datasets: a categorical dataset for classification and a numerical dataset for regression.
For classification that represents a Binary classification in asset-price forecasting, k-
fold cross-validation is applied to ensure that the best classifiers are selected for testing
and analysis. Most of the regression analysis was based on visualization, which dis-
played the predicted prices by each regressor in front of the original values and helped
analyze the models’ results more accurately. The outcomes of this dissertation were
achieved by anticipating bitcoin returns using classification and regression machine
learning models, despite the approaches’ low accuracy and significant precision rate to
the “Up” class. At this stage, with a significant limitation regarding the dataset and
lack of other indicators, a model capable of predicting future variations is considered
a beneficial addition for many trading tools or even for crypto market analysts.

Keywords— Bitcoin predictability, Time-series cross validation, Binary classification in


asset-price forecasting, Financial machine learning

iv
Acknowledgements

I would like to thank my supervisor, Dr. Antony Jackson, for his many suggestions
and constant support during this research. I would also like to express my gratitude
to Jazan University and the Government of the Kingdom of Saudi Arabia for full
sponsoring my course.
Finally, I am very grateful to my mother, the most remarkable woman, for her uncon-
ditional love, unlimited support and patience full of warmth and contentment.

Amal Alshehri

Norwich, UK.

v
Table of Contents

Abstract iv

Acknowledgements v

Table of Contents vi

List of Tables viii

List of Figures ix

List of Abbreviations xi

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Cryptocurrencies and their price movement . . . . . . . . . . . 2
1.1.2 Effect of Bitcoin variation on Crypto Market . . . . . . . . . . . 4
1.1.3 Machine Learning in relation to Cryptocurrencies . . . . . . . . 5
1.1.4 Machine learning’s role in forecasting cryptocurrency returns . 11
1.2 Aim and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Significance of the study . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Structure of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Literature Review 15
2.1 Basic Financial Theories . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Efficient Market Hypothesis . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 Behavioral Financial Theory . . . . . . . . . . . . . . . . . . . . 18
2.2 Machine Learning in Cryptocurrency Returns Prediction . . . . . . . . 19
2.2.1 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . 19
2.2.2 Ensemble Machine Learning . . . . . . . . . . . . . . . . . . . . 21
2.2.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Methodology Design 26
3.1 Design of Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Evaluation Methods and Measures . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Classification Evaluation Methods . . . . . . . . . . . . . . . . . 28
3.2.2 Regression Evaluation Methods . . . . . . . . . . . . . . . . . . 30
3.3 Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

vi
3.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Results and Analysis 36


4.1 Prediction Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.1 Classification prediction results . . . . . . . . . . . . . . . . . . 36
4.1.2 Regression prediction results . . . . . . . . . . . . . . . . . . . 48
4.2 Comparing and merging Classification and Regression models . . . . . 51
4.3 Ethical, Social, and Legal Issues . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Limitations and Problems . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Conclusion 56
5.1 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1.1 Methodology Discussion . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Suggestion for Further Work . . . . . . . . . . . . . . . . . . . . . . . . 59

References 60

A Dataset overview, coding parts and other plots 67

vii
List of Tables

4.1 Evaluation metrics comparison for classification models . . . . . . . . . 48

viii
List of Figures

1.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Decision Tree structure (Safavian and Landgrebe, 1991) . . . . . . . . . 20


2.2 Random Forest structure (Gaur et al., 2022) . . . . . . . . . . . . . . . 23
2.3 LSTM cell structure (Chniti et al., 2017) . . . . . . . . . . . . . . . . . 25

3.1 Machine learning development process . . . . . . . . . . . . . . . . . . 27


3.2 Confusion Matrix (Tharwat, 2020) . . . . . . . . . . . . . . . . . . . . 28
3.3 Initial collected dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Updated Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Variation column classes distribution . . . . . . . . . . . . . . . . . . . 33

4.1 Box Plots for classifiers accuracy range comparison . . . . . . . . . . . 37


4.2 Bar Plot for classifiers max accuracy comparison . . . . . . . . . . . . . 38
4.3 SVM confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 SVM classification report . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 SVM precision-recall curve . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6 SVM ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.7 XGBoost confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.8 XGBoost Classification report . . . . . . . . . . . . . . . . . . . . . . . 42
4.9 XGBoost Precision-Recall curve . . . . . . . . . . . . . . . . . . . . . . 43
4.10 XGBoost ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.11 Random Forest confusion matrix . . . . . . . . . . . . . . . . . . . . . 44
4.12 Random Forest classification report . . . . . . . . . . . . . . . . . . . . 45
4.13 Random Forest precision-recall curve . . . . . . . . . . . . . . . . . . . 45
4.14 Random Forest ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.15 ANN Classification Report . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.16 ANN Loss plot in function of epoch . . . . . . . . . . . . . . . . . . . . 47
4.17 ANN Accuracy plot in function of epoch . . . . . . . . . . . . . . . . . 47
4.18 Comparison between original and Predicted values for Linear Regression 49
4.19 Comparison between original and predicted values for XGB Regressor . 49

ix
4.20 Comparison between original and predicted values for Random Forest
Regressor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.21 Comparison between original and predicted values for LSTM . . . . . . 51
4.22 Explaining the created voting system that merges multiple classification
and regression models into one predictor . . . . . . . . . . . . . . . . . 52

A.1 Candlestick plot for Bitcoin price variation during the collected 8 years
of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.2 Bitcoin daily price percentage changes line plot . . . . . . . . . . . . . 67
A.3 Testing and training sets distribution . . . . . . . . . . . . . . . . . . . 68
A.4 K-fold cross validation coding part . . . . . . . . . . . . . . . . . . . . 68
A.5 RandomForest hyperparameters tuning using GridSearch coding part . 68
A.6 ANN model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.7 Overview on the classes distribution in ANN model (high precision rate
for class ‘1’) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.8 Predicting next 10 days price variations using XGBoost Regressor . . . 69
A.9 Predicting next 10 days price variations using RandomForest Regressor 70

x
xi
Chapter 1

Introduction

As a significant player in the global financial landscape, cryptocurrencies have gained

the interest of regulators, governmental organisations, institutional and individual in-

vestors, researchers, and the general public. Cryptocurrencies are brand-new money

that is sweeping the financial industry and catching the attention of industry pioneers.

They are a type of virtual currency intended for use online. The fact that cryptocur-

rencies are traded for profit and that their prices are rising has made them a hot topic.

Furthermore, they are built on a peer-to-peer or decentralised network (Lansky, 2018).

The first and best-known cryptocurrency is Bitcoin, which was developed in 2008 by

Nakamoto (2008). The first linear hash chain, sometimes known as a blockchain, was

proposed by Haber and Stornetta (1990). A method for verifying the creation or most

recent modification date of a digital document was developed. The data itself wasn’t

time-stamped in order to protect the material’s privacy. To address the spam mail,

Dwork and Naor (1992) developed a proof-of-work mechanism. Each email would have

a header that calculated virtual postage as a single computation. This postage stamp

was used to show that calculating the postage before sending the email took a slight

amount of CPU time. To describe the cost of computing each hash, the term ”hash

cash” was created by Back et al. (2002). Investors and industry professionals are both

concerned about making accurate predictions about the values of the cryptocurrency

market because prices are rising quickly and responding similarly to stock market price

changes. Although this market exhibits similar behaviours to other stock markets in

terms of expected volatility, investor confidence has been reflected in it. (Grinberg,

2012; Mai et al., 2018; Sapuric and Kokkinaki, 2014).

1
2
1.1 Background

1.1.1 Cryptocurrencies and their price movement

Even though prices are quite volatile, and it is hard to identify certain crucial factors

and quantify the extent to which they affect the price, researchers’ efforts in the area

of currency price forecasting continue in an effort to understand the digital financial

landscape. There is still no conclusive theory explaining how to price Cryptocurren-

cies (Polasik et al., 2015). Several categories may be used to categorize the variables

affecting cryptocurrency prices, including internal variables like supply and demand

as well as extrinsic variables like attractiveness and legality (Poyser, 2017; Buchholz

et al., 2012). A study has shown that cryptocurrency price is positively and statistically

significantly correlated with computing power and network adoption; the study may

also be a tool for identifying additional factors such as regulatory and political risks

affecting the returns on these digital assets (Bhambhwani et al., 2019). The volatility

of cryptocurrency exchange values is mostly brought on by market emotions, a lack of

government regulatory oversight, and the fact that these currencies lack an intrinsic

value (Iinuma, 2018). This cryptocurrency’s popularity is one of the key factors in-

fluencing its price. One of the critical causes of price instability and volatility is the

ineffective integration of cryptocurrencies into the traditional currency and macroeco-

nomic markets (Polasik et al., 2015). According to Aggarwal et al. (2019), the price

of cryptocurrencies does not significantly depend on societal factors. More research

should be done in this area to get better findings with more reliability.

The following list of factors may be used to summarise some of those that influence

the price of digital currencies:

1. Media

How the media impacts people’s lives cannot be disregarded. In a wide range

of decisions, the media significantly influences people. As a result, demand will

rise if the media actively promotes bitcoin in its ongoing coverage. By spreading

awareness of cryptocurrencies, more people can get interested in them and become

more likely to buy them. However, Philippas et al. (2019) found that the media

networks had a limited impact on Bitcoin values, with their impact being more
3
significant during times of more uncertainty. In certain circumstances, they also

serve as sources of information demand, so the effect is not of very significant

importance.

2. Political Instability

Politics impact the acceptance of bitcoin. When the political system is unstable

and about to fall apart, demand for bitcoin increases. People give cryptocur-

rencies a chance and believe in them over traditional currencies when they lack

confidence in their nation’s economy. Using 2875 observations from July 18-2010

to May 31- 2018, one study concluded that the Global Geopolitical Risk Index

(GPR) could predict the returns and volatility of Bitcoin. Additionally, it was

concluded that Bitcoin could be utilised as a tool for hedging against global

geopolitical risks because the results from Quantile-on-Quantile (QQ) estimates

showed that the effects are positive when higher amounts of both GPR and price

volatility and Bitcoin returns (Aysan et al., 2019).

3. Legal regulations

The presence of legal cover attracts the confidence of the public. These rules

have a dual effect on cryptocurrencies. Some regulatory requirements can dis-

courage investors from purchasing cryptocurrencies by charging them. Neverthe-

less, demand is anticipated to increase if the government makes it easy to buy

cryptocurrencies.

4. supply and demand

Due to its limited supply, demand mainly raises the price of Bitcoin. In essence,

Bitcoin’s value comes from the reality that scarce resources are more valuable

than abundant ones. As a result, if demand falls, the price of bitcoin will also

fall. On the other side, the price of Bitcoins increases when there is a higher

demand for them.


4
1.1.2 Effect of Bitcoin variation on Crypto Market

With the exception of bitcoin, all cryptocurrencies are referred to as “altcoins” which

is a term that originates from the notion that these coins are alternatives to bitcoin.

There are a plethora of other altcoins available, but before any of them were made,

Bitcoin was the first and most valuable asset. These alternative currencies aim to

improve or add advantages like quick transactions and low energy consumption. The

cryptocurrency sector is nearly new and still developing, Bitcoin holds the distinction

of being the first cryptocurrency or asset. One of the most critical leading indications

of the performance of cryptocurrencies is the performance of Bitcoin. The dominance

component is essential because Bitcoin prices influence markets through their dom-

inance. The market capitalization of Bitcoin rules the cryptocurrency industry and

significantly influences it. Böhme et al. (2015) stated that: “ Bitcoin’s rules were de-

signed by engineers with no apparent influence from lawyers or regulators”; this idea

gave bitcoin momentum and made it the king of the scene in cryptocurrencies. As a re-

sult, an entirely new, unrestricted, and uncontrolled industry has been made possible.

Altcoins do not have the same level of widespread acceptance as Bitcoin. They also

carry most of Bitcoin’s traits because it was developed using Bitcoin’s source code as

a starting point and then improved upon (Hobson, 2013). Prices may differ somewhat

between Bitcoin and its clones, but they often do not. Numerous scholars are inter-

ested in investigating the connection and impact between Bitcoin and other altcoin

markets. After analyzing the performance of the daily data of 16 alternative currencies

in addition to Bitcoin between 2013 and 2016, this study Ciaian et al. (2018) concluded

that there is, in fact, a significant correlation between the prices of Bitcoin and other

alternative currencies, with the correlation being more evident in the short term. The

study found that, compared to bitcoin prices, macro-finance factors have a longer-term

effect on altcoin pricing. In contrast to stock markets, cryptocurrencies have a strong

connection. It is not advised to create a portfolio using Bitcoin and other alternative

currencies since the value of these currencies is significantly impacted by the price of

Bitcoin (S Kumar and Ajaz, 2019). The price of other encrypted digital currencies is

significantly influenced by Bitcoin, the first and most important symbol for the whole

cryptocurrency industry. Various factors contribute to this, including that the most
5
well-known alternative currencies are just enhanced or expanded versions of Bitcoin.

The liquidity of Bitcoin also contributes to its continued dominance. Still regarded as

the safest cryptocurrency asset is bitcoin. It is impossible to accept or refute the claim

that Bitcoin has a 100% impact on the rates of other alternative currencies. In other

words, because there are many alternative cryptocurrencies and the financial sector

is intricate, the rise in bitcoin prices does not always grow by the same percentage

as all alternative cryptocurrency rates. The expected drop or rise may not occur be-

cause certain elements seem to cancel out the effects of other ones. Bitcoin and other

cryptocurrencies are, nonetheless, closely connected.

1.1.3 Machine Learning in relation to Cryptocurrencies

Machine Learning

Machine learning is a massive field of computing algorithms aiming to simulate

intelligence by learning based on available data. The concept of artificial intelligence

overlaps with machine learning, and the boundaries between them remain clear as

machine learning is an essential part of artificial intelligence and a requirement of

artificial intelligence (Alpaydin, 2016). Machine learning is an application or subset of

artificial intelligence (AI) which enables computers to learn from data without being

explicitly programmed. AI is a larger idea that aims to build intelligent machines that

can replicate human thinking capabilities and behaviour. ML is concerned with the

development of systems that can learn from data. A machine learning algorithm aims

to discover a hidden pattern or function that will help predict some unknown variable

that assumes the existence of a function capable of mapping inputs to outputs. Using

mathematical models to “train” or “learn” from a given collection of data so that they

can make decisions is the main goal of machine learning. These algorithms start with

input data and do several sorts of statistical analysis before going on to prediction. As

they do so, they predict the outcome while also changing parameters that aid increase

the accuracy of the prediction. Unlike typical machines and processes, this is not the

case which consists only of static parts and is a conventional input/output paradigm

then feeding machines input and then giving them outputs instead a machine learning

system that adapts to external signals and can predict the future from uncertain and
6
fragmentary information like humans do. Machine learning techniques have effectively

been employed in various industries, including pattern recognition, computer vision,

aerospace engineering, finance, entertainment, computational biology, and biological

and medical applications. In addition, Machine learning plays a broad role in business

and finance. It can be further used in complicated applications like price variation

prediction, risk assessment, and fraud detection. It is used to train and build these

models that depend on the provided data, which will make predictions for various

cases, and the most known are the forecasting models when it comes to forecasting

future values. The two most widely used machine learning algorithms for forecasting

price/percentage movement are Artificial Neural Networks (ANNs) and Support Vector

Machine (SVM), and both have their learning patterns that will be discussed in the

following parts.

Deep Learning

While it may sound trendy, deep learning is simply a term used to describe certain

types of neural networks and related algorithms that often consume very raw input

data. In order to calculate the target output, they process this data through many

layers of nonlinear transformations of the input data. Feature extraction is when an

algorithm can automatically derive or construct meaningful features of the data to

be used for further learning, generalisation, and understanding. In most other ma-

chine learning approaches, the feature extraction process, along with feature selection

and engineering, is traditionally the responsibility of the data scientist or programmer.

Feature extraction usually involves dimension reduction and reduces the number of in-

put features and data required to generate accurate results. Significant benefits result

from this, including simplification, a decrease in computational and memory power,

and many others. More commonly, deep learning falls under the techniques referred

to as feature learning or representation learning. As was already mentioned, feature

extraction is a technique used in machine learning to “learn” which features to empha-

sise and employ. To create the best-performing model, the machine learning algorithms

“learn” the best settings as they go along. Deep learning has been used successfully in

many applications and is currently regarded as one of the most cutting-edge machine

learning and AI methods. The algorithms associated with it are frequently used for
7
supervised, unsupervised, and semi-supervised learning problems.

Supervised Learning

A subtype of machine learning is supervised Learning, commonly referred to as

supervised machine learning. Supervised Learning is a machine learning paradigm for

learning a system’s input-output relationship information from a set of paired input-

output training samples. An input-output training sample is also known as labelled

training data or supervised data because the output is regarded as the label of the

input data or the supervision. Occasionally, it is also referred to as Learning with a

Teacher (Haykin, 2009), Learning from Labelled Data, or Inductive Machine Learning

(Kotsiantis et al., 2007). The goal of supervised Learning is to develop an artificial

system that can learn the mapping between input and output and predict system

output given new inputs. The learned mapping leads to the classification of the input

data if the output takes a defined set of discrete values that indicate the class labels

of the input. If the output takes continuous values, the input is regressed. Learning-

model parameters are frequently used to represent information about input-output

associations. When these parameters are not directly available from training samples,

a learning system must perform an estimation process to obtain them. Training data

for Supervised Learning require supervised or labelled information, whereas training

data for Unsupervised Learning are not labelled.

Artificial Neural Networks (ANNs)

Artificial neural networks (ANNs) are statistical models that were partially mod-

elled by biological neural networks and were directly influenced by them. They can

concurrently process and simulate nonlinear interactions between inputs and outcomes.

The related algorithms are part of the broader field of machine learning and can be

used in many applications, as discussed previously. Artificial neural networks contain

adaptive weights along paths between neurons that can be tuned by a learning algo-

rithm that learns from observed data to improve the model. In addition to the learning

algorithm, one must choose an appropriate cost function (Castrounis, 2016).

An artificial neural network is divided into three parts:

• a layer of neurons at the input (nodes, units).


8
• one or two neurons in hidden layers (or even three).

• a layer of output neurons at the end.

A typical architecture with lines linking neurons is displayed in the equation below

(1.1.1). Each link has a weight, which is a tally of numbers. The hidden layer’s neuron

(i) produces (hi ) as its output.

XN
hi = σ( V ijxj + Tihid ) (1.1.1)
j=1

Where σ is called the activation (or transfer) function, N is the number of input neurons,

V ij is the weights, xj inputs to the input neurons, and Tihid is the threshold terms

of the hidden neurons. The purpose of the activation function is, besides introducing

nonlinearity into the neural network, to bound the value of the neuron so that divergent

neurons do not paralyze the neural network. A typical example of the activation

function is the sigmoid (or logistic) function (Wang, 2003) which defined as:

1
σ(u) = (1.1.2)
1 + exp(−u)

Other possible activation functions are arc tangent and hyperbolic tangent. They

have similar responses to the inputs as the sigmoid function but differ in the output

ranges. A neural network built in the manner described above has been demonstrated

to be capable of approximating any computable function to arbitrary precision. Num-

bers given to the input neurons are independent variables, and those returning from

the output neurons are dependent variables to the function approximated by the neural

network. Binary data (such as yes or no) or even symbols (such as green or red) can

be used as inputs to and outputs from neural networks when the data is appropriately

encoded; this property gives neural networks a broad spectrum of use (Wang, 2003).

Classification Models

Classification models are a subset of supervised machine learning. A classification

model reads some given inputs and produces an output that categorizes the input.

Decision Tree and logistic regression are two classification techniques that produce

a probability score that indicates the likelihood that the input belongs to a specific

category. The probability is then transferred to a binary mapping if the categorization


9
is binary (exist or does not exist, Yes or No, Cat or Dog). If not the case with categorical

classification, the results will vary and belong to the set of expected categories.

Regression Models

Regression is a statistical technique used in finance, investing, and other disciplines

to establish the nature and strength of the relationship between a single dependent

variable (typically denoted by Y) and several independent variables (known as inde-

pendent variables). In the figure 1.1 below, the blue line labelled “Line of Regression”

represents linear regression, sometimes referred to as simple regression or ordinary least

squares (OLS), which is the most widely used form of this method. Linear regression

establishes the linear relationship between two variables based on a line of best fit.

In order to show how changing one variable effects another, linear regression uses the

slope of a straight line as a representation.

Figure 1.1: Simple Linear Regression

In a linear regression relationship, the y-intercept represents the value of one vari-

able when the other is zero. There are other non-linear regression models, they are far

more complicated.

Regression analysis is a valuable method for discovering relationships between the

variables observed in the data, even though causation cannot be demonstrated with

certainty. In the areas of business, finance, and economics, it has several uses. It

helps investment managers, for example, value assets and understands the relation-

ships between factors like commodity prices and the stocks of businesses that trade in

those commodities. Additionally, it is crucial to comprehend the difference between

regression as a statistical technique and regression to the mean. It is worth mentioning

that professionals in various industries, such as finance and investments, might benefit

from regression. Regression may help a firm forecast sales based on outside variables
10
such as the weather, previous sales, GDP growth, and other variables. The capital

asset pricing model is a well-liked regression model in finance for evaluating assets and

computing capital expenses (CAPM).

Forecasting Models

Forecasting Models are tried-and-true frameworks that make forecasting outcomes

in business and marketing sectors simpler. Among the many forecasting models are

time series models, econometric models, and judgmental forecasting. Although fore-

casting takes many forms, most of them are often connected to regression and well-

known regression models. As previously indicated, there are several formats for fore-

casting models. With fully cognizant of past occurrences, the Time Series Model

employs statistical data to generate predictions for the future. An econometric model

can predict economic variables like sales, demand, supply, and price. When there is

a paucity of historical data, a new product is entering the market, or there are com-

petitors, judgmental forecasting is used. It is impossible to forecast any consequences

in the current climate of uncertainty, increased competition, and erratic client loyalty.

Although they can not wholly alleviate this complex process, forecasting models offer

long-term insight. Forecasting models account for several variables, including usage

frequency, the availability of pertinent data, and the forecasting technique. Forecasted

elements include market share, stocks, revenue, and marketing expenses.

Time series analysis

Time series analysis is a technique for analysing a succession of data points gathered

over a period of time. Instead of merely inconsistently or randomly, time series analysis

requires analysts to capture data points at regular intervals over a predetermined time-

frame. However, this analysis is more than just collecting data over time. Time series

data differ from other forms of data because the analysis may demonstrate how vari-

ables change over time. In other words, time is a crucial variable since it demonstrates

how the data adjust through time and the final results; and that offers an additional

information source and a predetermined order of data dependencies. Time series anal-

ysis often calls for many data points to maintain consistency and dependability. The

sample size will be representative, and the analysis can cut through noisy data if there

is an extensive dataset. Additionally, it guarantees that trends or patterns are not


11
outliers and can consider seasonal fluctuation. Time series data are mainly used for

forecasting or making predictions based on historical data.

1.1.4 Machine learning’s role in forecasting cryptocurrency


returns

Mastering analysis is essential for making progress in buying and selling. The future

value may be estimated using two methods: technical analysis and fundamental anal-

ysis. Technical analysis makes predictions about future prices using data from the

market’s trading activity, such as price and trading volume. In contrast, the funda-

mental analysis forecasts the future using data from sources outside the market, such

as the economy, interest rates, and geopolitical issues. While some investors focus on

fundamental considerations, others are more interested in technical factors. However,

some investors seek to overlap between fundamental and technical aspects. Two of the

most popular algorithms for predicting price movement are Artificial Neural Networks

(ANNs) and Support Vector Machines (SVM), each with its learning patterns. ANNs

have been widely used in securities prediction. Researchers have discussed various

ANN-related topics, including parameter selection and training sets. The embedding

formulation advises that one-step forecasting be regarded as supervised learning when

a historical dataset is provided.

1.2 Aim and objectives

Forecasting returns for the cryptocurrency market based on machine learning and deep

learning is a multidisciplinary topic including Data Science, Computer Science, and

Finance, which places the most significant emphasis on techniques like regression and

classification. Since there is not as much data available for the cryptocurrency market

as for other markets, such as the stock market, the primary focus of deep learning

will be on the ANN and very rudimentary LSTM.This research intends to employ a

variety of algorithms to estimate specific cryptocurrency percentage values as precisely

as possible for a predetermined number of days in the future. then, the research will

focus on forecasting trading signals, such as ”Sell” or ”Buy”. The objectives of this

study are:
12
1. Develop a data pipeline that starts by obtaining up-to-date, reliable data about

the cryptocurrency market from the Yahoo Finance API and then cleans, pro-

cesses, and converts the data into daily price changes rather than the close price.

2. Transform the present data into labelled data using a time series analysis ap-

proach.

3. Create and evaluate several regression models that can predict the daily price

percent changes for a cryptocurrency, Bitcoin, rather than working with exact

prices, making the model perform better in future situations instead of training

with non-stationary data using the exact prices series.

4. developing a collection of classification models that, in addition to forecasting

the percentage change in price, classify whether the price will rise or fall over the

coming few days.

5. Combine the regression and classification models to construct a trading strategy

that can turn the projected outcomes into ”Sell” or ”Buy” trading signals, which

may be the foundation for highly successful trading bots.

1.3 Significance of the study

This research makes a significant scientific contribution at the academic, scientific and

research levels by providing the most appropriate algorithms and models for predicting

the percentage changes in bitcoin prices. The research contribution provided by this

thesis can be summed up as follows:

This research offers researchers in the area of the intersection between data sci-

ence and finances a method to choose the most appropriate machine learning and deep

learning models for use with financial time series data. In order to predict the per-

centage of the closing price for financial time series data and to ensure that the results

are generalizable to data in the future, this method uses cross-validation and split

folds manually, which increases the model’s accuracy and ensures that the results are

generalizable. By contrasting several sets of regression and classification approaches


13
to discover the most efficient algorithms, this work provides a solid foundation for de-

termining the most powerful algorithms and models appropriate for creating financial

projections. This research also focused on clearly and attractively displaying its final

findings in the form of buying and selling signals, with the buy sign appearing when

the returns are likely to be high.

1.4 Structure of Dissertation

There are five chapters in this dissertation: an introduction, a literature review, a

methodology section, results and analysis, and a conclusion.

Introduction: This chapter provides background information on the major topic of

the thesis, describes those topics in more detail, and then outlines the thesis’s goal,

scope, and importance.

Review of the Literature: This chapter covers earlier ideas and methods that were

used to estimate cryptocurrency returns using machine learning. It also reviews earlier

research on similar subjects and its findings from other studies.

Methodology Design: A defined route for the project life cycle was selected and

tailored for this project using the “Machine Learning Conceptual Model” methodology.

Along with a section explaining the data pipeline from the initial raw data collected

to the creation of two sets: one for the classification problem and the other for the

regression problem.

Results and Analysis: This chapter highlights the outcomes of many approaches

that were used, each of which included multiple techniques (various models), as well

as the analysis and comparison of the results. Following a review of all the trained

models, a group of models that produced reliable results was chosen.

Conclusion: This chapter concludes the results of this project and reviews the

summary for the entire work; in addition, a discussion of the project’s aim, objectives,

and possible suggested future work in light of the results of this research.
14
1.5 Summary

This chapter gives a quick overview of cryptocurrencies, the factors that drive their

prices up and down, and several arguments for why Bitcoin is the pioneer in the

cryptocurrency industry. The most essential and tested frameworks for predicting

returns in the commercial and financial sectors, as well as a deeper exploration of

time series analysis, are summarised in this chapter, together with other essential

ideas in machine learning and deep learning with a brief introduction of technical and

fundamental analysis techniques.


Chapter 2

Literature Review

This chapter discusses relevant previous literature. The first part presents the most fa-

mous, discussed and researched financial theories. Machine learning and deep learning

methods used for forecasting financial topics are also discussed in the second part.

2.1 Basic Financial Theories

2.1.1 Efficient Market Hypothesis

The idea of an efficient market has long been a cornerstone of academic finance research.

Eugene Fama, a University of Chicago economist, initially proposed the efficient mar-

kets hypothesis, which maintains that financial markets are “information efficient” and

that asset prices in financial markets properly reflect all information currently accessi-

ble about an asset. According to this theory, it is complicated to predict asset prices

accurately enough to “beat the market” because asset prices will only respond to new

information. As a result, no amount of analysis will provide an investor with a compet-

itive advantage over other investors. Three forms of the efficient market hypothesis are

discussed, and each has a ton of supporting evidence. In the strong form of this theory:

public, private and historical information is reflected on stock prices immediately, and

this does not allow the investor to achieve any returns or profits. In the weak form,

stock prices reflect all information of past stock prices; thus, technical trading analysis

is useless and cannot be relied upon solely to generate profits. On the other hand, the

semi-strong form assumes that fundamental analysis based on historical stock prices

and publicly available information will not aid in making money without access to

private information. The foundation of EMH is a collection of presumptions, the most

15
16
crucial of which is that all investors are rational and have the same outlook on potential

investments. It also assumes that all investors have full access to the same sources of

information (Copeland et al., 2005). Rationality entails updating the information ac-

curately and making decisions that are compatible with the notion of expected benefit

and normatively acceptable when new information is received (Barberis and Thaler,

2003). Not everybody concurs that the market is operating effectively. Academic

researchers and practicing professionals have traditionally disagreed on and debated

the notion of the efficiency of financial markets. The best proof, according to Malkiel

(2005), that markets are efficient and respond to new information accurately is that

professional investors do not overcome the market as Jensen (1978) further emphasized

that the strongest economic theory supported by strong empirical evidence is the idea

of the efficiency of financial markets. From a different perspective, the theory of market

efficiency may be unrealistic, mainly because it assumes the investor’s rationality while

ignoring the psychological component of that investor, which may be the cause of some

price deviations and distortions. From here, the behavioural finance theory emerged

(Singh et al., 2021). In an effort to balance the various viewpoints, it should be stated

that the market is ultimately efficient since anomalies are extremely unlikely to endure

since they are the exception rather than the rule (Malkiel, 2003).

2.1.2 Random Walk

The stock market draws attention because its prices don’t always rise in reaction to

good news. Many financial and economic professionals believe that the idea of “random

walk” is useful in comprehending the broad variations in the prices of stocks. One of the

first persons to get interested in the subject of random walk was Roberts (1959), who

came to the conclusion that the stock prices supports the random walk hypothesis.

The subsequent price changes are independent, and any changes to stock prices in

the future must be entirely unrelated to any previous price changes. Investors are

constantly looking for methods to boost their earnings and analyze new information

as soon as it is available. Because of the random information and the potential of

obtaining it at any time, it is impossible to anticipate a pattern in a system where

randomness still reigns supreme. The effectiveness of financial markets is intimately


17
tied to the random walk hypothesis. When new information enters the market, stock

prices react swiftly and rationally, fluctuating randomly around their real worth Nayak

(2012). Bachelier (1900) demonstrated that there is no relationship between prices,

and he illustrated this unpredictability by arguing that speculating is a fair game

where no one can guarantee profits. Kendall and Hill (1953), Van Horne and Parker

(1967) also emphasized that prices reflect the available information on the market.

Proponents of this theory assert that the reason why some people made some profits

is that they took a high risk, which involved either a high or loss profit. They contend

that diversifying the portfolio is the most effective method to reduce this high risk. On

the other hand, numerous studies show that certain international financial markets do

not behave randomly, such as the Indian stock market, which the variance ratio test

has demonstrated is a market that is not susceptible to the theory of random walk

between April 1996 and June 2001. According to another research findings, Swedish

stock prices did not fluctuate randomly during the period of 72 years, from 1919 to

1990(Frennberg and Hansson, 1993). Greece, Hungary, Poland, Portugal, and Turkey

were five medium-sized European developing markets that were investigated in 2010.

Only the Turkish market among these five nations appears to follow a random walk,

according to the research (Smith and Ryoo, 2003).

Testing the Random Walk Hypothesis:

• Serial Correlation Test:

This test seeks to establish the stock returns’ independence from one another.

• Variance Ratio Test:

This test is predicated on the notion that a randomly generated time series’

variance grows linearly with time.

• Rescaled Range Test:

This particular test aims to identify the kind of correlation between long-term

returns.

• Runs Test:

A non-parametric test used to determine the degree of Independence among re-

turns that cannot be detected by parametric tests.


18
2.1.3 Behavioral Financial Theory

Behavioural finance theory studies how psychology affects investors or financial ana-

lysts’ behaviour and how their decisions affect the markets. This idea holds that an

investor may not always act logically and may commit systematic estimation mistakes

that affect judgment. Behavioural finance integrates finance and other social sciences

to understand better and explain what is happening in financial markets. Several aca-

demic studies have examined how people behave as individuals, groups, enterprises,

and markets using a wide range of research methodologies related to the science dis-

ciplines of psychology and finance (Shiller, 2003). Behavioural finance theory holds

that investors are not rational. Investors do not use the market information properly,

which hinders them from gaining earnings and causes them to make poor purchasing

and selling decisions (Ricciardi and Simon, 2000). The distinction between rational

and irrational investors is not explicitly defined by classic financial theory. Because

people’s behaviours are interrelated, situations where irrational investors can affect ra-

tional investors result in a new pricing trend due to imitation and simulation and that

is known as “herd behaviour” (Hirshleifer and Hong Teoh, 2003). Shefrin (2002)dis-

cussed behavioural finance from several perspectives. First, investors make mistakes

while making investments because they rely their decisions on past experiences. This

includes so-called heuristics, which shorten the time it takes someone to decide based

only on prior, personal experiences. Second, both substance and seeming form influ-

ence financial decisions(framing). Finally, mistakes and decision-making frameworks

have an impact on financial market values. In behavioural theory, biases reflect the

complexity and uniqueness of the human mind while they are faults that need to be

corrected in classic theories (Copur, 2015). There are numerous studies done in this

area, and the majority of them follow one of two primary patterns:

• Behavioral finance and established financial theories can explain unforeseen oc-

currences.

• Identify the beliefs and practices of financial investors that contradict traditional

economic ideas.
19
2.2 Machine Learning in Cryptocurrency Returns
Prediction

2.2.1 Supervised Machine Learning

Decision Tree

Decision trees are a non-parametric supervised learning method. Although it may

be used for regression, it is often utilized for classification. Internal nodes of a decision

tree represent a feature test, while leaves represent decisions reached after additional

processing (Rathan et al., 2019). The decision tree’s most valuable feature is its sim-

plicity of interpretation following prediction (Quinlan et al., 1992). A decision tree can

be pruned to make it more predictable and capable of producing decisions that are

more effective by computing the error rate. In addition, decision trees differ from other

models, such as artificial neural networks, by providing several decision rules that are

particularly beneficial for further investigation and analysis (Tsai and Wang, 2009).

Classification trees are commonly employed in practical disciplines, including finance,

marketing, engineering, and medical as it is an adequate exploratory. Classification

trees do not aim to replace the standard statistical methods; various approaches, in-

cluding artificial neural networks and support vector machines, can be utilised (Maimon

and Rokach, 2014). Chang et al. (2011) states that few studies have examined how

well stock index movements can be predicted in terms of their direction or sign, and

we believe the bitcoin market is no exception. This study Huang et al. (2019) discusses

constructing a tree-based method to forecast daily Bitcoin returns, especially utilising

the decision tree classification approach with 124 price-based technical indicators, and

found that the model has a solid forecast of narrow ranges of returns. Decision trees

are one of the most important models, as one of their principles is clarity, brevity and

flexibility. One of the main flaws is that not all attributes interact with one another

concurrently while making decisions. A single attribute is used to split the dataset at

each partitioning stage (Quinlan, 1990). Here is the general decision tree structure:
20

Figure 2.1: Decision Tree structure (Safavian and Landgrebe, 1991)

Support Vector Machine

Through the use of a hyperplane, the supervised learning algorithm SVM divides

the data into classes. The chosen hyperplane is the one that has the largest distance

between it and all other points, which increases the chance that the classification will

be accurate. SVM employs a linear model to construct nonlinear class borders by

performing some nonlinear mapping of the input vectors x into the high-dimensional

feature space (Kim, 2003). The support vector machines’ results depend on the kernel

functions’ choice, which can be considered a weakness. Some of the kernel functions

are Linear Kernel function, sigmoid function and the Polynomial Kernel Function.

Due to SVM models’ complexity and reliance on the kernel function, calculating SVM

models takes time and incurs cost (Lee et al., 2019). For large-scale financial time-series

forecasting, SVM models based on a limited training and testing sample are unsuitable

(Niu et al., 2020). Further research on Deep Learning techniques is needed in order to

make meaningful comparisons given the efficacy and limitations of Machine Learning.

Forecasting techniques predicated on lowering estimation error may not be sufficient

for financial market practitioners to achieve their goals (Cao and Tay, 2001; Tay and

Cao, 2002).

Logistic Regression

Logistic regression used as a classification method in the field of finance to identify

whether future returns will curve upward or downward. It is a method for adjusting
21
a regression curve, y=f(x), where y is composed of data with binary coding (0,1).

The link between several independent factors and a categorical dependent variable is

examined via logistic regression (Mohammed and Osman, 2021). Since the logistic

regression does not require a normal data distribution, it is appropriate for use in

finance. The algorithm is speedy because there are few parameters, however this is a

flaw that might cause the issue of over-fitting (Akyildirim et al., 2021).

Regression

The regression technique is a statistical method that evaluates the importance and

relationship between two variables. Regression can involve two or more variables that

show a relationship; it is not limited to a specific number of variables. Regression is

characterised as Simple Linear Regression when there is just one input variable and as

multiple linear regression if there are several predictor variables (Weisberg, 2005). Re-

gression analysis can be used in financial modelling to predict the strength of the link

between variables and then predict how the relationship will behave moving forward.

Questions concerning the past behaviour of the variables and their expected future

behaviour can be asked using time series data, the benefit of time series regression

analysis is its capacity to explain the past and forecast the future behaviour of relevant

variables. As a result, the time series’ history is required to serve a double purpose

(Ostrom, 1990). According to Campbell et al. (1998), it is customary to assume that

asset returns are jointly multivariate normal and independently and identically dis-

tributed over time when employing statistical models to determine the normal return

of a particular investment. On the other hand, Mills et al. (1996) has investigated the

effects of incorrectly estimating the regression of the market model and demonstrated

that different estimation techniques produce different outcomes.

2.2.2 Ensemble Machine Learning

XGBoost

Ensemble models perform better than individual models in machine learning with a

high likelihood. A machine learning model called an ensemble combines several different

models into a single model. For instance, the Random Forest uses bagging to calculate

the average of several Decision Trees. Bootstrap aggregation, sometimes known as


22
“bagging” is selecting samples through replacement and combining them (aggregating

them) by figuring out their averages. Bagging can be substituted with boosting. By

focusing on individual model failures rather than overall forecasts, boosters turn weak

into strong learners. Gradient boosting develops special models based on residuals,

predicted differences, and actual results. Gradient boosted trees learn from errors as

they are made instead of simply aggregating trees.

Extreme gradient boosting, also known as XGBoost (Chen and Guestrin, 2016),

is roughly ten times quicker than regular gradient boosting due to performance traits

including parallel processing and cache awareness. XGBoost contains built-in regulari-

sation to reduce overfitting and a split-finding algorithm for optimising trees. Gradient

Boosting has a quicker and more precise variant called XGBoost. However, boosting

surpasses bagging, and Gradient boosting may be the best ensemble boosting strategy.

XGBoost is arguably the most important machine learning ensemble since it is a su-

perior version of Gradient Boosting with unmatched outcomes. In classification issues,

this method has produced effective and noteworthy results. This method (XGBoost)

has been widely used in finance and financial forecasting. Nobre and Neves (2019)

has created a financial system that can provide a rate of return for the portfolio of

about 50% using core component analysis, discrete Wavelet transform, and XGBoost.

In order to forecast the stock market over 60-day and 90-day periods, Dey et al. (2016)

proposed a system using XGBoost as a classifier. Dey et al. (2016) concluded that

XGBoost outperformed non-ensemble algorithms, such as Support Vector Machines

(SVM) and Artificial Neural Networks (ANN).

Random Forest

The decision of multiple decision trees is the foundation of the machine learning

approach known as random forest (Grömping, 2009). It was developed by Breiman

(2001). In decision-tree classifiers and regressors in a random forest, each tree is de-

pendent on a separate random sample and has the same distribution for all trees in the

forest; with a good injection of randomness, RF will be an effective tool for prediction

(Breiman, 2001). Each decision-tree classifier or decision-tree regressor is created in the

random forest by randomly selecting both the training data and the input variables too

(Géron, 2019). Compared to neural network models, random forests are the optimal
23
global solution. Since neural networks frequently fall into the optimal local solution,

overfitting is less likely to happen with random forests (Kumar and Thenmozhi, 2006).

In order to solve the issue of the overfitting, Random Forest trains several decision trees

on various subspaces of the feature space at the expense of a little higher bias. This

indicates that none of the forest’s trees can see the training data. Partitions are created

for the data recursively. The split is carried out at each node by posing a question

regarding an attribute (Akyildirim et al., 2021). There are not many studies on the use

of random forest regression to predict cryptocurrency returns but their uses in the field

of finance, in general, are noticeable to make better business decisions. In their study,

Creamer and Freund (2004) employed the random forest regression approach to assess

the risk associated with corporate governance in Latin American markets and forecast

performance. On one sample of Latin American banks and one sample of Latin Amer-

ican Depository Receipts (ADRs), they perform tenfold cross-validation trials. Their

findings from logistic regression and random forest were contrasted. Results supported

the use of random forest regression. Here is an illustration of the general concept of

the Random forest regressor:

Figure 2.2: Random Forest structure (Gaur et al., 2022)


24
2.2.3 Deep Learning

Artificial Neural Networks (ANNs)

An artificial neural network is a computational model based on biological princi-

ples that is built from thousands of artificial neurons coupled by coefficients (weights)

that represent the neural architecture. There are three or more linked layers in an

artificial neural network. Neurons in the input layer make up the first layer. These

neurons send the final output results to the final output layer from the deeper layers.

Another set of learning rules uses backpropagation, which enables the ANN to adjust

its output results by accounting for errors. The error is used to modify the weight of

the ANN’s unit connections to account for the mismatch between the expected and ac-

tual results (Zurada, 1992). Investors and scholars have been interested in forecasting

non-stationary time-series financial data. The vast majority of studies have focused

on stock markets as a whole; Because of the financial market’s severe nonlinearity and

volatility, neural networks have become increasingly popular, particularly for time se-

ries forecasting. Support vector machines (SVM) and ANN, two models based on two

classification approaches, were examined to predict the direction of movement on the

Istanbul Stock Exchange. According to their experimental results, the ANN model

outperformed the SVM model significantly on average (Kara et al., 2011). To predict

the daily return direction of an ETF, Zhong and Enke (2017) conduct research. It

demonstrates that three distinct PCA-based logistic regression models do not perform

as well as ANN classifiers, which are PCA-based. It should be emphasised that ANN

consistently outperforms other approaches in comparison studies, according to various

research, and its use to financial forecasting is expanding.

LSTM

Long Short Term Memory, or LSTM, is a type of recurrent neural network (RNN)

used in deep learning that can learn long-term associations, especially in applications

requiring sequence prediction (Hochreiter and Schmidhuber, 1997). A memory cell hav-

ing a state maintained over time is called a “cell state”. It serves as the main functional

component of an LTSM model, as shown in the figure below. Optional gates regulate

the flow of information into and out of the cell by adding or removing information
25
from it. This mechanism facilitation needs two pointwise prerequisites: a multiplica-

tion operation and a sigmoid neuronal layer. The sigmoid layer outputs values between

0 (which denotes no data to be passed through) and 1 (which means everything to be

let through). LSTM neural networks can solve issues where earlier learning algorithms

like RNNs fell short. High-end problems can be resolved with LSTM, effectively cap-

turing long-term temporal relationships without an enormous optimization challenge.

This method has produced effective results in situations involving NLP categorization

and time-series forecasting. In order to predict the future price of Bitcoin, the ARIMA

time series model and the LSTM are evaluated in this study Karakoyun and Cibikdiken

(2018); it is noted that the LSTM model performs better when results are compared

based on the accuracy of test results.

Figure 2.3: LSTM cell structure (Chniti et al., 2017)

2.3 Summary

This chapter summarised the key papers that were relevant to the thesis topic. In

the first part, we addressed the three most crucial financial theories: the efficient

market hypothesis, the theory of random walks, and the behavioural financial theory,

which describes how human behaviour impacts financial markets. The other section

examined studies on forecasting financial returns by looking at techniques and methods

like supervised machine learning, deep learning, and Ensemble learning, focusing on

some essential models that have shown promise results in earlier research like XGBoost

and random forest.


Chapter 3

Methodology Design

This chapter outlines the methodology designed and followed to ensure this disser-

tation is starting from the approach of the Machine Learning Conceptual Model to

the presentation of the dataset collected in addition to the data preprocessing steps

followed by the time-series-analysis finalized by the problem formulation.

3.1 Design of Methodology

The methodology approach followed by this project is Machine Learning Conceptual

Modelling. It is a modern approach introduced in many papers and consists of pairing

conceptual modelling with machine learning because both have long been recognized

as essential research areas (Lukyanenko et al., 2019, 2018; Maass and Storey, 2021).

The conceptual model consists of a well-determined development process that starts

with problem understanding and ends with Analytical decision making.

Taking into account the fact that machine learning problems might vary depending

on the case study included in the project, starting with the overall development process,

a tailored method was created to meet the needs of this project, as shown in figure 3.1

at the top of the next page:

26
27

Figure 3.1: Machine learning development process

The first parts consist of data-related parts, which start with data collection from

a reliable data source such as yahoo finance API and then carrying out the required

pre-processing, cleaning, and feature engineering of the collected data. Time-series-

analysis will be performed on the data, which will be our features for the following

parts. Instead of relying just on one discipline, such as regression or classification,

both will be applied in accordance with the dataset in the next section, and two more

sub-datasets will be created. The first will be a continuous values dataset labelled with

the daily percent change in price for the chosen cryptocurrency (daily closing price

was converted to daily price percent change). The second dataset will also include

categorical data, which will be divided into the classifications “Up” and “Down” based

on the daily price percent change, which represents price variation.

The algorithm selection process starts after the datasets have been created. For

the classification problem, cross-validation will be used to compare the most popular

classification method, and for the regression problem, a collection of linear and non-

linear regression algorithms will be used. The deep learning model will be approached

differently since the layers’ development, and the architecture as a whole will be the

core issues of concern

Multiple training will be performed then, followed by the testing and validation

part. The Result analysis comes just before validating the final model and setting

it for deployment, so it includes improved visualization compared to the actual data

(only for regression) and then comparing both predictions from the various trained

classification and regression models.


28
3.2 Evaluation Methods and Measures

Regression and classification employ several evaluation methods and metrics. In this

part, the assessment methodology and metrics for the classification and regression

models utilised in this research will be discussed.

3.2.1 Classification Evaluation Methods


Confusion Matrix

A confusion matrix is a particular table schema used specifically in machine learning

and classification that allows representing the performance of a classification model on

a test set for which the actual values are known and creates a representation of the four

key parameters that will help determine other metrics like accuracy and recall. For

supervised machine learning classification models, it serves as a performance indicator.

Additionally, the output classes can change based on the working scenario. A general

confusion matrix is shown in the figure 3.2 below:

Figure 3.2: Confusion Matrix (Tharwat, 2020)

The confusion matrix’s displayed parameters are as follows:

True Positives (TP):

These are the positive values that were successfully predicted, indicating that the

true value of the actual class is yes and the value of the predicted class is also yes.

True Negatives (TN):

These are the successfully predicted negative values, indicating that the actual class

value is no and the predicted class value is also no.

False Positives (FP):

When the predicted class is yes and the actual class is no.
29
False Negatives (FN):

When the predicted class is no but the actual class is yes.

Accuracy

Regarding classification, accuracy is the most apparent performance indicator since it

broadly characterises how the trained model performs across all classes. It is simply

calculating the number of correct predictions overall predictions as represented in the

formula 3.2.1 below:

(T P + T N )
Accuracy = (3.2.1)
(T P + T N + F P + F N )

Precision

The precision metric measures up the number of correctly predicted positive observa-

tions to the total number of predicted positive observations. Its formula is as shown

below:
(T P )
P recision = (3.2.2)
(T P + F P )

The precision measures the model’s accuracy in classifying a sample as positive, which

has the goal of classifying all the Positive samples as Positive and not misclassifying a

negative sample as Positive (Tharwat, 2020).

Recall

The (Recall) is a measure of how many of the positive cases the classifier correctly

predicted, overall the positive cases in the data, formula 3.2.4 represents the recall

formula:
(T P )
Recall = (3.2.3)
(T P + F N )

The recall metric assesses the model’s ability to identify positive samples. The higher

the recall, the more positive samples detected regardless the number of the negative

values (Bekkar et al., 2013).


30
F1-Score

The F1-Score is simply the weighted average between the Precision and the Recall,

which is calculated as:


(P recision ∗ Recall)
F1 = 2 ∗ (3.2.4)
(P recision + Recall)

3.2.2 Regression Evaluation Methods


Mean Squared Error (MSE)

The most used regression loss function is mean squared error (L2 loss). The mean

squared difference between true and predicted values is used to determine the loss and

is known as the mean squared error. The mean squared error is expressed as follows

for a data point Yi and its predicted value Ŷi, where n is the total number of data

points in the dataset:

Pn
i=1 (yi − ybi )2
M SE = (3.2.5)
n

Mean Absolute Error (MAE)

One of the most straightforward loss functions is mean absolute error, sometimes re-

ferred to as L1 loss. It is determined by averaging the absolute difference between the

predicted and actual values over the dataset (Wang and Lu, 2018). It is the arithmetic

average of absolute errors, mathematically denoted by the following equation:

Pn
i=1 |yi − xi |
MAE = (3.2.6)
n

Root Mean squared Error (RMSE)

The square root of the mean of the squares of all the errors is known as the root mean

squared error (RMSE). By obtaining the square root of MSE, RMSE is calculated.

RMSE is also known as the Root Mean Square Deviation. The model and its predictions

are better the more minor the RMSE value. RMSE defined by the following equation:
s
Pn
i=1 (yi − ybi )2
RM SE = (3.2.7)
n
31
When model errors follow a normal distribution, RMSE is a superior option; the sen-

sitivity of RMSE to outliers, which must be eliminated for it to work as intended, is

one of its main flaws (Chai and Draxler, 2014).

3.3 Tools and Resources

Several tools and resources were employed for this research, starting with Python as

the main programming language because it is the most common language for data

science and offers a large number of packages that can be used for many different

computer science and data science fields. In terms of data collection, Bitcoin-related

data was gathered via the Yahoo Finance API. Since it is a reliable source of data

with many benefits most well-known of which is that it has already been implemented

as a Python package, by just installing the package and simple lines of code, the

data will be ready for use in subsequent phases without the need for an API-Key

or with limited access. Tensorflow, Keras, Sickit-Learn, Pandas, Numpy, Matplotlib,

Plotly, and Dash, are the primary tools utilised; these are the essential packages for our

machine learning conceptual model. The working environment used was Google Colab,

a cloud-based alternative to Jupyter Notebook that removes most hardware issues and

other potential limitations while working on data science projects. To guarantee the

optimum development circumstances for the project, every resource and tool chosen

for it was carefully considered after extensive testing with various tools.

3.4 Dataset

The data will be mainly gathered through the Yahoo Finance API using the “yfinance”

package as indicated in the preceding section. The collected data will be for Bitcoin

since Bitcoin influences the majority of the volatility in the cryptocurrency market.

The initial dataset will contain these features: Open, High, Low, Close, Adj Close, and

Volume. Which are defined as:

• Open: It is the price when the trading begins.

• High: The highest price at which a stock traded within a specific time period.
32
• Low: Low is the minimum price of a stock in a period.

• Close: Closing price usually relates to the last price where a stock trades during

a standard trading session.

• Adj Close The adjusted closing price modifies a stock’s closing price to reflect

its value after accounting for any dividends.

• Volume: The number of shares traded in a stock or contracts traded in futures

or options is referred to as volume.

The dataset date range will be from “2014-09-17” (which is the minimum date you

can start within the API) to “2022-07-30” (a randomly picked data to have a specified

end date), which comprises of 2874 days (row of data). The initial dataset is as follow:

Figure 3.3: Initial collected dataset

In light of the fact that the forecast would be based on daily variation and that

the closing price is the best match to be used as the daily price, all the columns

were then dropped, with the exception of the close price column. Since the project

idea does not involve non-stationary values like prices, a new column named ”Close%”

was created. This column will be utilised for the regression portion of the project

and reflects the daily percent changes for Bitcoin starting on “2014-09-17”. A second
33
column called “Variation” was also produced using the column “Close%” with values 1

and 0. The “Variation” column denotes the price percent indications (1 for “Up” and

0 for “Down”). The modified dataset is also displayed below:

Figure 3.4: Updated Dataset

For more illustration, here is how the variation column distribution looks like be-

tween the two classes “Up” and “Down”:

Figure 3.5: Variation column classes distribution

We implemented the following in the data preprocessing section:

• Changing the date column to use the pandas DateTime format.


34
• Normalizing the price percent change (the “Close%” column).

• Feature generation utilizing time-series analysis using a custom function on the

provided dataset based on a 21-time step, which is accomplished by extracting

the features from the original time series at each time step while considering a

predefined number of past values.

A rolling mechanism will provide a sub-time series of the last (m) time steps to build

the features. The final part will consist of splitting into two sub-datasets, one for

regression and the other for classification, which mainly have the same features but

different labels, then split these datasets into 70% train and 30% test.

3.5 Problem Formulation

Regression and classification are the two main sorts of issues used in this research.

Although each has its own formulation, they both fall under the category of supervised

learning, meaning that the prediction will be based on a labelled dataset. In this

case, classification and regression both use the same input feature data, but they

differ when it comes to the output/label data, which must be continuous for regression

and categorical or binary for classification. For the input data, it will be defined by

a set of time-series vectors extracted after the feature engineering part. Each data

point consists of 21-time steps, which present the feature of that specific data point

built based on the last steps, basically a translation for the daily price percent change

variation. For classification, the output vector will be made up of two primary classes:

1, which represents the “Up” class and denotes that the daily price percent change

will go up, and 0, which denotes the “Down” class and indicates that the daily price

percent change will go down. The output vector for regression will include the daily

price percent change for each day, which ranges from -1 to 1. (normalized). for each

part a group of models will be trained to predict the possible classes or values for each

data point, after which the model will be able to make future predictions on new cases.
35
3.6 Summary

In this chapter, we first reviewed the methodology used, which is the Machine Learning

Conceptual Model. This model is made up of a well-built custom schema that is based

on general conceptual model studies and is useful in contemporary machine learning

projects. The project’s assessment techniques and measurements are covered in the

following section, which also explains the methodologies used by each field, including

classification and regression. Then all the dataset features and characteristics were

summarized. Beginning with a description of the data preprocessing steps, along with

the creation of the labels (“Closed%” and “Variation”), next came the time-series

analysis that was carried out to produce the training features, and finally, the data

structure that would be used for classification and regression to be able to perform both

approaches, followed by an analysis of the results from all produced models. In the

final section, the problem formulation for this project, which is based on classification

and regression problems, is represented and explained in depth. The training results

for both methods will be compared and analysed in the next chapter, along with the

models’ rankings based on various criteria.


Chapter 4

Results and Analysis

In this section, we evaluate the results of each model after dealing with the data prepa-

ration, feature engineering, classification and regression models construction, building,

and training. Analysis of each model’s results and comparison of its metrics will be

the focus to determine which model performs best.

4.1 Prediction Results

This section discusses the results of the classification prediction models in addition to

the regression models. The primary label for classification was either the price percent

was positive or negative for that day, representing a data point. For regression, the

label will be the daily price percent changes. 70% of the data were utilised for training,

and 30% were used for testing. However, in the case of the ANN, 40% of the testing set

was used for validation while training. As each approach has its own criteria for model

evaluation, the results will first be analysed for the classification component, followed

by the regression component.

4.1.1 Classification prediction results


Classifiers comparison

As previously discussed, the confusion matrix allows representing the performance of

a classification model on the test set that will help determine the other metrics like

accuracy, precision and recall. All the score calculation was done using the 30% test set

on the models. Cross-validation, a resampling technique that tests and trains a model

at different iterations using different parts of the data, is the first step in analysing

36
37
the results for machine learning models. It runs on the entire dataset and is based on

the number of folds (8 folds) that make up the number of years in the dataset. It is

primarily used in cases where the goal is prediction and can measure how accurately

a predictive model can perform on different portions of the dataset. A collection of

machine learning classifiers will be utilised for the comparison of the result, including:

• Logistic Regression

• Linear Discriminant Analysis

• KNN

• Decision Tree

• GaussianNB

• Linear SVM andSVM

• Random Forest

• LGBM

• XGBoost

The figure 4.1, which shows the range of accuracy provided by each classifier after

cross-validating each technique for eight folds run, is shown here:

Figure 4.1: Box Plots for classifiers accuracy range comparison


38
To determine which classifiers can achieve high accuracy throughout this procedure,

just the maximum accuracy for the eight folds training/testing will be shown, as shown

below:

Figure 4.2: Bar Plot for classifiers max accuracy comparison

Bellow, the average accuracy results for each classifier are shown. Based on all the

results, it is possible that the choice will not only be based on accuracy scores but

also consistency and other factors. In order to get the highest accuracy, the selected

classifiers will also undergo hyper-parameter tuning; the resulting average accuracies

are as follows:

• Logistic Regression: 51.88%

• LDA: 52.40%

• KNNeighbors: 49.47%

• Decision Tree: 50.36%

• Gaussia nNB: 53.08%

• Linear SVM: 52.23%

• SVM: 53.09%

• Random Forest: 52.23%

• LGBM: 52.41%
39
• XGB: 52.58%

Based on all the findings from this step, this collection of classifiers will be chosen

for the following reasons:

SVM:

Because it has the highest average accuracy.

XGBoost:

The most recommended approach for machine learning classification models, even

if earlier results were not particularly strong.

RandomForest:

This model allows for the highest accuracy obtained through hyper-parameter ad-

justment.

A custom ANN model will be created and included in the analysis and comparison

in addition to these classifiers.

Classification experiments results

After executing the training for each of the selected models, all the metrics indicated

before in the evaluation methods section will be computed in this part. Then analysed

and evaluated. The confusion matrix, classification report, precision, recall curve, and

roc curve will be illustrated. After that, all the models will be compared globally.

SVM result analysis:

SVM evaluation metrics gave these values:

• Accuracy: 48.99%

• Balanced Accuracy: 47.68%

• Precision: 65.62%

• Recall: 51.41%

• F1: 57.65%

The confusion matrix for the SVM model is shown below:


40

Figure 4.3: SVM confusion matrix

The classification report, which is used to evaluate the accuracy of predictions made

using a classification model, is illustrated in figure 4.4. It includes all the necessary

metrics for each class, such as accuracy, precision, recall, and the f1-score.

Figure 4.4: SVM classification report


41
The precision-recall curve plots the trade-off between precision and recall for different

thresholds, and this is the curve in the case of the SVM model shown below:

Figure 4.5: SVM precision-recall curve

The ROC curve (Receiver Operating Characteristic Curve) is a plot showing the

performance of a classification model at all classification thresholds, as shown below

in the case of the SVM model:

Figure 4.6: SVM ROC curve

In conclusion, the SVM model does not provide the expected results as it is lower

than the minimum required accuracy for this kind of classifier which is 50%.

XGBoost result analysis

The XGBoost classifier evaluation metrics results:

• Accuracy: 52.56%
42
• Balanced Accuracy: 51.68%

• Precision: 70.70%

• Recall: 53.94%

• F1: 61.22%

Figure 4.7: XGBoost confusion matrix

Figure 4.8: XGBoost Classification report


43

Figure 4.9: XGBoost Precision-Recall curve

Figure 4.10: XGBoost ROC curve


44
Among the previous models, the XGBoost classifier seems to have the highest

accuracy and a value considerably acceptable, especially to be implemented into a

trading strategy or a trading bot.

Random Forest result analysis

For Random Forest, a typical run was performed at the start, but then

hyper-parameters tuning was applied to get the best possible parameters from the

model, which will get the best accuracy score possible, starting with the evaluation

metrics results for the first run:

• Accuracy: 53.51%

• Balanced Accuracy: 52.87%

• Precision: 86.07%

• Recall: 53.79%

• F1: 66.21%

Figure 4.11: Random Forest confusion matrix


45

Figure 4.12: Random Forest classification report

Figure 4.13: Random Forest precision-recall curve

Figure 4.14: Random Forest ROC curve

In addition to these results, thanks to hyperparameters tuning, the best estimator

was able to hit 54.65% accuracy, as it seems that the Random Forest classifier

performed the best among all machine learning classifiers chosen. However, the only

drawback is that this classifier has a very high precision rate for class “1” compared

to other models with lower accuracy, which have a closer rate between the two classes.
46
ANN result analysis

The result analysis for the ANN model has some similar points to the other machine

learning models. However, ANN will have other comparison aspects, as the focus will

be on the loss and accuracy. In addition to the validation set, which validates each

training epoch and calculates the validation loss and validation accuracy to help to

notice any over/underfitting during the training. Also, an extra function was added

that helps save the best model during the whole training, which is ModelCheckpoint,

to save the model with the lowest “validation loss”. EarlyStopping was added and

tested. However, it seemed that it is not helpful for this case, so the focus will be

only on the ModelCheckpoint method. The evaluation metrics for ANN:

• Accuracy: 52.38%

• Balanced Accuracy: 50.23%

• Precision: 85.82%

• Recall: 53.24%

• F1: 65.71%

Figure 4.15: ANN Classification Report

The following plots will plot the training and validation loss/accuracy as a function of

the epoch during the model training, which helps analyse the training and look for

any over/underfitting. In this case, the train loss/accuracy will be in the purple line,

and the validation loss/accuracy will be in the orange line.


47

Figure 4.16: ANN Loss plot in function of epoch

Figure 4.17: ANN Accuracy plot in function of epoch

It can be observed from the figures above that the result does not get better even

when using a well build ANN model.


48

Table 4.1: Evaluation metrics comparison for classification models

Models Accuracy Balanced acc Precision Recall F1-score AUC-ROC


SVM 48.99% 47.68% 65.62% 51.41% 57.65% 48.00%
XGBoost 52.56% 51.68% 70.79% 53.94% 61.22% 51.00%
RandomForest 53.51% 52.87% 86.07% 53.79% 66.21% 50.00%
ANN 52.38% 50.23% 85.82% 53.24% 65.71% NA

In the table above 4.1.1 we demonstrate the final classifiers global comparison

( green color means highest value in the column, yellow color means a considered

model)

The major noticed point from prior confusion matrices and classification reports is

that most classifiers have high precision rates for class (1), which results in high

precisions and low accuracies, reducing the model’s efficiency. This prompts us to

consider not only the high accurate model but the more stable and balanced model,

which are the XGBoost model and the Random Forest model.

4.1.2 Regression prediction results

The linear model, ensemble regression models, and RNN models make up the three

primary components of the regression section. The same time-series analysis feature

utilised for classification will be the input for all models, and the intended result will

be the daily price percent change. In order to determine how accurate the model is at

forecasting the daily percent change, the assessment will be based on three metrics

(MAE, MSE, and RMSE) in addition to a comparison plot that displays the results

between the projected percentages and the actual percentage.

Linear Regression Model

Results of Model evaluation metrics for Linear Regression (RMSE, MSE, MAE):

• Mean Absolute Error - MAE: 8.58%

• Mean squared Error - MSE: 1.44%

• Root Mean squared Error – RMSE: 12.02%

The prediction on the training set is shown in the red line, the prediction on the

testing set is shown in the green line, and the actual daily price percent changes are
49
shown in the blue line in the following figure 4.18. It appears that the linear

regression model was unable to produce a value that was close to what the actual

results should be.

Figure 4.18: Comparison between original and Predicted values for Linear Regression

Ensemble Regression Model(s):

XGB Regressor Model:

Model Evaluation metrices RMSE, MSE, MAE (for XGB Regressor):

• Mean Absolute Error - MAE: 9.65%

• Mean squared Error - MSE: 1.67%

• Root Mean squared Error – RMSE: 12.92%

As plotted in the figure below 4.19, XGBoost Regressor could have a very close

prediction on the training set and a great result regarding the testing set.

Figure 4.19: Comparison between original and predicted values for XGB Regressor
50
Random Forest Regressor Model:

Model Evaluation metrices RMSE, MSE, MAE (for Random Forest Regressor):

• Mean Absolute Error - MAE: 8.66%

• Mean squared Error - MSE: 1.44%

• Root Mean squared Error – RMSE: 12.00%

The results from the RandomForest model are not that great compared to the

previous ensemble model, as shown in figur 4.20 below:

Figure 4.20: Comparison between original and predicted values for Random Forest
Regressor

RNN Regression Model (LSTM):

Model Evaluation metrics RMSE, MSE, MAE results (for LSTM):

• Mean Absolute Error - MAE: 3.70%

• Mean squared Error - MSE: 13.69%

• Root Mean squared Error - RMSE: 2.64%

Despite having a lower loss value, the LSTM model provides very far results, as

shown in the figure 4.21 at the top of the next page:


51

Figure 4.21: Comparison between original and predicted values for LSTM

By attempting to predict the price returns using traditional regression models and

various classification approaches, all experiments have been done to test all the

potential models that might be employed in such a field. The Random Forest

Classifier and the XGBoost Classifier are the best models for classification. Then, for

regression, it demonstrates that the XGBoost regressor could be able to predict the

anticipated values rather well. The major thing that stands out is that XGBoost is

the best model in both classification and regression because it is known for its ability

to get the best performance for boosted tree algorithms and great computational

speed.

4.2 Comparing and merging Classification and


Regression models

The only way to compare regression models to classification models is to transform

the numerical prediction results of the regression into categorical results that can be

divided into two classes, “1” or “0” By taking the prediction values and changing

them into “1” if the value is positive or “0” if the value is negative, it will be possible

to calculate accuracy using those transformed results. In addition, the accuracy will

be calculated using the testing set first and then using the entire dataset and these

are the results for the testing set:

• RandomForest Classifier: 53.51%

• XGBoost Classifier: 52.56%


52
• RandomForest Regressor: 52.91%

• XGBoost Regressor: 52.68%

For the full dataset results:

• RandomForest Classifier: 66.83%

• XGBoost Classifier: 65.15%

• RandomForest Regressor: 54.00%

• XGBoost Regressor: 35.93%

Moving on to another part, we combine all the models into one predictor that utilises

a majority voting system, gathering the predictor results into one array and

beginning to iterate data point per data point. By using the majority voting system,

the decision will be made based on the class that receives the most votes, and by

doing this, we will be able to predict the outcomes of five predictions that came from

various models which can be used to calculate the accuracy value and these the

values we got:

• alculating with the testing set only: 53.27%

• Calculating with the full dataset: 65.43%

And the following figure 4.22 shows how the voting process actually works:

Figure 4.22: Explaining the created voting system that merges multiple classification
and regression models into one predictor
53
In conclusion, classification models offered higher accuracy than regression models.

The combination method used can be improved in future work by adding more

prediction models or improving the voting system and can be easily transformed into

a reliable predictor more efficient than just one machine learning model.

4.3 Ethical, Social, and Legal Issues

Cryptocurrencies are a widespread global economic phenomenon, which has

repercussions at the economic and individual levels, as it affects the monetary policies

of the state in terms of the money supply due to the lack of state control over its

issuance, and also affects the financial policy to facilitate the process of tax evasion,

as well as the payments and credit system due to the absence of mediation. However,

at the same time, it was distinguished by its wide acceptance due to the dealers’

confidence due to its fluidity in dealing and benefiting from it in all transactions

efficiently and quickly through the widespread trading platforms. On the other hand,

there are significant risks to investing in Bitcoin and cryptocurrencies in general, as

volatility is a crucial characteristic of cryptocurrencies. The price of Bitcoin and

cryptocurrencies, in general, is incredibly volatile because it is a very young and

emerging market. It is common for the price of Bitcoin to experience sharp

fluctuations within a day or even minutes. Therefore, this project is only intended for

research purposes, and we cannot recommend using its results to make an investment

or speculative decisions in cryptocurrencies.

4.4 Limitations and Problems

One of the significant limitations of the project was the dataset used for the training

because of the lack of external features, which makes most of the work based on the

time-series features, which were analysis for the daily price percent variation for

Bitcoin for 2874 days, this is a significant factor for having a low accuracy value and

most of the classifiers get distracted by predicting a single class. However, the overall

approaches provided reliable models that can be integrated into a useful trading tool.

In the same matter, a lot of other factors can be related to cryptocurrency price
54
variation; some of them is unpredictable, like economic crisis or wars, and other

different factors that can be suitable for the prediction that was not experimented

with in this project and because most of them are beyond the scope of this project.

However, things are a bit different regarding the cryptocurrency market for various

reasons, one of which is that blockchain and cryptocurrencies are still mostly unheard

of, especially when talking about the Web 3.0 era, the future of the web. On the

other hand, the comprehension of cryptocurrencies and forecasting models may be

advanced by extensive research and analysis, which will improve as more data,

features, and other factors become accessible.


55
4.5 Summary

In this chapter, a thorough analysis of the chosen classification and regression models

is presented, along with metrics assessments and the methodology used to distinguish

between the regression models since the provided metrics cannot adequately convey

how the model would function in actual case studies. After analysing all the given

results, the classification models considered the best were Random Forest and the

XGBoost classifiers for many reasons, starting with evaluating the metrics. In

addition to the accuracy also, the precision and how balanced the precision for both

classes were considered, but overall most models had a high precision rate for the “1”

class, which represents the “Up” class, as in the Random Forest model which has

high accuracy but was less balanced than the XGBoost model (based on the

confusion matrices). For the regression model, XGBoost regressors were picked based

on the simulation performed, comparing the predicted values with the actual ones.
Chapter 5

Conclusion

The conclusion of the findings from several parts will be the main focus of this

chapter. It will begin by evaluating the testing and analysis conducted in the previous

chapter, which will discuss the advantages of each approach, particularly in this case

when working with two different machine learning approaches, regression and

classification. The dissertation’s limitations, problems, and potential improvement

points will be examined together with the evaluation section. Following that, the

approach used will be discussed and how it affected the project’s construction. The

conclusions will also analyse the findings and demonstrate how this dissertation’s

goals and objectives were met, ending with a suggestion for future research.

5.1 Evaluation and Discussion

According to all of the prior analysis and metrics calculations, whether for the

classification or regression approach, the overall result is that the average accuracy

for both is about 53% (precisely 52.91%) based on over 60% precision rate for the

class “1” (calculated only for classification models), which represents that the price

would increase in that day. The RandomForest classifier was the top-performing

model, outperforming both classification and regression models with an accuracy rate

of 53.51% and a precision rate of 86.07%. As the cryptocurrency market is heavily

dependent on Bitcoin, it may also be incorporated into a trading strategy and

anticipate not only the return of the Bitcoin price but also market variety and

fluctuations. However, in this instance, it is employed for research and study reasons

that aid in understanding the variance of cryptocurrencies and utilising machine

56
57
learning models to forecast returns. Additionally, as discussed in the previous section,

a method was used to combine all the models and use a voting strategy to decide the

prediction. This method had some efficiency and produced predictions with an

accuracy of about 53%, but it still requires additional work and improvements to be

converted into an ensemble learning approach.

5.1.1 Methodology Discussion

As discussed in the previous chapter, the machine learning development method was

used to ensure that this project adheres to the standards for machine learning

projects. The methodology offers a structured approach to follow along with that

allows flexibility in solving this machine learning problem and is based on these axes:

Data collection, Data engineering, Model training, Model optimization, and finally,

Model integration, which results in a machine learning project capable of making

analytical decisions. Although working on a standard software development project

requires a certain way to be able to change, improve, or even delete a part of a

project, in this situation, the process allows flexibility in the work by isolating each

central element even though it is connected to a next or previous one. Any action can

be taken in this manner without endangering other parts (in some cases, a further

change must be done with other sections depending on the type of changes). This

advantage was obvious when concentrating on the model training and model

optimization part when returning to these specific parts to apply any further changes;

any change that was applied does not require reviewing the entire project from the

start or at a certain point. This benefit was made possible by the methodology that

was used. This methodology provides numerous advantages and may be maintained

in further work by updating datasets, models, or methodologies.

5.2 Conclusion

This study’s primary objective was to determine whether machine learning models

could anticipate specific cryptocurrency returns using classification or regression

models to predict whether the price will move “Up” or “Down” on that particular
58
day. As was mentioned in the Literature Review, there are numerous financial

theories that are relevant to this topic and should be taken into account, but when it

comes to the machine learning component, the majority of the work is still in its

infancy, and many are constrained by or reliant upon one approach. The

methodology utilised in this dissertation was systematic, beginning with the data

collection from the Yahoo Finance API, then producing two datasets, one for

classification (categorical, which showed two classes “1” implies the price will climb,

“0” means the price will decline). The other was for regression, which used numbers

to show the daily percent change in price, ranging from -1 to 1. Both datasets were

built using the daily close price of Bitcoin over eight years of data. Based on 21

timesteps, time series were created as the main feature of the prediction. Initial

cross-validation was carried out for the classification strategy to help determine which

machine learning classifiers should be considered for the next training. Along with

the customised ANN model, the initial selection included SVM, XGBoost Classifier,

and RandomForest Classifier. Following training, it was determined that

RandomForest outperformed all other classifiers listed and then XGBoost with has a

less acuuracy than the RandomForest and more balanced score, especially precision.

The major regression indicators for each model were highly similar, making it difficult

to determine how well the model would perform. As a result, the evaluation of the

regression models was a little different for each chosen model. As a result, a created

plot, which compares anticipated price percent changes with original values, served as

the basis for this comparison. It reveals that only the XGBoost Regressor

outperformed all evaluated models, including LSTM, the model that is best suited for

time series projects. The outputs of the regression models were converted into a

categorical label with ”0” and ”1” classes, and an additional method was devised to

calculate accuracy to compare scores from regression and classification since the

project is accuracy-focused. The classification is better at foretelling whether the

price will move ”Up” or ”Down” on a particular day.


59
5.3 Suggestion for Further Work

The objective of this dissertation was to develop machine learning models that can

forecast a particular cryptocurrency return based on predicting whether the price will

go “up” or “down” in the upcoming days. The models achieved the highest accuracy

of 54%, with a high precision rate for the class “1” (which is “Up”) in most of them.

Most of the selected models can be employed in various profitable, analysis or even

related studies projects, for example, trading strategy, trading bots, analysis

dashboard, and market studies.

The initial voting strategy was utilised to avoid the forecast being made by a single

decision maker, which opened the door for additional work that can be implemented

and enhanced based on the voting approach to predict cryptocurrency market returns

using ensemble machine learning. There is already a pre-built machine learning

method with the same principle but only supports Classification alone or Regression

named: (VotingClassifier, VotingRegressor by Sickit-Learn) this method can be more

effective. However, it requires additional research and enhancements.

More features are required for this subject as it has been demonstrated in this

dissertation that time series alone are insufficient to achieve higher results than those

obtained. One aspect of the additional work that can be done is to expand research

for other features that can interact directly with the cryptocurrency market prices,

which can be related to finance. This project’s one major weakness was the lack of

features when it came to building the dataset. In conclusion, this project showed that

predicting cryptocurrency returns using classification and regression machine learning

models is feasible and could be a useful tool for many other fields or projects since

the cryptocurrency market is regarded as a new market.


References

Aggarwal, G., Patel, V., Varshney, G., and Oostman, K. (2019). Understanding the
social factors affecting the cryptocurrency market. arXiv preprint
arXiv:1901.06245.

Akyildirim, E., Goncu, A., and Sensoy, A. (2021). Prediction of cryptocurrency


returns using machine learning. Annals of Operations Research, 297(1):3–36.

Alpaydin, E. (2016). Machine learning: the new AI. MIT press.

Aysan, A. F., Demir, E., Gozgor, G., and Lau, C. K. M. (2019). Effects of the
geopolitical risks on bitcoin returns and volatility. Research in International
Business and Finance, 47:511–518.

Bachelier, L. (1900). Théorie de la spéculation. In Annales scientifiques de l’École


normale supérieure, volume 17, pages 21–86.

Back, A. et al. (2002). Hashcash-a denial of service counter-measure.

Barberis, N. and Thaler, R. (2003). A survey of behavioral finance. Handbook of the


Economics of Finance, 1:1053–1128.

Bekkar, M., Djemaa, H. K., and Alitouche, T. A. (2013). Evaluation measures for
models assessment over imbalanced data sets. J Inf Eng Appl, 3(10).

Bhambhwani, S., Delikouras, S., Korniotis, G. M., et al. (2019). Do fundamentals


drive cryptocurrency prices? Centre for Economic Policy Research.

Böhme, R., Christin, N., Edelman, B., and Moore, T. (2015). Bitcoin: Economics,
technology, and governance. Journal of economic Perspectives, 29(2):213–38.

Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.

Buchholz, M., Delaney, J., Warren, J., and Parker, J. (2012). Bits and bets,
information, price volatility, and demand for bitcoin. Economics, 312(1):2–48.

60
61
Campbell, J. Y., Lo, A. W., MacKinlay, A. C., and Whitelaw, R. F. (1998). The
econometrics of financial markets. Macroeconomic Dynamics, 2(4):559–562.

Cao, L. and Tay, F. E. (2001). Financial forecasting using support vector machines.
Neural Computing & Applications, 10(2):184–192.

Castrounis, A. (2016). Artificial intelligence, deep learning, and neural networks,


explained. KDnuggets informatics blog.

Chai, T. and Draxler, R. R. (2014). Root mean square error (rmse) or mean absolute
error (mae)?–arguments against avoiding rmse in the literature. Geoscientific
model development, 7(3):1247–1250.

Chang, P.-C., Fan, C.-Y., and Lin, J.-L. (2011). Trend discovery in financial time
series data using a case based fuzzy decision tree. Expert Systems with
Applications, 38(5):6070–6080.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In


Proceedings of the 22nd acm sigkdd international conference on knowledge
discovery and data mining, pages 785–794.

Chniti, G., Bakir, H., and Zaher, H. (2017). E-commerce time series forecasting using
lstm neural network and support vector regression. In Proceedings of the
international conference on big data and Internet of Thing, pages 80–84.

Ciaian, P., Rajcaniova, M., et al. (2018). Virtual relationships: Short-and long-run
evidence from bitcoin and altcoin markets. Journal of International Financial
Markets, Institutions and Money, 52:173–195.

Copeland, T. E., Weston, J. F., Shastri, K., et al. (2005). Financial theory and
corporate policy, volume 4. Pearson Addison Wesley Boston.

Copur, Z. (2015). Handbook of research on behavioral finance and investment


strategies: Decision making in the financial industry: Decision Making in the
Financial Industry. IGI Global.

Creamer, G. G. and Freund, Y. (2004). Predicting performance and quantifying


corporate governance risk for latin american adrs and banks. Financial Engineering
and Applications, MIT, Cambridge.
62
Dey, S., Kumar, Y., Saha, S., and Basak, S. (2016). Forecasting to classification:
Predicting the direction of stock market price using xtreme gradient boosting.
PESIT South Campus.

Dwork, C. and Naor, M. (1992). Pricing via processing or combatting junk mail. In
Annual international cryptology conference, pages 139–147. Springer.

Frennberg, P. and Hansson, B. (1993). Testing the random walk hypothesis on


swedish stock prices: 1919–1990. Journal of Banking & Finance, 17(1):175–191.

Gaur, D., Mehrotra, D., and Singh, K. (2022). Estimation of particulate matter
pm2.5 concentration using random forest regressor with hyperparameter tuning. In
2022 12th International Conference on Cloud Computing, Data Science
Engineering (Confluence), pages 465–469.

Géron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and


TensorFlow: Concepts, tools, and techniques to build intelligent systems. ” O’Reilly
Media, Inc.”.

Grinberg, R. (2012). Bitcoin: An innovative alternative digital currency. Hastings


Sci. & Tech. LJ, 4:159.

Grömping, U. (2009). Variable importance assessment in regression: linear regression


versus random forest. The American Statistician, 63(4):308–319.

Haber, S. and Stornetta, W. S. (1990). How to time-stamp a digital document. In


Conference on the Theory and Application of Cryptography, pages 437–455.
Springer.

Haykin, S. (2009). Neural networks and learning machines, 3/E. Pearson Education
India.

Hirshleifer, D. and Hong Teoh, S. (2003). Herd behaviour and cascading in capital
markets: A review and synthesis. European Financial Management, 9(1):25–66.

Hobson, D. (2013). What is bitcoin? XRDS: Crossroads, The ACM Magazine for
Students, 20(1):40–44.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural


computation, 9(8):1735–1780.
63
Huang, J.-Z., Huang, W., and Ni, J. (2019). Predicting bitcoin returns using
high-dimensional technical indicators. The Journal of Finance and Data Science,
5(3):140–155.

Iinuma, A. (2018). Why is the cryptocurrency market so volatile: Expert take. Coin
Telegraph.

Jensen, M. C. (1978). Some anomalous evidence regarding market efficiency. Journal


of financial economics, 6(2-3):95–101.

Kara, Y., Boyacioglu, M. A., and Baykan, Ö. K. (2011). Predicting direction of stock
price index movement using artificial neural networks and support vector machines:
The sample of the istanbul stock exchange. Expert systems with Applications,
38(5):5311–5319.

Karakoyun, E. S. and Cibikdiken, A. (2018). Comparison of arima time series model


and lstm deep learning algorithm for bitcoin price forecasting. In The 13th
multidisciplinary academic conference in Prague, volume 2018, pages 171–180.

Kendall, M. G. and Hill, A. B. (1953). The analysis of economic time-series-part i:


Prices. Journal of the Royal Statistical Society. Series A (General), 116(1):11–34.

Kim, K.-j. (2003). Financial time series forecasting using support vector machines.
Neurocomputing, 55(1-2):307–319.

Kotsiantis, S. B., Zaharakis, I., Pintelas, P., et al. (2007). Supervised machine
learning: A review of classification techniques. Emerging artificial intelligence
applications in computer engineering, 160(1):3–24.

Kumar, M. and Thenmozhi, M. (2006). Forecasting stock index movement: A


comparison of support vector machines and random forest. In Indian institute of
capital markets 9th capital markets conference paper.

Lansky, J. (2018). Possible state approaches to cryptocurrencies. Journal of Systems


integration, 9(1):19.

Lee, T. K., Cho, J. H., Kwon, D. S., and Sohn, S. Y. (2019). Global stock market
investment strategies based on financial network indicators using machine learning
techniques. Expert Systems with Applications, 117:228–242.

Lukyanenko, R., Castellanos, A., Parsons, J., Chiarini Tremblay, M., and Storey,
V. C. (2019). Using conceptual modeling to support machine learning. In
64
International Conference on Advanced Information Systems Engineering, pages
170–181. Springer.

Lukyanenko, R., Parsons, J., and Storey, V. C. (2018). Modeling matters: Can
conceptual modeling support machine learning? AIS SIGSAND, pages 1–12.

Maass, W. and Storey, V. C. (2021). Pairing conceptual modeling with machine


learning. Data & Knowledge Engineering, 134:101909.

Mai, F., Shan, Z., Bai, Q., Wang, X., and Chiang, R. H. (2018). How does social
media impact bitcoin value? a test of the silent majority hypothesis. Journal of
management information systems, 35(1):19–52.

Maimon, O. Z. and Rokach, L. (2014). Data mining with decision trees: theory and
applications, volume 81. World scientific.

Malkiel, B. G. (2003). The efficient market hypothesis and its critics. Journal of
economic perspectives, 17(1):59–82.

Malkiel, B. G. (2005). Reflections on the efficient market hypothesis: 30 years later.


Financial review, 40(1):1–9.

Mills, T. C., Coutts, J. A., and Roberts, J. (1996). Misspecification testing and
robust estimation of the market model and their implications for event studies.
Applied Economics, 28(5):559–566.

Mohammed, E. M. and Osman, E. G. A. (2021). Comparison between neural


networks and binary logistic regression for classification observation (case study:
risk factors for cardiovascular disease). In 2020 International Conference on
Computer, Control, Electrical, and Electronics Engineering (ICCCEEE), pages 1–6.

Nakamoto, S. (2008). Bitcoin: A peer-to-peer electronic cash system. Decentralized


Business Review, page 21260.

Nayak, K. M. (2012). A study of random walk hypothesis of selected scripts listed on


nse. International Journal of Management Research and Reviews, 2(4):508.

Niu, T., Wang, J., Lu, H., Yang, W., and Du, P. (2020). Developing a deep learning
framework with two-stage feature selection for multivariate financial time series
forecasting. Expert Systems with Applications, 148:113237.
65
Nobre, J. and Neves, R. F. (2019). Combining principal component analysis, discrete
wavelet transform and xgboost to trade in the financial markets. Expert Systems
with Applications, 125:181–194.

Ostrom, C. W. (1990). Time series analysis: Regression techniques. Number 9. Sage.

Philippas, D., Rjiba, H., Guesmi, K., and Goutte, S. (2019). Media attention and
bitcoin prices. Finance Research Letters, 30:37–43.

Polasik, M., Piotrowska, A. I., Wisniewski, T. P., Kotkowski, R., and Lightfoot, G.
(2015). Price fluctuations and the use of bitcoin: An empirical inquiry.
International Journal of Electronic Commerce, 20(1):9–49.

Poyser, O. (2017). Exploring the determinants of bitcoin’s price: an application of


bayesian structural time series. arXiv preprint arXiv:1706.01437.

Quinlan, J. (1990). Decision trees and decision-making. IEEE Transactions on


Systems, Man, and Cybernetics, 20(2):339–346.

Quinlan, J. R. et al. (1992). Learning with continuous classes. In 5th Australian joint
conference on artificial intelligence, volume 92, pages 343–348. World Scientific.

Rathan, K., Sai, S. V., and Manikanta, T. S. (2019). Crypto-currency price


prediction using decision tree and regression techniques. In 2019 3rd International
Conference on Trends in Electronics and Informatics (ICOEI), pages 190–194.

Ricciardi, V. and Simon, H. K. (2000). What is behavioral finance? Business,


Education & Technology Journal, 2(2):1–9.

Roberts, H. V. (1959). Stock-market” patterns” and financial analysis:


methodological suggestions. The Journal of Finance, 14(1):1–10.

S Kumar, A. and Ajaz, T. (2019). Co-movement in crypto-currency markets:


evidences from wavelet analysis. Financial Innovation.

Safavian, S. and Landgrebe, D. (1991). A survey of decision tree classifier


methodology. IEEE Transactions on Systems, Man, and Cybernetics,
21(3):660–674.

Sapuric, S. and Kokkinaki, A. (2014). Bitcoin is volatile! isn’t that right? In


International conference on business information systems, pages 255–265. Springer.
66
Shefrin, H. (2002). Beyond greed and fear: Understanding behavioral finance and the
psychology of investing. Oxford University Press on Demand.

Shiller, R. J. (2003). From efficient markets theory to behavioral finance. Journal of


economic perspectives, 17(1):83–104.

Singh, J. E., Babshetti, V., and Shivaprasad, H. (2021). Efficient market hypothesis
to behavioral finance: A review of rationality to irrationality. Materials Today:
Proceedings.

Smith, G. and Ryoo, H.-J. (2003). Variance ratio tests of the random walk hypothesis
for european emerging stock markets. The European Journal of Finance,
9(3):290–300.

Tay, F. E. and Cao, L. (2002). Modified support vector machines in financial time
series forecasting. Neurocomputing, 48(1-4):847–861.

Tharwat, A. (2020). Classification assessment methods. Applied Computing and


Informatics.

Tsai, C. F. and Wang, S. P. (2009). Stock price forecasting by hybrid machine


learning techniques. In Proceedings of the international multiconference of
engineers and computer scientists, volume 1, page 60.

Van Horne, J. C. and Parker, G. G. (1967). The random-walk theory: an empirical


test. Financial analysts journal, 23(6):87–92.

Wang, S.-C. (2003). Artificial neural network. In Interdisciplinary computing in java


programming, pages 81–100. Springer.

Wang, W. and Lu, Y. (2018). Analysis of the mean absolute error (mae) and the root
mean square error (rmse) in assessing rounding model. In IOP conference series:
materials science and engineering, volume 324, page 012049. IOP Publishing.

Weisberg, S. (2005). Applied linear regression, volume 528. John Wiley & Sons.

Zhong, X. and Enke, D. (2017). A comprehensive cluster and classification mining


procedure for daily stock market return forecasting. Neurocomputing, 267:152–168.

Zurada, J. (1992). Introduction to artificial neural systems. West Publishing Co.


Appendix A

Dataset overview, coding parts and


other plots

Figure A.1: Candlestick plot for Bitcoin price variation during the collected 8 years of
data

Figure A.2: Bitcoin daily price percentage changes line plot

Overview on the classes distribution in ANN model (high precision rate for class ‘1’)

67
68

Figure A.3: Testing and training sets distribution

Figure A.4: K-fold cross validation coding part

Figure A.5: RandomForest hyperparameters tuning using GridSearch coding part


69

Figure A.6: ANN model architecture

Figure A.7: Overview on the classes distribution in ANN model (high precision rate
for class ‘1’)

Figure A.8: Predicting next 10 days price variations using XGBoost Regressor
70

Figure A.9: Predicting next 10 days price variations using RandomForest Regressor

You might also like