Predicting Cryptocurrency Returns Using Classification and Regression Machine Learning Models
Predicting Cryptocurrency Returns Using Classification and Regression Machine Learning Models
Amal Alshehri
A dissertation submitted to
The School of Computing Sciences of the University of East Anglia
in partial fulfilment of the requirements for the degree of
MASTER OF SCIENCE.
AUGUST, 2022
© This dissertation has been supplied on condition that anyone who consults it is
understood to recognise that its copyright rests with the author and that no
quotation from the dissertation, nor any information derived therefrom, may be
published without the author or the supervisor’s prior consent.
SUPERVISOR(S), MARKERS/CHECKER AND ORGANISER
Supervisor:
Dr. Antony Jackson
Markers:
Marker 1: Dr. Beatriz De La Iglesia
External Examiner:
Checker/Moderator
Moderator:
Dr. Wenjia Wang
ii
DISSERTATION INFORMATION AND STATEMENT
Degree: MSc.
Duration: 2021--2022
STATEMENT:
Unless otherwise noted or referenced in the text, the work described in
this dissertation is, to the best of my knowledge and belief, my own work. It has
not been submitted, either in whole or in part for any degree at this or any other
academic or professional institution.
Subject to confidentiality restriction if stated, permission is herewith
granted to the University of East Anglia to circulate and to have copied for
non-commercial purposes, at its discretion, the above title upon the request of
individuals or institutions.
Signature of Student
iii
Abstract
People are starting to see the cryptocurrency market as a viable source of income and
investment, similar to the stock market, as the concept of cryptocurrencies continues
to gain popularity. Additionally, several projects use tokens or coins built utilizing
blockchain technology. Furthermore, it is sound knowledge that the Bitcoin variety
dominates the cryptocurrency market, posing whether Bitcoin is predictable. However,
machine learning can forecast cryptocurrencies more accurately than established ana-
lytic methods like technical analysis. Predicting bitcoin returns is related to financial
machine learning, which uses time series to forecast price variance. This study starts
with the daily close price of bitcoin for its initial dataset. The price is transformed
into percentages (daily price percent change) and binary classes, which categorize into
two classes, “Up” and “Down,” after which a time series is applied to produce two
datasets: a categorical dataset for classification and a numerical dataset for regression.
For classification that represents a Binary classification in asset-price forecasting, k-
fold cross-validation is applied to ensure that the best classifiers are selected for testing
and analysis. Most of the regression analysis was based on visualization, which dis-
played the predicted prices by each regressor in front of the original values and helped
analyze the models’ results more accurately. The outcomes of this dissertation were
achieved by anticipating bitcoin returns using classification and regression machine
learning models, despite the approaches’ low accuracy and significant precision rate to
the “Up” class. At this stage, with a significant limitation regarding the dataset and
lack of other indicators, a model capable of predicting future variations is considered
a beneficial addition for many trading tools or even for crypto market analysts.
iv
Acknowledgements
I would like to thank my supervisor, Dr. Antony Jackson, for his many suggestions
and constant support during this research. I would also like to express my gratitude
to Jazan University and the Government of the Kingdom of Saudi Arabia for full
sponsoring my course.
Finally, I am very grateful to my mother, the most remarkable woman, for her uncon-
ditional love, unlimited support and patience full of warmth and contentment.
Amal Alshehri
Norwich, UK.
v
Table of Contents
Abstract iv
Acknowledgements v
Table of Contents vi
List of Figures ix
List of Abbreviations xi
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Cryptocurrencies and their price movement . . . . . . . . . . . 2
1.1.2 Effect of Bitcoin variation on Crypto Market . . . . . . . . . . . 4
1.1.3 Machine Learning in relation to Cryptocurrencies . . . . . . . . 5
1.1.4 Machine learning’s role in forecasting cryptocurrency returns . 11
1.2 Aim and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Significance of the study . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Structure of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Literature Review 15
2.1 Basic Financial Theories . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Efficient Market Hypothesis . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 Behavioral Financial Theory . . . . . . . . . . . . . . . . . . . . 18
2.2 Machine Learning in Cryptocurrency Returns Prediction . . . . . . . . 19
2.2.1 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . 19
2.2.2 Ensemble Machine Learning . . . . . . . . . . . . . . . . . . . . 21
2.2.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Methodology Design 26
3.1 Design of Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Evaluation Methods and Measures . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Classification Evaluation Methods . . . . . . . . . . . . . . . . . 28
3.2.2 Regression Evaluation Methods . . . . . . . . . . . . . . . . . . 30
3.3 Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
vi
3.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Conclusion 56
5.1 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1.1 Methodology Discussion . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Suggestion for Further Work . . . . . . . . . . . . . . . . . . . . . . . . 59
References 60
vii
List of Tables
viii
List of Figures
ix
4.20 Comparison between original and predicted values for Random Forest
Regressor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.21 Comparison between original and predicted values for LSTM . . . . . . 51
4.22 Explaining the created voting system that merges multiple classification
and regression models into one predictor . . . . . . . . . . . . . . . . . 52
A.1 Candlestick plot for Bitcoin price variation during the collected 8 years
of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.2 Bitcoin daily price percentage changes line plot . . . . . . . . . . . . . 67
A.3 Testing and training sets distribution . . . . . . . . . . . . . . . . . . . 68
A.4 K-fold cross validation coding part . . . . . . . . . . . . . . . . . . . . 68
A.5 RandomForest hyperparameters tuning using GridSearch coding part . 68
A.6 ANN model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.7 Overview on the classes distribution in ANN model (high precision rate
for class ‘1’) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.8 Predicting next 10 days price variations using XGBoost Regressor . . . 69
A.9 Predicting next 10 days price variations using RandomForest Regressor 70
x
xi
Chapter 1
Introduction
vestors, researchers, and the general public. Cryptocurrencies are brand-new money
that is sweeping the financial industry and catching the attention of industry pioneers.
They are a type of virtual currency intended for use online. The fact that cryptocur-
rencies are traded for profit and that their prices are rising has made them a hot topic.
The first and best-known cryptocurrency is Bitcoin, which was developed in 2008 by
Nakamoto (2008). The first linear hash chain, sometimes known as a blockchain, was
proposed by Haber and Stornetta (1990). A method for verifying the creation or most
recent modification date of a digital document was developed. The data itself wasn’t
time-stamped in order to protect the material’s privacy. To address the spam mail,
Dwork and Naor (1992) developed a proof-of-work mechanism. Each email would have
a header that calculated virtual postage as a single computation. This postage stamp
was used to show that calculating the postage before sending the email took a slight
amount of CPU time. To describe the cost of computing each hash, the term ”hash
cash” was created by Back et al. (2002). Investors and industry professionals are both
concerned about making accurate predictions about the values of the cryptocurrency
market because prices are rising quickly and responding similarly to stock market price
changes. Although this market exhibits similar behaviours to other stock markets in
terms of expected volatility, investor confidence has been reflected in it. (Grinberg,
1
2
1.1 Background
Even though prices are quite volatile, and it is hard to identify certain crucial factors
and quantify the extent to which they affect the price, researchers’ efforts in the area
cies (Polasik et al., 2015). Several categories may be used to categorize the variables
affecting cryptocurrency prices, including internal variables like supply and demand
as well as extrinsic variables like attractiveness and legality (Poyser, 2017; Buchholz
et al., 2012). A study has shown that cryptocurrency price is positively and statistically
significantly correlated with computing power and network adoption; the study may
also be a tool for identifying additional factors such as regulatory and political risks
affecting the returns on these digital assets (Bhambhwani et al., 2019). The volatility
government regulatory oversight, and the fact that these currencies lack an intrinsic
value (Iinuma, 2018). This cryptocurrency’s popularity is one of the key factors in-
fluencing its price. One of the critical causes of price instability and volatility is the
nomic markets (Polasik et al., 2015). According to Aggarwal et al. (2019), the price
should be done in this area to get better findings with more reliability.
The following list of factors may be used to summarise some of those that influence
1. Media
How the media impacts people’s lives cannot be disregarded. In a wide range
rise if the media actively promotes bitcoin in its ongoing coverage. By spreading
awareness of cryptocurrencies, more people can get interested in them and become
more likely to buy them. However, Philippas et al. (2019) found that the media
networks had a limited impact on Bitcoin values, with their impact being more
3
significant during times of more uncertainty. In certain circumstances, they also
importance.
2. Political Instability
Politics impact the acceptance of bitcoin. When the political system is unstable
and about to fall apart, demand for bitcoin increases. People give cryptocur-
rencies a chance and believe in them over traditional currencies when they lack
confidence in their nation’s economy. Using 2875 observations from July 18-2010
to May 31- 2018, one study concluded that the Global Geopolitical Risk Index
(GPR) could predict the returns and volatility of Bitcoin. Additionally, it was
concluded that Bitcoin could be utilised as a tool for hedging against global
showed that the effects are positive when higher amounts of both GPR and price
3. Legal regulations
The presence of legal cover attracts the confidence of the public. These rules
cryptocurrencies.
Due to its limited supply, demand mainly raises the price of Bitcoin. In essence,
Bitcoin’s value comes from the reality that scarce resources are more valuable
than abundant ones. As a result, if demand falls, the price of bitcoin will also
fall. On the other side, the price of Bitcoins increases when there is a higher
With the exception of bitcoin, all cryptocurrencies are referred to as “altcoins” which
is a term that originates from the notion that these coins are alternatives to bitcoin.
There are a plethora of other altcoins available, but before any of them were made,
Bitcoin was the first and most valuable asset. These alternative currencies aim to
improve or add advantages like quick transactions and low energy consumption. The
cryptocurrency sector is nearly new and still developing, Bitcoin holds the distinction
of being the first cryptocurrency or asset. One of the most critical leading indications
component is essential because Bitcoin prices influence markets through their dom-
inance. The market capitalization of Bitcoin rules the cryptocurrency industry and
significantly influences it. Böhme et al. (2015) stated that: “ Bitcoin’s rules were de-
signed by engineers with no apparent influence from lawyers or regulators”; this idea
gave bitcoin momentum and made it the king of the scene in cryptocurrencies. As a re-
sult, an entirely new, unrestricted, and uncontrolled industry has been made possible.
Altcoins do not have the same level of widespread acceptance as Bitcoin. They also
carry most of Bitcoin’s traits because it was developed using Bitcoin’s source code as
a starting point and then improved upon (Hobson, 2013). Prices may differ somewhat
between Bitcoin and its clones, but they often do not. Numerous scholars are inter-
ested in investigating the connection and impact between Bitcoin and other altcoin
markets. After analyzing the performance of the daily data of 16 alternative currencies
in addition to Bitcoin between 2013 and 2016, this study Ciaian et al. (2018) concluded
that there is, in fact, a significant correlation between the prices of Bitcoin and other
alternative currencies, with the correlation being more evident in the short term. The
study found that, compared to bitcoin prices, macro-finance factors have a longer-term
connection. It is not advised to create a portfolio using Bitcoin and other alternative
currencies since the value of these currencies is significantly impacted by the price of
Bitcoin (S Kumar and Ajaz, 2019). The price of other encrypted digital currencies is
significantly influenced by Bitcoin, the first and most important symbol for the whole
cryptocurrency industry. Various factors contribute to this, including that the most
5
well-known alternative currencies are just enhanced or expanded versions of Bitcoin.
The liquidity of Bitcoin also contributes to its continued dominance. Still regarded as
the safest cryptocurrency asset is bitcoin. It is impossible to accept or refute the claim
that Bitcoin has a 100% impact on the rates of other alternative currencies. In other
words, because there are many alternative cryptocurrencies and the financial sector
is intricate, the rise in bitcoin prices does not always grow by the same percentage
as all alternative cryptocurrency rates. The expected drop or rise may not occur be-
cause certain elements seem to cancel out the effects of other ones. Bitcoin and other
Machine Learning
overlaps with machine learning, and the boundaries between them remain clear as
artificial intelligence (AI) which enables computers to learn from data without being
explicitly programmed. AI is a larger idea that aims to build intelligent machines that
can replicate human thinking capabilities and behaviour. ML is concerned with the
development of systems that can learn from data. A machine learning algorithm aims
to discover a hidden pattern or function that will help predict some unknown variable
that assumes the existence of a function capable of mapping inputs to outputs. Using
mathematical models to “train” or “learn” from a given collection of data so that they
can make decisions is the main goal of machine learning. These algorithms start with
input data and do several sorts of statistical analysis before going on to prediction. As
they do so, they predict the outcome while also changing parameters that aid increase
the accuracy of the prediction. Unlike typical machines and processes, this is not the
case which consists only of static parts and is a conventional input/output paradigm
then feeding machines input and then giving them outputs instead a machine learning
system that adapts to external signals and can predict the future from uncertain and
6
fragmentary information like humans do. Machine learning techniques have effectively
and medical applications. In addition, Machine learning plays a broad role in business
and finance. It can be further used in complicated applications like price variation
prediction, risk assessment, and fraud detection. It is used to train and build these
models that depend on the provided data, which will make predictions for various
cases, and the most known are the forecasting models when it comes to forecasting
future values. The two most widely used machine learning algorithms for forecasting
price/percentage movement are Artificial Neural Networks (ANNs) and Support Vector
Machine (SVM), and both have their learning patterns that will be discussed in the
following parts.
Deep Learning
While it may sound trendy, deep learning is simply a term used to describe certain
types of neural networks and related algorithms that often consume very raw input
data. In order to calculate the target output, they process this data through many
be used for further learning, generalisation, and understanding. In most other ma-
chine learning approaches, the feature extraction process, along with feature selection
Feature extraction usually involves dimension reduction and reduces the number of in-
put features and data required to generate accurate results. Significant benefits result
and many others. More commonly, deep learning falls under the techniques referred
sise and employ. To create the best-performing model, the machine learning algorithms
“learn” the best settings as they go along. Deep learning has been used successfully in
many applications and is currently regarded as one of the most cutting-edge machine
learning and AI methods. The algorithms associated with it are frequently used for
7
supervised, unsupervised, and semi-supervised learning problems.
Supervised Learning
training data or supervised data because the output is regarded as the label of the
Teacher (Haykin, 2009), Learning from Labelled Data, or Inductive Machine Learning
system that can learn the mapping between input and output and predict system
output given new inputs. The learned mapping leads to the classification of the input
data if the output takes a defined set of discrete values that indicate the class labels
of the input. If the output takes continuous values, the input is regressed. Learning-
associations. When these parameters are not directly available from training samples,
a learning system must perform an estimation process to obtain them. Training data
Artificial neural networks (ANNs) are statistical models that were partially mod-
elled by biological neural networks and were directly influenced by them. They can
concurrently process and simulate nonlinear interactions between inputs and outcomes.
The related algorithms are part of the broader field of machine learning and can be
adaptive weights along paths between neurons that can be tuned by a learning algo-
rithm that learns from observed data to improve the model. In addition to the learning
A typical architecture with lines linking neurons is displayed in the equation below
(1.1.1). Each link has a weight, which is a tally of numbers. The hidden layer’s neuron
XN
hi = σ( V ijxj + Tihid ) (1.1.1)
j=1
Where σ is called the activation (or transfer) function, N is the number of input neurons,
V ij is the weights, xj inputs to the input neurons, and Tihid is the threshold terms
of the hidden neurons. The purpose of the activation function is, besides introducing
nonlinearity into the neural network, to bound the value of the neuron so that divergent
neurons do not paralyze the neural network. A typical example of the activation
function is the sigmoid (or logistic) function (Wang, 2003) which defined as:
1
σ(u) = (1.1.2)
1 + exp(−u)
Other possible activation functions are arc tangent and hyperbolic tangent. They
have similar responses to the inputs as the sigmoid function but differ in the output
ranges. A neural network built in the manner described above has been demonstrated
bers given to the input neurons are independent variables, and those returning from
the output neurons are dependent variables to the function approximated by the neural
network. Binary data (such as yes or no) or even symbols (such as green or red) can
be used as inputs to and outputs from neural networks when the data is appropriately
encoded; this property gives neural networks a broad spectrum of use (Wang, 2003).
Classification Models
model reads some given inputs and produces an output that categorizes the input.
Decision Tree and logistic regression are two classification techniques that produce
a probability score that indicates the likelihood that the input belongs to a specific
classification, the results will vary and belong to the set of expected categories.
Regression Models
to establish the nature and strength of the relationship between a single dependent
pendent variables). In the figure 1.1 below, the blue line labelled “Line of Regression”
squares (OLS), which is the most widely used form of this method. Linear regression
establishes the linear relationship between two variables based on a line of best fit.
In order to show how changing one variable effects another, linear regression uses the
In a linear regression relationship, the y-intercept represents the value of one vari-
able when the other is zero. There are other non-linear regression models, they are far
more complicated.
variables observed in the data, even though causation cannot be demonstrated with
certainty. In the areas of business, finance, and economics, it has several uses. It
helps investment managers, for example, value assets and understands the relation-
ships between factors like commodity prices and the stocks of businesses that trade in
that professionals in various industries, such as finance and investments, might benefit
from regression. Regression may help a firm forecast sales based on outside variables
10
such as the weather, previous sales, GDP growth, and other variables. The capital
asset pricing model is a well-liked regression model in finance for evaluating assets and
Forecasting Models
in business and marketing sectors simpler. Among the many forecasting models are
time series models, econometric models, and judgmental forecasting. Although fore-
casting takes many forms, most of them are often connected to regression and well-
known regression models. As previously indicated, there are several formats for fore-
casting models. With fully cognizant of past occurrences, the Time Series Model
employs statistical data to generate predictions for the future. An econometric model
can predict economic variables like sales, demand, supply, and price. When there is
a paucity of historical data, a new product is entering the market, or there are com-
in the current climate of uncertainty, increased competition, and erratic client loyalty.
Although they can not wholly alleviate this complex process, forecasting models offer
long-term insight. Forecasting models account for several variables, including usage
frequency, the availability of pertinent data, and the forecasting technique. Forecasted
Time series analysis is a technique for analysing a succession of data points gathered
over a period of time. Instead of merely inconsistently or randomly, time series analysis
requires analysts to capture data points at regular intervals over a predetermined time-
frame. However, this analysis is more than just collecting data over time. Time series
data differ from other forms of data because the analysis may demonstrate how vari-
ables change over time. In other words, time is a crucial variable since it demonstrates
how the data adjust through time and the final results; and that offers an additional
information source and a predetermined order of data dependencies. Time series anal-
ysis often calls for many data points to maintain consistency and dependability. The
sample size will be representative, and the analysis can cut through noisy data if there
Mastering analysis is essential for making progress in buying and selling. The future
value may be estimated using two methods: technical analysis and fundamental anal-
ysis. Technical analysis makes predictions about future prices using data from the
market’s trading activity, such as price and trading volume. In contrast, the funda-
mental analysis forecasts the future using data from sources outside the market, such
as the economy, interest rates, and geopolitical issues. While some investors focus on
some investors seek to overlap between fundamental and technical aspects. Two of the
most popular algorithms for predicting price movement are Artificial Neural Networks
(ANNs) and Support Vector Machines (SVM), each with its learning patterns. ANNs
have been widely used in securities prediction. Researchers have discussed various
ANN-related topics, including parameter selection and training sets. The embedding
Forecasting returns for the cryptocurrency market based on machine learning and deep
Finance, which places the most significant emphasis on techniques like regression and
classification. Since there is not as much data available for the cryptocurrency market
as for other markets, such as the stock market, the primary focus of deep learning
will be on the ANN and very rudimentary LSTM.This research intends to employ a
as possible for a predetermined number of days in the future. then, the research will
focus on forecasting trading signals, such as ”Sell” or ”Buy”. The objectives of this
study are:
12
1. Develop a data pipeline that starts by obtaining up-to-date, reliable data about
the cryptocurrency market from the Yahoo Finance API and then cleans, pro-
cesses, and converts the data into daily price changes rather than the close price.
2. Transform the present data into labelled data using a time series analysis ap-
proach.
3. Create and evaluate several regression models that can predict the daily price
percent changes for a cryptocurrency, Bitcoin, rather than working with exact
prices, making the model perform better in future situations instead of training
the percentage change in price, classify whether the price will rise or fall over the
that can turn the projected outcomes into ”Sell” or ”Buy” trading signals, which
This research makes a significant scientific contribution at the academic, scientific and
research levels by providing the most appropriate algorithms and models for predicting
the percentage changes in bitcoin prices. The research contribution provided by this
This research offers researchers in the area of the intersection between data sci-
ence and finances a method to choose the most appropriate machine learning and deep
learning models for use with financial time series data. In order to predict the per-
centage of the closing price for financial time series data and to ensure that the results
are generalizable to data in the future, this method uses cross-validation and split
folds manually, which increases the model’s accuracy and ensures that the results are
termining the most powerful algorithms and models appropriate for creating financial
projections. This research also focused on clearly and attractively displaying its final
findings in the form of buying and selling signals, with the buy sign appearing when
the thesis, describes those topics in more detail, and then outlines the thesis’s goal,
Review of the Literature: This chapter covers earlier ideas and methods that were
used to estimate cryptocurrency returns using machine learning. It also reviews earlier
Methodology Design: A defined route for the project life cycle was selected and
tailored for this project using the “Machine Learning Conceptual Model” methodology.
Along with a section explaining the data pipeline from the initial raw data collected
to the creation of two sets: one for the classification problem and the other for the
regression problem.
Results and Analysis: This chapter highlights the outcomes of many approaches
that were used, each of which included multiple techniques (various models), as well
as the analysis and comparison of the results. Following a review of all the trained
Conclusion: This chapter concludes the results of this project and reviews the
summary for the entire work; in addition, a discussion of the project’s aim, objectives,
and possible suggested future work in light of the results of this research.
14
1.5 Summary
This chapter gives a quick overview of cryptocurrencies, the factors that drive their
prices up and down, and several arguments for why Bitcoin is the pioneer in the
cryptocurrency industry. The most essential and tested frameworks for predicting
time series analysis, are summarised in this chapter, together with other essential
ideas in machine learning and deep learning with a brief introduction of technical and
Literature Review
This chapter discusses relevant previous literature. The first part presents the most fa-
mous, discussed and researched financial theories. Machine learning and deep learning
methods used for forecasting financial topics are also discussed in the second part.
The idea of an efficient market has long been a cornerstone of academic finance research.
Eugene Fama, a University of Chicago economist, initially proposed the efficient mar-
kets hypothesis, which maintains that financial markets are “information efficient” and
that asset prices in financial markets properly reflect all information currently accessi-
ble about an asset. According to this theory, it is complicated to predict asset prices
accurately enough to “beat the market” because asset prices will only respond to new
itive advantage over other investors. Three forms of the efficient market hypothesis are
discussed, and each has a ton of supporting evidence. In the strong form of this theory:
public, private and historical information is reflected on stock prices immediately, and
this does not allow the investor to achieve any returns or profits. In the weak form,
stock prices reflect all information of past stock prices; thus, technical trading analysis
is useless and cannot be relied upon solely to generate profits. On the other hand, the
semi-strong form assumes that fundamental analysis based on historical stock prices
and publicly available information will not aid in making money without access to
15
16
crucial of which is that all investors are rational and have the same outlook on potential
investments. It also assumes that all investors have full access to the same sources of
information (Copeland et al., 2005). Rationality entails updating the information ac-
curately and making decisions that are compatible with the notion of expected benefit
and normatively acceptable when new information is received (Barberis and Thaler,
2003). Not everybody concurs that the market is operating effectively. Academic
the notion of the efficiency of financial markets. The best proof, according to Malkiel
(2005), that markets are efficient and respond to new information accurately is that
professional investors do not overcome the market as Jensen (1978) further emphasized
that the strongest economic theory supported by strong empirical evidence is the idea
of the efficiency of financial markets. From a different perspective, the theory of market
efficiency may be unrealistic, mainly because it assumes the investor’s rationality while
ignoring the psychological component of that investor, which may be the cause of some
price deviations and distortions. From here, the behavioural finance theory emerged
(Singh et al., 2021). In an effort to balance the various viewpoints, it should be stated
that the market is ultimately efficient since anomalies are extremely unlikely to endure
since they are the exception rather than the rule (Malkiel, 2003).
The stock market draws attention because its prices don’t always rise in reaction to
good news. Many financial and economic professionals believe that the idea of “random
walk” is useful in comprehending the broad variations in the prices of stocks. One of the
first persons to get interested in the subject of random walk was Roberts (1959), who
came to the conclusion that the stock prices supports the random walk hypothesis.
The subsequent price changes are independent, and any changes to stock prices in
the future must be entirely unrelated to any previous price changes. Investors are
constantly looking for methods to boost their earnings and analyze new information
prices react swiftly and rationally, fluctuating randomly around their real worth Nayak
where no one can guarantee profits. Kendall and Hill (1953), Van Horne and Parker
(1967) also emphasized that prices reflect the available information on the market.
Proponents of this theory assert that the reason why some people made some profits
is that they took a high risk, which involved either a high or loss profit. They contend
that diversifying the portfolio is the most effective method to reduce this high risk. On
the other hand, numerous studies show that certain international financial markets do
not behave randomly, such as the Indian stock market, which the variance ratio test
has demonstrated is a market that is not susceptible to the theory of random walk
between April 1996 and June 2001. According to another research findings, Swedish
stock prices did not fluctuate randomly during the period of 72 years, from 1919 to
1990(Frennberg and Hansson, 1993). Greece, Hungary, Poland, Portugal, and Turkey
were five medium-sized European developing markets that were investigated in 2010.
Only the Turkish market among these five nations appears to follow a random walk,
This test seeks to establish the stock returns’ independence from one another.
This test is predicated on the notion that a randomly generated time series’
This particular test aims to identify the kind of correlation between long-term
returns.
• Runs Test:
Behavioural finance theory studies how psychology affects investors or financial ana-
lysts’ behaviour and how their decisions affect the markets. This idea holds that an
investor may not always act logically and may commit systematic estimation mistakes
that affect judgment. Behavioural finance integrates finance and other social sciences
to understand better and explain what is happening in financial markets. Several aca-
demic studies have examined how people behave as individuals, groups, enterprises,
and markets using a wide range of research methodologies related to the science dis-
ciplines of psychology and finance (Shiller, 2003). Behavioural finance theory holds
that investors are not rational. Investors do not use the market information properly,
which hinders them from gaining earnings and causes them to make poor purchasing
and selling decisions (Ricciardi and Simon, 2000). The distinction between rational
and irrational investors is not explicitly defined by classic financial theory. Because
people’s behaviours are interrelated, situations where irrational investors can affect ra-
tional investors result in a new pricing trend due to imitation and simulation and that
is known as “herd behaviour” (Hirshleifer and Hong Teoh, 2003). Shefrin (2002)dis-
cussed behavioural finance from several perspectives. First, investors make mistakes
while making investments because they rely their decisions on past experiences. This
includes so-called heuristics, which shorten the time it takes someone to decide based
only on prior, personal experiences. Second, both substance and seeming form influ-
have an impact on financial market values. In behavioural theory, biases reflect the
complexity and uniqueness of the human mind while they are faults that need to be
corrected in classic theories (Copur, 2015). There are numerous studies done in this
area, and the majority of them follow one of two primary patterns:
• Behavioral finance and established financial theories can explain unforeseen oc-
currences.
• Identify the beliefs and practices of financial investors that contradict traditional
economic ideas.
19
2.2 Machine Learning in Cryptocurrency Returns
Prediction
Decision Tree
be used for regression, it is often utilized for classification. Internal nodes of a decision
tree represent a feature test, while leaves represent decisions reached after additional
processing (Rathan et al., 2019). The decision tree’s most valuable feature is its sim-
plicity of interpretation following prediction (Quinlan et al., 1992). A decision tree can
be pruned to make it more predictable and capable of producing decisions that are
more effective by computing the error rate. In addition, decision trees differ from other
models, such as artificial neural networks, by providing several decision rules that are
particularly beneficial for further investigation and analysis (Tsai and Wang, 2009).
trees do not aim to replace the standard statistical methods; various approaches, in-
cluding artificial neural networks and support vector machines, can be utilised (Maimon
and Rokach, 2014). Chang et al. (2011) states that few studies have examined how
well stock index movements can be predicted in terms of their direction or sign, and
we believe the bitcoin market is no exception. This study Huang et al. (2019) discusses
the decision tree classification approach with 124 price-based technical indicators, and
found that the model has a solid forecast of narrow ranges of returns. Decision trees
are one of the most important models, as one of their principles is clarity, brevity and
flexibility. One of the main flaws is that not all attributes interact with one another
concurrently while making decisions. A single attribute is used to split the dataset at
each partitioning stage (Quinlan, 1990). Here is the general decision tree structure:
20
Through the use of a hyperplane, the supervised learning algorithm SVM divides
the data into classes. The chosen hyperplane is the one that has the largest distance
between it and all other points, which increases the chance that the classification will
performing some nonlinear mapping of the input vectors x into the high-dimensional
feature space (Kim, 2003). The support vector machines’ results depend on the kernel
functions’ choice, which can be considered a weakness. Some of the kernel functions
are Linear Kernel function, sigmoid function and the Polynomial Kernel Function.
Due to SVM models’ complexity and reliance on the kernel function, calculating SVM
models takes time and incurs cost (Lee et al., 2019). For large-scale financial time-series
forecasting, SVM models based on a limited training and testing sample are unsuitable
(Niu et al., 2020). Further research on Deep Learning techniques is needed in order to
make meaningful comparisons given the efficacy and limitations of Machine Learning.
for financial market practitioners to achieve their goals (Cao and Tay, 2001; Tay and
Cao, 2002).
Logistic Regression
whether future returns will curve upward or downward. It is a method for adjusting
21
a regression curve, y=f(x), where y is composed of data with binary coding (0,1).
The link between several independent factors and a categorical dependent variable is
examined via logistic regression (Mohammed and Osman, 2021). Since the logistic
regression does not require a normal data distribution, it is appropriate for use in
finance. The algorithm is speedy because there are few parameters, however this is a
flaw that might cause the issue of over-fitting (Akyildirim et al., 2021).
Regression
The regression technique is a statistical method that evaluates the importance and
relationship between two variables. Regression can involve two or more variables that
characterised as Simple Linear Regression when there is just one input variable and as
multiple linear regression if there are several predictor variables (Weisberg, 2005). Re-
gression analysis can be used in financial modelling to predict the strength of the link
between variables and then predict how the relationship will behave moving forward.
Questions concerning the past behaviour of the variables and their expected future
behaviour can be asked using time series data, the benefit of time series regression
analysis is its capacity to explain the past and forecast the future behaviour of relevant
variables. As a result, the time series’ history is required to serve a double purpose
asset returns are jointly multivariate normal and independently and identically dis-
tributed over time when employing statistical models to determine the normal return
of a particular investment. On the other hand, Mills et al. (1996) has investigated the
effects of incorrectly estimating the regression of the market model and demonstrated
XGBoost
Ensemble models perform better than individual models in machine learning with a
high likelihood. A machine learning model called an ensemble combines several different
models into a single model. For instance, the Random Forest uses bagging to calculate
them) by figuring out their averages. Bagging can be substituted with boosting. By
focusing on individual model failures rather than overall forecasts, boosters turn weak
into strong learners. Gradient boosting develops special models based on residuals,
predicted differences, and actual results. Gradient boosted trees learn from errors as
Extreme gradient boosting, also known as XGBoost (Chen and Guestrin, 2016),
is roughly ten times quicker than regular gradient boosting due to performance traits
including parallel processing and cache awareness. XGBoost contains built-in regulari-
sation to reduce overfitting and a split-finding algorithm for optimising trees. Gradient
Boosting has a quicker and more precise variant called XGBoost. However, boosting
surpasses bagging, and Gradient boosting may be the best ensemble boosting strategy.
XGBoost is arguably the most important machine learning ensemble since it is a su-
this method has produced effective and noteworthy results. This method (XGBoost)
has been widely used in finance and financial forecasting. Nobre and Neves (2019)
has created a financial system that can provide a rate of return for the portfolio of
about 50% using core component analysis, discrete Wavelet transform, and XGBoost.
In order to forecast the stock market over 60-day and 90-day periods, Dey et al. (2016)
proposed a system using XGBoost as a classifier. Dey et al. (2016) concluded that
Random Forest
The decision of multiple decision trees is the foundation of the machine learning
(2001). In decision-tree classifiers and regressors in a random forest, each tree is de-
pendent on a separate random sample and has the same distribution for all trees in the
forest; with a good injection of randomness, RF will be an effective tool for prediction
random forest by randomly selecting both the training data and the input variables too
(Géron, 2019). Compared to neural network models, random forests are the optimal
23
global solution. Since neural networks frequently fall into the optimal local solution,
overfitting is less likely to happen with random forests (Kumar and Thenmozhi, 2006).
In order to solve the issue of the overfitting, Random Forest trains several decision trees
on various subspaces of the feature space at the expense of a little higher bias. This
indicates that none of the forest’s trees can see the training data. Partitions are created
for the data recursively. The split is carried out at each node by posing a question
regarding an attribute (Akyildirim et al., 2021). There are not many studies on the use
of random forest regression to predict cryptocurrency returns but their uses in the field
of finance, in general, are noticeable to make better business decisions. In their study,
Creamer and Freund (2004) employed the random forest regression approach to assess
the risk associated with corporate governance in Latin American markets and forecast
performance. On one sample of Latin American banks and one sample of Latin Amer-
ican Depository Receipts (ADRs), they perform tenfold cross-validation trials. Their
findings from logistic regression and random forest were contrasted. Results supported
the use of random forest regression. Here is an illustration of the general concept of
ples that is built from thousands of artificial neurons coupled by coefficients (weights)
that represent the neural architecture. There are three or more linked layers in an
artificial neural network. Neurons in the input layer make up the first layer. These
neurons send the final output results to the final output layer from the deeper layers.
Another set of learning rules uses backpropagation, which enables the ANN to adjust
its output results by accounting for errors. The error is used to modify the weight of
the ANN’s unit connections to account for the mismatch between the expected and ac-
tual results (Zurada, 1992). Investors and scholars have been interested in forecasting
non-stationary time-series financial data. The vast majority of studies have focused
on stock markets as a whole; Because of the financial market’s severe nonlinearity and
volatility, neural networks have become increasingly popular, particularly for time se-
ries forecasting. Support vector machines (SVM) and ANN, two models based on two
Istanbul Stock Exchange. According to their experimental results, the ANN model
outperformed the SVM model significantly on average (Kara et al., 2011). To predict
the daily return direction of an ETF, Zhong and Enke (2017) conduct research. It
demonstrates that three distinct PCA-based logistic regression models do not perform
as well as ANN classifiers, which are PCA-based. It should be emphasised that ANN
LSTM
Long Short Term Memory, or LSTM, is a type of recurrent neural network (RNN)
used in deep learning that can learn long-term associations, especially in applications
requiring sequence prediction (Hochreiter and Schmidhuber, 1997). A memory cell hav-
ing a state maintained over time is called a “cell state”. It serves as the main functional
component of an LTSM model, as shown in the figure below. Optional gates regulate
the flow of information into and out of the cell by adding or removing information
25
from it. This mechanism facilitation needs two pointwise prerequisites: a multiplica-
tion operation and a sigmoid neuronal layer. The sigmoid layer outputs values between
let through). LSTM neural networks can solve issues where earlier learning algorithms
like RNNs fell short. High-end problems can be resolved with LSTM, effectively cap-
This method has produced effective results in situations involving NLP categorization
and time-series forecasting. In order to predict the future price of Bitcoin, the ARIMA
time series model and the LSTM are evaluated in this study Karakoyun and Cibikdiken
(2018); it is noted that the LSTM model performs better when results are compared
2.3 Summary
This chapter summarised the key papers that were relevant to the thesis topic. In
the first part, we addressed the three most crucial financial theories: the efficient
market hypothesis, the theory of random walks, and the behavioural financial theory,
which describes how human behaviour impacts financial markets. The other section
like supervised machine learning, deep learning, and Ensemble learning, focusing on
some essential models that have shown promise results in earlier research like XGBoost
Methodology Design
This chapter outlines the methodology designed and followed to ensure this disser-
tation is starting from the approach of the Machine Learning Conceptual Model to
the presentation of the dataset collected in addition to the data preprocessing steps
conceptual modelling with machine learning because both have long been recognized
as essential research areas (Lukyanenko et al., 2019, 2018; Maass and Storey, 2021).
Taking into account the fact that machine learning problems might vary depending
on the case study included in the project, starting with the overall development process,
a tailored method was created to meet the needs of this project, as shown in figure 3.1
26
27
The first parts consist of data-related parts, which start with data collection from
a reliable data source such as yahoo finance API and then carrying out the required
analysis will be performed on the data, which will be our features for the following
both will be applied in accordance with the dataset in the next section, and two more
sub-datasets will be created. The first will be a continuous values dataset labelled with
the daily percent change in price for the chosen cryptocurrency (daily closing price
was converted to daily price percent change). The second dataset will also include
categorical data, which will be divided into the classifications “Up” and “Down” based
The algorithm selection process starts after the datasets have been created. For
the classification problem, cross-validation will be used to compare the most popular
classification method, and for the regression problem, a collection of linear and non-
linear regression algorithms will be used. The deep learning model will be approached
differently since the layers’ development, and the architecture as a whole will be the
Multiple training will be performed then, followed by the testing and validation
part. The Result analysis comes just before validating the final model and setting
(only for regression) and then comparing both predictions from the various trained
Regression and classification employ several evaluation methods and metrics. In this
part, the assessment methodology and metrics for the classification and regression
a test set for which the actual values are known and creates a representation of the four
key parameters that will help determine other metrics like accuracy and recall. For
Additionally, the output classes can change based on the working scenario. A general
These are the positive values that were successfully predicted, indicating that the
true value of the actual class is yes and the value of the predicted class is also yes.
These are the successfully predicted negative values, indicating that the actual class
When the predicted class is yes and the actual class is no.
29
False Negatives (FN):
Accuracy
broadly characterises how the trained model performs across all classes. It is simply
(T P + T N )
Accuracy = (3.2.1)
(T P + T N + F P + F N )
Precision
The precision metric measures up the number of correctly predicted positive observa-
tions to the total number of predicted positive observations. Its formula is as shown
below:
(T P )
P recision = (3.2.2)
(T P + F P )
The precision measures the model’s accuracy in classifying a sample as positive, which
has the goal of classifying all the Positive samples as Positive and not misclassifying a
Recall
The (Recall) is a measure of how many of the positive cases the classifier correctly
predicted, overall the positive cases in the data, formula 3.2.4 represents the recall
formula:
(T P )
Recall = (3.2.3)
(T P + F N )
The recall metric assesses the model’s ability to identify positive samples. The higher
the recall, the more positive samples detected regardless the number of the negative
The F1-Score is simply the weighted average between the Precision and the Recall,
The most used regression loss function is mean squared error (L2 loss). The mean
squared difference between true and predicted values is used to determine the loss and
is known as the mean squared error. The mean squared error is expressed as follows
for a data point Yi and its predicted value Ŷi, where n is the total number of data
Pn
i=1 (yi − ybi )2
M SE = (3.2.5)
n
One of the most straightforward loss functions is mean absolute error, sometimes re-
predicted and actual values over the dataset (Wang and Lu, 2018). It is the arithmetic
Pn
i=1 |yi − xi |
MAE = (3.2.6)
n
The square root of the mean of the squares of all the errors is known as the root mean
squared error (RMSE). By obtaining the square root of MSE, RMSE is calculated.
RMSE is also known as the Root Mean Square Deviation. The model and its predictions
are better the more minor the RMSE value. RMSE defined by the following equation:
s
Pn
i=1 (yi − ybi )2
RM SE = (3.2.7)
n
31
When model errors follow a normal distribution, RMSE is a superior option; the sen-
Several tools and resources were employed for this research, starting with Python as
the main programming language because it is the most common language for data
science and offers a large number of packages that can be used for many different
computer science and data science fields. In terms of data collection, Bitcoin-related
data was gathered via the Yahoo Finance API. Since it is a reliable source of data
with many benefits most well-known of which is that it has already been implemented
as a Python package, by just installing the package and simple lines of code, the
data will be ready for use in subsequent phases without the need for an API-Key
Plotly, and Dash, are the primary tools utilised; these are the essential packages for our
machine learning conceptual model. The working environment used was Google Colab,
a cloud-based alternative to Jupyter Notebook that removes most hardware issues and
other potential limitations while working on data science projects. To guarantee the
optimum development circumstances for the project, every resource and tool chosen
for it was carefully considered after extensive testing with various tools.
3.4 Dataset
The data will be mainly gathered through the Yahoo Finance API using the “yfinance”
package as indicated in the preceding section. The collected data will be for Bitcoin
since Bitcoin influences the majority of the volatility in the cryptocurrency market.
The initial dataset will contain these features: Open, High, Low, Close, Adj Close, and
• High: The highest price at which a stock traded within a specific time period.
32
• Low: Low is the minimum price of a stock in a period.
• Close: Closing price usually relates to the last price where a stock trades during
• Adj Close The adjusted closing price modifies a stock’s closing price to reflect
The dataset date range will be from “2014-09-17” (which is the minimum date you
can start within the API) to “2022-07-30” (a randomly picked data to have a specified
end date), which comprises of 2874 days (row of data). The initial dataset is as follow:
In light of the fact that the forecast would be based on daily variation and that
the closing price is the best match to be used as the daily price, all the columns
were then dropped, with the exception of the close price column. Since the project
idea does not involve non-stationary values like prices, a new column named ”Close%”
was created. This column will be utilised for the regression portion of the project
and reflects the daily percent changes for Bitcoin starting on “2014-09-17”. A second
33
column called “Variation” was also produced using the column “Close%” with values 1
and 0. The “Variation” column denotes the price percent indications (1 for “Up” and
For more illustration, here is how the variation column distribution looks like be-
the features from the original time series at each time step while considering a
A rolling mechanism will provide a sub-time series of the last (m) time steps to build
the features. The final part will consist of splitting into two sub-datasets, one for
regression and the other for classification, which mainly have the same features but
different labels, then split these datasets into 70% train and 30% test.
Regression and classification are the two main sorts of issues used in this research.
Although each has its own formulation, they both fall under the category of supervised
learning, meaning that the prediction will be based on a labelled dataset. In this
case, classification and regression both use the same input feature data, but they
differ when it comes to the output/label data, which must be continuous for regression
and categorical or binary for classification. For the input data, it will be defined by
a set of time-series vectors extracted after the feature engineering part. Each data
point consists of 21-time steps, which present the feature of that specific data point
built based on the last steps, basically a translation for the daily price percent change
variation. For classification, the output vector will be made up of two primary classes:
1, which represents the “Up” class and denotes that the daily price percent change
will go up, and 0, which denotes the “Down” class and indicates that the daily price
percent change will go down. The output vector for regression will include the daily
price percent change for each day, which ranges from -1 to 1. (normalized). for each
part a group of models will be trained to predict the possible classes or values for each
data point, after which the model will be able to make future predictions on new cases.
35
3.6 Summary
In this chapter, we first reviewed the methodology used, which is the Machine Learning
Conceptual Model. This model is made up of a well-built custom schema that is based
projects. The project’s assessment techniques and measurements are covered in the
following section, which also explains the methodologies used by each field, including
classification and regression. Then all the dataset features and characteristics were
summarized. Beginning with a description of the data preprocessing steps, along with
the creation of the labels (“Closed%” and “Variation”), next came the time-series
analysis that was carried out to produce the training features, and finally, the data
structure that would be used for classification and regression to be able to perform both
approaches, followed by an analysis of the results from all produced models. In the
final section, the problem formulation for this project, which is based on classification
and regression problems, is represented and explained in depth. The training results
for both methods will be compared and analysed in the next chapter, along with the
In this section, we evaluate the results of each model after dealing with the data prepa-
and training. Analysis of each model’s results and comparison of its metrics will be
This section discusses the results of the classification prediction models in addition to
the regression models. The primary label for classification was either the price percent
was positive or negative for that day, representing a data point. For regression, the
label will be the daily price percent changes. 70% of the data were utilised for training,
and 30% were used for testing. However, in the case of the ANN, 40% of the testing set
was used for validation while training. As each approach has its own criteria for model
evaluation, the results will first be analysed for the classification component, followed
a classification model on the test set that will help determine the other metrics like
accuracy, precision and recall. All the score calculation was done using the 30% test set
on the models. Cross-validation, a resampling technique that tests and trains a model
at different iterations using different parts of the data, is the first step in analysing
36
37
the results for machine learning models. It runs on the entire dataset and is based on
the number of folds (8 folds) that make up the number of years in the dataset. It is
primarily used in cases where the goal is prediction and can measure how accurately
machine learning classifiers will be utilised for the comparison of the result, including:
• Logistic Regression
• KNN
• Decision Tree
• GaussianNB
• Random Forest
• LGBM
• XGBoost
The figure 4.1, which shows the range of accuracy provided by each classifier after
just the maximum accuracy for the eight folds training/testing will be shown, as shown
below:
Bellow, the average accuracy results for each classifier are shown. Based on all the
results, it is possible that the choice will not only be based on accuracy scores but
also consistency and other factors. In order to get the highest accuracy, the selected
classifiers will also undergo hyper-parameter tuning; the resulting average accuracies
are as follows:
• LDA: 52.40%
• KNNeighbors: 49.47%
• SVM: 53.09%
• LGBM: 52.41%
39
• XGB: 52.58%
Based on all the findings from this step, this collection of classifiers will be chosen
SVM:
XGBoost:
The most recommended approach for machine learning classification models, even
RandomForest:
This model allows for the highest accuracy obtained through hyper-parameter ad-
justment.
A custom ANN model will be created and included in the analysis and comparison
After executing the training for each of the selected models, all the metrics indicated
before in the evaluation methods section will be computed in this part. Then analysed
and evaluated. The confusion matrix, classification report, precision, recall curve, and
roc curve will be illustrated. After that, all the models will be compared globally.
• Accuracy: 48.99%
• Precision: 65.62%
• Recall: 51.41%
• F1: 57.65%
The classification report, which is used to evaluate the accuracy of predictions made
using a classification model, is illustrated in figure 4.4. It includes all the necessary
metrics for each class, such as accuracy, precision, recall, and the f1-score.
thresholds, and this is the curve in the case of the SVM model shown below:
The ROC curve (Receiver Operating Characteristic Curve) is a plot showing the
In conclusion, the SVM model does not provide the expected results as it is lower
than the minimum required accuracy for this kind of classifier which is 50%.
• Accuracy: 52.56%
42
• Balanced Accuracy: 51.68%
• Precision: 70.70%
• Recall: 53.94%
• F1: 61.22%
For Random Forest, a typical run was performed at the start, but then
hyper-parameters tuning was applied to get the best possible parameters from the
model, which will get the best accuracy score possible, starting with the evaluation
• Accuracy: 53.51%
• Precision: 86.07%
• Recall: 53.79%
• F1: 66.21%
was able to hit 54.65% accuracy, as it seems that the Random Forest classifier
performed the best among all machine learning classifiers chosen. However, the only
drawback is that this classifier has a very high precision rate for class “1” compared
to other models with lower accuracy, which have a closer rate between the two classes.
46
ANN result analysis
The result analysis for the ANN model has some similar points to the other machine
learning models. However, ANN will have other comparison aspects, as the focus will
be on the loss and accuracy. In addition to the validation set, which validates each
training epoch and calculates the validation loss and validation accuracy to help to
notice any over/underfitting during the training. Also, an extra function was added
that helps save the best model during the whole training, which is ModelCheckpoint,
to save the model with the lowest “validation loss”. EarlyStopping was added and
tested. However, it seemed that it is not helpful for this case, so the focus will be
• Accuracy: 52.38%
• Precision: 85.82%
• Recall: 53.24%
• F1: 65.71%
The following plots will plot the training and validation loss/accuracy as a function of
the epoch during the model training, which helps analyse the training and look for
any over/underfitting. In this case, the train loss/accuracy will be in the purple line,
It can be observed from the figures above that the result does not get better even
In the table above 4.1.1 we demonstrate the final classifiers global comparison
( green color means highest value in the column, yellow color means a considered
model)
The major noticed point from prior confusion matrices and classification reports is
that most classifiers have high precision rates for class (1), which results in high
precisions and low accuracies, reducing the model’s efficiency. This prompts us to
consider not only the high accurate model but the more stable and balanced model,
which are the XGBoost model and the Random Forest model.
The linear model, ensemble regression models, and RNN models make up the three
primary components of the regression section. The same time-series analysis feature
utilised for classification will be the input for all models, and the intended result will
be the daily price percent change. In order to determine how accurate the model is at
forecasting the daily percent change, the assessment will be based on three metrics
(MAE, MSE, and RMSE) in addition to a comparison plot that displays the results
Results of Model evaluation metrics for Linear Regression (RMSE, MSE, MAE):
The prediction on the training set is shown in the red line, the prediction on the
testing set is shown in the green line, and the actual daily price percent changes are
49
shown in the blue line in the following figure 4.18. It appears that the linear
regression model was unable to produce a value that was close to what the actual
Figure 4.18: Comparison between original and Predicted values for Linear Regression
As plotted in the figure below 4.19, XGBoost Regressor could have a very close
prediction on the training set and a great result regarding the testing set.
Figure 4.19: Comparison between original and predicted values for XGB Regressor
50
Random Forest Regressor Model:
Model Evaluation metrices RMSE, MSE, MAE (for Random Forest Regressor):
The results from the RandomForest model are not that great compared to the
Figure 4.20: Comparison between original and predicted values for Random Forest
Regressor
Despite having a lower loss value, the LSTM model provides very far results, as
Figure 4.21: Comparison between original and predicted values for LSTM
By attempting to predict the price returns using traditional regression models and
various classification approaches, all experiments have been done to test all the
potential models that might be employed in such a field. The Random Forest
Classifier and the XGBoost Classifier are the best models for classification. Then, for
regression, it demonstrates that the XGBoost regressor could be able to predict the
anticipated values rather well. The major thing that stands out is that XGBoost is
the best model in both classification and regression because it is known for its ability
to get the best performance for boosted tree algorithms and great computational
speed.
the numerical prediction results of the regression into categorical results that can be
divided into two classes, “1” or “0” By taking the prediction values and changing
them into “1” if the value is positive or “0” if the value is negative, it will be possible
to calculate accuracy using those transformed results. In addition, the accuracy will
be calculated using the testing set first and then using the entire dataset and these
Moving on to another part, we combine all the models into one predictor that utilises
a majority voting system, gathering the predictor results into one array and
beginning to iterate data point per data point. By using the majority voting system,
the decision will be made based on the class that receives the most votes, and by
doing this, we will be able to predict the outcomes of five predictions that came from
various models which can be used to calculate the accuracy value and these the
values we got:
And the following figure 4.22 shows how the voting process actually works:
Figure 4.22: Explaining the created voting system that merges multiple classification
and regression models into one predictor
53
In conclusion, classification models offered higher accuracy than regression models.
The combination method used can be improved in future work by adding more
prediction models or improving the voting system and can be easily transformed into
a reliable predictor more efficient than just one machine learning model.
repercussions at the economic and individual levels, as it affects the monetary policies
of the state in terms of the money supply due to the lack of state control over its
issuance, and also affects the financial policy to facilitate the process of tax evasion,
as well as the payments and credit system due to the absence of mediation. However,
at the same time, it was distinguished by its wide acceptance due to the dealers’
confidence due to its fluidity in dealing and benefiting from it in all transactions
efficiently and quickly through the widespread trading platforms. On the other hand,
fluctuations within a day or even minutes. Therefore, this project is only intended for
research purposes, and we cannot recommend using its results to make an investment
One of the significant limitations of the project was the dataset used for the training
because of the lack of external features, which makes most of the work based on the
time-series features, which were analysis for the daily price percent variation for
Bitcoin for 2874 days, this is a significant factor for having a low accuracy value and
most of the classifiers get distracted by predicting a single class. However, the overall
approaches provided reliable models that can be integrated into a useful trading tool.
In the same matter, a lot of other factors can be related to cryptocurrency price
54
variation; some of them is unpredictable, like economic crisis or wars, and other
different factors that can be suitable for the prediction that was not experimented
with in this project and because most of them are beyond the scope of this project.
However, things are a bit different regarding the cryptocurrency market for various
reasons, one of which is that blockchain and cryptocurrencies are still mostly unheard
of, especially when talking about the Web 3.0 era, the future of the web. On the
advanced by extensive research and analysis, which will improve as more data,
In this chapter, a thorough analysis of the chosen classification and regression models
is presented, along with metrics assessments and the methodology used to distinguish
between the regression models since the provided metrics cannot adequately convey
how the model would function in actual case studies. After analysing all the given
results, the classification models considered the best were Random Forest and the
XGBoost classifiers for many reasons, starting with evaluating the metrics. In
addition to the accuracy also, the precision and how balanced the precision for both
classes were considered, but overall most models had a high precision rate for the “1”
class, which represents the “Up” class, as in the Random Forest model which has
high accuracy but was less balanced than the XGBoost model (based on the
confusion matrices). For the regression model, XGBoost regressors were picked based
on the simulation performed, comparing the predicted values with the actual ones.
Chapter 5
Conclusion
The conclusion of the findings from several parts will be the main focus of this
chapter. It will begin by evaluating the testing and analysis conducted in the previous
chapter, which will discuss the advantages of each approach, particularly in this case
when working with two different machine learning approaches, regression and
points will be examined together with the evaluation section. Following that, the
approach used will be discussed and how it affected the project’s construction. The
conclusions will also analyse the findings and demonstrate how this dissertation’s
goals and objectives were met, ending with a suggestion for future research.
According to all of the prior analysis and metrics calculations, whether for the
classification or regression approach, the overall result is that the average accuracy
for both is about 53% (precisely 52.91%) based on over 60% precision rate for the
class “1” (calculated only for classification models), which represents that the price
would increase in that day. The RandomForest classifier was the top-performing
model, outperforming both classification and regression models with an accuracy rate
anticipate not only the return of the Bitcoin price but also market variety and
fluctuations. However, in this instance, it is employed for research and study reasons
56
57
learning models to forecast returns. Additionally, as discussed in the previous section,
a method was used to combine all the models and use a voting strategy to decide the
prediction. This method had some efficiency and produced predictions with an
accuracy of about 53%, but it still requires additional work and improvements to be
As discussed in the previous chapter, the machine learning development method was
used to ensure that this project adheres to the standards for machine learning
projects. The methodology offers a structured approach to follow along with that
allows flexibility in solving this machine learning problem and is based on these axes:
Data collection, Data engineering, Model training, Model optimization, and finally,
project, in this situation, the process allows flexibility in the work by isolating each
central element even though it is connected to a next or previous one. Any action can
be taken in this manner without endangering other parts (in some cases, a further
change must be done with other sections depending on the type of changes). This
advantage was obvious when concentrating on the model training and model
optimization part when returning to these specific parts to apply any further changes;
any change that was applied does not require reviewing the entire project from the
start or at a certain point. This benefit was made possible by the methodology that
was used. This methodology provides numerous advantages and may be maintained
5.2 Conclusion
This study’s primary objective was to determine whether machine learning models
models to predict whether the price will move “Up” or “Down” on that particular
58
day. As was mentioned in the Literature Review, there are numerous financial
theories that are relevant to this topic and should be taken into account, but when it
comes to the machine learning component, the majority of the work is still in its
infancy, and many are constrained by or reliant upon one approach. The
methodology utilised in this dissertation was systematic, beginning with the data
collection from the Yahoo Finance API, then producing two datasets, one for
classification (categorical, which showed two classes “1” implies the price will climb,
“0” means the price will decline). The other was for regression, which used numbers
to show the daily percent change in price, ranging from -1 to 1. Both datasets were
built using the daily close price of Bitcoin over eight years of data. Based on 21
timesteps, time series were created as the main feature of the prediction. Initial
cross-validation was carried out for the classification strategy to help determine which
machine learning classifiers should be considered for the next training. Along with
the customised ANN model, the initial selection included SVM, XGBoost Classifier,
RandomForest outperformed all other classifiers listed and then XGBoost with has a
less acuuracy than the RandomForest and more balanced score, especially precision.
The major regression indicators for each model were highly similar, making it difficult
to determine how well the model would perform. As a result, the evaluation of the
regression models was a little different for each chosen model. As a result, a created
plot, which compares anticipated price percent changes with original values, served as
the basis for this comparison. It reveals that only the XGBoost Regressor
outperformed all evaluated models, including LSTM, the model that is best suited for
time series projects. The outputs of the regression models were converted into a
categorical label with ”0” and ”1” classes, and an additional method was devised to
calculate accuracy to compare scores from regression and classification since the
The objective of this dissertation was to develop machine learning models that can
forecast a particular cryptocurrency return based on predicting whether the price will
go “up” or “down” in the upcoming days. The models achieved the highest accuracy
of 54%, with a high precision rate for the class “1” (which is “Up”) in most of them.
Most of the selected models can be employed in various profitable, analysis or even
related studies projects, for example, trading strategy, trading bots, analysis
The initial voting strategy was utilised to avoid the forecast being made by a single
decision maker, which opened the door for additional work that can be implemented
and enhanced based on the voting approach to predict cryptocurrency market returns
method with the same principle but only supports Classification alone or Regression
More features are required for this subject as it has been demonstrated in this
dissertation that time series alone are insufficient to achieve higher results than those
obtained. One aspect of the additional work that can be done is to expand research
for other features that can interact directly with the cryptocurrency market prices,
which can be related to finance. This project’s one major weakness was the lack of
features when it came to building the dataset. In conclusion, this project showed that
models is feasible and could be a useful tool for many other fields or projects since
Aggarwal, G., Patel, V., Varshney, G., and Oostman, K. (2019). Understanding the
social factors affecting the cryptocurrency market. arXiv preprint
arXiv:1901.06245.
Aysan, A. F., Demir, E., Gozgor, G., and Lau, C. K. M. (2019). Effects of the
geopolitical risks on bitcoin returns and volatility. Research in International
Business and Finance, 47:511–518.
Bekkar, M., Djemaa, H. K., and Alitouche, T. A. (2013). Evaluation measures for
models assessment over imbalanced data sets. J Inf Eng Appl, 3(10).
Böhme, R., Christin, N., Edelman, B., and Moore, T. (2015). Bitcoin: Economics,
technology, and governance. Journal of economic Perspectives, 29(2):213–38.
Buchholz, M., Delaney, J., Warren, J., and Parker, J. (2012). Bits and bets,
information, price volatility, and demand for bitcoin. Economics, 312(1):2–48.
60
61
Campbell, J. Y., Lo, A. W., MacKinlay, A. C., and Whitelaw, R. F. (1998). The
econometrics of financial markets. Macroeconomic Dynamics, 2(4):559–562.
Cao, L. and Tay, F. E. (2001). Financial forecasting using support vector machines.
Neural Computing & Applications, 10(2):184–192.
Chai, T. and Draxler, R. R. (2014). Root mean square error (rmse) or mean absolute
error (mae)?–arguments against avoiding rmse in the literature. Geoscientific
model development, 7(3):1247–1250.
Chang, P.-C., Fan, C.-Y., and Lin, J.-L. (2011). Trend discovery in financial time
series data using a case based fuzzy decision tree. Expert Systems with
Applications, 38(5):6070–6080.
Chniti, G., Bakir, H., and Zaher, H. (2017). E-commerce time series forecasting using
lstm neural network and support vector regression. In Proceedings of the
international conference on big data and Internet of Thing, pages 80–84.
Ciaian, P., Rajcaniova, M., et al. (2018). Virtual relationships: Short-and long-run
evidence from bitcoin and altcoin markets. Journal of International Financial
Markets, Institutions and Money, 52:173–195.
Copeland, T. E., Weston, J. F., Shastri, K., et al. (2005). Financial theory and
corporate policy, volume 4. Pearson Addison Wesley Boston.
Dwork, C. and Naor, M. (1992). Pricing via processing or combatting junk mail. In
Annual international cryptology conference, pages 139–147. Springer.
Gaur, D., Mehrotra, D., and Singh, K. (2022). Estimation of particulate matter
pm2.5 concentration using random forest regressor with hyperparameter tuning. In
2022 12th International Conference on Cloud Computing, Data Science
Engineering (Confluence), pages 465–469.
Haykin, S. (2009). Neural networks and learning machines, 3/E. Pearson Education
India.
Hirshleifer, D. and Hong Teoh, S. (2003). Herd behaviour and cascading in capital
markets: A review and synthesis. European Financial Management, 9(1):25–66.
Hobson, D. (2013). What is bitcoin? XRDS: Crossroads, The ACM Magazine for
Students, 20(1):40–44.
Iinuma, A. (2018). Why is the cryptocurrency market so volatile: Expert take. Coin
Telegraph.
Kara, Y., Boyacioglu, M. A., and Baykan, Ö. K. (2011). Predicting direction of stock
price index movement using artificial neural networks and support vector machines:
The sample of the istanbul stock exchange. Expert systems with Applications,
38(5):5311–5319.
Kim, K.-j. (2003). Financial time series forecasting using support vector machines.
Neurocomputing, 55(1-2):307–319.
Kotsiantis, S. B., Zaharakis, I., Pintelas, P., et al. (2007). Supervised machine
learning: A review of classification techniques. Emerging artificial intelligence
applications in computer engineering, 160(1):3–24.
Lee, T. K., Cho, J. H., Kwon, D. S., and Sohn, S. Y. (2019). Global stock market
investment strategies based on financial network indicators using machine learning
techniques. Expert Systems with Applications, 117:228–242.
Lukyanenko, R., Castellanos, A., Parsons, J., Chiarini Tremblay, M., and Storey,
V. C. (2019). Using conceptual modeling to support machine learning. In
64
International Conference on Advanced Information Systems Engineering, pages
170–181. Springer.
Lukyanenko, R., Parsons, J., and Storey, V. C. (2018). Modeling matters: Can
conceptual modeling support machine learning? AIS SIGSAND, pages 1–12.
Mai, F., Shan, Z., Bai, Q., Wang, X., and Chiang, R. H. (2018). How does social
media impact bitcoin value? a test of the silent majority hypothesis. Journal of
management information systems, 35(1):19–52.
Maimon, O. Z. and Rokach, L. (2014). Data mining with decision trees: theory and
applications, volume 81. World scientific.
Malkiel, B. G. (2003). The efficient market hypothesis and its critics. Journal of
economic perspectives, 17(1):59–82.
Mills, T. C., Coutts, J. A., and Roberts, J. (1996). Misspecification testing and
robust estimation of the market model and their implications for event studies.
Applied Economics, 28(5):559–566.
Niu, T., Wang, J., Lu, H., Yang, W., and Du, P. (2020). Developing a deep learning
framework with two-stage feature selection for multivariate financial time series
forecasting. Expert Systems with Applications, 148:113237.
65
Nobre, J. and Neves, R. F. (2019). Combining principal component analysis, discrete
wavelet transform and xgboost to trade in the financial markets. Expert Systems
with Applications, 125:181–194.
Philippas, D., Rjiba, H., Guesmi, K., and Goutte, S. (2019). Media attention and
bitcoin prices. Finance Research Letters, 30:37–43.
Polasik, M., Piotrowska, A. I., Wisniewski, T. P., Kotkowski, R., and Lightfoot, G.
(2015). Price fluctuations and the use of bitcoin: An empirical inquiry.
International Journal of Electronic Commerce, 20(1):9–49.
Quinlan, J. R. et al. (1992). Learning with continuous classes. In 5th Australian joint
conference on artificial intelligence, volume 92, pages 343–348. World Scientific.
Singh, J. E., Babshetti, V., and Shivaprasad, H. (2021). Efficient market hypothesis
to behavioral finance: A review of rationality to irrationality. Materials Today:
Proceedings.
Smith, G. and Ryoo, H.-J. (2003). Variance ratio tests of the random walk hypothesis
for european emerging stock markets. The European Journal of Finance,
9(3):290–300.
Tay, F. E. and Cao, L. (2002). Modified support vector machines in financial time
series forecasting. Neurocomputing, 48(1-4):847–861.
Wang, W. and Lu, Y. (2018). Analysis of the mean absolute error (mae) and the root
mean square error (rmse) in assessing rounding model. In IOP conference series:
materials science and engineering, volume 324, page 012049. IOP Publishing.
Weisberg, S. (2005). Applied linear regression, volume 528. John Wiley & Sons.
Figure A.1: Candlestick plot for Bitcoin price variation during the collected 8 years of
data
Overview on the classes distribution in ANN model (high precision rate for class ‘1’)
67
68
Figure A.7: Overview on the classes distribution in ANN model (high precision rate
for class ‘1’)
Figure A.8: Predicting next 10 days price variations using XGBoost Regressor
70
Figure A.9: Predicting next 10 days price variations using RandomForest Regressor