0% found this document useful (0 votes)
9 views47 pages

Stocks Final Word

The document is a major project report titled 'Stock Analysis and Prediction Using Big Data Analytics' submitted by students from Gokaraju Lailavathi Engineering College for their Bachelor of Engineering degree. It outlines the development of a data pipeline using Cloudera-Hadoop to analyze and predict stock movements, focusing on US oil stocks with real-time data from Yahoo Finance, employing ARIMA and LSTM models for accurate forecasting. The report includes sections on objectives, methodology, testing, results, and literature survey, emphasizing the significance of big data in financial analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views47 pages

Stocks Final Word

The document is a major project report titled 'Stock Analysis and Prediction Using Big Data Analytics' submitted by students from Gokaraju Lailavathi Engineering College for their Bachelor of Engineering degree. It outlines the development of a data pipeline using Cloudera-Hadoop to analyze and predict stock movements, focusing on US oil stocks with real-time data from Yahoo Finance, employing ARIMA and LSTM models for accurate forecasting. The report includes sections on objectives, methodology, testing, results, and literature survey, emphasizing the significance of big data in financial analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 47

STOCKS ANALYSIS AND PREDICTION USING BIG DATA

ANALYTICS
A Major Project Report

Submitted in partial fulfilment of the requirements for the award of the degree of

Bachelor Of Engineering

In

Computer Science and Engineering

By

DEEPTHI MANE 245621733097

MUTTINENI SADHVIKA 245621733104

NALLA BHARGAVI 245621733312

Under the Esteemed guidance of

Dr.A.USHASREE

Assistant Professor,

Dept.of CSE

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


Gokaraju Lailavathi Engineering College
(Affiliated to Osmania University)
Bachupally, Kukatpally, Hyderabad –
500090 (2024-2025)

1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Gokaraju Lailavathi Engineering College
(Affiliated to Osmania University)
Bachupally, Kukatpally, Hyderabad –
500090 (2024-2025)

CERTIFICATE

This is to Certify that A Major Project report entitled “Stock Analysis And Prediction
Using Big Data Analytics” is being submitted by Deepthi Mane(245621733097), Muttineni
Sadhvika (245621733104) and Nalla Bhargavi(24562173312) in partial fulfilment of the
requirement of the award for the degree of Bachelor of Engineering in “Computer Science
and Engineering” O.U., Hyderabad during the year 2022-2023 is a record of Bonafide work
carried out by them under my guidance. The results presented in this project have been
verified and are found to be satisfactory.

Project Guide HOD


Dr.A.Ushasree Dr. Padmalay Nayak

Assistant Professor Professor, HOD

Dept of CSE Dept of CSE

External Examiner(s)

II
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Gokaraju Lailavathi Engineering College

(Affiliated to Osmania University)

Bachupally, Kukatpally, Hyderabad –

500090 (2024-2025)

DECLARATION

We, Deepthi Mane bearing Ht.No.245621733097, Muttineni Sadhvika bearing


Ht.No.245621733104, Nalla Bhargavi bearing Ht.No.24562173312 hereby certify that the
major project entitled “Stocks Analysis and Prediction Using Big Data Analysis” is
submitted in the partial fulfilment of the required for the award of the degree of Bachelor of
Engineering in Computer Science and Engineering. This is a record work carried out by us
under the guidance of Dr. UshaSree, Professor,Dept of CSE, Gokaraju Lailavathi
Engineering College, Bachupally. The results embodied in this report have not been
reproduced/copied from any source. The results embodied in this report have not been
submitted to any other university or institute for the award of any other degree or diploma.

DEEPTHI MANE 245621733097

MUTTINENI SADHVIKA 245621733104

NALLA BHARGAVI 245621733312

III
ACKNOWLEDGEMENT

We, Deepthi Mane (245621733097), Muttineni Sadhvika (245621733104), Nalla Bhargavi


(245621733312) , extend our heartfelt appreciation to those who have been instrumental in
the completion of this project.

Our deepest gratitude goes to our mentor, Dr. A.Ushasree, Assistant Professor, Gokaraju
Lailavathi Engineering College. Her unwavering support, insightful guidance, and
constructive feedback have been invaluable throughout our research journey.

We are particularly thankful to our Project Coordinator, Dr.V.Harika ,Professor,Gokaraju


Lailavathi Engineering College. For their timely advice and constant encouragement have
significantly contributed to the success of our work.

Our sincere appreciation extends to Dr. Padmalaya Nayak, Professor & Head of the
Department Gokaraju Lailavathi Engineering College, for motivational words and invaluable
input that helped shape our project.

We would like to express our gratitude to Dr. A. Sai Hanuman, Principal,


GokarajuLailavathi Engineering College, for fostering an environment conducive to academic
growth and research.

Our thanks also go to all faculty members and staff of the Computer Science andEngineering
Department for their support and encouragement throughout our academic journey.

Lastly, we are deeply indebted to our friends and families for their unwavering emotional
support and understanding, which have been crucial in helping us compl

IV
TABLE OF CONTENTS

S.NO CONTENT PGNO


1 Introduction 1-3
1.1 Objective 1
1.2 Problem Statement 2
1.3 Existing system 2
1.4 Proposed system 3

2 Literature survey 4-6


2.1 Related works 4
3 Software Requirements Specifications 7-10
3.1 Software requirements 7
3.2 Hardware requirements 8
3.3 Functional requirements 9
3.4 Non-Functional requirements 10
4 System architecture & uml diagrams 11-18
4.1 System architecture 11
4.2 UML diagrams 12
5 Methodology 19-28
5.1 Modules 19
5.2 Methodology proposed 22
5.3 Executable code 24
6 Testing 29-31
6.1 Testing definition 29
6.2 Unit Testing 29
6.3 Integration testing 30
6.4 Acceptance Testing 30
6.5 Test cases 30
7 Results 32
8 Conclusion 38
8.1 Conclusion 38
8.2 Future scope 38
9 References 39

V
LIST OF FIGURES

S.NO Fig Title PGNO


1 System Architecture Diagram 11
2 Uml Hierarchy Diagram 13
3 Use Case Diagram 14
4 Class Diagram 15
5 Activity Diagram 16
6 Sequence Diagram 17
7 Component Diagram 18
8 Results 32

LIST OF Tables

Table No Table Name PG no

1 6..5.1 Test case 1 30

2 6.5.2 Test case 2 31

3 6..5.3 Test case 3 31

VI
ABSTRACT

Big data analytics have become integral across various sectors for accurate prediction, pattern
recognition, and insightful analysis of massive datasets. These techniques enable the
discovery of valuable insights that would otherwise remain hidden in traditional data
processing systems. In this paper, we propose a robust Cloudera-Hadoop-based data pipeline
designed to handle and analyse data of any scale and type, providing a scalable and
fault-tolerant architecture for real-time analytics. The pipeline integrates the Apache Hadoop
big-data ecosystem, leveraging HDFS for distributed storage and YARN for resource
management, ensuring efficient processing of high-volume stock market data. As a use case,
we focus on selected stocks from the US stock market and utilize real-time financial data
sourced from Yahoo Finance to analyse and predict daily stock gains. The data pipeline
preprocesses raw stock data, which is then split into training and testing datasets. To enhance
forecasting accuracy, we employ a hybrid approach combining the Autoregressive Integrated
Moving Average (ARIMA) model with Long Short-Term Memory (LSTM) networks.
ARIMA captures linear trends in the data, while LSTM models address non-linear patterns,
providing a comprehensive analysis. The performance of each model is evaluated based on
accuracy, precision, and recall, and the best-performing model is used to forecast stocks with
high daily returns. The proposed solution demonstrates the capability of big data technologies
in financial analytics, providing a scalable, efficient, and accurate mechanism for real-time
stock prediction. This framework can be extended to other domains where large-scale,
real-time data processing and analytics are required.

Vii
CHAPTER-1

INTRODUCTION

Big data has been attached to great importance for the proliferation of a lot of different
sectors. It has been extensively employed by business organizations to formalize important
business insights and intelligence. Furthermore, it has been utilized by the healthcare sector
to discover important patterns and knowledge so as to improve the modern healthcare
systems. Besides, big data holds significant importance for the information, technology and
cloud computing sector. Recently, the finance and banking sectors utilized big data to track
the financial market activity. Big data analytics and network analytics were used to catch
illegal trading in the financial markets. Similarly, traders, big banks, financial institutions and
companies utilized big data for generating trade analytics utilized in high frequency trading.
Besides, big data analytics also helped in the detection of illegal activities such as: money
laundering and financial frauds. In this paper, we hope to build a system which analyses US
oil stocks to predict daily gains in US stocks based on the real time data from Yahoo Finance.
About all 13 stocks in the US oil fund are picked up and their daily gain data are divided into
training and test data set to predict the stocks with high daily gains using the Machine
Learning module of Spark. Based on our analysis we propose a robust ClouderaHadoop
based data pipeline to perform this analyses for any type and scale of data. By means of
studying live stream data of US oil stock prices so that it can help us better understand how
the US Oil index affects the stock price of other stocks in US oil funds exchange. Besides, it
can help us predict the profitable stocks for stock traders and provide profits to the US Oil
stocks trader community.

1.1 OBJECTIVE

1. The goal is to harness the capabilities of big data by developing robust machine learning
models, specifically ARIMA and LSTM algorithms. These models are intended to analyze
vast quantities of data, with the overarching objective of predicting stock movements.
Emphasizing the significance of big data processing, the project aims to uncover meaningful
insights for stock traders and facilitate data-driven decision-making.

1
2. The objective is to implement PYSPARK, a Python API for Apache Spark, with the goal
of efficiently processing large volumes of stock data. Acknowledging the streaming nature of
this data, the project aims to ensure its capacity to handle substantial datasets and maintain
responsiveness to real-time changes in stock prices.

3. The project's objective is to apply the XGBOOST algorithm to refine and enhance the
relevance of features within the stock data. By systematically removing irrelevant variables,
the goal is to improve the accuracy of predictions made by the machine learning models. This
contributes to more reliable outcomes, aligning with the overarching goal of providing
valuable insights for stock traders.

4. The project aims to assess the performance of the ARIMA and LSTM algorithms, utilizing
the R SQUARED metric. This objective provides a quantitative measure of how well the
models predict stock movements. With a higher R SQUARED value signifying a more
effective predictive capability, the goal is to aid in the selection of the most suitable
algorithm for comprehensive stock analysis.

1.2 PROBLEM STATEMENT

This project is to develop a stock analysis prediction system using big data analytics. Illicit
trading in the financial markets was detected through the use of network analytics and big
data analytics. In a similar vein, traders, large banks, corporations, and financial
organizations used big data to produce trade analytics used in high frequency trading.

1.3 EXISTING SYSTEM:

Recently, the finance and banking sectors utilized big data to track the financial market activity.
Big data analytics and network analytics were used to catch illegal trading in the financial
markets. Similarly, traders, big banks, financial institutions and companies utilized big data
for generating trade analytics utilized in high frequency trading. Besides, big data analytics
also helped in the detection of illegal activities such as: money laundering and financial
frauds.

2
1.3.1 DISADVANTAGES OF EXISTING SYSTEM:

1. In existing systems there is no accurate prediction in data


2. and also existing systems unable to analysis of the large data sets

1.4 Proposed System:

In this paper, we hope to build a system which analyses US oil stocks to predict daily gains in
US stocks based on the real time data from Yahoo Finance. About all 13 stocks in the US oil
fund are picked up and their daily gain data are divided into training and test data set to
predict the stocks with high daily gains using the Machine Learning module of Spark. Based
on our analysis we propose a robust ClouderaHadoop based data pipeline to perform this
analyses for any type and scale of data.

1.4.1 Advantages of proposed system:

2. It support all sort of complex analysis


3. faster
4. With lot of ML tools available, deciding the tool that can perform analysis and
implement ML algorithms efficiently has been a daunting task.
5. Provides a flexible platform for implementing

3
CHAPTER 2
LITERATURE
SURVEY

2.1 Price Trend Prediction of Stock Market Using Outlier Data Mining
Algorithm:

ABSTRACT:In this paper we present a novel data mining approach to predict long term
behavior of stock trends. Traditional techniques on stock trend prediction have shown their
limitations when using time series algorithms or volatility modelling on price sequence. In
our research, a novel outlier mining algorithm is proposed to detect anomalies on the basis of
volume sequence of high frequency tick-by tick data of the stock market. Such anomaly
trades always inference with the stock price in the stock market. By using the cluster
information of such anomalies, our approach predicts the stock trend effectively in the real
world market. Experiment results show that our proposed approach makes profits on the
Chinese stock market, especially in a long-term usage.

2.2 Stock price prediction using data analytics:

ABSTRACT:Accurate financial prediction is of great interest for investors. This paper


proposes use of Data analytics to be used in assisting investors for making the right financial
prediction so that the right decision on investment can be taken by Investors. Two platforms
are used for operation: Python and R. various techniques like Arima, Holt winters, Neural
networks (Feed forward and Multi-layer perceptron), linear regression and time series are
implemented to forecast the opening index price performance in R. While in python
Multi-layer perceptron and support vector regression are implemented for forecasting Nifty
50 stock price and also sentiment analysis of the stock was done using recent tweets on
Twitter. Nifty 50 ( A NSEI) stock indices is considered as a data input for methods which are
implemented. 9 years of data is used. The accuracy was calculated using 2-3 years of forecast
results of R and 2 months of forecast results of Python after comparing with the actual price
of the stocks. Mean squared error and other error parameters for every prediction system
were calculated and it is found that feed forward network only produces 1.81598342% error
when opening price of stock is forecasted using it.

4
2.3 Stock market: Statistical analysis of its indexes and its constituents:

ABSTRACT:The ever-changing realm of the stock market is constantly thriving under the
process of modifications and alterations. Thus, making a profit from it is hard and requires
intensive planning. It is in the context of this fact that makes Stock Market analysis the first
and foremost priority for any financial investment. Considering the behavioural aspects of
stock prices which have a tendency to rise and fall unexpectedly, leads to a volatile scenario.
However, to acquire some insight, intellectual wit and smartness to extract the best, a
thorough and consistent analysis is the most popular and tested way. This paper aims to
determine top high performing stocks having good returns under a given index that would be
most safe and beneficial for investment. Using historical data we were able to obtain top
stocks that are advisable for investment. We also verified our results by analyzing
contemporary data similarly and found out that the performance and returns of these stocks
were still high irrespective of volatility.

2.4 Stocks Analysis and Prediction Using Big Data Analytics:

ABSTRACT:Big data is a new and emerging buzzword in today's times. Stock market is an
up and ever evolving, volatile, uncertain and intriguingly potential niche, which is an
important extension in finance and business growth and prediction. Stock market has to deal
with a large amount of vast and distinct data to function and draw meaningful conclusions.
Stock market trends depend broadly on two analyses; technical and fundamental. Technical
analysis is carried out using historical trends and market values. On the other hand,
fundamental analysis is done based on the sentiments, values and social media data and
responses. Since large, complex and complicated and exponentially growing data is involved,
we use big data analysis to help assist in the prediction and drawing accurate business
decisions and profitable investments.

2.5
Stock market prediction: A big data approach:

ABSTRACT:The Stock market process is full of uncertainty and is affected by many


factors. Hence the Stock market prediction is one of the important exertions in finance and
business. There are two types of analysis possible for prediction, technical and fundamental.
In this paper both technical and fundamental analysis are considered. Technical analysis
is done
5
using historical data of stock prices by applying machine learning and fundamental analysis
is done using social media data by applying sentiment analysis. Social media data has a
higher impact today than ever, it can aid in predicting the trend of the stock market. The
method involves collecting news and social media data and extracting sentiments expressed
by individuals. Then the correlation between the sentiments and the stock values is analyzed.
The learned model can then be used to make future predictions about stock values. It can be
shown that this method is

able to predict the sentiment and the stock performance and its recent news and social data
are also closely correlated.

CHAPTER-3

6
SOFTWARE REQUIREMENT SPECIFICATION

3.1 SOFTWARE REQUIREMENTS

Software requirements deal with defining software resource requirements and prerequisites
that need to be installed on a computer to provide optimal functioning of an application.
These requirements or prerequisites are generally not included in the software installation
package and need to be installed separately before the software is installed.

Platform – In computing, a platform describes some sort of framework, either in hardware or


software, which allows software to run. Typical platforms include a computer’s architecture,
operating system, or programming languages and their runtime libraries.

Operating system is one of the first requirements mentioned when defining system
requirements (software). Software may not be compatible with different versions of same line
of operating systems, although some measure of backward compatibility is often maintained.
For example, most software designed for Microsoft Windows XP does not run on Microsoft
Windows 98, although the converse is not always true. Similarly, software designed using
newer features of Linux Kernel v2.6 generally does not run or compile properly (or at all) on
Linux distributions using Kernel v2.2 or v2.4.

APIs and drivers – Software making extensive use of special hardware devices, like high-end
display adapters, needs special API or newer device drivers. A good example is DirectX,
which is a collection of APIs for handling tasks related to multimedia, especially game
programming, on Microsoft platforms.

Web browser – Most web applications and software depending heavily on Internet
technologies make use of the default browser installed on the system. Microsoft Internet
Explorer is a frequent choice of software running on Microsoft Windows, which makes use
of ActiveX controls, despite their vulnerabilities.

1. Software : Anaconda
2. Primary Language : Python
3. Frontend Framework : Flask
4. Back-end Framework : Jupyter Notebook
5. Database : Sqlite3

7
6. Front-End Technologies : HTML, CSS, JavaScript and Bootstrap4

3.2 HARDWARE REQUIREMENTS

The most common set of requirements defined by any operating system or software application
is the physical computer resources, also known as hardware, A hardware requirements list is
often accompanied by a hardware compatibility list (HCL), especially in case of operating
systems. An HCL lists tested, compatible, and sometimes incompatible hardware devices for
a particular operating system or application. The following subsections discuss the various
aspects of hardware requirements.

Architecture – All computer operating systems are designed for a particular computer
architecture. Most software applications are limited to particular operating systems running
on particular architectures. Although architecture-independent operating systems and
applications exist, most need to be recompiled to run on a new architecture. See also a list of
common operating systems and their supporting architectures.

Processing power – The power of the central processing unit (CPU) is a fundamental system
requirement for any software. Most software running on x86 architecture defines processing
power as the model and the clock speed of the CPU. Many other features of a CPU that
influence its speed and power, like bus speed, cache, and MIPS are often ignored. This
definition of power is often erroneous, as AMD Athlon and Intel Pentium CPUs at similar
clock speed often have different throughput speeds. Intel Pentium CPUs have enjoyed a
considerable degree of popularity, and are often mentioned in this category.

Memory – All software, when run, resides in the random access memory (RAM) of a computer.
Memory requirements are defined after considering demands of the application, operating
system, supporting software and files, and other running processes. Optimal performance of
other unrelated software running on a multi-tasking computer system is also considered when
defining this requirement.

Secondary storage – Hard-disk requirements vary, depending on the size of software


installation, temporary files created and maintained while installing or running the software,
and possible use of swap space (if RAM is insufficient).

8
Display adapter – Software requiring a better than average computer graphics display, like
graphics editors and high-end games, often define high-end display adapters in the system
requirements.

Peripherals – Some software applications need to make extensive and/or special use of some
peripherals, demanding the higher performance or functionality of such peripherals. Such
peripherals include CD-ROM drives, keyboards, pointing devices, network devices, etc.

1)Operating System : Windows

Only 2)Processor : i5 and above

3) Ram : 8gb and above

4) Hard Disk : 25 GB in local drive

3.3 FUNCTIONAL REQUIREMENTS

1. Data acquisition & characterisation


2. Data injection
3. Storage
4. Preprocessing
5. machine learning

3.4 NON FUNCTIONAL REQUIREMENTS

NON-FUNCTIONAL REQUIREMENT (NFR) specifies the quality attribute of a software


system. They judge the software system based on Responsiveness, Usability, Security,
Portability and other non-functional standards that are critical to the success of the software
system. Example of nonfunctional requirement, “how fast does the website load?” Failing to
meet non-functional requirements can result in systems that fail to satisfy user needs. Non-
functional Requirements allow you to impose constraints or restrictions on the design of the
system across the various agile backlogs. Example, the site should load in 3 seconds when
the number of simultaneous users is > 10000. Description of non-functional requirements is
just as critical as a functional requirement.

● Usability Requirement
● Serviceability Requirement

9
● Manageability Requirement
● Recoverability Requirement
● Security Requirement
● Data Integrity requirement
● Capacity Requirement
● Availability Requirement
● Scalability Requirement
● Interoperability Requirement
● Reliability Requirement
● Maintainability Requirement
● Regulatory Requirement
● Environmental Requirement

CHAPTER-4

10
SYSTEM DESIGN

4.1 SYSTEM ARCHITECTURE:

4.2 UML DIAGRAMS

Unified Modelling Language (UML) Diagrams are a standardized way of visualizing the
design of a system. UML is widely used in software engineering for specifying, visualizing,
constructing, and documenting the artifacts of software systems. It is similar to the blueprints
used in other fields of engineering. UML is not a programming language; it is rather a visual
language. UML diagrams provide a graphical representation of software components, their
relationships, and their interactions, which helps in understanding, designing, and managing
complex systems.UML has been revised over the years and is reviewed periodically.

Do we really need UML?

11
● Serves as comprehensive documentation that can be referred to throughout the project
lifecycle, including during maintenance and future development.

● Specifies the structure and behaviour of the system using detailed diagrams, aiding in
clear and precise communication of requirements and design.

● Helps in visualizing complex systems by breaking them down into manageable


components and interactions.

● UML is linked with object-oriented design and analysis. UML makes the use of
elements

and forms associations between them to form diagrams.

The Primary goals in the design of the UML are as follows:

Standardization: Provide a consistent and uniform modelling language across different


projects and teams.

ComprehensiveModelling: Represent various aspects of a system, including structure,


behaviour, and interactions.

Ease of Understanding: Facilitate clear communication and understanding of complex


systems among stakeholders.

 Support for Analysis and Design: Aid in the analysis of requirements and the design of
systems with various diagrams addressing different concerns.

Flexibility: Adapt to different methodologies and approaches, such as object-oriented and


service-oriented design.

Tool Support: Enable automated tools for creating, managing, and analysing diagrams to
enhance productivity and accuracy.

Documentation: Offer detailed documentation of systems for future maintenance,


enhancements, and knowledge transfer.

Facilitation of Development: Assist in planning, designing, and implementing software


systems.

Integration: Integrate with other modelling languages and methodologies for a coherent
modelling approach.

Communication of Design: Ensure effective communication of system design and


architecture to all stakeholders.

Types of UML Diagrams:

12
Structural Diagrams:

Capture static aspects or structure of a system. Structural Diagrams include: Component


Diagrams, Object Diagrams, Class Diagrams and Deployment Diagrams.

Behavior Diagrams:

Capture dynamic aspects or behavior of the system. Behavior diagrams include: Use Case
Diagrams, State Diagrams, Activity Diagrams and Interaction Diagrams.

The image below shows the hierarchy of diagrams according to UML

Figure-4.2.1 UML Hierarchy diagrams

Use case diagram:

A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals (represented
as use cases), and any dependencies between those use cases. The main purpose of a use case

13
diagram is to show what system functions are performed for which actor. Roles of the actors
in the system can be depicted.

Class diagram:

The class diagram is used to refine the use case diagram and define a detailed design
of the system. The class diagram classifies the actors defined in the use case diagram into a
set of interrelated classes. The relationship or association between the classes can be either an
"is-a" or "has-a" relationship. Each class in the class diagram may be capable of providing

14
certain functionalities. These functionalities provided by the class are termed "methods" of
the class. Apart from this, each class may have certain "attributes" that uniquely identify the
class.

Activity diagram:

The process flows in the system are captured in the activity diagram. Similar to a state
diagram, an activity diagram also consists of activities, actions, transitions, initial and final
states, and guard conditions.

15
Sequence diagram:

A sequence diagram represents the interaction between different objects in the system.
The important aspect of a sequence diagram is that it is time-ordered. This means that the

16
exact sequence of the interactions between the objects is represented step by step. Different
objects in the sequence diagram interact with each other by passing "messages".

Component diagram:

The component diagram represents the high-level parts that make up the system. This
diagram depicts, at a high level, what components form part of the system and how they are
interrelated. A component diagram depicts the components culled after the system has
undergone the development or construction phase.

17
18
Chapter-5
METHODOLOGY

5.1 MODULES:

1. Read Data using PySpark:

Read stock data from a dataset file using PySpark.

Initialize Spark and set up a Spark Streaming Context.

Create a Spark session.

Read the dataset as a stream or in batches using PySpark classes.

Display the dataset to ensure it has been loaded correctly.

2. Data Normalization:

Normalize dataset values to ensure consistency in scale.

Apply data normalization techniques to scale the values of the dataset.

Normalization is crucial when dealing with features that may have different units or ranges,
ensuring that the algorithms can learn effectively from the data.

3. Feature Engineering With XGBoost:

Apply XGBoost algorithm to perform feature engineering and select relevant features. while
excluding irrelevant ones.

XGBoost can help enhance the model's ability to make accurate predictions by focusing on
the most impactful features.

4. Training ARIMA Model:

Train the ARIMA model on the preprocessed dataset.

19
Use the preprocessed dataset to train an ARIMA model, specifying the order of the model
(p, d, q).

The ARIMA model is a time series model that captures temporal dependencies in the data.

5. Evaluate ARIMA Model:

Evaluate the performance of the ARIMA model.

Apply the trained ARIMA model to the test dataset.

Evaluate the model's performance using metric R-squared.

R-squared measures how well the model explains the variance in the test data.

6. Training LSTM Model:

Train the LSTM model on the preprocessed dataset.

Set up a Sequential model using Keras with LSTM layers.

Train the LSTM model on the preprocessed dataset.

LSTM models are effective for capturing long-term dependencies in sequential data.

7. Evaluate LSTM Model:

Evaluate the performance of the LSTM model.

Apply the trained LSTM model to the test dataset.

Evaluate the model's performance using metrics like R-squared.

Compare the R-squared value with the ARIMA model to determine which model performs
better.

20
Algorithms:

1. LSTM (Long Short-Term Memory):

Definition: LSTM is a type of recurrent neural network (RNN) architecture designed to


overcome the vanishing gradient problem in traditional RNNs. It excels at capturing and
learning patterns in sequential data over extended time periods by maintaining a cell state that
can be selectively updated, allowing for better retention of long-term dependencies.

Why Used in the Project: LSTM is employed in the project for its capability to analyze and
learn patterns in time-series data, which is crucial for predicting stock movements. Its ability
to capture long-term dependencies in the sequential nature of stock prices makes it
well-suited for forecasting stock

2. ARIMA (AutoRegressive Integrated Moving Average):

Definition: ARIMA is a statistical method used for time-series analysis and forecasting. It
consists of three main components: AutoRegressive (AR), Integrated (I), and Moving
Average (MA). ARIMA models are effective in capturing trends and seasonality within
time-series data.

Why Used in the Project: ARIMA is applied in the project due to its effectiveness in
modeling time-series data, making it suitable for predicting stock price movements over time.
Its ability to account for trends and seasonality in the data complements the project's goal of
forecasting daily gains or losses in the stock market.

3. XGBOOST (Extreme Gradient Boosting):

Definition: XGBOOST is a powerful machine learning algorithm based on gradient boosting


frameworks. It builds a series of decision trees and combines their predictions to produce a
final output. XGBOOST is known for its efficiency, speed, and high performance in various
data science applications.

Why Used in the Project: XGBOOST is utilized in the project for feature refinement within
the stock data. Its capability to handle large datasets and prioritize the most influential
features makes it suitable for improving the accuracy of predictions. By removing irrelevant

21
variables, XGBOOST contributes to enhancing the overall performance of the machine
learning models in the project.

5.2 Methodology proposed for implementation

The Stock Prediction System aims to forecast future stock prices using big data analytics and
machine learning. It follows a structured methodology involving data collection,
preprocessing, feature engineering, model selection, training, optimization, performance
evaluation, and forecasting. This approach ensures the system delivers reliable predictions for
financial decision-making.

1. Data Collection & Preprocessing

The first step in stock prediction involves acquiring high-quality historical stock price data
from sources such as Yahoo Finance, Alpha Vantage, and Quandl. Additional sources
include financial reports and news sentiment analysis.

Once collected, the raw data undergoes preprocessing to remove inconsistencies and enhance
accuracy. This includes handling missing values using interpolation, normalizing numerical
features with Min-Max scaling, and detecting outliers using Boxplot analysis or Isolation
Forest algorithms. These steps ensure that the dataset remains clean and structured for further
analysis.

2. Feature Engineering

To improve predictive accuracy, feature engineering is applied to enhance the dataset with
technical indicators. Moving Averages (SMA & EMA) help track trends over different time
periods, while the Relative Strength Index (RSI) measures stock momentum to indicate
whether a stock is overbought or oversold.

Additional indicators such as MACD (Moving Average Convergence Divergence) and


Bollinger Bands provide insights into trend strength and price fluctuations. Furthermore,
Natural Language Processing (NLP) techniques analyze financial news and investor
sentiment to capture external market influences.

3. Model Selection & Implementation

22
Several machine learning models are tested to determine the most accurate prediction algorithm.

The ARIMA (AutoRegressive Integrated Moving Average) model is ideal for time-series
forecasting, capturing trends based on historical stock behavior. For deeper pattern
recognition, LSTM (Long Short-Term Memory Neural Network) is used to analyze
sequential dependencies in financial data. Additionally, Random Forest Regression, an
ensemble learning method, utilizes multiple decision trees to provide robust predictions by
minimizing overfitting.

To ensure model effectiveness, these methods are benchmarked against traditional approaches
such as Linear Regression and Support Vector Regression (SVR).

4. Model Training & Optimization

The dataset is divided into training (80%), validation (10%), and test (10%) sets to ensure
proper evaluation of the models.Hyperparameter tuning plays a vital role in improving
accuracy. Optimization techniques such as Grid Search and Random Search help refine
parameters like learning rate, number of layers, and activation functions. Neural networks are
optimized using Adam and RMSprop optimizers, while cross-validation methods such as
k-fold validation prevent overfitting.

5. Forecasting & Visualization

Once the model is trained, stock price forecasts are generated for short-term (5-30 days) and
long-term (quarterly) predictions The system provides interactive graphs to visualize
prediction results. Line charts show predicted vs. actual stock movements, while candlestick
charts highlight market opening and closing prices. Users can also access heatmaps and
correlation matrices to analyze relationships between indicators. Dynamic filters allow users
to adjust forecast parameters based on stock symbols, timeframes, and volatility levels.

6. Performance Evaluation

To validate prediction accuracy, the system employs multiple evaluation metrics.The Mean
Squared Error (MSE) measures deviations from actual stock prices, while Root Mean
Squared Error (RMSE) evaluates prediction stability. The R² Score determines how well the

23
predicted data aligns with real market trends. Additionally, Mean Absolute Percentage Error
(MAPE) quantifies forecast errors, ensuring precise adjustments for more reliable predictions.

7. Error Reduction Strategies

Because financial markets are highly volatile, implementing error reduction techniques is
essential.Bayesian Optimization enhances hyperparameter tuning, while hybrid model
integration combines LSTM and ARIMA to improve accuracy. Sentiment analysis
adjustments refine predictions by incorporating investor mood swings, while ensemble
learning techniques merge outputs from multiple models for better stability.

5.2 Executable CODE:

#**************** IMPORT PACKAGES ********************

from flask import Flask, render_template, request, flash, redirect, url_for

from alpha_vantage.timeseries import TimeSeries

import pandas as pd

import numpy as np

from statsmodels.tsa.arima_model import ARIMA

from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt

plt.style.use('ggplot')

import math, random

from datetime import datetime

import datetime as dt

import yfinance as yf

24
import tweepy

import preprocessor as p

import re

from sklearn.linear_model import LinearRegression

from textblob import TextBlob

import constants as ct

from Tweet import Tweet

import nltk

nltk.download('punkt')

# Ignore

Warnings import

warnings

warnings.filterwarnings("ignore")

import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

#***************** FLASK *****************************

app = Flask( name )

#To control caching so as to save and retrieve plot figs on client

side @app.after_request

def add_header(response):

response.headers['Pragma'] = 'no-cache'

25
response.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'

response.headers['Expires'] = '0'

return respons

@app.route('/')

def index():

return render_template('index.html')

@app.route('/insertintotable',methods = ['POST'])

def insertintotable():

nm = request.form['nm']

#**************** FUNCTIONS TO FETCH DATA ***************************

def get_historical(quote):

end = datetime.now()

start = datetime(end.year-2,end.month,end.day)

data = yf.download(quote, start=start, end=end)

df = pd.DataFrame(data=data)

df.to_csv(''+quote+'.csv')

if(df.empty):

ts = TimeSeries(key='N6A6QT6IBFJOPJ70',output_format='pandas')

data, meta_data = ts.get_daily_adjusted(symbol='NSE:'+quote,

outputsize='full') #Format df

#Last 2 yrs rows => 502, in ascending order => ::-1

26
data=data.head(503).iloc[::-1]

data=data.reset_index()

#Keep Required cols only

df=pd.DataFrame()

df['Date']=data['date']

df['Open']=data['1. open']

df['High']=data['2. high']

df['Low']=data['3. low']

df['Close']=data['4. close']

df['Adj Close']=data['5. adjusted close']

df['Volume']=data['6. volume']

df.to_csv(''+quote+'.csv',index=False)

return

#******************** ARIMA SECTION ********************

def ARIMA_ALGO(df):

uniqueVals =

df["Code"].unique()

len(uniqueVals)

df=df.set_index("Code")

#for daily basis

def parser(x):

return datetime.strptime(x, '%Y-%m-%d')


27
def arima_model(train, test):

history = [x for x in train]

predictions = list()

for t in range(len(test)):

28
CHAPTER -6
TESTING

6.1 Testing Defintion

Testing is like a detective's investigation into a product. Its main job is to uncover any hidden
flaws or weaknesses that could cause problems later on. By carefully checking each part
—from small components to the whole system—testing ensures everything works as it
should. This helps make sure the software meets what users expect and doesn't let them down
unexpectedly. Different types of tests serve different purposes, like checking specific
requirements or making sure everything works together smoothly. Ultimately, testing is about
making sure the software is reliable and meets high standards before it reaches users.

6.2 Unit Testing

Unit testing is like checking each ingredient before baking a cake. It's about testing small,
specific parts of your code—like checking that each ingredient is fresh and measured
correctly. By doing this, you ensure that each piece of your code works as intended before
putting everything together. Unit tests help catch mistakes early, maintain the overall quality
of your code, and build confidence that your software will work smoothly when it's all
combined.

Test strategy and approach

We'll take the product out for a spin in real-world situations to see how it holds up, making
sure everything works as intended through hands-on testing. At the same time, we'll craft
detailed tests to check each function thoroughly, ensuring they meet our standards and
perform reliably.

Test Objectives

1. Every field entry needs to function correctly. When you click on the predict button,
it should give the right prediction.
2. The entry screen should be responsive, showing messages and responses
instantly without any delays.

Features to be Tested

29
1. Make sure that all entries are in the correct format.
2. We shouldn't have any duplicates entered.

6.3 Integration Testing

Integration testing is like putting together pieces of a puzzle to see if they fit perfectly. It
checks how different parts of a software system work together to make sure they collaborate
smoothly and perform their tasks correctly when combined. This testing ensures that the
entire system functions seamlessly as intended.

Test Results: All the test cases mentioned above passed successfully. No
defects encountered.

6.4 Acceptance Testing

Acceptance testing is like giving the software a trial run in a real-life situation before it goes
live. It's about making sure the software does everything it's supposed to do and meets all the
goals we set for it. This testing phase helps us decide if the software is ready to be used by
the people who will rely on it every day.

6.5 TEST CASES:

Table 6.5.1 - Test Case 1: Data Import Validation

Test case no .1

Objective To test whether the system can successfully import


stock market data from an external API.

Test data Historical stock prices

Expected Result Retrieve stock data without errors

Actual Result Data successfully fetched from the external API

Test Result This indicates the system can efficiently import


stock data.

30
Table 6.5.2 - Test Case 2: Feature Extraction Validation

Test Case No.2

Objective To test whether the system correctly extracts


financial indicators from stock data.

Test Data Raw stock price data, including open, close,


high, and low values.

Expected Result Moving Averages, RSI, and MACD are


computed accurately for analysis.
Actual Result The system successfully extracted all
indicators without errors.

Test Result The above result indicates that feature


extraction is functioning correctly.

Table 6.5.3 - Test Case 3: Model Training Validation

Test Case No.2

Objective To test whether predictive models (ARIMA,


LSTM, Random Forest) train successfully.

Test Data Cleaned and structured stock market data


used for training

Expected Result Models should learn patterns effectively


and optimize parameters for accurate
forecasting.

Actual Result Models successfully trained and optimized,


producing accurate stock price trends.

Test Result The above result indicates that the models


have been successfully trained and
validated.

31
CHAPTER-7

Results

Figure 7.1 - Welcome Page of Stock Prediction System

The Welcome Page of the Stock Prediction System introduces users to the platform with a
simple and intuitive interface. It includes options for Sign Up and Home, allowing new users
to create an account and returning users to access stock analysis features directly. The design
ensures easy navigation, guiding users toward financial predictions and insights. A visually
appealing header displays the platform’s name, reinforcing the purpose of the tool.

Figure 7.2 - Sign-In Page of Stock Prediction System

The Sign-In Page is designed to provide users with secure access to the Stock Prediction
System. The interface includes two primary options: Sign In for existing users and Sign Up
for new registrations. Users must enter their email and password to access the dashboard,
ensuring a protected login experience.

Additionally, a "Forgot Password" option allows users to recover their credentials in case of
access issues. The design prioritizes user convenience, featuring a clean layout with intuitive
buttons and responsive fields.

32
Figure 7.3 - Crypto Symbol Entry Page

This page prompts users to enter the cryptocurrency symbol for analysis. Users can input the
ticker symbol of the crypto asset they want to track, such as BTC for Bitcoin, ETH for
Ethereum, or XRP for Ripple. The system uses the entered symbol to fetch real-time market
data, historical price trends, and predictive insights.

The interface includes an input field with placeholder text guiding users to enter the correct
crypto symbol. A validation mechanism ensures the symbol is recognized, preventing errors
caused by invalid entries. Once a valid symbol is provided, users can proceed to view charts,
predictions, and trading recommendations.

Additionally, users have access to a drop-down menu listing popular cryptocurrencies,


making selection easier. The goal of this page is to enable quick and accurate retrieval of
crypto market data, helping traders and investors make informed decisions.

33
Figure 7.5 - Stock Prediction Model Based on Investor Sentiments and
Optimized Deep Learning

This image illustrates a stock prediction model integrating investor sentiment analysis with
deep learning algorithms. The system analyzes news, social media, and market trends to
gauge sentiment—classifying it as positive, neutral, or negative.Optimized LSTM and CNN
models process financial data, refining stock forecasts based on historical patterns and
emotional market responses. The visual elements highlight neural network layers, sentiment
classification, and predicted stock movements, enhancing investment strategies with real-
time sentiment insights.

34
Figure 7.6-Stock Prediction Model Evaluation and Market Trends Analysis

This section presents various graphs analyzing stock trends and prediction model accuracy.
Figure 7.6 visualizes recent trends in Google cryptocurrency prices, tracking historical
fluctuations, volatility, and trading volume to assess market sentiment and investor behavior.
Figure 7.7 evaluates the ARIMA model’s accuracy, displaying error metrics like RMSE,
MAPE, and R² score, which indicate its effectiveness in predicting linear time-series trends.
Figure 7.8 focuses on the LSTM model’s accuracy, highlighting its ability to capture
long-term dependencies in financial data, making it superior for volatile stock forecasting.
Figure 7.9 assesses XGBoost’s predictive performance, illustrating its precision, recall, and
error rates, showcasing how boosted decision trees refine feature selection and enhance stock
forecast accuracy. These graphical insights collectively provide a comprehensive evaluation
of stock prediction methodologies.

35
Figure 7.10 - Final Test Report and Stock Selling Decision

This image presents the comprehensive test report summarizing the performance evaluations
of different stock prediction models. The table includes results from unit testing, integration
testing, and model accuracy assessments, ensuring the system functions correctly and
provides reliable forecasts. Key metrics such as error rates, processing speed, and prediction
accuracy are displayed to validate the effectiveness of the algorithms.

36
Additionally, the image features a "Sell or Not Sell" decision panel, where stocks are
analyzed based on market trends and forecasted movements. The system categorizes stocks
into "Recommended for Sale" or "Hold for Future Gains", helping investors make informed
trading decisions. Color-coded indicators highlight profitable opportunities, reinforcing
strategic investment choices.

37
CHAPTER 8

CONCLUSION

8.1. CONCLUSION

In this paper, the big data analytics are used for efficient stock market analysis and
prediction. Generally, stock market is a domain that uncertainty and inability to accurately
predict the stock values may result in huge financial losses. Through our work we were able
to propose a approach to help us identify stocks with positive everyday return margins, which
can be suggested to be the potential stocks for enhanced trading. Such an approach will act as
a Hadoop based pipeline to learn from past data and make decisions based on streaming
updates which the US stocks are profitable to trade in. We also try to find scope of
improvements to our study in future directions. We intend to further our study by automating
the analysis processes using scheduling module, then obtain periodic recommendations for
trading the US stocks. We also plan to test some Neural Network model based learning rather
than linear regression aims to accurately predict the US stock prices

8.2 Future Scope

Future iterations may include sentiment analysis of news articles and social media posts
related to the stock market, providing traders with additional contextual information for
improved decision-making.C Interactive visualization tools and dashboards can enhance user
experience by enabling deeper exploration of stock market trends and predictions through
customizable charts and real-time updates.

C Building predictive APIs would allow integration with trading platforms and financial
applications, of éring traders convenient access to real-time stock market predictions within
their existing workflows.Future iterations could explore reinforcement learning techniques to
dynamically optimize trading strategies, enabling the system to adapt to changing market
conditions for more responsive decision-making.

38
REFERENCES

[1] Albeladi, K., Zafar, B., & Mueen, A. (2023). Time series forecasting using LSTM and
ARIMA. International Journal of Advanced Computer Science and Applications, 14(1),
313-320.

[2] Sirisha, U. M., Belavagi, M. C., & Attigeri, G. (2022). Profit prediction using ARIMA,
SARIMA and LSTM models in time series forecasting: A comparison. IEEE Access, 10,
124715-124727

[3] Awan, M. J., Rahim, M. S., Nobanee, H., Munawar, A., Yasin, A., & Zain Azlan, A. M.
(2021). Social Media and Stock Market Prediction: A Big Data Approach. Computers,
Materials & Continua, 67(2).

[4] ]Z. Peng, “Stocks Analysis and Prediction Using Big Data Analytics,” in 2019
International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS),
Changsha, China, 2019, pp. 309–312.

[5] P. Singh and A. Thakral, “Stock market: Statistical analysis of its indexes and its
constituents,” in 2017 International Conference On Smart Technologies For Smart Nation
(SmartTechCon), Bangalore, 2017, pp. 962–966.

[6] S. Tiwari, A. Bharadwaj, and S. Gupta, “Stock price prediction using data analytics,” in
2017 International Conference on Advances in Computing, Communication and Control
(ICAC3), Mumbai, 2017, pp. 1–5.

[7] L. Zhao and L. Wang, “Price Trend Prediction of Stock Market Using Outlier Data
Mining Algorithm,” in 2015 IEEE Fifth International Conference on Big Data and Cloud
Computing, Dalian, China, 2015, pp. 93–98.
[8] G. V. Attigeri, Manohara Pai M M, R. M. Pai, and A. Nayak, “Stock market prediction: A
big data approach,” in TENCON 2015 - 2015 IEEE Region 10 Conference, Macao, 2015, pp.
1–5.

[9] W.-Y. Huang, A.-P. Chen, Y.-H. Hsu, H.-Y. Chang, and M.-W. Tsai, “Applying Market
Profile Theory to Analyze Financial Big Data and Discover Financial Market Trading
Behavior - A Case Study of Taiwan Futures Market,” in 2016 7th International Conference
on Cloud Computing and Big Data (CCBD), Macau, China, 2016, pp. 166–169.

39
[10] S. Jeon, B. Hong, J. Kim, and H. Lee, “Stock Price Prediction based on Stock Big Data
and Pattern Graph Analysis:,” in Proceedings of the International Conference on Internet of
Things and Big Data, Rome, Italy, 2016, pp. 223–231.

[11] R. Choudhry and K. Garg, “A Hybrid Machine Learning System for Stock Market
Forecasting,” vol. 2, no. 3, p. 4, 2008.

[12] K. Kim, “Financial time series forecasting using support vector machines,”
Neurocomputing, vol. 55, no. 1–2, pp. 307–319, Sep. 2003.

[13] M. Makrehchi, S. Shah, and W. Liao, “Stock Prediction Using EventBased Sentiment
Analysis,” in 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence
(WI) and Intelligent Agent Technologies (IAT), Atlanta, GA, USA, 2013, pp. 337–342.

[14] H. Pouransari and H. Chalabi, “Event-based stock market prediction,” p. 5.

[15] M.D. Jaweed and J. Jebathangam, “Analysis of stock market by using Big Data
Processing Environment” in International Journal of Pure and Applied Mathematics, Volume
119

40

You might also like