0% found this document useful (0 votes)
21 views57 pages

Report TTT

Uploaded by

Vansh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views57 pages

Report TTT

Uploaded by

Vansh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Predicting Soil Moisture Using Weather Data

by

Aryan Parashar (2000320120048)


Aryan Gupta (2000320120046)
Aryan Tyagi (2000320120050)
Anurag
Bhardwaj(2000320120039)

Submitted to the department of Computer Science


in partial fulfilment of the requirements
for the degree of
Bachelor of
Technology in
Computer Science

ABES Engineering College, Ghaziabad


Dr. A.P.J. Abdul Kalam Technical University,Uttar Pradesh
Lucknow May, 2024
DECLARATION

We thus declare that, to the best of our knowledge and belief, this submission is entirely original
work by us, with the exception of any passages where appropriate credit has been given within
the text. Neither previously published or written work by another person nor work that has been
substantially accepted for the award of any other degree or certificate from the university or
other higher education institution is present.

Signature:
Name: Aryan Parashar
Roll number: 2000320120048
Date:

Signature:
Name: Aryan Gupta
Roll number: 2000320120046
Date:

Signature:
Name: Aryan Tyagi
Roll number: 2000320120050
Date:

Signature:
Name: Anurag Bhardwaj
Roll number: 2000320120039
Date:
CERTIFICATE

This certifies that Aryan Parashar, Aryan Gupta, Aryan Tyagi, and Anurag Bhardwaj's project
report, "Predicting soil moisture using meteorological data," is a record of their own work
completed under my supervision and partially fulfills the requirements for the granting of a
B.Tech. degree in the Department of Computer Science at Dr. A.P.J. Abdul Kalam Technical
University. This thesis' content is unique and hasn't been submitted for consideration for any
other degree.

Date: 17-05-2024 (Supervisor Signature)


Name
Ms. Disha Mohini Pathak

Designation: Assistant Professor


Department of Computer Science
ABES Engineering College,
Ghaziabad.
ACKNOWLEDGEMENT

Presenting the report of the B.Tech project completed during the B.Tech final year brings us
tremendous joy. We are especially grateful to Ms. Disha Mohini Pathak, an assistant professor in
the computer science department of ABES Engineering College in Ghaziabad, for her
unwavering support and direction during the project. His/her honesty, diligence, and tenacity
have always served as an example for us. Only because of his aware efforts have our endeavors
been successful.

We also take this opportunity to thank Professor (Dr.) Pankaj Kumar Sharma, Head of the
Computer Science Department at ABES Engineering College, Ghaziabad, for his invaluable
support and help during the project's development.
Furthermore, we would hate to pass up the chance to thank the department's teachers for their
generous support and collaboration as our project was being developed. Finally, but just as
importantly, we thank our friends for helping to see the project through to completion.

Signature:
Name: Aryan Parashar
Roll No. 2000320120048
Date:

Signature:
Name: Aryan Gupta
Roll No. 2000320120046
Date:

Signature:
Name: Aryan Tyagi
Roll No. 2000320120048
Date:

Signature:
Name: Anurag Bhardwaj
Roll No. 2000320120039
Date:
ABSTRACT

The study of agriculture has the potential to improve our current situation regarding food
and water scarcity, two issues that the world is now grappling with and trying to find
solutions for. This issue can be resolved by using economically viable machine learning
approaches to predict soil moisture based on meteorological data. Many farmers lack the
means or money to check soil moisture levels as they pertain to long-term crops, even if
some can afford to have several moisture sensors and keep an eye on them. Giving farmers
permission to hire a specialist to conduct a sensor-based study of their property would be
one way to solve the problem. This way, the model could forecast soil moisture levels based
on meteorological information.

Keeping the soil at the proper moisture level during the plant-growing phenomena might
result in higher yields and fewer crop-related problems overall. Varying phases of growth
have different consequences, or maybe insignificant effects, from water surplus or deficit.
It's critical to understand how your land uses and stores water, as these factors can vary
greatly depending on the type of plants you use, the terrain, and elevation changes. We may
use a regression model with many strategies to get around this problem; based on settings
for prediction time, fit time, and r2 score, Random forest emerges as the top option.
TABLE OF
Page
CONTENTS

DECLARATION ii
CERTIFICATE iii
ACKNOWLEDGEMENTS iv
ABSTRACT v
LIST OF TABLES vii
LIST OF FIGURES viii
LIST OF SYMBOLS ix
LIST OF ABBREVIATIONS x
CHAPTER 1 (INTRODUCTION) 1
1.1 (Motivation)
1.2 (Project Objective)
1.3 (Scope of the object)
1.4 (Related Work)
1.5 (Organizaion of the report)
CHAPTER 2 (LITERATURE REVIEW) 13
2.1. ...............................................................................................................
2.2. ...............................................................................................................
2.2.1 .............................................................................................................
2.2.2. ..........................................................................................................
2.2.2.1. ........................................................................................................
2.2.2.2. ........................................................................................................
2.3. ...............................................................................................................
CHAPTER 3
3.1. ................................................................................................................ 36
3.2. ................................................................................................................ 39
CHAPTER 4 (CONCLUSIONS) 40
APPENDIX A 45
APPENDIX B 47
REFERENCES... .................................................................................................... 49
LIST OF TABLES

Table Description

Table 2 Literature Survey Paper


LIST OF FIGURES

Figure Description

Fig 2.1 Diagrams


LIST OF SYMBOLS

[x] Integer value of x.

≠ Not Equal

∈ Belongs to

€ Euro- A Currency

_ Optical distance

Optical thickness or optical half


thickness
LIST OF ABBREVIATIONS

AAM Active Appearance Model

ICA Independent Component Analysis

ISC Increment Sign Correlation

PCA Principal Component Analysis

ROC Receiver Operating Characteristics


CHAPTER 1
INTRODUCTION

Maintaining the right amount of moisture in the soil during the plant growth season
can increase yields and reduce crop issues overall. Water excess or shortage has
varying, or even insignificant, impacts on different development stages.
Because it can vary widely based on the plants you utilize, the terrain, and the
elevation of the region, it is important to understand how your land consumes and
stores water. This kind of approach has been utilized by farmers for hundreds of
years. What counts is the precision we can get with real data. For the last few
centuries, farmers had to evaluate the moisture level of their soil mostly by touch
and experience. Even if a lot of farmers were successful in the sense that they
were able to grow crops, there were still ways they might have improved the
productivity of their harvests.
Although there are other factors besides plant availability that affect yields, the
goal of this study is to develop a model that farmers can use to estimate soil
moisture levels without having to purchase and install costly sensors. There are
several possible uses for the developed model. The ability to track present soil
conditions is the primary usage, allowing for the potential correction of any
problems by making necessary adjustments.
Second, a farmer might assess past data, compare it to yields or other harvest
outcomes, and utilize this analytical knowledge to guide actions in the future. A
maize farmer, for instance, could just be concerned with the circumstances that
are anticipated to ensure that they fall within acceptable bounds. Together with
other data, a grape farmer in a wine vineyard may use this information to forecast
the wine's quality or even the blend of wine that would be best made from grapes
grown under these circumstances.
The goal of this research is to precisely observe how weather affects a specific
plot of land in the state of Washington. To get benchmarks, this process might be
carried out anywhere in the world. For a farmer without the resources to undertake
a comprehensive study of water consumption on their field, these benchmarks
may be an affordable alternative for training data. Alternatively, they may choose a
model with a comparable soil composition and/or topography, and then estimate
the soil moisture content using their own meteorological data.
This project's main objective is to provide the greatest tool at a price that will allow
for widespread use.
1. Problem Statement
1.1 Motivation
Predicting soil moisture using machine learning and weather data is crucial for
optimizing agriculture, enhancing water resource management, and mitigating the
impacts of droughts. Accurate predictions aid in crop yield optimization through
precise irrigation, promoting resource efficiency and sustainable farming practices.
Early warning systems for droughts benefit communities, governments, and farmers,
enabling proactive measures. The information is valuable for reservoir management,
urban planning, and infrastructure development, reducing risks associated with soil
moisture variations. Integration into weather models improves forecasting accuracy,
while the insights contribute to climate change research. Soil moisture predictions
also play a role in insurance risk assessment for agriculture. Overall, this approach
not only addresses immediate challenges but fosters scientific understanding,
supporting innovation and informed decision-making in various sectors.

1.2 Project Objective

The project aims to develop a robust machine learning model for predicting soil
moisture levels based on weather data. By leveraging advanced algorithms, the
objective is to provide accurate and timely information to optimize agricultural
practices, enhance water resource management, and mitigate the impact of
droughts. The model will contribute to precision farming by guiding farmers in
optimal irrigation scheduling, ultimately improving crop yields and resource
efficiency. Additionally, the project seeks to facilitate early warning systems for
droughts, empowering communities, governments, and farmers to proactively
respond to water scarcity challenges. The research will also explore the integration
of soil moisture predictions into weather forecasting models, with the potential to
improve overall forecasting accuracy. The overarching goal is to create a practical
tool that addresses real-world challenges in agriculture, water management, and
environmental sustainability, fostering informed decision-making and contributing to
the advancement of scientific knowledge.

1.3 Scope of the Project

The project's scope encompasses the development and implementation of a


machine learning-based soil moisture prediction system using weather data. It
involves data collection, preprocessing, and the application of advanced algorithms
to create a reliable model.
The primary focus is on optimizing agricultural practices, enhancing water resource
management, and providing early drought warnings. The scope extends to
assessing the model's impact on precision farming, improving crop yields, and
promoting resource-efficient irrigation. Additionally, the project explores the potential
integration of soil moisture predictions into broader weather forecasting models to
enhance overall prediction accuracy. Beyond agriculture, the research aims to
contribute valuable insights to urban planning, infrastructure development, and
climate change studies. The project's outcomes are expected to offer practical
applications in diverse sectors, providing a holistic solution to soil moisture
prediction challenges and fostering sustainable practices through informed
decision-making.

2. Related Previous Work


Previous work in soil moisture prediction has predominantly focused on empirical
models and remote sensing technologies. Many studies have employed statistical
approaches, such as regression models, to correlate weather variables with soil
moisture levels. Remote sensing techniques, including satellite and sensor-based
observations, have been instrumental in providing spatially distributed data for
large-scale analysis. However, machine learning techniques offer a promising
avenue for improved accuracy and predictive capabilities. Recent efforts have
explored the use of algorithms like Random Forest, Support Vector Machines, and
Neural Networks in soil moisture prediction, showcasing their potential in handling
complex, non-linear relationships. While existing research has laid a foundation,
this project seeks to advance the field by integrating state-of-the-art machine
learning methodologies, enhancing prediction accuracy, and addressing practical
applications in agriculture, water resource management, and environmental
sustainability.

2.1 Organization of the Report.


The report will follow a structured framework comprising several key sections. The
introduction will provide background information, the project's significance, and its
objectives. The literature review will delve into existing research on soil moisture
prediction, emphasizing previous methodologies, technologies, and their
limitations. The methodology section will detail data collection processes,
preprocessing steps, and the application of machine learning algorithms. Results
will present findings, including model performance metrics and validation
outcomes. A discussion section will interpret results, compare findings with
existing literature, and address any limitations. Practical implications and potential
applications in agriculture, water management, and beyond will be explored.
3. Organization of the Report

Chapter 2: Literature Survey and Software Requirement Specification


This chapter reviews existing literature on soil moisture prediction, summarizing
methodologies, algorithms, and technologies. It provides insights crucial for shaping the
project's methodology. Additionally, it outlines the software requirements, specifying
tools, programming languages, and frameworks needed for data processing and
machine learning implementation.

Chapter 3: System Design and Methodology


Detailing the project's architecture and methodology, this chapter serves as a blueprint
for development. It outlines the data flow, component interactions, and machine learning
approaches chosen. By defining the systematic approach, it provides clarity for
subsequent phases of the project.

Chapter 4: Implementation and Results


This chapter focuses on the practical aspects, detailing the implementation of the soil
moisture prediction system. It covers coding, model training, and the integration of
weather data. Results are thoroughly analyzed, and visualizations may be included to
validate the system's efficacy.

Chapter 5: Conclusion
Synthesizing the project, this chapter summarizes key findings, reflecting on limitations
and proposing avenues for future research. It underscores the project's significance in
agriculture and environmental sustainability, providing closure to the report while
emphasizing real-world applications.
CHAPTER 2

LITERATURE SURVEY

India, recognized for its agricultural legacy, saw around 73 million hectares fitted with irrigation
infrastructure in fiscal year 2022-23, accounting for 52% of the total 141 million hectares of gross planted
land [10]. The nation's available utilizable water resources are limited to an annual capacity of 1122 BCM
(Billion Cubic Meters), which includes 690 BCM from surface water and 432 BCM from groundwater. The
utilized water potential for this allotment is roughly 699 BCM, with 450 BCM generated from surface water
and 249 BCM from groundwater. Noticeably, the agriculture sector is thought to utilize 85-90% of the
country's overall water demand.
The work in [2] focuses on the prediction of soil moisture using specific implementations of RNN (Recurrent
Neural Network), LSTM (Long-Short Term Memory), and Volumetric Soil maintenance clustering using
weather data [5]. The data set was collected from 28 dissimilar locations located near Siberia which
inculcated 4 distinct types of soil. The data set was collected from 2017-2018 and was tested for about 8
months from September 2019 to April 2020. The results were proved to be 84 percent accurate wherein the
accuracy was compromised when the no.of days was reduced and soil depth was increased.
Gap: The shorter time frame prediction produced an unsatisfactory accuracy rate and inculcated that a large
set of data should be researched.
The research in [3] focuses on soil moisture prediction for irrigation automation and forecasting utilising
time-series modelling. The models utilised were Lasso, Decision Tree, Random Forest, and Support Vector
Machine. The investigation was conducted on a cotton farm, employing wireless soil moisture monitoring
equipment set in five plots. Temperature, depth, air humidity, and wind speed were some of the factors used.
Gap: In this model, only a single data set with a small count of a number of rows is used. No comparative
analysis is provided in the result and analysis section to select the best out of all.
The study in [4] focuses on predicting soil moisture. The models utilised were Deep Neural Network
Regression (DNNR). The research region included 25 weather stations that collected data on 32 factors such
as air temperature, rainfall, soil moisture, wind direction, and so on, with 15 placed in 8 North Dakota
counties and 10 in seven Minnesota counties. The study included various sorts of crops, including wheat, dry
beans, canola, oats, and barley.
Gap: Predict the soil moisture only at a depth of 20 cm.
The research in [5] focuses on forecasting soil moisture. The models used were Random Forest (RF),
Extremely Randomised Trees (ET), and Gradient Boosting Machines (GBM), with temperature, wind speed,
and air humidity as variables.
Gap: Limited parameters; not suitable for varied depths.
In this model, we are going to fill the following gaps discovered in several research papers.

1. Comparative analysis between all the regression techniques.


2. Combining multiple datasets for better results.
3. Result values such as prediction time, fit time and r2 score at various depths.
Related work

AUTHORS METHODOLOg ATTRIBUTE GAPS


Y S SE-
LECTED
SAGARI SVM, Temperat The study
KA Linear ure, Hu- covers
PAUL, Re- midity, less types
et. al[2] gression, and of crop
Naive Mois-
Bayes ture

Umesh random air Data sets


Acharya, forest re- temperat are not
et. al[3] gression( ure, big
advanced rainfall, enough
such as soil
boosted mois-
regression ture,
t
r wind
e direction
e
s
,
support
vector
regressio
n,
multiple
regressio
n,

and
artificial
neural
network)

Ramendra Lasso, rainfall, soil gaps


Prasad et. al[4] mois related to
Decisio - the usage
n Tree, ture, Humidity of
Random surface
reflectan
Forest ce
and and land-
use data,
Support
Vector
Machin
e.
Cai Y, Deep air predict
Zheng Neural temperat the soil
W,et. al[5] Net- ure, air moisture
work humidity only
Regressi , atmo- at depth
on of 20 cm
(DNNR) spheric
pressure,

soil
moisture,
daily
precipitat
ion, illu-
mination
duration

Jang, Support air Limited


Young-bin Vector temperat Parame-
Machines ure, ters and
and Jang, et. (SVM), rainfall, not for
al[6] Random variable
Forest wind depths
(RF), direc-
Extremel tion
y
Randomi
zed Trees
(ET),
Gradient
Boosting
Machines
(GBM),
and Deep
Feedforw
ard Net-
work
(DFN)

Figure 1
.

CHAPTER 3

SYSTEM DESIGN AND METHODOLOGY

1. System Design

1.1. System Architecture

1.2. DFD, Class Diagram, flow charts, ER Diagrams

2. Methodology
1. Linear Regression

2. Ridge Regression

3. Lasso Regression

4. Decision Tree Regression

5. Random Forest Regression

6. Gradient Boosting Regression

7. Support Vector Regression (SVR)

8. K-Nearest Neighbors Regression (KNN)

9. ElasticNet Regression

1. Bayesian Ridge Regression


CHAPTER 3

SYSTEM

DESIGN

System Design should include the following sections (Refer each figure or table in some
text). Figure number should be provided below the figure and the table numbering should
be provided above the table.

1. Architecture diagrams

Figure 2 -Tier Architecture Diagram example


2. Data Flow Diagram

Figure 3

3. Class Diagram

Figure 4
4. Database schema diagrams

Figure 5
CHAPTER 4

RESULTS AND

DISCUSSION

Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11

Hybrid KNN-Decision Tree and Hybrid Decision Tree-Ridge models have a longer training time.
This could imply that they need more time to study the initial data and understand the underlying
patterns. However, Hybrid Lasso-Ridge and Hybrid KNN-Lasso models can be trained much faster,
approximately between 1.5 to 2 seconds. It seems like they can learn quickly due to getting rid of
features that are unimportant or reducing complexity in general.
On the other hand, both Hybrid Decision Tree-Ridge and Hybrid Lasso-Ridge are also coming up
with predictions in 0.2 seconds or less; unlike the Hybrid KNN-Decision Tree and Hybrid
KNN-Lasso predictions that require a large amount of time - 24 seconds. This might be due to the
fact that before making a prediction, KNN needs to compare new data points to all of the data it was
trained on. The r2_score determines how well the model's predictions correspond with the actual
values; here, Hybrid Decision Tree-Ridge performs excellently since it has the highest score, hence
there is a strong connection between the data leading to highly reliable predictions.
Figure 12

Random Forest demonstrates remarkable efficiency, showcasing a reduction in processing time as


the dataset expands. Its scalability is evident, making it adept at handling larger datasets efficiently.
Conversely, Linear Regression and SVM experience an uptick in computation time with growing
dataset sizes, potentially due to sequential processing constraints. Ridge CV, while generally
efficient, displays performance variations, indicating sensitivity to dataset size. These observations
underscore the nuanced interplay between algorithmic efficiency and dataset characteristics. When
selecting models, it becomes crucial to consider scalability and processing capabilities, tailoring
choices to the specific demands of the dataset's size and complexity. This nuanced understanding
aids in optimizing model performance across a diverse array of scenarios, ensuring effective and
efficient utilization of machine learning algorithms.
CHAPTER 5

CONCLUSIO

1. Performance Evaluation

The outcome of all experimentation was a proceeding in which two data sets could be joined and
fed into a model to forecast the soil moisture with great accuracy, an r2 score lies between 0.977
and 0.991 depending on the depth using a Random Forest Regressor with default settings. This
procedure could be a repeatable process in which a peasant contracts a company to collect
training data on their land specifically for a growing season. As the collected sets of sensor data
could be cumbersome and expensive to deal with as a peasant, this is an alternative that is
cheaper and still gives nearly the same outcomes as having sensors that are constantly running.
Alternatively, this process could be a sub-process in a larger suite of software that peasants could
use for forecasting analysis or even have a set of data on soil moisture from a growing season to
use in post-season analysis of their crop produced. As long as large-scale AI programs are still
expensive and cumbersome for peasants to deal with, there will be a low rate of adoption. This
project has shown that a solution for large-scale soil moisture prediction software could be done
with relatively low computational costs.

2. Comparison with existing State-of-the-Art Technologies


A comparison with existing state-of-the-art technologies in soil moisture prediction reveals the
project's innovative contributions. While traditional methods often rely on empirical models and
remote sensing, this project leverages advanced machine learning algorithms to enhance accuracy
and adaptability. State-of-the-art technologies may incorporate complex models like ensemble
methods or deep learning, yet this project aims to balance performance and interpretability.
Additionally, the integration of soil moisture predictions into weather forecasting models
represents a forward-looking approach, aligning with the evolving landscape of environmental
data science. The project's focus on practical applications in agriculture and water management
further distinguishes it in the context of existing technologies.
3. Future Directions
Future directions for the soil moisture prediction project involve several potential
enhancements and expansions. Firstly, the integration of real-time data feeds could enable
more dynamic and responsive predictions. Exploring additional machine learning models,
including state-of-the-art deep learning architectures, could further improve accuracy.
Collaboration with meteorological agencies for access to comprehensive weather data may
enhance the model's predictive capabilities. The scalability of the system for broader
geographical coverage and adaptation to diverse ecosystems is another avenue for
development. Moreover, incorporating feedback loops for continuous model improvement
and addressing uncertainties in predictions through probabilistic models are areas
warranting exploration. Finally, expanding the project's applications to include soil health
assessment and irrigation system optimization could provide holistic solutions for
sustainable agriculture.
Code for the Project
},
{
"cell_type":
"markdown",
"metadata": {},
"source": [
"### Results are better! Let's try Lasso"
]
},
{
"cell_type": "code",
"execution_count":
18, "metadata": {},
"outputs": [
{
"name": "stdout",
"output_type":
"stream", "text": [
" Experiment Depth Fit_Time Pred_Time r2_score \\\n",
"0 First Linear Reg 30cm 32.042364 14.371771 9.154623e-01 \n",
"1 First Linear Reg 60cm 2.874317 0.135288 -1.662894e+14 \n",
"2 First Linear Reg 90cm 2.858727 0.156215 9.487954e-01 \n",
"3 First Linear Reg 120cm 2.874285 0.171831 9.460321e-01 \n",
"4 First Linear Reg 150cm 2.952462 0.125001 9.433287e-01 \n",
"5 Ridge Reg - Alpha = 1 30cm 28.188629 0.140590 9.162112e-01 \n",
"6 Ridge Reg - Alpha = 1 60cm 1.312190 0.218699 9.427566e-01 \n",
"7 Ridge Reg - Alpha = 1 90cm 1.202838 0.156215 9.487904e-01 \n",
"8 Ridge Reg - Alpha = 1 120cm 1.140353 0.140597 9.460320e-01 \n",
"9 Ridge Reg - Alpha = 1 150cm 1.140351 0.140591 9.433202e-01 \n",
"10 Lasso Reg - Alpha = 1 30cm 15.871741 0.156247 -1.832157e-04 \n",
"11 Lasso Reg - Alpha = 1 60cm 1.327864 0.140546 -4.613909e-05 \n",
"12 Lasso Reg - Alpha = 1 90cm 1.359043 0.140626 -5.673799e-06 \n",
"13 Lasso Reg - Alpha = 1 120cm 1.373374 0.140591 -1.131381e-06 \n",
"14 Lasso Reg - Alpha = 1 150cm 1.348597 0.140538 -1.814059e-04 \n",
"\n",
" datetime \n",
"0 2021-12-22 15:17:55 \n",
"1 2021-12-22 15:17:58 \n",
"2 2021-12-22 15:18:05 \n",
"3 2021-12-22 15:18:08 \n",
"4 2021-12-22 15:18:11 \n",
"5 2021-12-22 15:18:39 \n",
"6 2021-12-22 15:18:41 \n",
"7 2021-12-22 15:18:42 \n",
"8 2021-12-22 15:18:43 \n",
"9 2021-12-22 15:18:45 \n",
"10 2021-12-22 15:19:01 \n",
"11 2021-12-22 15:19:02 \n",
"12 2021-12-22 15:19:04 \n",
"13 2021-12-22 15:19:05 \n",
"14 2021-12-22 15:19:07 \n"
]

}
],
"source": [
"pipe_with_estimator = Pipeline(steps=[('preprocessor',
preprocessor),\n", " ('classifier', Lasso(alpha = 1))])\n",
"\n",
"data_cols = ['30cm', '60cm', '90cm', '120cm', '150cm']\n",
"try:\n",
" log\n",
"except NameError:\n",
" log = pd.DataFrame(columns = ['Experiment', 'Depth', 'Fit_Time', 'Pred_Time',
'r2_score', 'datetime'])\n",
" \n",
"for cols in
data_cols:\n", " t0 =
time.time()\n",
" pipe_with_estimator.fit(X_train_set[cols],
y_train_set[cols])\n", " t1 = time.time()\n",
" preds =
pipe_with_estimator.predict(X_test_set[cols])\n", " t2 =
time.time()\n",
" r2sc = r2_score(y_test_set[cols], preds)\n",
" now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')\n",
" log.loc[len(log)] = ['Lasso Reg - Alpha = 1', cols, t1-t0, t2-t1, r2sc,
now]\n", " \n",
"print(log)"
]
},
{
"cell_type":
"markdown",
"metadata": {},
"source": [
"### At least with with these parameters, Lasso Fits Poorly"
]
},
{
"cell_type":
"markdown",
"metadata": {},
"source": [
"### Ridge with a built in gridsearch cross validation"
]
},
{
"cell_type": "code",
"execution_count":
19, "metadata": {},
"outputs": [
{
"name": "stdout",
"output_type":
"stream", "text": [
" Experiment Depth Fit_Time Pred_Time r2_score \\\n",
"0 First Linear Reg 30cm 32.042364 14.371771 9.154623e-01 \n",
"1 First Linear Reg 60cm 2.874317 0.135288 -1.662894e+14 \n",
"2 First Linear Reg 90cm 2.858727 0.156215 9.487954e-01 \n",
"3 First Linear Reg 120cm 2.874285 0.171831 9.460321e-01 \n",
"4 First Linear Reg 150cm 2.952462 0.125001 9.433287e-01 \n",
"5 Ridge Reg - Alpha = 1 30cm 28.188629 0.140590 9.162112e-01 \n",
"6 Ridge Reg - Alpha = 1 60cm 1.312190 0.218699 9.427566e-01 \n",
"7 Ridge Reg - Alpha = 1 90cm 1.202838 0.156215 9.487904e-01 \n",
"8 Ridge Reg - Alpha = 1 120cm 1.140353 0.140597 9.460320e-01 \n",
"9 Ridge Reg - Alpha = 1 150cm 1.140351 0.140591 9.433202e-01 \n",
"10 Lasso Reg - Alpha = 1 30cm 15.871741 0.156247 -1.832157e-04 \n",
"11 Lasso Reg - Alpha = 1 60cm 1.327864 0.140546 -4.613909e-05 \n",
"12 Lasso Reg - Alpha = 1 90cm 1.359043 0.140626 -5.673799e-06 \n",
"13 Lasso Reg - Alpha = 1 120cm 1.373374 0.140591 -1.131381e-06 \n",
"14 Lasso Reg - Alpha = 1 150cm 1.348597 0.140538 -1.814059e-04 \n",
"15 Ridge Reg - GSCV 30cm 15.183893 0.187457 9.162351e-01 \n",
"16 Ridge Reg - GSCV 60cm 5.366714 0.156221 9.427570e-01 \n",
"17 Ridge Reg - GSCV 90cm 5.436197 0.140594 9.487957e-01 \n",
"18 Ridge Reg - GSCV 120cm 5.483074 0.171830 9.460322e-01 \n",
"19 Ridge Reg - GSCV 150cm 5.561178 0.203086 9.433280e-01 \n",
"\n",
}
],
"source": [
"pipe_with_estimator = Pipeline(steps=[('preprocessor', preprocessor),\n",
" ('classifier', RidgeCV(alphas = [0.001, 0.01, 0.1, 1, 10, 100,
1000]))])\n", "\n",
"data_cols = ['30cm', '60cm', '90cm', '120cm', '150cm']\n",
"try:\n",
" log\n",
"except NameError:\n",
" log = pd.DataFrame(columns = ['Experiment', 'Depth', 'Fit_Time', 'Pred_Time',
'r2_score', 'datetime'])\n",
" \n",
"for cols in
data_cols:\n", " t0 =
time.time()\n",
" pipe_with_estimator.fit(X_train_set[cols],
y_train_set[cols])\n", " t1 = time.time()\n",
" preds =
pipe_with_estimator.predict(X_test_set[cols])\n", " t2 =
time.time()\n",
" r2sc = r2_score(y_test_set[cols], preds)\n",
" now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')\n",
" log.loc[len(log)] = ['Ridge Reg - GSCV', cols, t1-t0, t2-t1, r2sc,
now]\n", " \n",
"print(log)"
]
},
{
"cell_type":
"markdown",
"metadata": {},
"source": [
"Gridsearch found alpha = 1 to be the best parameter"
]
},
{
"cell_type":
"markdown",
"metadata": {},
"source": [
"## Other Regressor Tests"
]
},
{
"cell_type":
"markdown",
"metadata": {},
"source": [
"Right now Ridge Regression with an alpha of 1 is winning as the best model so far. Let's see if we
can beat it"
]
},
{
"cell_type": "code",
"execution_count":
20, "metadata": {},
"outputs": [
{
"name": "stdout",
"output_type":
"stream", "text": [
" Experiment Depth Fit_Time Pred_Time r2_score \\\n",

"0 Random Forest - Default 30cm 688.237841 0.860152 0.980310 \n",


"1 Random Forest - Default 60cm 680.878078 0.687300 0.990726 \n",
"2 Random Forest - Default 90cm 689.800418 0.734206 0.992370 \n",
"3 Random Forest - Default 120cm 722.545116 0.718582 0.992590 \n",
"4 Random Forest - Default 150cm 733.399503 0.725267 0.993203 \n",
"\n",
}
],
"source": [
"pipe_with_estimator = Pipeline(steps=[('preprocessor',
preprocessor),\n", " ('classifier', RandomForestRegressor())])\n",
"\n",
"data_cols = ['30cm', '60cm', '90cm', '120cm', '150cm']\n",
"try:\n",
" log_other\n",
"except
NameError:\n",
" log_other = pd.DataFrame(columns = ['Experiment', 'Depth', 'Fit_Time', 'Pred_Time',
'r2_score', 'datetime'])\n",
"for cols in
data_cols:\n", " t0 =
time.time()\n",
" pipe_with_estimator.fit(X_train_set[cols],
y_train_set[cols])\n", " t1 = time.time()\n",
" preds =
pipe_with_estimator.predict(X_test_set[cols])\n", " t2 =
time.time()\n",
" r2sc = r2_score(y_test_set[cols], preds)\n",
" now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')\n",
" log_other.loc[len(log_other)] = ['Random Forest - Default', cols, t1-t0, t2-t1, r2sc,
now]\n", " \n",
"print(log_other)"
]
},
{
"cell_type":
"markdown",
"metadata": {},
"source": [
"Amazing results! Although it takes considerably longer to train, the default does rather well"
]
},
{
"cell_type":
"markdown",
"metadata": {},
"source": [
"As a litmus test, lets just try a few more models."
]
},
{
"cell_type": "code",
"execution_count":
21, "metadata": {},
"outputs": [
{
"name": "stdout",
"output_type":
"stream", "text": [
" Experiment Depth Fit_Time Pred_Time r2_score \\\n",

"0 Random Forest - Default 30cm 688.237841 0.860152 0.980310 \n",


"1 Random Forest - Default 60cm 680.878078 0.687300 0.990726 \n",
"2 Random Forest - Default 90cm 689.800418 0.734206 0.992370 \n",
"3 Random Forest - Default 120cm 722.545116 0.718582 0.992590 \n",
"4 Random Forest - Default 150cm 733.399503 0.725267 0.993203 \n",
"5 SVM - Default 30cm 63.639315 7.706923 0.658935 \n",
"6 SVM - Default 60cm 148.874223 10.001588 0.753807 \n",
"7 SVM - Default 90cm 150.792928 10.414850 0.775367 \n",
"8 SVM - Default 120cm 127.845249 9.556673 0.746775 \n",
"9 SVM - Default 150cm 158.235881 11.079853 0.747956 \n",
"\n",
}
],
"source": [
"pipe_with_estimator = Pipeline(steps=[('preprocessor',
preprocessor),\n", " ('classifier', SVR())])\n",
"\n",
"data_cols = ['30cm', '60cm', '90cm', '120cm', '150cm']\n",
"try:\n",
" log_other\n",
"except
NameError:\n",
" log_other = pd.DataFrame(columns = ['Experiment', 'Depth', 'Fit_Time', 'Pred_Time',
'r2_score', 'datetime'])\n",
"for cols in
data_cols:\n", " t0 =
time.time()\n",
" pipe_with_estimator.fit(X_train_set[cols],
y_train_set[cols])\n", " t1 = time.time()\n",
" preds =
pipe_with_estimator.predict(X_test_set[cols])\n", " t2 =
time.time()\n",
" r2sc = r2_score(y_test_set[cols], preds)\n",
" now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')\n",
" log_other.loc[len(log_other)] = ['SVM - Default', cols, t1-t0, t2-t1, r2sc,
now]\n", " \n",
"print(log_other)"
]
},
{
"cell_type":
"markdown",
"metadata": {},
"source": [
"Just with the default values, SVM, did not perform well, but this could just mean that default
parameters are not good"
]
},
{
"cell_type": "code",
"execution_count":
22, "metadata": {},
"outputs": [
{
"name": "stdout",
"output_type":
"stream", "text": [
" Experiment Depth Fit_Time Pred_Time r2_score \\\n",

"0 Random Forest - Default 30cm 688.237841 0.860152 0.980310 \n",


"1 Random Forest - Default 60cm 680.878078 0.687300 0.990726 \n",
"2 Random Forest - Default 90cm 689.800418 0.734206 0.992370 \n",
"3 Random Forest - Default 120cm 722.545116 0.718582 0.992590 \n",
"4 Random Forest - Default 150cm 733.399503 0.725267 0.993203 \n",
"5 SVM - Default 30cm 63.639315 7.706923 0.658935 \n",
"6 SVM - Default 60cm 148.874223 10.001588 0.753807 \n",
"7 SVM - Default 90cm 150.792928 10.414850 0.775367 \n",
"8 SVM - Default 120cm 127.845249 9.556673 0.746775 \n",
"9 SVM - Default 150cm 158.235881 11.079853 0.747956 \n",
"10 SGD - Default 30cm 6.263381 0.558115 0.889629 \n",
"11 SGD - Default 60cm 1.503087 0.148150 0.930496 \n",
"12 SGD - Default 90cm 1.475703 0.139627 0.941424 \n",
"13 SGD - Default 120cm 1.440918 0.169494 0.935683 \n",
"14 SGD - Default 150cm 29.228487 0.136069 0.928644 \n",
"\n",
}
],
"source": [
"pipe_with_estimator = Pipeline(steps=[('preprocessor',
preprocessor),\n", " ('classifier', SGDRegressor())])\n",
"\n",
"data_cols = ['30cm', '60cm', '90cm', '120cm', '150cm']\n",
"try:\n",
" log_other\n",
"except
NameError:\n",
" log_other = pd.DataFrame(columns = ['Experiment', 'Depth', 'Fit_Time', 'Pred_Time',
'r2_score', 'datetime'])\n",
"for cols in
data_cols:\n", " t0 =
time.time()\n",
" pipe_with_estimator.fit(X_train_set[cols],
y_train_set[cols])\n", " t1 = time.time()\n",
" preds =
pipe_with_estimator.predict(X_test_set[cols])\n", " t2 =
time.time()\n",
" r2sc = r2_score(y_test_set[cols], preds)\n",
" now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')\n",
" log_other.loc[len(log_other)] = ['SGD - Default', cols, t1-t0, t2-t1, r2sc, now]\n",
" \n",
"print(log_other)
"
]
},
{
"cell_type":
"markdown",
"metadata": {},
"source": [
"## Hyper Parameter Tuning Random Forest"
]
},
{
"cell_type":
"markdown",
"metadata": {},
"source": [
"The following, will take a considerable amount of time to run. Run with caution!!"
]
},
{
"cell_type":
"markdown",
"metadata": {},
"source": [
"This experiment is not included in the final report, but shows an extension of trying to get better
results."
]
},
{
"cell_type": "code",
"execution_count":
null, "metadata": {},
"outputs": [
{
"name": "stdout",
"output_type":
"stream", "text": [
"Fitting 3 folds for each of 10 candidates, totalling 30 fits\n"
]
},
{
"name": "stderr",
"output_type":
"stream", "text": [
"[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent
workers.\n", "[Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 7.9min\n"
]
}
],
"source": [
"## Param grid comes from the following site:\n",
"##
https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn
-28d2aa77dd74\n",
"\n",
"pipe_with_estimator = Pipeline(steps=[('preprocessor',
preprocessor),\n", " ('classifier', RandomForestRegressor())])\n",
"\n",
"param_grid = {'classifier bootstrap': [True, False],\n",
" 'classifier max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],\n",
" 'classifier max_features': ['auto', 'sqrt'],\n",
" 'classifier min_samples_leaf': [1, 2, 4],\n",
" 'classifier min_samples_split': [2, 5, 10],\n",
" 'classifier n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}\n",
"\n",
"data_cols = ['30cm', '60cm', '90cm', '120cm', '150cm']\n",
"cv_res = {}\n",
"try:\n",
" log_rf\n",
"except NameError:\n",
" log_rf = pd.DataFrame(columns = ['Experiment', 'Depth', 'Fit_Time', 'Pred_Time',
'r2_score', 'best_params' 'datetime'])\n",
"for cols in
data_cols:\n", " t0 =
time.time()\n",
" random_search = RandomizedSearchCV(estimator = pipe_with_estimator, param_distributions
= param_grid, n_iter = 10, cv = 3, verbose=10, random_state=42, n_jobs =
-1)\n", " random_search.fit(X_train_set[cols], y_train_set[cols])\n",
" best =
random_search.best_params_\n", "
t1 = time.time()\n",
" preds =
random_search.predict(X_test_set[cols])\n", " t2 =
time.time()\n",
" r2sc = r2_score(y_test_set[cols], preds)\n",
" now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')\n",
" log_rf.loc[len(log_rf)] = ['RF - random search', cols, t1-t0, t2-t1, r2sc, best,
now]\n", " cv_res[cols] = random_search.cv_results_\n",
"
print(log_rf)\n", "
\n", "print(log_rf)"
]
}
],
"metadata": {
"kernelspec":
{
"display_name": "Python
3", "language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode":
{ "name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter":
"python", "pygments_lexer":
"ipython3", "version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

You might also like