Mini Project Report on
RAINFALL PREDICTION SYSTEM
Submitted in partial fulfilment of the requirement for the award of the
                               degree of
                    BACHELOR OF TECHNOLOGY
                                   IN
                 COMPUTER SCIENCE & ENGINEERING
                          Submitted by:
 Student Name:                                     University Roll No.
                                                   2019162
SUMIT MALAN
                         Under the Mentorship of
                         Ms. Meenakshi Maindola
 Department of Computer Science and Engineering
      Graphic Era (Deemed to be University)
             Dehradun, Uttarakhand
                  January-2024
                      CANDIDATE’S DECLARATION
I hereby certify that the work which is being presented in the project report entitled “Rainfall
Prediction System” in partial fulfillment of the requirements for the award of the Degree of
Bachelor of Technology in Computer Science and Engineering of the Graphic Era (Deemed
to be University), Dehradun shall be carried out by the under the mentorship of Ms.
Meenakshi Maindola , Department of Computer Science and Engineering, Graphic Era
(Deemed to be University), Dehradun.
               Name :- Sumit malan
               University Roll no:- 2019162
                   Table of Contents
Chapter No.                  Description   Page No.
 Chapter 1    Introduction                   1-3
 Chapter 2    Literature Survey              4-5
 Chapter 3    Methodology                    6-14
 Chapter 4    Result and Discussion         15-17
 Chapter 5    Conclusion and Future Work    18-19
              References
Chapter 1
                                      Introduction
1.1 Introduction
Rainfall prediction remains a serious concern and has attracted the attention of governments,
industries, risk management entities, as well as the scientific community. Rainfall is a
climatic factor that affects many human activities like agricultural production, construction,
power generation, forestry and tourism, among others. To this extent, rainfall prediction is
essential since this variable is the one with the highest correlation with adverse natural events
such as landslides, flooding, mass movements and avalanches. These incidents have affected
society for years. Therefore, having an appropriate approach for rainfall prediction makes it
possible to take preventive and mitigation measures for these natural phenomena.
Weather forecasting is seen as dynamic and time consuming Even with the new scientific
advancements, doing jobs Thanks to the intensely active situation so in fact, messy. There are
deterministic and mathematical knowledge Climate prediction methods. The capacity to
process knowledge correctly in a dynamic world is artificial intelligence.
Land parameters are not predefined and thus not set. The forecast of rainfall remains a major
issue and has been called to attention Governments, markets, risk control agencies and
research community. They are obtained through the collection of knowledge. Nevertheless,
the last decade has seen a prior focus both in science and in tech developers in the field. Their
involvement is unprecedented. In a hydrological model, rainfalls are one of the most critical
parameters. Many models for the analysis and prediction of precipitation patterns have been
established. Thanks to the time-frequency interpretation of different water management work
                                                1
in recent years, wavelet techniques have been commonly implemented. In this work to
predict the rainfall by using data mining algorithm such as neural network, random forest,
classification and regression tree, support vector machine, and k-nearest neighbor. The
proposed algorithm of this work is spatial - temporal algorithm in mining for better
understanding of the weather and climate data. Also, developed some novel algorithms in
mining techniques based on case studies such as rainfall analysis and simulation, cyclone
analysis and simulation and temperature analysis and simulation. This paper found that
various parameters that causes the precipitation in the atmosphere such as temperature,
humidity, based on correlation. In this work proposed we have predicted the rainfall based
on quantitative data of current atmospheric state and complex combination of mathematical
abstractions.
1.2 Existing System
Agriculture is the strength of our Indian economy. Farmer only depends upon monsoon to be
their cultivation. The good crop productivity needs good soil, fertilizer and also good climate.
Weather forecasting is the very important requirement of the each farmer. Due to the sudden
changes in climate/weather, The people are suffered economically and physically. Weather
prediction is one of the challenging problems in current state. The main motivation of this
paper to predict the weather using various data mining techniques. Such as classification,
clustering, decision tree and also neural networks. Weather related information is also called
the meteorological data. In this paper the most commonly used weather parameters are
rainfall, wind speed, temperature and cold.
1.2.1 Disadvantages of Existing System
1.Classification
2.Clustering
                                               2
3.Decision Tree
1.3 POPOSED SYSTEM
Rainfall is important for food production plan, water resource management and all activity
plans in the nature. The occurrence of prolonged dry period or heavy rain at the critical stages
of the crop growth and development may lead to significant reduce crop yield. India is an
agricultural country and its economy is largely based upon crop productivity. Thus rainfall
prediction becomes a significant factor in agricultural countries like India. Rainfall
forecasting has been one of the most scientifically and technologically challenging problems
around the world in the last century.
1.3.1 Advantages of Proposed System
1.Numerical Weather Pediction
2.Statistical Weather Prediction
3.Synoptic Weather Prediction
                                                3
Chapter 2
                                 Literature Survey
 Pritpal Singh et al. Measurable investigation shows the idea of ISMR, which can't be
precisely anticipated by insights or factual information. Hence, this review exhibits the
utilization of three techniques: object creation, entropy, and artificial neural network (ANN).
In view of this innovation, another technique for anticipating ISMR times has been created to
address the idea of ISMR. This model has been endorsed and supported by the studio and
exploration data. Factual examination of different information and near investigations
showing the presentation of the normal technique
 Sam Carmer , Michael Kampouridis, Alex A. Freitas , Antonios Alexandridis et al. The
primary impact of this movement is to exhibit the advantages of AI calculations, just as the
more prominent degree of clever framework than the advanced rainfall determining methods.
We analyze and think about the momentum execution (Markov chain stretched out by rainfall
research) with the forecasts of the six most notable AI machines: Genetic programming,
Vector relapse support, radio organizations, M5 organizations, M5 models, models - Happy.
To work with a more itemized appraisal, we led a rainfall overview utilizing information
from 42 metropolitan urban communities.
 Sahar Hadi Poura , Shamsuddin Shahida, Eun-Sung chungb et al. RF was utilized to
anticipate assuming that it would rain in one day, while SVM was utilized to foresee
downpour on a blustery day. The limit of the Hybrid model was fortified by the decrease of
day-by-day rainfall in three spots at the rainfall level in the eastern piece of Malaysia.
Crossover models have likewise been found to emulate the full change, the quantity of days
straight, 95% of the month-to-month rainfall, and the dispersion of the noticed rainfall .
                                               4
 Tanvi Patil, Dr. Kamal Shah et al. The reason for the framework is to anticipate 4 the
climate sooner or later. Climatic still up in the air utilizing various sorts of factors all over the
place. Of these, main the main highlights are utilized in climate conjectures. Picking
something like this relies a great deal upon the time you pick. Underlying displaying is
utilized to incorporate the fate of demonstrating, AI applications, data trade, and character
examination.
 N.Divya Prabha, P. Radha et al. Contrasted with different spots where rainfall information
isn't accessible, it consumes a large chunk of the day to build up a solid water overview for a
long time. Improving complex neural organizations is intended to be a brilliant instrument for
anticipating the stormy season. This downpour succession was affirmed utilizing a complex
perceptron neural organization. Estimations like MSE (Early Modeling), NMSE (Usually
Early Error), and the arrangement of informational collections for transient arranging are
clear in the examination of different organizations, like Adanaive. AdaSVM.
 Senthamil Selvi S, Seetha et al. In this paper, Artificial Neural Network (ANN) innovation
is utilized to foster a climate anticipating strategy to distinguish rainfall utilizing Indian
rainfall information. Along these lines, Feed Forward Neural Network (FFNN) was utilized
utilizing the Backpropagation Algorithm. Execution of the two models is assessed dependent
on emphasis examination, Mean Square Error (MSE) and Magnitude of Relative Error
(MRE). This report likewise gives a future manual for rainfall determining.
 YashasAthreya, VaishaliBV, SagarK and SrinidhiHR, et al. This page features rainfall
investigation speculations utilizing Machine Learning. The principle motivation behind
utilizing this program is to secure against the impacts of floods. This program can be utilized
by conventional residents or the public authority to anticipate what will occur before the
flood. The flood card, then, at that point, furnish them with the vital help by moving versatile
or other important measures.
                                                  5
Chapter 3
                                         Methodology
In this paper, the overall architecture include four major components: Data Exploration and
Analysis, Data Pre-processing, Model Implementation, and Model Evaluation, as shown in
Fig. 3.1
                                  Fig. 3.1 Overall Architecture.
3.1 Data Exploration and Analysis
Exploratory Data Analysis is valuable to machine learning problems since it allows to get
closer to the certainty that the future results will be valid, correctly interpreted, and applicable
to the desired business contexts. Such level of certainty can be achieved only after raw data is
validated and checked for anomalies, ensuring that the data set was collected without errors.
EDA also helps to find insights that were not evident or worth investigating to business
stakeholders and researchers. We performed EDA using two methods - Univariate
Visualization which provides summary statistics for each field in the raw data set (figure 3.2)
and Pair-wise Correlation Matrix which is performed to understand interactions between
different fields in the data set (figure 3.3).
                                                 6
                                 Table 3.1 Irrelevant Features
                               Fig. 3.2 Univariate Visualization.
We have other features with null values too which we will be imputing in our preprocessing
steps. If we look the distribution of our target variable, it is clear that we have a class
imbalance problem with number of positive instances - 110316 and number of negative
instances - 31877.
                                                 7
                                       Fig. 3.3 Heat Map.
The correlation matrix depicts that the features - MaxTemp, Pressure9am, Pressure3pm,
Temp3pm and Temp9am are negatively correlated with target variable. Hence, we can drop
this features in our feature selection step later.
                                                     8
3.2 Data Preprocessing
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors. We have carried below
preprocessing steps.
3.2.1 Missing Values: As per our EDA step, we learned that we have few instances with null
values. Hence, this becomes one of the important step. To impute the missing values, we will
group our instances based on the location and date and thereby replace the null values by
there respective mean values. Feature Expansion: Date feature can be expanded to Day,
Month and Year and then these newly created features can be further used for other
preprocessing steps.
3.2.2 Categorical Values: Categorical feature is one that has two or more categories, but
there is no intrinsic ordering to the categories. We have a few categorical features -
WindGustDir, WindDir9am, WindDir3pm with 16 unique values. Now it gets complicated
for machines to understand texts and process them, rather than numbers, since the models are
based on mathematical equations and calculations. Therefore, we have to encode the
categorical data. We here tried two different techniques.
3.2.2.1 Dummy Variables: A Dummy variable is an artificial variable created to represent
an attribute with two or more distinct categories/levels. However, as we have 16 unique
values, our one feature will now get transformed to 16 new features which in turn results in
curse of dimensionality. For each instance, we will have a feature with 1 value and rest 15
features with 0 values. Example: Categorical Encoding of feature - windDir3pm using
Dummy Variables
                                                9
                                  Fig. 3.4 Sample Instance.
3.2.2.2 Feature Hashing: Feature hashing scheme is another useful feature engineering
scheme for dealing with large scale categorical features. In this scheme, a hash function is
typically used with the number of encoded features pre-set (as a vector of pre-defined length)
such that the hashed values of the features are used as indices in this pre-defined vector and
values are updated accordingly .Example: Categorical Encoding of feature - windDir3pm
using Feature Hashing
                                  Fig. 3.5 Sample Instance.
                                  Fig. 3.6 Feature Hashing.
3.2.3 Feature Selection Feature Selection is the process where you automatically or
manually select those features which contribute most to our prediction variable or output.
Having irrelevant features in data can decrease the accuracy of the models and make the
                                               10
model learn based on irrelevant features. Feature selection helps to reduce over fitting,
improves accuracy and reduces training time. We used two techniques to perform this activity
and got the same results.
3.2.3.1 Univariate Selection: Statistical tests can be used to select those features that have
the strongest relationship with the output variable. The scikitlearn library provides the
SelectKBest class that can be used with a suite of different statistical tests to select a specific
number of features. We used chi-squared statistical test for non-negative features to select 5
of the best features from our data set .
3.2.3.2 Correlation Matrix with Heatmap: Correlation states how the features are related
to each other or the target variable. Correlation can be positive (increase in one value of
feature increases the value of the target variable) or negative (increase in one value of feature
decreases the value of the target variable). Heatmap makes it easy to identify which features
are most related to the target variable, we plotted heatmap of correlated features using the
seaborn library (figure 3.3).
3.2.4 Handling Class Imbalance We learned in our EDA step that our data set is highly
imbalanced. Imbalanced data results in biased results as our model doesn’t learn much about
the minority class. We performed two experiments one with oversampled data and another
with undersampled data.
3.2.4.1 Undersampling: We used Imblearn’s random under sampler library to eliminate
instances of majority class [10]. This elimination is based on distance so that there is
minimum loss of information (figure 3.7)
3.2.4.2 Oversampling: We used Imblearn’s SMOTE technique to generate synthetic
instances for minority class . A subset of data is taken from the minority class as an example
and then new synthetic similar instances are created. (figure 3.8)
                                                 11
3.3 Models
We chose different classifiers each belonging to different model family (such as Linear
classifier, Tree-based, Distance-based, Rule-based and Ensemble). All the
                                  Fig. 3.7 Undersampling.
                                             12
classifiers were implemented using scikit-learn except for Decision table which was
implemented using weka. The following classification algorithms have been used to build
prediction models to perform the experiments:
3.3.1 Logistic Regression is a classification algorithm used to predict a binary outcome (1 /
0, Yes / No, True / False) given a set of independent variables. To represent binary /
categorical outcome, we use dummy variables. We can also think of logistic regression as a
special case of linear regression when the outcome variable is categorical, where we are using
log of odds as dependent variable. In simple words, it predicts the probability of occurrence
of an event by fitting data to a logit function. Hence, this makes Logistic Regression a better
fit as ours is a binary classification problem.
3.3.2 Decision Tree have a natural if then else construction that makes it fit easily into a
programmatic structure. They also are well suited to categorization problems where attributes
or features are systematically checked to determine a final category. It works for both
categorical and continuous input and output variables. In this technique, we split the
population or sample into two or more homogeneous sets (or sub-populations) based on most
significant splitter / differentiator in input variables. This characteristics of Decision Tree
makes it a good fit for our problem as our target variable is binary categorical variable.
3.3.3 K - Nearest Neighbour is a non-parametric and lazy learning algorithm. Non-
parametric means there is no assumption for underlying data distribution. In other words, the
model structure is determined from the dataset. Lazy algorithm means it does not need any
training data points for model generation. All training data used in the testing phase. KNN
performs better with a lower number of features than a large number of features. We can say
that when the number of features increases than it requires more data. Increase in dimension
also leads to the problem of overfitting. However, we have performed feature selection which
helps to reduce dimension and hence KNN looks a good candidate for our problem. Our
                                                  13
Model’s configuration: We tried various values of n ranging from 3 to 30 and learned that the
model performs best with n as 25, 27 and 29.
3.3.4 Decision table provides a handy and compact way to represent complex business logic.
In a decision table, business logic is well divided into conditions, actions (decisions) and
rules for representing the various components that form the business logic. This was
implemented using Weka.
3.3.5 Random Forest is a supervised ensemble learning algorithm. Ensemble means that it
takes a bunch of weak learners and have them work together to form one 12 Nikhil Oswal
strong predictor. Here, we have a collection of decision trees, known as Forest. To classify a
new object based on attributes, each tree gives a classification and we say the tree votes for
that class. The forest chooses the classification having the most votes (over all the trees in the
forest). Our Model’s configuration: number of weak learners = 100, maximum depth of each
tree = 4
                                                14
Chapter 4
                              Result and Discussion
4.1 Result
4.1.1 Experiment 1 - Original Dataset: Post all the preprocessing steps (as mentioned above
in the Methodology section), we ran all the implemented classifiers each one with the same
input data (Shape: 92037 x 4). Figure 4.1 depicts two considered metrics (10-skfold Accuracy
and Area Under Curve) for all the classifiers. Accuracy wise Gradient Boosting with a
learning rate of 0.25 performed best, coverage wise Random Forest and Decision Tree
performed worsts.
                                  Fig. 4.1 Experiment 1.
4.1.2 Experiment 2 - Undersampled Dataset: Post all the preprocessing steps (as mentioned
above in the Methodology section) including the undersampling step, we ran all the
implemented classifiers each one with the same input data (Shape: 54274 x 4). Figure 4.2
                                             15
depicts two considered metrics (10-skfold Accuracy and Area Under Curve) for all the
classifiers.
                                   Fig. 4.2 Experiment 2
Accuracy and coverage wise Logistic Regression performed best and Decision Tree
performed worsts.
Experiment 3 - Oversampled Dataset: Post all the preprocessing steps (as mentioned above
in the Methodology section) including the oversampling step, we ran all the implemented
classifiers each one with the same input data (Shape: 191160 x 4). Figure 4.3 depicts two
considered metrics (10-skfold Accuracy and Area Under Curve) for all the classifiers.
                                             16
                                    Fig. 4.3 Experiment 3
Accuracy and coverage wise Decision Tree performed best and Logistic Regression
performed worsts. We have varying range of results with respect to different input data and
different classifiers. Other metrics are followed in appendix.
4.2 Discussion
With the issues with our original dataset, we learned many things considering all the
preprocessing steps that we carried to rectify them. The first important thing we learned is the
importance of knowing your data. While imputing the missing value, we grouped two other
features and calculated the mean instead of directly calculating the mean for all the instances.
This way our imputed values were closer to the correct information. Another thing we
learned is about the leaky features. While exploring our data, we came to that one of our
feature (RiskMM) was used for generating the target variable and hence it made no sense to
use this feature for predictions. We learned about the curse of dimensionality while dealing
with categorical variables which we solved using feature hashing. We also learned two
techniques for performing feature selection - univariate selection and correlation heat map.
We also explore undersampling and oversampling techniques while handling the class
imbalance problem.
With the experiments that we carried using different data, we also came to know that in a few
cases we have achieved higher accuracy (Decision Tree) clearly implying the classic case of
overfitting. We also observed that the performance of classifiers varied with different input
data. To count a few, Logistic Regression performed best with undersampled data whereas it
performed worst with oversampled data; same goes with KNN, it performed best with
oversampled data and worst with undersampled data. Hence we can say that the input data
                                               17
has a very important role here. Ensembles to be precise Gradient Boosting performed pretty
consistently in all the experiments.
Chapter 5
                          Conclusion and Future Work
In this paper, we explored and applied several preprocessing steps and learned there impact
on the overall performance of our classifiers. We also carried a comparative study of all the
classifiers with different input data and observed how the input data can affect the model
predictions. We can conclude that Australian weather is uncertain and there is no such
correlation among rainfall and the respective region and time. We figured certain patterns and
relationships among data which helped in determining important features. Refer to the
appendix section. As we have a huge amount of data, we can apply Deep Learning models
such as Multilayer Perceptron, Convolutional Neural Network, and others. It would be great
to perform a comparative study between the Machine learning classifiers and Deep learning
models.
                                              18
References
1. World Health Organization: Climate Change and Human Health: Risks and Responses. World
     Health Organization, January 2003
2. Alcntara-Ayala, I.: Geomorphology, natural hazards, vulnerability and prevention of natural
     disasters in developing countries. Geomorphology 47(24), 107124 (2002)
3. Nicholls, N.: Atmospheric and climatic hazards: Improved monitoring and prediction for disaster
     mitigation. Natural Hazards 23(23), 137155 (2001)
4. [Online] InDataLabs, Exploratory Data Analysis: the Best way to Start a Data Science Project.
     Available:     https://medium.com/@InDataLabs/           why-start-a-data-science-project-with-
     exploratory-data-analysis-f90c0efcbe49
5.       [Online]     Pandas        Documentation.        Available:       https://pandas.pydata.org/
     pandas-docs/stable/reference/api/pandas.get\_dummies.html
6.       [Online]     Sckit-Learn        Documentation       Available:       https://scikit-learn.org/
     stable/modules/generated/sklearn.feature\_extraction.FeatureHasher. html
7.       [Online]     Sckit-Learn        Documentation       Available:       https://scikit-learn.org/
     stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
8.      [Online]     Sckit     Learn      Documentation       Available:      https://scikit-learn.org/
     stable/modules/generated/sklearn.feature_selection.SelectKBest.html
                                                 19