0% found this document useful (0 votes)
36 views55 pages

House Price Prediction Project

The document outlines a house price prediction project conducted by Swati Pandey, focusing on the application of machine learning techniques to predict house prices for a US-based company entering the Australian market. It details the project's objectives, methodologies, and the importance of various independent variables in determining house prices, while also acknowledging the contributions of various individuals and organizations. The document includes a comprehensive description of the dataset, data preprocessing steps, and the analytical models used for prediction.

Uploaded by

itspsr101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views55 pages

House Price Prediction Project

The document outlines a house price prediction project conducted by Swati Pandey, focusing on the application of machine learning techniques to predict house prices for a US-based company entering the Australian market. It details the project's objectives, methodologies, and the importance of various independent variables in determining house prices, while also acknowledging the contributions of various individuals and organizations. The document includes a comprehensive description of the dataset, data preprocessing steps, and the analytical models used for prediction.

Uploaded by

itspsr101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

HOUSE PRICE PREDICTION PROJECT

Submitted by:

Swati Pandey
ACKNOWLEDGMENT
I have taken efforts in this project. However, it would not have
been possible without the kind support and help of many
individuals and organizations. I would like to extend my
sincere thanks to all of them.

I am highly indebted to Flip Robo Technologies for their


guidance and constant supervision as well as for providing
necessary information regarding the project & also for their
support in completing the project.

I want to thank my SME Srishti Maan for providing theDataset


and helping us to solve the problem and addressing out our
Query in right time.

I would like to express my gratitude towards DataTrained and


my parents & members of Flip Robo for their kind co-
operation and encouragement which help me in completion of
this project.

I would like to express my special gratitude and thanks to


industry persons for giving me such attention and time.
INTRODUCTION
Business Problem Framing

Houses are one of the necessary needs of each and every person
around the globe and therefore housing and real estate market is one
of the markets which is one of the major contributors in the world’s
economy. It is a very large market and there are various companies
working in the domain.

Data science comes as a very important tool to solve problems in


the domain to help the companies increase their overall revenue,
profits, improving their marketing strategies and focusing on changing
trends in house sales and purchases. Predictive modelling, Market mix
modelling, recommendation systems are some of the machine
learning techniques used for achieving the business goals for housing
companies. Our problem is related to one such housing company.

We are required to model the price of houses with the available


independent variables. This model will then be used by the
management to understand how exactly the prices vary with the
variables. They can accordingly manipulate the strategy of the firm
and concentrate on areas that will yield high returns. Further, the
model will be a good way for the management to understand the
pricing dynamics of a new market.
Conceptual Background of the Domain Problem

A US-based housing company named Surprise Housing has


decided to enter the Australian market. The company uses data
analytics to purchase houses at a price below their actual values and
flip them at a higher price. For the same purpose, the company has
collected a data set from the sale of houses in Australia. The data is
provided in the CSV file below.

The company is looking at prospective properties to buy houses


to enter the market. You are required to build a model using Machine
Learning in order to predict the actual value of the prospective
properties and decide whether to invest in them or not. For this
company wants to know:

• Which variables are important to predict the price of a variable?


• How do these variables describe the price of the house?

Motivation for the Problem Undertaken


Our main objective of doing this project is to build a model to
predict the house prices with the help of other supporting features.
We are going to predict by using Machine Learning algorithms.

The sample data is provided to us from our client database. In


order to improve the selection of customers, the client wants some
predictions that could help them in further investment and
improvement in selection of customers.

House Price Index is commonly used to estimate the changes in


housing price. Since housing price is strongly correlated to other
factors such as location, area, population, it requires other
information apart from HPI to predict individual housing price.
There has been a considerably large number of papers adopting
traditional machine learning approaches to predict housing prices
accurately, but they rarely concern themselves with the performance
of individual models and neglect the less popular yet complex models.

As a result, to explore various impacts of features on prediction


methods, this paper will apply both traditional and advanced machine
learning approaches to investigate the difference among several
advanced models. This paper will also comprehensively validate
multiple techniques in model implementation on regression and
provide an optimistic result for housing price prediction.

ANALYTICAL PROBLEM FRAMING


Mathematical/ Analytical Modelling of the Problem

We are building a model in Machine Learning to predict the


actual value of the prospective properties and decide whether to
invest in them or not. So, this model will help us to determine which
variables are important to predict the price of variables & also how do
these variables describe the price of the house. This will help to
determine the price of houses with the available independent
variables. They can accordingly manipulate the strategy of the firm
and concentrate on areas that will yield high returns.

Regression analysis is a set of statistical processes for estimating


the relationships between a dependent variable (often called the
'outcome variable') and one or more independent variables (often
called 'predictors', 'covariates', or 'features'). The most common form
of regression analysis is linear regression, in which one finds the line
(or a more complex linear combination) that most closely fits the data
according to a specific mathematical criterion. For specific
mathematical reasons this allows the researcher to estimate the
conditional expectation of the dependent variable when the
independent variables take on a given set of values.

Regression analysis is also a form of predictive modelling


technique which investigates the relationship between a dependent
(target) and independent variable (predictor). This technique is used
for forecasting, time series modelling and finding the causal effect
relationship between the variables.

The different Mathematical/Analytical models that are used in this


project are as below:

1. Linear regression - is a linear model, e.g., a model that assumes a


linear relationship between the input variables (x) and the single
output variable (y). More specifically, that y can be calculated from a
linear combination of the input variables (x).

2. Lasso - In statistics and machine learning, lasso is a regression


analysis method that performs both variable selection and
regularization in order to enhance the prediction accuracy and
interpretability of the resulting statistical model.

3. Ridge - regression is a way to create a parsimonious model when


the number of predictor variables in a set exceeds the number of
observations, or when a data set has multi co linearity (correlations
between predictor variables).
4. K Neighbors Regressor - KNN algorithm can be used for both
classification and regression problems. The KNN algorithm uses
'feature similarity' to predict the values of any new data points. This
means that the new point is assigned a value based on how closely it
resembles the points in the training set.

5. Decision Tree - is one of the most commonly used, practical


approaches for supervised learning. It can be used to solve both
Regression and Classification tasks with the latter being put more into
practical application. It is a tree-structured classifier with three types
of nodes.

6. Random forest - is a meta estimator that fits a number of classifying


decision trees on various sub-samples of the dataset and uses
averaging to improve the predictive accuracy and control over-fitting.
A Random Forest's nonlinear nature can give it a leg up over linear
algorithms, making it a great option.

7. AdaBoost Regressor - is a meta-estimator that begins by fitting a


regressor on the original dataset and then fits additional copies of the
regressor on the same dataset but where the weights of instances are
adjusted according to the error of the current prediction.

8. Gradient Boosting Regressor - GB builds an additive model in a


forward stage-wise fashion; it allows for the optimization of arbitrary
differentiable loss functions. In each stage a regression tree is fit on
the negative gradient of the given loss function.

9. Support Vector Regressor – SVR is a supervised learning algorithm,


That is used to predict discrete values. SVR uses the same principle
As the SVMs. The basic idea behind SVR is to fit best line. In SVR the
Best fit line is the hyperplane that has the maximum number of
points.
-> First, use the train dataset and do the EDA process, fitting the best
model and saving the model.
-> Then, use the test dataset, load the saved model and predict the
values over the test data.
Data Sources and their formats

Let’s check the data now. Below I have attached the snapshot below
to give an overview.

Data description

Data contains 1460 entries each having 81 variables. The details of the
features are given below:
1. MSSubClass: Identifies the type of dwelling involved in the sale.
2. MSZoning: Identifies the general zoning classification of the sale.
3. LotFrontage: Linear feet of street connected to property
4. LotArea: Lot size in square feet
5. Street: Type of road access to property
6. Alley: Type of alley access to property
7. LotShape: General shape of property
8. LandContour: Flatness of the property
9. Utilities: Type of utilities available
10. LotConfig: Lot configuration
11. LandSlope: Slope of property
12. Neighborhood: Physical locations within Ames city limits
13. Condition1: Proximity to various conditions
14. Condition2: Proximity to various conditions (if more than one is
present)
15. BldgType: Type of dwelling
16. HouseStyle: Style of dwelling
17. OverallQual: Rates the overall material and finish of the house
18. OverallCond: Rates the overall condition of the house
19. YearBuilt: Original construction date
20. YearRemodAdd: Remodel date (same as construction date if no
remodeling or additions)
21. RoofStyle: Type of roof
22. RoofMatl: Roof material
23. Exterior1st: Exterior covering on house
24. Exterior2nd: Exterior covering on house (if more than one
material)
25. MasVnrType: Masonry veneer type
26. MasVnrArea: Masonry veneer area in square feet
27. ExterQual: Evaluates the quality of the material on the exterior
28. ExterCond: Evaluates the present condition of the material on the
exterior
29. Foundation: Type of foundation
30. BsmtQual: Evaluates the height of the basement
31. BsmtCond: Evaluates the general condition of the basement
32. BsmtExposure: Refers to walkout or garden level walls
33. BsmtFinType1: Rating of basement finished area
34. BsmtFinSF1: Type 1 finished square feet
35. BsmtFinType2: Rating of basement finished area (if multiple
types)
36. BsmtFinSF2: Type 2 finished square feet
37. BsmtUnfSF: Unfinished square feet of basement area
38. TotalBsmtSF: Total square feet of basement area
39. Heating: Type of heating
40. HeatingQC: Heating quality and condition
41. CentralAir: Central air conditioning
42. Electrical: Electrical system
43. 1stFlrSF: First Floor square feet
44. 2ndFlrSF: Second floor square feet
45. LowQualFinSF: Low quality finished square feet (all floors)
46. GrLivArea: Above grade (ground) living area square feet
47. BsmtFullBath: Basement full bathrooms
48. BsmtHalfBath: Basement half bathrooms
49. FullBath: Full bathrooms above grade
50. HalfBath: Half baths above grade
51. Bedroom: Bedrooms above grade (does NOT include basement
bedrooms)
52. Kitchen: Kitchens above grade
53. KitchenQual: Kitchen quality
54. TotRmsAbvGrd: Total rooms above grade (does not include
bathrooms)
55. Functional: Home functionality (Assume typical unless deductions
are warranted)
56. Fireplaces: Number of fireplaces
57. FireplaceQu: Fireplace quality
58. GarageType: Garage location
59. GarageYrBlt: Year garage was built
60. GarageFinish: Interior finish of the garage
61. GarageCars: Size of garage in car capacity
62. GarageArea: Size of garage in square feet
63. GarageQual: Garage quality
64. GarageCond: Garage condition
65. PavedDrive: Paved driveway
66. WoodDeckSF: Wood deck area in square feet
67. OpenPorchSF: Open porch area in square feet
68. EnclosedPorch: Enclosed porch area in square feet
69. 3SsnPorch: Three season porch area in square feet
70. ScreenPorch: Screen porch area in square feet
71. PoolArea: Pool area in square feet
72. PoolQC: Pool quality
73. Fence: Fence quality
74. MiscFeature: Miscellaneous feature not covered in other
categories
75. MiscVal: $Value of miscellaneous feature
76. MoSold: Month Sold (MM)
77. YrSold: Year Sold (YYYY)
78. SaleType: Type of sale
79. SaleCondition: Condition of sale
80. Id: Id of House
81. SalePrice: Price of House

Checking the data type & info of dataset


Checking the no. of null values in the dataset

Data Pre-processing
Data pre-processing in Machine Learning refers to the technique of
preparing (cleaning and organizing) the raw data to make it suitable
for a building and training Machine Learning models. In other words,
whenever the data is gathered from different sources it is collected in
raw format which is not feasible for the analysis. Data pre-processing
is an integral step in Machine Learning as the quality of data and the
useful information that can be derived from it directly affects the
ability of our model to learn; therefore, it is extremely important that
we pre- process our data before feeding it into our model.
Checking the value counts of categorical data

Observations:

1. There is only one unique value present in utilities column, so that


we will be dropping it.

2. In categorical columns there are missing values present in


columns Alley, MasVnrType, BsmtQual, BsmtCond,
BsmtExposure, BsmtFinType1, BsmtFinType2, FireplaceQu,
GarageType, GarageFinish, GarageQual, GarageCond, PoolQC,
Fence, MiscFeature.
Handling missing data
Dropping some unnecessary columns

Checking the statistical summary of the dataset


In descriptive statistics, summary statistics are used to summarize a
set of observations, in order to communicate the largest amount of
information as simply as possible. Summary statistics summarize and
provide information about your sample data. It tells something about
the values in data set. This includes where the average lies and
whether the data is skewed.

The describe() function computes a summary of statistics pertaining


to the Data Frame columns. This function gives the mean, count,
max, standard deviation and IQR values of the dataset in a simple
understandable way.

Observations:
-> Maximum standard deviation of 8957.44 is observed in LotArea
column.
-> Maximum SalePrice of a house observed is 755000 and minimum is
34900.
-> In the columns MSSubclass, LotArea, MasVnrArea, BsmtFinSF1,
BsmtFinSF2, BsmtUnfsF, TotalBsmtSF, 1stFlrSF, 2ndFlrSF,
LowQualFinSF, GrLivArea, BsmtFullBath, HalfBath, TotRmsAbvGrd,
WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch,
Miscval, salePrice mean is considerably greater than median so the
columns are positively skewed.
-> In the columns FullBath, BedroomAbvGr, Fireplaces, Garagecars,
GarageArea, YrSold Median is greater than mean so the columns are
negatively skewed.
-> In the columns MSSubClass, LotFrontage, LotArea, MasVnrArea,
BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, 1stFlrSF, 2ndFlrSF,
LowQualFinSF, GrLivArea, BsmtHalfBath, BedroomAbvGr,
ToRmsAbvGrd, GarageArea, WoodDeckSF, OpenPorchSF,
EnclosedPorch, 3SsnPorch, ScreenPorch, MiscVal, SalePrice there is
considerable difference between the 75 percentile and maximum so
outliers are present.
Correlation Factor
The statistical relationship between two variables is referred to as
their correlation. The correlation factor represents the relation
between columns in a given dataset. A correlation can be positive,
meaning both variables are moving in the same direction or it can be
negative, meaning that when one variable's value increasing, theother
variable’s value is decreasing.

Correlation matrix and its visualization

A correlation matrix is a tabular data representing the ‘correlations’


between pairs of variables in a given dataset. It is also a very important
pre-processing step in Machine Learning pipelines. The Correlation
matrix is a data analysis representation that is used to summarize data
to understand the relationship between various different variables of
the given dataset.
Observations:
-> SalePrice is highly positively correlated with the columns
OverallQual, YearBuilt, YearRemodAdd, TotalBsmtSF, 1stFlrSF,
GrLivArea, FullBath, TotRmsAbvGrd, GarageCars, GarageArea.
-> SalePrice is negatively correlated with OverallCond, KitchenAbvGr,
Encloseporch, YrSold.
-> We observe multicollinearity in between columns, so we will be
using Principal Component Analysis (PCA).
Correlation with target variable

Observations:
>'MSSubClass’,'OverallCond’,'OverallCond’,'LowQualFinSF’,'BsmtHalf
Bath’,'KitchenAbvGr’,'YrSold','EnclosedPorch','MiscVal' are negatively
correlated with the target column, rest all are positively correlated
> 'OverallQual' & 'GrLivArea' are highly positively correlated with
target column
>'MSSubClass','OverallCond','OverallCond’,'LowQualFinSF’,'BsmtHalf
Bath’,'YrSold', 'MiscVal', 'MoSold', '3SsnPorch' are least correlated
with the target column
Checking skewness and plotting the distribution plot

Skewness refers to distortion or asymmetry in a symmetrical bell


curve, or normal distribution in a set of data. Besides positive and
negative skew, distributions can also be said to have zero or undefined
skew. The skewness value can be positive, zero, negative, or
undefined.
Checking outliers and plotting it
An outlier is a data point in a data set which is distant or far from all
other observations available. It is a data point which lies outside the
overall distribution which is available in the dataset. In statistics, an
outlier is an observation point that is distant from other observations.
A box plot is a method or a process for graphically representing groups
of numerical data through their quartiles. Outliers may also be plotted
as an individual point. If there is an outlier it will plotted as point in
box plot but other numerical data will be grouped together and
displayed as boxes in the diagram. In most cases a threshold of 3 or -
3 is used i.e., if the Z-score value is higher than or less than 3 or -3
respectively, that particular data point will be identified as outlier.
Treating skewness
In the Data Science it is just statistics and many algorithms revolve
around the assumption that the data is normalized. So, the more the
data is close to normal, the better it is for getting good predictions.
There are many ways of transforming skewed data such as log
transform, square-root transform, box-cox transform,
PowerTransformer,etc.

1. Log Transform
Log transformation is a data transformation method in which it
replaces each variable x with a log(x). The log transformation is,
arguably, the most popular among the different types of
transformations used to transform skewed data to approximately
conform to normality

2. Square Root Transform


The square root, x to x^ (1/2) = sqrt(x), is a transformation with a
moderate effect on distribution shape: it is weaker than the logarithm
and the cube root. It is also used for reducing right skewness, and also
has the advantage that it can be applied to zero values. So, applying a
square root transform inflates smaller numbers but stabilises bigger
ones.

3. Box-Cox Transform
In statistics, a power transform is a family of functions that are applied
to create a monotonic transformation of data using power functions.
This is a useful data transformation technique used to stabilize
variance, make the data more normal distribution-like, improve the
validity of measures of association such as the Pearson correlation
between variables and for other data stabilization procedures.
Hardware and Software Requirements and Tools Used
For doing this project, the hardware used is a laptop with high end
specification and a stable internet connection. While coming to
software part, I had used anaconda navigator and in that I have used
Jupyter notebook to do my python programming and analysis.

For using a csv file, Microsoft excel is needed. In Jupyter notebook, I


had used lots of python libraries to carry out this project and I have
mentioned below with proper justification:
1. Pandas- a library which is used to read the data, visualisation and
analysis of data.
2. NumPy- used for working with array and various mathematical
techniques.
3. Seaborn- visualization tool for plotting different types of plot.
4. Matplotlib- It provides an object-oriented API for embedding plots
into applications.
5. zscore- technique to remove outliers.
6. skew ()- to treat skewed data , l have used PowerTransformer .
7. PCA- I used this to reduce the data dimensions to 10 columns.
8. standard scaler- I used this to scale my data before sending it to
model.
9. train_test_split- to split the test and train data.
10. Then I used different classification algorithms to find out the best
model for predictions.
11. pickle- library used to save the model in either pickle or obj file.

MODEL/S DEVELOPMENT AND EVALUATION


Identification of possible problem-solving approaches
(methods)

From the given dataset it can be concluded that it is a Regression


problem as the output column “SalePrice” has continuous output. So,
for further analysis of the problem, we have to import or call out the
Regression related libraries in Python work frame.

The different libraries used for the problem solving are:


sklearn - Scikit-learn is a free machine learning library for Python. It
features various algorithms like support vector machine, random
forests, and k-neighbours, and it also supports Python numerical and
scientific libraries like NumPy and SciPy.

1. sklearn.linear_model
i. Linear Regression - Linear regression - is a linear model, e.g., a model
that assumes a linear relationship between the input variables (x) and
the single output variable (y). More specifically, that y can be
calculated from a linear combination of the input variables (x).

In statistics, linear regression is a linear approach to modelling the


relationship between a scalar response and one or more explanatory
variables. The case of one explanatory variable is called simple linear
regression; for more than one, the process is called multiple linear
regressions.
ii. Lasso - In statistics and machine learning, lasso is a regression
analysis method that performs both variable selection and
regularization in order to enhance the prediction accuracy and
interpretability of the resulting statistical model.

Lasso regression is a type of linear regression that uses shrinkage.


Shrinkage is where data values are shrunk towards a central point, like
the mean. The lasso procedure encourages simple, sparse models (i.e.,
models with fewer parameters). This particular type of regression is
well-suited for models showing high levels of multicollinearity or when
you want to automate certain parts of model selection, like variable
selection/parameter elimination.

iii. Ridge - The regression is a way to create a parsimonious model


when the number of predictor variables in a set exceeds the number
of observations, or when a data set has multicollinearity (correlations
between predictor variables). Ridge regression is particularly useful to
mitigate the problem of multicollinearity in linear regression, which
commonly occurs in models with large numbers of parameters. In
general, the method provides improved efficiency in parameter
estimation problems in exchange for a tolerable amount of bias.

10. SupportVectorRegressor: SVR is a supervised learning algorithm,


That is used to predict discrete values. SVR uses the same principle
As the SVMs. The basic idea behind SVR is to fit best line. In SVR the
Best fit line is the hyperplane that has the maximum number of
points.
2. sklearn.tree –
Decision Trees (DTs) are a non-parametric supervised learning method
used for classification and regression. The goal is to create a model
that predicts the value of a target variable by learning simple decision
rules inferred from the data features.

There are several advantages of using decision trees for predictive


analysis:

• Decision trees can be used to predict both continuous and


discrete values i.e., they work well for both regression and
classification tasks.
• They require relatively less effort for training the algorithm.
• They can be used to classify non-linearly separable data.
• They're very fast and efficient compared to KNN and other
algorithms.

Decision tree learning is one of the predictive modelling approaches


used in statistics, data mining and machine learning. It uses a decision
tree to go from observations about an item to conclusions about the
item's target value.
Decision Tree Regressor - Decision Tree is one of the most commonly
used, practical approaches for supervised learning. It can be used to
solve both Regression and Classification tasks with the latter being put
more into practical application. It is a tree-structured classifier with
three types of nodes.

Decision tree builds regression or classification models in the form of


a tree structure. It breaks down a dataset into smaller and smaller
subsets while at the same time an associated decision tree is
incrementally developed. The final result is a tree with decision nodes
and leaf nodes. A decision node (e.g., Outlook) has two or more
branches (e.g., Sunny, Overcast and Rainy), each representing values
for the attribute tested. Leaf node (e.g., Hours Played) represents a
decision on the numerical target. The topmost decision node in a tree
which corresponds to the best predictor called root node. Decision
trees can handle both categorical and numerical data.

3. sklearn.ensemble
The goal of ensemble methods is to combine the predictions of several
base estimators built with a given learning algorithm in order to
improve generalizability / robustness over a single estimator. The
sklearn.ensemble module includes two averaging algorithms based on
randomized decision trees: the RandomForest algorithm and the
Extra-Trees method. Both algorithms are perturb-and-combine
techniques specifically designed for trees. This means a diverse set of
classifiers is created by introducing randomness in the classifier
construction. The prediction of the ensemble is given as the averaged
prediction of the individual classifiers.

Boosting ensemble algorithms creates a sequence of models that


attempt to correct the mistakes of the models before them in the
sequence. Once created, the models make predictions which may be
weighted by their demonstrated accuracy and the results are
combined to create a final output prediction.
The different types of ensemble techniques used in the model are:

i. Random Forest Regressor - It is a meta estimator that fits a number


of classifying decision trees on various sub-samples of the dataset and
uses averaging to improve the predictive accuracy and control over-
fitting. A Random Forest's nonlinear nature can give it a leg up over
linear algorithms, making it a great option. Random forest is a type of
supervised learning algorithm that uses ensemble methods (bagging)
to solve both regression and classification problems. The algorithm
operates by constructing a multitude of decision trees at training time
and outputting the mean/mode of prediction of the individual trees.

ii. AdaBoost Regressor - It is a meta-estimator that begins by fitting a


regressor on the original dataset and then fits additional copies of the
regressor on the same dataset but where the weights of instances are
adjusted according to the error of the current prediction.

iii. Gradient Boosting Regressor - GB builds an additive model in a


forward stage-wise fashion; it allows for the optimization of arbitrary
differentiable loss functions. In each stage a regression tree is fit on
the negative gradient of the given loss function.

4. sklearn.metrics - The sklearn. metrics module implements several


losses, score, and utility functions to measure classification
performance. Some metrics might require probability estimates of the
positive class, confidence values, or binary decisions values.

Important sklearn.metrics modules used in the project are:

i. mean_absolute_error - In statistics, mean absolute error is a


measure of errors between paired observations expressing the same
phenomenon. Examples of Y versus X include comparisons of
predicted versus observed, subsequent time versus initial time, and
one technique of measurement versus an alternative technique of
measurement.
The MAE measures the average magnitude of the errors in a set of
forecasts, without considering their direction. It measures accuracy
for continuous variables. Mean Absolute Error (MAE): MAE measures
the average magnitude of the errors in a set of predictions, without
considering their direction. It's the average over the test sample of the
absolute differences between prediction and actual observation
where all individual differences have equal weight.

ii. mean_squared_error - In statistics, the mean squared error or


mean squared deviation of an estimator measures the average of the
squares of the errors—that is, the average squared difference
between the estimated values and the actual value. MSE is a risk
function, corresponding to the expected value of the squared error
loss. Mean Square Error (MSE) is defined as Mean or Average of the
square of the difference between actual and estimated values.

iii. r2_score - In statistics, the coefficient of determination, denoted


R² or r² and pronounced "R squared", is the proportion of the variance
in the dependent variable that is predictable from the independent
variable. R-squared is a statistical measure of how close the data are
to the fitted regression line. It is also known as the coefficient of
determination, or the coefficient of multiple determinations for
multiple regressions.

5. sklearn.model_selection –

i. GridSearchCV - It is a library function that is a member of sklearn’s


model_selection package. It helps to loop through predefined hyper
parameters and fit your estimator (model) on your training set. So, in
the end, you can select the best parameters from the listed
hyperparameters. GridSearchCV combines an estimator with a grid
search preamble to tune hyper-parameters. The method picks the
optimal parameter from the grid search and uses it with the estimator
selected by the user.
ii. cross_val_score - Cross validation helps to find out the over fitting
and under fitting of the model. In the cross validation the model is
made to run on different subsets of the dataset which will get multiple
measures of the model. If we take 5 folds, the data will be divided into
5 pieces where each part being 20% of full dataset. While running the
Cross validation the 1st part (20%) of the 5 parts will be kept out as a
holdout set for validation and everything else is used for training data.

This way we will get the first estimate of the model quality of the
dataset. In the similar way further iterations are made for the second
20% of the dataset is held as a holdout set and remaining 4 parts are
used for training data during process. This way we will get the second
estimate of the model quality of the dataset. These steps are repeated
during the cross-validation process to get the remaining estimate of
the model quality.

cross_val_score estimates the expected accuracy of the model on out-


of-training data (pulled from the same underlying process as the
training data). The benefit is that one need not set aside any data to
obtain this metric, and we can still train the model on all of the
available data.

Testing of Identified Approaches


After completing the required pre-processing techniques for the
model building data is separated as input and output columns before
passing it to the train_test_split.
Scaling the data using Standard Scaler
For each value in a feature, StandardScaler subtracts the minimum
value in the feature and then divides by the range. The range is the
difference between the original maximum and original minimum.
StandardScaler preserves the shape of the original distribution.

Using PCA
An important machine learning method for dimensionality reduction
is called Principal Component Analysis. It is a method that uses simple
matrix operations from linear algebra and statistics to calculate a
projection of the original data into the same number or fewer
dimensions.
Run and evaluate selected models
We will find the best random state value so that we can create our
train_test_split.
Train Test Split

Scikit-learn is a Python library that offers various features for data


processing that can be used for classification, clustering, and model
selection. Model_selection is a method for setting a blueprint to
analyze data and then using it to measure new data. Selecting a proper
model allows you to generate accurate results when making a
prediction. If we have one dataset, then it needs to be split by using
the Sklearn train_test_split function first. By default, Sklearn
train_test_split will make random partitions for the two subsets.

The train_test_split is a function in Sklearn model selection for


splitting data arrays into two subsets: for training data and for testing
data. With this function, we don't need to divide the dataset manually.
The train_test_split function is for splitting a single dataset for two
different purposes: training and testing. The testing subset is for
building your model. The testing subset is for using the model on
unknown data to evaluate the performance of the model.
#Creating train_test_split using best random_state
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=r_state,te
st_size=.25)

Now, we will run a for loop for all regression algorithms and find the
best model
As you can see above, I had called the algorithms, then I called the
empty list with the name models [ ], and calling all the model one by
one and storing the result in that.

We can observe that I imported the metrics in order to interpret the


model’s output. Then I also selected the model to find the
cross_validation_score value.
Let’s check the code below:

rmse.append(RMSE) print('\n\n') #Last 2 lines


As you can observe above, I made a for loop and called all the algorithms one
by one and appending their result to models. The same I had done to store
MSE, RMSE, MAE, SD and cross validation score. Let me show the output so
that we can glance the result in more appropriate way.

The following are the outputs of the different algorithms I had used, along with
the metrics score obtained and after finalizing the outputs in a data frame, it will
be as follows:

We can see that Ridge and Lasso Regression algorithms areperforming


well, as compared to other algorithms. Now we will try
Hyperparameter Tuning to find out the best parameters and try to
increase their scores.

Key Metrics for success in solving problem under


consideration
The key metrics used here were r2_score, cross_val_score, sd, MAE,
MSE and RMSE. We tried to find out the best parameters and also to
increase our scores by using Hyperparameter Tuning and we will be
using GridSearchCV method.

1. Cross Validation:
Cross-validation helps to find out the over fitting and under fitting of
the model. In the cross validation the model is made to run ondifferent
subsets of the dataset which will get multiple measures of the model.
If we take 5 folds, the data will be divided into 5 pieces
where each part being 20% of full dataset. While running the Cross-
validation the 1st part (20%) of the 5 parts will be kept out as a holdout
set for validation and everything else is used for training data. This way
we will get the first estimate of the model quality of the dataset.

In the similar way further iterations are made for the second 20% of
the dataset is held as a holdout set and remaining 4 parts are used for
training data during process. This way we will get the second estimate
of the model quality of the dataset. These steps are repeated during
the cross-validation process to get the remaining estimate of the
model quality.
2. R2 Score:
It is a statistical measure that represents the goodness of fit of a
regression model. The ideal value for r-square is 1. The closer the value
of r-square to 1, the better is the model fitted.

3. Mean Squared Error (MSE):


MSE of an estimator (of a procedure for estimating an unobserved
quantity) measures the average of the squares of the errors — that is,
the average squared difference between the estimated values and
what is estimated. MSE is a risk function, corresponding to the
expected value of the squared error loss. RMSE is the Root Mean
Squared Error.

4. Mean Absolute Error (MAE):


MAE measures the average magnitude of the errors in a set of
predictions, without considering their direction. It’s the average over
the test sample of the absolute differences between prediction and
actual observation where all individual differences have equal weight.
5. Hyperparameter Tuning:
There is a list of different machine learning models. They all are
different in some way or the other, but what makes them different is
nothing but input parameters for the model. These input parameters
are named as Hyperparameters. These hyperparameters will define
the architecture of the model, and the best part about these is that
you get a choice to select these for your model. You must select from
a specific list of hyperparameters for a given model as it varies from
model to model.

We are not aware of optimal values for hyperparameters which would


generate the best model output. So, what we tell the model is to
explore and select the optimal model architecture automatically. This
selection procedure for hyperparameter is known as Hyperparameter
Tuning. We can do tuning by using GridSearchCV.

GridSearchCV is a function that comes in Scikit-learn (or SK-learn)


model selection package. An important point here to note is that we
need to have Scikit-learn library installed on the computer. This
function helps to loop through predefined hyperparameters and fit
your estimator (model) on your training set. So, in the end, we can
select the best parameters from the listed hyperparameters.
After Tuning the best algorithms, we can see that Lasso Regression has
not been improved. Now, we will try Ensemble techniques like
RandomForestRegressor, AdaBoostRegressor
and GradientBoostingRegressor to boost up our scores.
After applying Ensemble Techniques, we can see that
RandomForestRegressor is the best performing algorithm among all
other algorithms as it is giving a r2_score of 89.49 and crossvalidation
score of 84.37. It has also the less amount of error values obtained.
Lesser the RMSE score, the better the model. Now we will finalize the
model.
Saving the model

Using the test dataset and doing pre-processing

Here, I will be doing the same steps as I did for training dataset like
handling missing data, dropping unnecessary columns, encoding non-
categorical data, treating skewness, etc. Then, I’ll scale the data and
do PCA analysis according to the best model requirements.
Predicting over the test data

Visualizations

Now, we will see the different plots done with this dataset in order to
know the insight of the data present. Below are the codes given for
the plots and the output obtained:

Importing required libraries and plotting graphs


for categorical data

Below are some of the graphs we obtain after running this code:
Observations:
-> Residential Low Density zone's count is more in count for sale.

-> Street type Paved, road access to property is more in count for sale.

-> No_alley_access type alley is more in count for sale .

-> Regular shape property is more in count for sale followed by slightly
irregular shape property.

-> flatness type near flat is nore in count for sale than other type property.

-> Lot configuration Inside is more in count for sale followed by corner lot.
-> Gentle shape of property is more in count for sale.

-> In condition1,Condition2, Normal is more in count for sale than others.

-> Type of dwelling means building type , Single family detached type is
more in count for sale.

-> Style of dwelling means House Style ,1Story is more in count for sale.

-> Roofstyle Gable is more in count for sale followed by Roofstyle Hip.

-> Roof material Standard (Composite) Shingle used type house is more in
count for sale than others.

-> Most of the house Masonry veneer type is None means either not
mentioned or not collected at the time of data collection.
-> quality of the material on the exterior and Exterior condition is
Average/Typical type house is more in count for sale.

-> Cinder Block and Poured Contrete type of foundation house is more in
count for sale followed by Brick & Tile type foundation.
-> Neighborhood north park villa, Condition is mostly normal.

-> Basement quality Typical type is more in count for sale followed by Good
quality.
-> Typical - slight dampness type basement condition house is more in count
for sale than others.

-> No refers to walkout or garden level walls type basement exposure is


more in count for sale.

-> Rating of basement finished area unfinished type houses are more in
count for sale followed by Good Living Quarters rating.
-> Rating of basement finished area (if multiple types),Unfinished type
rating is more in count for sale.

-> Heating type forced warm air furnace houses are most in count for sale
than others.

-> Heating quality and condition excellent type houses are more in count for
sale.

-> Central Air Conditioning type houses are more in count for sale.

-> Standard Circuit Breakers & Romex electrical system houses are more in
count for sale.

-> Kitchen quality Typical type houses are more in count for sale.

-> Home functionality Typical are more in count for sale.


-> No Fireplace type houses are more in count for sale followed by Masonry
Fireplace in main level and Prefabricated Fireplace in main living area or
Masonry Fireplace in basement.

-> Garage location attached to home types are more in count for sale
followed by detached from home.

-> Interior finish of garage -unfinished type houses are more in count for
sale rough finished and finished.

-> Typical/Average garage quality,garage condition houses are more in


count for sale.

-> Paved driveway houses are more in count for sale.

-> No fence type houses are more in count for sale.

-> Warranty Deed - Conventional type sale houses are more in count for
sale.

-> Houses with normal condition sale are more in count.

Taking all continuous data and plotting histogram


Observations:
-> 20-(1-STORY 1946 & NEWER ALL STYLES), 30-(1-STORY 1945 & OLDER)
are high in number followed by 50 1-1/2 STORY FINISHED ALL AGES.

-> LotFrontage (linear feet of street) mean is almost 71 feet.Most of the


houses has lotfrontage under the mean value.

-> Overall quality is above average for most of the houses for sale.

-> Overall condition rating below than 4 and above than 7, houses are less
in number for sale and rating greater than 5 to 7(Average,Above Average
and Good) are more in count for sale.

-> Residential Low Density are also high in number, Street access are mostly
payment.

-> Original Construction date -Built in year after 2000 are high and year re-
modify after 2010 are higher in counts for sale.

-> Masonry veneer area in square feet under the mean 101 sq feet are more
in counts for sale.

-> BsmtFinSF1: Type 1 finished square feet-:Most houses have Type 1


finished square feet area of basement between 0 and 1500.
-> BsmtFinSF2: Type 2 finished square feet-: Above 1000 houses have Type
2 finished tpye basement.

-> BsmtUnfSF: Unfinished square feet of basement area-: Around 310


houses have unfinished basesent of area around 100-500 sqft.

-> TotalBsmtSf: Total square feet of basement area, Above700 houses have
less than 1062 square feet basement area, which are out for sale.

-> 1stFlrSF: First Floor square feet Around 310 houses have 1st floor square
feet area between 800-1200sqft.

-> 2ndFlrSF: Around 100 houses have 2nd floor square feet area 500 to
1000.

-> GrLivArea: Above grade (ground) living area square feet-: Most houses
have above ground living sq ft area in between 800 to 3000.

-> BsmtFullBath: Basement full bathrooms-:50% houses have no full


bathrooms in basement and in remaining houses most have 1 full bathroom
in basement and very few has 2 full bathrooms.

-> FullBath: Full bathrooms above grade-:25% houses have 1 full


bathrooms above ground and 50% have 2 full bathrooms located above
ground and very less have 3.

-> HalfBath: Half baths above grade-: around 700 houses have no half
bathrooms, very few has 1 half bathroom.

-> Bedroom: Bedrooms above ground (does NOT include basement


bedrooms)-: Most houses have 3 bedrooms above ground followed by 2
and 4.

-> Kitchen: Kitchens above grade-: Maximum houses have 1 Kitchen.very


few have 2.

-> TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)-
:Around 300 houses have 6 rooms ,around 200 have 5,&250 have 7. Very
few have 12 & 14 rooms.

-> Fireplaces: Number of fireplaces-: Most houses have 0 fireplaces followed


by 1.

-> GarageCars: Size of garage in car capacity-: Most houses have garage with
2 car capcity.

-> GarageArea: Size of garage in square feet-: Most houses have Garage
area in between 200 to 800.

-> woodDeckSF: Wood deck area in square feet-: More than 50% of houses
have 0 Wood Deck sq ft area and rest have in between 0 to 400.

-> OpenPorchSF: Open porch area in square feet-: 25% of houses have 0
open porch sq ft area and rest have in between 0 to 300.

-> EnclosedPorch: Enclosed porch area in square feet-: Almost all houses
have 0 enclosed porch sq ft area.

-> ScreenPorch: Screen porch area in square feet-: Almost all houses have 0
screen porch area sq ft.

-> Houses with Type 1 finished square feet area under 444.726 feet are
more in counts.

-> Properties out for sale are mostly builtin year 1977, 1970. 1997, 2006,
1957, 1965, 1947, 1937, 2003, 1974.
which means from 1945 to 2006 in all frequency we have properties largely
out for sale.

-> Sale Price-: Above 500 houses have sale price in between 100000 to
200000.Very few houses have sale price of 600000 & 700000.
Bivariate Analysis with SalePrice for categorical
data
Above are some of the plots obtained and its value counts for bivariate
analysis of all columns with the target variable SalePrice.
Observations:
->2-STORY 1946 & NEWER type of dwelling sale prices are more than others.

-> LotFrontage: Linear feet of street connected to property-: Lot frontage does
not impact much on sale price since houses with different sale price are
having same Lot frontage area

-> LotArea: Lot size in square feet-: LotArea doesn't affect sale price of the
houses much, as can be seen different sale price are availble within the Lot
area range of 0 to 20000.In fact some houses where Lot Area is very large
have moderate sale price

-> OverallQual: Rates the overall material and finish of the house-: Overall
quality is directly proportional to the sale price of houses

->OverallCond : Rates the overall condition of the house : - Average condition


type houses are more in count and sale price is more for them.

-> YearBuilt: & YearRemodAdd: Houses which are build latest, after 2000, have
high sale price in comparison to those build in early years. similar is the case
with remodelling date

-> MnsVnrArea

-> BsmtFinSF1: Type 1 finished square feet-: Total sq ft of basement area is


directly proportional to sale price
Houses with higher number of full bathrooms seems having high sale price

-> GrLivAreaAbove:- grade (ground) living area square feet has linear
relationship with sale price.

-> BsmtFullBath:- Basement full bathrooms, having one bathroom's houses sale
price is more.
-> Kitchen: Kitchens above grade-: houses with 1 kitchen above ground have
high sale price in comparison to those having 2 kitchens

-> Bedrooms above grade, having 4 bedroom's sale price is more followed by
having 2 bedrooms.

-> Fireplaces: Number of fireplaces-: Houses with 1 and 2 fireplaces have higher
prices in comparison to houses having 0 or 3 fireplaces

-> HalfBath, Wood deck, Enclosed porch, three season porch, screen porch,
pool area, Miscval do not have impact on sale price

Plotting for continuous data


Observations:
-> LotFrontage: Linear feet of street connected to property-: Lot
frontage does not impact much on sale price since houses with
different sale price are having same Lot frontage area
-> LotArea: Lot size in square feet-: LotArea doesn't affect sale price of
the houses much, as can be seen different sale price are available
within the Lot area range of 0 to 20000.In fact some houses where Lot
Area is very large have moderate sale price
-> OverallQual: Rates the overall material and finish of the house-:
Overall quality is directly proportional to the sale price of houses
-> YearBuilt: & YearRemodAdd: Houses which are build latest have
high sale price in comparison to those build in early years. Similar is
the case with remodelling date
-> BsmtFinSF1: Type 1 finished square feet-: Total sqft of basement
area is directly proportional to sale price Houses with higher number
of full bathrooms seems having high sale price
-> Kitchen: Kitchens above grade-: houses with 1 kitchen above
ground have high sale price in comparison to those having 2 kitchens
-> Fireplaces: Number of fireplaces-: Houses with 1 and 2 fireplaces
have higher prices in comparison to houses having 0 or 3 fireplaces
-> Wood deck, Enclosed porch, three season porch, screen porch, pool
area, Miscval do not have impact on sale price

CONCLUSION
Key Findings and Conclusions of the Study
-> After getting an insight of this dataset, we were able to understand
that the Housing prices are done on basis of different features.
-> First, I loaded the train dataset and did the EDA process and other
pre-processing techniques like skewness check and removal, handling
the outliers present, filling the missing data, visualizing the
distribution of data, etc.
-> Then I did the model training, building the model and finding outthe
best model on the basis of different metrices scores I got like Mean
Absolute Error, Mean squared Error, Root Mean Squared Error,etc.
-> I got Lasso and Ridge Regressor as the best algorithm among all as
it gave more r2_score and cross_val_score. Then for finding out the
best parameter and improving the scores, we performed
HyperparameterTuning.
-> As the scores were not increased, we also tried using Ensemble
Techniques like RandomForestRegressor, AdaBoostRegressor and
GradientBoostingRegressor algorithms for boosting up our scores.
Finally, we concluded that RandomForestRegressor was the best
performing algorithm, although there were more errors in it and it had
less RMSE compared to other algorithms. It gave an r2_score of 89.47
and cross_val_score of 84.37 which is the highest scores among all.
-> I saved the model in a pickle with a filename in order to use
whenever we require.
-> I predicted the values obtained and saved it.
-> Then we used the test dataset and performed all the pre-processing
pipeline methods to it.
-> After treating skewness, I loaded the saved model that I obtained
and did the predictions over the test data .
-> From this project, I learnt that how to handle train and test data
separately and how to predict the values from them. This will be useful
while we are working in a real-time case study as we can get any new
data from the client we work on and we can proceed our analysis by
loading the best model we obtained and start working on the analysis
of the new data we have.
-> The final result will be the predictions we get from the new data
and saving it separately.
-> Overall, we can say that this dataset is good for predicting the
Housing prices using regression analysis andRandomForestRegressor
is the best working algorithm model we obtained.
-> I can improve the data by adding more features that are positively
correlated with the target variable, having less outliers, normally
distributed values, etc.
Learning Outcomes of the Study in respect of Data
Science
1. Price Prediction modeling – This allows predicting the prices of
houses & how they are varying in nature considering the different
factors affecting the prices in the real time scenarios.

2. Prediction of Sale Price – This helps to predict the future revenues


based on inputs from the past and different types of factors related to
real estate & property related cases. This is best done using predictive
data analytics to calculate the future values of houses. This helps in
segregating houses, identifying the ones with high future value, and
investing more resources on them.

3. Deployment of ML models – The Machine learning models can also


predict the houses depending upon the needs of the buyers and
recommend them, so customers can make final decisions as per the
needs.

4. I see how to deal with outliers when all the rows have at least one
value Z>3.

5. To do a visualisation when data has high standard deviation and no


Classification

6. Ways to select features and to do hyperparameter tuning efficiently

7. Ways of removing skewness and what are the best methods still not
versatile when it comes to data with 0 value
Limitations of this work and Scope for Future
Work

1. The biggest limitation I observed was that not all categories of a


particular feature were available in the training data. So, if there were
new category in the test data the model would not be able to identify
that.

Example: MSZoning has 8 categories


A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low-Density Park
RM Residential Medium Density

2. However in the Training dataset only 5 categories are present, what


happen if other 3 categories will present in test data in future. It would be
difficult for machine to identify and predict.

3. The high skewness of data reduces the effectivity

4.Many features have NaN values more than 50%, and imputation of
them can decrease the effectiveness. And dropping them had the loss
of data.

5.I can increase the efficiency of a model by selecting a better method


to remove outliers and skewness also how to make the searchof perfect
model in a way that if we want to change some parametersin model
then we don't have to run all the model again

You might also like