Sathyabama: House Price Prediction
Sathyabama: House Price Prediction
by
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
JEPPIAAR NAGAR, RAJIV GANDHI
SALAI, CHENNAI – 600 119
MARCH - 2020
i
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with “A” grade by NAAC
Jeppiaar Nagar, Rajiv Gandhi Salai, Chennai – 600 119
www.sathyabama.ac.in
BONAFIDE CERTIFICATE
This is to certify that this project report is the bonafide work of K PAVAN (Reg. No.
37110555) and T RAGHUL (Reg. No.37110613) who carried out the project entitled
“HOUSE PRICE PREDICTION MODEL” under my supervision from August 2019 to
March 2020.
Internal Guide
Dr..Ashok Kumar.,M.E.,Phd.,
ii
DECLARATION
I K PAVAN and T RAGHUL hereby declare that the Project Report entitled “HOUSE
PRICE PREDICTION MODEL” is done by us under the guidance of DR.ASHOK
KUMAR, M.E.,Phd Department of Computer Science and Engineering at Sathyabama
Institute of Science and Technology is submitted in partial fulfillment of the requirements
for the award of Bachelor of Engineering degree in Computer Science and Engineering.
DATE:
iii
ACKNOWLEDGEMENT
I convey my thanks to Dr. T. Sasikala, M.E., Ph.D., Dean, School of Computing, Dr. S.
Vigneswari, M.E., Ph.D., and Dr. L. Lakshmanan, M.E., Ph.D., Heads of the Department
of Computer Science and Engineering for providing me necessary support and details at
the right time during the progressive reviews.
I would like to express my sincere and deep sense of gratitude to my Project Guide
Dr.ASHOK KUMAR, M.E.,Phd Professor, for his valuable guidance, suggestions and
constant encouragement paved way for the successful completion of my project work.
I wish to express my thanks to all Teaching and Non-teaching staff members of the
Department of Computer Science and Engineering who were helpful in many ways for
the completion of the project.
iv
Abstract:
Usually, House price index represents the summarized price changes of residential housing.To make it
more easier for a family to search for a house we have made it more precise by asking the required
square feet, no of bedrooms and bathrooms required.
With preloaded dataset and data features, a practical data pre-processing, creative feature
engineering method is examined in this paper. The paper also proposes regression technique in
machine learning to predict house price.
v
TABLE OF CONTENTS
ABSTRACT v
LIST OF FIGURES viii
1. INTRODUCTION 1
MACHINE LEARNING 1
ADVANTAGES AND APPLICATIONS 2
2. LITREATURE SURVEY 7
4. EXPERIMENTAL METHODS
AND ALGORITHMS
HARDWARE REQUIREMENTS 16
SOFTWARE REQUIREMENTS 16
PYTHON 17
ANACONDA 19
SYSTEM DESIGN 25
USE CASE DIAGRAM 27
SEQUENCE DIAGRAM 28
ACTIVITY DIAGRAM 29
vi
5. RESULTS AND DISCUSSION 30
MODULE IMPLEMENTATION 30
SOFTWARE TESTION 35
RESULTS
REFERENCES 40
APPENDIX
A. PAPER ACCEPTANCE MAIL 42
B. PLAGIARISM REPORT 43
C. JOURNAL PAPER 44
D. SOURCE CODE 49
LIST OF FIGURES
ANACONDA 19
ANACONDA NAVIGATOR 19
SYSTEM DESIGN 25
USE CASE DIAGRAM 27
SEQUENCE DIAGRAM 28
ACTIVITY DIAGRAM 29
vii
CHAPTER 1
INTRODUCTION:
Data is at the heart of technical innovations, achieving any result is now possible
using predictive models. Machine learning is extensively used in this approach. Machine
learning means providing valid dataset and further on predictions are based on that, the
machine itself learns how much importance a particular event may have on the entire
system supported its pre-loaded data and accordingly predicts the result. Various
modern applications of this technique include predicting stock prices, predicting the
possibility of an earthquake, predicting company sales and the list has endless
possibilities.
Our aim is to predict a house price based on their needs and priorities.. By analyzing
previous market trends and price ranges, and also upcoming developments future prices
will be predicted.The functioning involves a website which accepts customers
specifications and then combines the application of neuralnetwork.
Machine Learning
It is a subset of artificial intelligence (AI).It provides system the ability to automatically
learn and improve by itself.It focuses on the development of computer programs that can
access data learn by themselves. The process of learning begins with observations based on
the examples that we provide. The aim is to make computers to learn by itself without the
need of a human.
Machine Learning Methods
Machine learning can be classified into three types namely the supervised, unsupervised
and reinforcement learning. Supervised machine learning algorithms can apply
what has been learned in the past to new data predict future events. It analysis from a
known training dataset, and produces a functions to predict outputs.
1
The system will provide outputs for inputs after training. The system will compare with the
correct, intended output and find errors and modify it to make the model more practical
and useful.
2
While even experts often cannot be sure where and by which correlation a production error
in a plant fleet arises, Machine Learning offers the possibility to identify the error early this
saves down times and money. Machine learning are now used in the medical field. In the
future, after collecting huge amounts of data apps will be able to warn in case his doctor
wants to prescribe a drug that he cannot tolerate.The app can also suggest alternative
options by taking into account the genetics of patient.
2.Traffic Predictions :
Whenever we visit to a new place or when we are not sure about the route we generally use
maps it shows the distance, the amout of time it takes to cover the distance and also it
provides the information regarding traffic congestion ,By making use of machine learning
it predicts the traffic in the particular route by analyzing the previous days traffic on the
route on the same time .hence machine learning helps us in predicting traffic.
3
3. IDEO SURVEILLENCE:
A single person cannot be monitoring multiple cameras at single time that’s where
machine learning is used nowadays video cameras are powered by AI henceit helps us by
tracking unusual behaiviour for example if a person is standing motionless for long
looking after the camera .It has been used extensively in video survillence and it has been
extremely useful.
Machine learning has been extensively used in checking spam and malware emails .It
detects new malware and protects users against it.It can detect various malwares and can
protect us .
Whenever we search for anything in web the search engine for example if it is google then
it will keep track of what users are opening after the reults are shown.it checks whether the
users are clicking the top search result or the bottom ones.Machine learning helps and
makes the search engine better with time.
4
7.Product recommendations:
Every time when a product is recommended for you ,be it after you purchase a certain
product from the website or it’s a new product machine learning is the one that helps in
recommending products to customers.
It helps in detecting money fraud in online .many payment gateways have started to
implement this technique to prevent fraud .company like paypal uses machine learning to
detect fraud.
5
INTRODUCTION TO PROJECT
Housing is one of the most valuable economic assets an individual can purchase during his
adult life. Hence we need to be extremely careful before buying a house we need to spend
correct money to buy a house.
6
CHAPTER 2
LITERATURE SURVEY
Literature Survey
House price Prediction is a crucial topic of land . The literature attempts to get useful
knowledge from historical data of property markets. Machine learning techniques are applied
to research historical property transactions in
Australia to get useful models for house buyers and sellers. Revealed is the the high
discrepancy between house prices within the costliest and most affordable places within
Melbourne city. Moreover, experiments demonstrate that the mixture of Stepwise and
Support Vector
Machine that’s supported mean squared error measurement may be a
competitive approach.
This article we’ll describe our solution for “House Prices: Advanced Regression
Techniques” machine learning competition, which was persisted Kaggle platform. The
goal is to predict house sale price by attributes like house area,year of building etc. In
our solution, we use classic machine learning algorithms, and our original methods, which
may be described here. At the highest of the competition, we took 18th place among
2124 participants from whole world.
7
3. Real Estate Value Prediction Using Linear Regression
The real estate market may be a standout amongst the foremost focused regarding
pricing and keeps fluctuating. It is one among the prime fields to use the ideas of
machine learning on the way to enhance and foresee the prices with high accuracy.
There are three factors that influence the price of a house which includes physical
conditions, concepts and location. The current framework includes estimating the worth
of homes with none expectations of market prices and price increment. The objective
of the paper is prediction of residential prices for the purchasers considering their
financial plans and
wishes . By breaking down past market patterns and value ranges, and coming
advancements future costs are going to be anticipated. This examination means to
predict house prices in Mumbai city with Linear Regression. It will help clients to place
resources into a gift without moving toward a broker. The result from this research
proved linear regression gives minimum prediction error which is 0.3713.
In this study, we attempt to predict the Dutch housing market trends using text mining
and machine learning as an application of knowledge science methods in finance. Our
main goal is to predict the short term upward or downward trend of the average house
price in the Dutch market by using text data collected from Twitter. Twitter is widely
used also and has been proven to be a helpful
source of knowledge . However, Twitter, text mining (tokenization,
bag-of-words, n-grams, weighted term frequencies) and machine learning (classification
algorithms) have not been combined yet in order to predict the housing market trends in
short term. In this study, tweets including predefined search words are collected counting
on domain knowledge, and therefore
the corresponding text is grouped by month as documents. Then words and word
sequences are transformed into numerical values. These values served as attributes to
predict whether the housing market moves up or down,
8
i.e. we approached this as a binomial classification problem relating text data of a
month with (up or down) trends for the subsequent month.
Our main results reveal there’s a correlation between the (weighted) frequency of
words and short term housing trends, in other words, we were ready to make accurate
predictions of trends in short term using multiple machine learning and text mining
techniques combined.
Real estate is that the least transparent industry in our ecosystem. Housing prices keep
changing day in and outing and sometimes are hyped instead of being supported
valuation. Predicting housing prices with real factors is that the main crux of our
scientific research . Here we aim to form our evaluations supported every basic
parameter that’s considered while determining the worth . We use various
regression techniques during this pathway, and our results aren’t sole
determination of 1 technique rather it’s the weighted mean of varied techniques to
offer most accurate results. The results proved that this approach yields minimum error
and maximum accuracy than individual algorithms applied. We also propose to use real-
time neighborhood details using Google maps to urge exact real-world valuations.
The results of Chinese housing market continues to prosper or not is said to the event of
China, and further it also has an impression on the planet finance.
Thus forecasting the house price level is extremely important and challenging.
during this paper we propose an unsupervised learnable neuron
9
model (DNM) by including the nonlinear interactions between excitation and inhibition
on dendrites.
We use DNM to suit the House price level (HPI) data then forecast the trends of
Chinese housing market. To verify the effectiveness of the DNM, we use a standard
statistical model (i.e., the exponential smoothing (ES) model) to
form a performance comparison. Three quantitative statistical metrics including
normalized mean square error, absolute percentage of error, and coefficient of correlation
are wont to evaluate the forecasting performance of the 2 models. Experimental results
demonstrate that the proposed DNM is best than
ES altogether of the three quantitative statistical metrics.
It is documented that a lot of economic parameters may more or less influence the
important estate price variation. additionally , the banker and investor also are
interesting to understand the important estate price future change. There had not
appropriate model for including these factors for price prediction. Here, the
influences of most macroeconomic parameters
on land price variation are investigated before establishing the worth fluctuation
prediction model. Here, back propagation neural network (BPN) and radial basis
function neural network (RBF) two schemes are employed to
determine the nonlinear model for real estates price variation prediction of Taipei,
Taiwan supported leading and simultaneous economic indices. Those prediction
results are compared with the general public Cathay House price level or the Sinyi
Home price level . The mean absolute error and root mean square error two indices of
the worth variation are selected because
the performance index. the general public related data of Taipei, Taiwan land
variation during 2005 ~ 2015 are adopted for analysis and prediction
comparison.
10
8. Predicting house sale price using fuzzy logic, Artificial
Neural Network and K-Nearest Neighbor
Determining the worth of land and residential are regularly determined at the earliest by
the vendor , however determining the proper price within the sales process will affect
the buyer’s desire to elect and bid. Special characteristics in Indonesia, tax object value
(NJOP) and site parameters are high influence
to the worth . during this paper we proposed the prediction of land and house value
using several methods. symbolic logic , Artificial Neural Network and
K-Nearest Neighbor are compared during this paper to get the foremost
appropriate method which will be used as a reference for
determining the worth by the sellers. Google Maps is employed to represent the spatial
data for prediction parameter. The variables that utilized in the methods are NJOP of
land, the locations, the age, NJOP of house, and therefore
the valuable location of the land. The experimental methods are tested by comparing
between the important price transaction and therefore the prediction using MAPE
formula.
The housing sector in India has been predicted to grow at 30-35% over
subsequent decade. In terms of employment provided, it’s second only to the agricultural
sector. Housing is one among the main domain of land . Pune is emerging together of the
main metropolitan cities of India and has many prestigious Educational institutions and
IT parks. This makes it a perfect place to shop for homes. Vagueness among the
costs of homes makes it challenging for the customer to pick their dream house.
11
The interests of both buyer and seller should be satisfied in order that they are doing not
overestimate or underestimate price. This housing price prediction model acts as a hand
for buyer and seller or a true realtor to form a better-informed decision. to realize this,
diverse features are selected as input from feature set and various algorithms are applied
like Random Forest and Decision Tree.
12
CHAPTER 3
Multiple Linear Regression. It shows the relationship between two or more explanatory
variables and scalar response variable .Independent variable value is associated with
dependent variable value
Limitations
The dependent variable y must be continuous.. The independent variables can be of any
type. The dependent variable is usually affected by the independent variables.
Proposed System
Linear Regression is a technique that helps to identify the relationship between a scalar
response (or dependent variable) and one or more explanatory variables (or independent
variables). The case of one explanatory variable is called simple linear regression.
Advantages
Space complexity is very low it just needs to save the weights at the end of
training. hence it’s a high latency algorithm.
Good interpretability
Feature importance is generated at the time model building. With the help of
hyperparameter lamba, you can handle features selection hence we can achieve
dimensionality reduction
13
FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business proposal is put forth
with a very general plan for the project and some cost estimates. The feasibility study of
the proposed system is carried out. It is carried out to ensure that the proposed system is
not a burden to the company. Economic feasibility
1. Economical feasibility
2. Technical feasibility
3. Social feasibility
ECONOMICAL FEASIBILITY
This study is generally carried out to check whether right amount of funds are invested in
the model.this study is done to eliminate excess amount of money poured into a single
model.It makes sure whether the model is well within the budget.It is extremely
important to spend only right amount of funds to a model.
TECHNICAL FEASIBILITY
It makes sure whether the technical requirements are limited to what we can offerd.Any
system developed should not have high demand on technical resources since it puts burden
on client,It also checks the projects potential what it can do once developed.
14
SOCIAL FEASIBILITY
It is carried out check how a system acts with other systems.It checks the level of
acceptance of the system by the user. It trains the user to use the system efficiently. it is a
necessity. Since a client is the final user of the system he can critizise the system but it
should be in a disciplined and meaningful manner.
15
CHAPTER 4
HARDWARE REQUIREMENTS
The most common set of requirements defined by any operating system or software
application is the physical computer resources, also known as hardware. A hardware
requirements list is often accompanied by a hardware compatibility list, especially in case
of operating systems. The minimal hardware requirements are as follows,
1. PROCESSOR : PENTIUM IV
2. RAM : 8 GB
SOFTWARE REQUIREMENTS
Software requirements deals with defining resource requirements and prerequisites that
needs to be installed on a computer to provide functioning of an application. These
requirements are need to be installed separately before the software is installed. The
minimal software requirements are as follows,
2. IDE : ANACONDA
16
Python Language
Syntax is simple
A module in Python may have one or more classes and free functions
17
Applications of Python Programming
Web Applications
We can create web apps in python by using frameworks and CMS. We can create web
applications using Django, Flask, Pyramid, Plone, Django CMS. Sites like Mozilla, Reddit,
Instagram and PBS are written in Python.
There are many number of libraries in python that can be used for scientific and numeric
computing . SciPy and NumPy that are used in general purpose computing. EarthPy is used
for earth science, AstroPy is used for Astronomy and so on. It is also used in machine
learning, data mining and deep learning.
Python is slow but is great for creating prototypes. For example: You can use Pygame
which is used to create game prototype. If you are satisfied with the prototype then you can
build the app using C or C++.
Python has been used by many students.There are several companies teaching python to
their employees. It has a lot of features and capabilities. The syntax is simple and it is one
of the easiest language to learn.
Compared to other languages like C/C++, Python is slower. Python can be easily extended
with C/C++. We can write codes in C/C++ and create a python wrapper.
18
This gives us two advantages: first, our code is as fast as original C/C++ code and second,
it is very easy to code in Python. Hence OpenCV-Python is a Python wrapper around
original C++ implementation.
Anaconda is free
It is used for scientific computing, data science, statistical analysis and machine learning.
19
What is Anaconda Navigator?
Anaconda Navigator may be a desktop graphical interface (GUI) included within the
Anaconda distribution. It allows us to launch applications provided within the Anaconda
distribution and simply manage conda packages, environments and channels without the
utilization of command-line commands. It is available for Windows, macOS and Linux.
20
Applications Provided In Anaconda Distribution
The Anaconda distribution comes with the subsequent applications along side Anaconda
Navigator.
1. JupyterLab
2. Jupyter Notebook
3. Qt Console
4. Spyder
5. Glueviz
6. Orange3
7. RStudio
> JupyterLab: This is the extensible working environment for interactive and the
reproducible computing, supported the Jupyter Notebook and Architecture.
environment. we will able to edit and runs in human-readable docs while describing the
info
analysis.
> Qt Console: It is an PyQt GUI that supports inline figures, proper multiline
VS Code: It is an streamlined code editor within the support for development operations like
debugging, task running and version control.
21
Glueviz: It is used for multidimensional data visualization across the files. It is explored
in relationships within and among related datasets.
Rstudio: This is a set of integrated tools designed for help you to be more productive
by R.Then it includes R essentials and notebooks.
22
New Features of Anaconda 5.3
Compiled by Latest Python release: Anaconda 5.3 is compiled by Python 3.7, taking
advantage of Python’s speed and feature improvements.
• Better Reliability: The reliability of Anaconda is improved in the latest release by
capturing and storing the package metadata for the installed packages.
Users deploying Tensorflow can make usefull by MKL 2019 for Deep Neural Networks.These
Python binary packages are provided to realize the high CPU performance.
• New packages has been added: These pakages are over 230 packages which is
updated and added in the new release.
• add Progress: there’s a casting bug in Numpy with Python 3.7 but the
team is currently performing on patching it until Numpy is updated.
23
Flask
Flask is an API of Python that permits to create up web-applications. It was developed by
Armin Ronacher. Flask’s framework is more explicit than Django’s framework and it is
also easier to learn because it has the less base code to implement a simple web-
Application.
METHOD DESCRIPTION
24
SYSTEM DESIGN
Architecture
Collection of
Dataset Data Loading and Determine
Pre-Procesing Dependent and
Independent
Determine the
Calculate the Prediction Prediction Results
25
UML DIAGRAMS
o UML stands for Unified Modeling Language.
o It is used in the field of object-oriented software engineering.
o The goal is for UML to become a common language for creating models of object
oriented computer software.
o It consists of two components: a Meta-model and a notation..
The Unified Modeling Language is a standard language for specifying, Visualization,
Constructing and documenting the artifacts of software system, as well as for business
modeling and other non-software systems.
o It has been proven successful in the modeling of large and complex systems.
o The UML is a very important part of developing objects oriented software and the
software development process. It uses graphical notations to show the design of
software projects.
GOALS:
The Primary goals are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core concepts.
3. Be independent of particular programming languages and development process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations,
frameworks, patterns and components.
7. Integrate best practices.
26
USE CASE DIAGRAM:
A use case diagram is a behaivioural diagram. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals
(represented as use cases), and any dependencies between those use cases. The main
purpose of a use case diagram is to show what system functions are performed for which
actor. Roles of the actors in the system can be depicted.
27
SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a interaction diagram that
shows how processes operate with one another and in what order.. Sequence diagrams are
sometimes called event diagrams, event scenarios, and timing diagrams.
28
ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise activities and
actions with support for choice, iteration and concurrency. Activity diagrams can be used
to describe the business and operational step-by-step workflows of components in a
system. An activity diagram shows the overall flow of control.
29
CHAPTER 5
Module Implementation
Collection of Dataset
The dataset used in this project was Parameters such as Area in square meters, Location, no
of bedrooms and no of bathrooms in that particular property. Selling price is a dependent
variable on several other independent variables.
Data Preprocessing
Import Libraries
A library is a collection of modules the first step is to import the libraries that we require in
our system.There are functions for them, which can be invoked without writing the
required code. This is a list for most popular Python libraries for Data Science. We have
imported pandas library and named it as pd.
30
Import the Dataset
A lot of datasets come in CSV formats.At first We have to locate direcotory of csv file and
read it using a method called read_csv which may be found in the library called pandas.
Sometimes our data is in qualitative form, that is we have texts as our data. We can find
categories in text form. Now it gets complicated for machines to know texts and process
them, rather than numbers, since the models are based on mathematical equations and
calculations. Therefore, we have to encode the categorical data.
Now we should split our dataset into two sets — a Training set and a Test set. We will
train our machine learning models on our training set, i.e our machine learning models will
try to understand any correlations in our training set and then we will test the models on
our test set to check how accurately it can predict. In general we need to allocate 80% of
the dataset to training set and the remaining 20% to test set.
Regression analysis describes the relationship between independent variables and the
dependent variable. It predicts value of dependent variable by analyzing the value of
independent variables.
Regression coefficient
31
Prediction
Prediction is nothing but the output of an algorithm after being trained on a dataset and
applied to new data and predicts the output. Finally our model will predict the house price
based on user inputs.
SOFTWARE TESTING
General
In a generalized way, we can say that the system testing is a type of testing in which
the main aim is to make sure that system performs efficiently and seamlessly. The process
of testing is applied to a program with the main aim to discover an unprecedented error,
an error which otherwise could have damaged the future of the software. Test cases which
brings up a high possibility of discovering and error is considered successful. This
successful test helps to answer the still unknown errors.
TEST CASE
Testing, as already explained earlier, is the process of discovering all possible weak-points
in the finalized software product. Testing helps to counter the working of sub-assemblies,
components, assembly and the complete result. The software is taken through different
exercises with the main aim of making sure that software meets the business requirement
and user-expectations and doesn’t fails abruptly. Several types of tests are used today.
Each test type addresses a specific testing requirement.
Testing Techniques
A test plan is a document which describes approach, its scope, its resources and the
schedule of aimed testing exercises. It helps to identify almost other test item, the features
which are to be tested, its tasks, how will everyone do each task, how much the tester is
independent, the environment in which the test is taking place, its technique of design plus
the both the end criteria which is used, also rational of choice of theirs, and whatever kind
of risk which requires emergency planning. It can be also referred to as the record of the
process of test planning. Test plans are usually prepared with signification input from test
engineers.
32
(I) UNIT TESTING
In unit testing, the design of the test cases is involved that helps in the validation of the
internal program logic. The validation of all the decision branches and internal code takes
place. After the individual unit is completed it takes place. Plus it is taken into account
after the individual united is completed before integration. The unit test thus performs the
basic level test at its component stage and test the particular business process, system
configurations etc. The unit test ensures that the particular unique path of the process gets
performed precisely to the documented specifications and contains clearly defined inputs
with the results which are expected.
determine whether if they really execute as a single program or application. The testing is
event driven and thus is concerned with the basic outcome of field. The Integration tests
demonstrate that the components were individually satisfaction, as already represented by
successful unit testing, the components are apt and fine. This type of testing is specially
aimed to expose the issues that come-up by the components combination.
The functional tests help in providing the systematic representation that functions tested
are available and specified by technical requirement, documentation of the system and the
user manual.
System testing, as the name suggests, is the type of testing in which ensure that the
software system meet the business requirements and aim. Testing of the configuration is
taken place here to ensure predictable result and thus analysis of it.System testing is relied
on the description of process and its flow, stressing on pre driven process and the points of
integration.
33
V) WHITE BOX TESTING
The white box testing is the type of testing in which the internal components of the system
software is open and can be processed by the tester. It is therefore a complex type of
testing process. All the data structure, components etc. are tested by the tester himself to
find out a possible bug or error. It is used in situation in which the black box is incapable
of finding out a bug. It is a complex type of testing which takes more time to get applied.
The black box testing is the type of testing in which the internal components of the
software is hidden and only the input and output of the system is the key for the tester to
find out a bug. It is therefore a simple type of testing. A programmer with basic knowledge
can also process this type of testing. It is less time consuming as compared to the white box
testing. It is very successful for software which are less complex are straight-forward in
nature. It is also less costly than white box testing.
34
RESULTS:
area_type availability
location size society total_sqft bath balcony price
Super built-up Electronic City
Area 19-Dec Phase II 2 BHK Coomee 1056 2 1 39.07
Ready To 4
Plot Area Move Chikka Tirupathi Bedroom Theanmp 2600 5 3 120
Built-up Ready To
Area Move Uttarahalli 3 BHK 1440 2 3 62
Super built-up Ready To
Area Move Lingadheeranahalli 3 BHK Soiewre 1521 3 1 95
Super built-up Ready To
Area Move Kothanur 2 BHK 1200 2 1 51
Super built-up Ready To
Area Move Whitefield 2 BHK DuenaTa 1170 2 1 38
Super built-up
Area 18-May Old Airport Road 4 BHK Jaades 2732 4 204
Super built-up Ready To
Area Move Rajaji Nagar 4 BHK Brway G 3300 4 600
Super built-up Ready To
Area Move Marathahalli 3 BHK 1310 3 1 63.25
Ready To
Plot Area Move Gandhi Bazar 6 Bedroom 1020 6 370
Super built-up
Area 18-Feb Whitefield 3 BHK 1800 2 2 70
Ready To 4
Plot Area Move Whitefield Bedroom Prrry M 2785 5 3 295
Super built-up Ready To 7th Phase JP
Area Move Nagar 2 BHK Shncyes 1000 2 1 38
Built-up Ready To
Area Move Gottigere 2 BHK 1100 2 2 40
Ready To 3
Plot Area Move Sarjapur Bedroom Skityer 2250 3 2 148
Super built-up Ready To
Area Move Mysore Road 2 BHK PrntaEn 1175 2 2 73.5
Super built-up Ready To
Area Move Bisuvanahalli 3 BHK Prityel 1180 3 2 48
Super built-up Ready To Raja Rajeshwari
Area Move Nagar 3 BHK GrrvaGr 1540 3 3 60
These are the sample for preloaded data sets in our model
35
Graph:
Before deleting anamolies:
This graph shows bathrooms per property This graph represents property price by square feet
Importing libraries:
We use pandas library to read the train and test files.
import pandas as pd ( used for data analysis)
36
Data preprocessing :
It gets the count of area type in dataset and removes unwanted columns
We are taking 80% of our data as training data and 20% as test data.
X_train,X_test,y_train,y_test=
train_test_split(X,y,test_size=0.2,random_state=10)
Eg:
Dependent variable in our model is price(since it relies on other factors for its value)
37
Linear regression:
predict_price(location,sqft,bath,bhk):
loc_index = np.where(X.columns == location)[0][0]
x = np.zeros(len(X.columns))
x[0] = sqft
x[1] = bath
x[2] = bhk
if loc_index >= 0:
x[loc_index] = 1
return regressor.predict([x])[0]
Screenshot:
38
CHAPTER 6
In this paper, several tests have been performed using linear regression algorithm to
perform house price prediction. This algorithm is to predict prices of new properties that
are going to be listed by taking some input variables and predicting the correct and
justified price.It was a great learning experience building this predictive Sale Price model.
In Future Using different methods that match the time-series data will be used in the
research to obtain smaller error prediction values and using more data to get the better
result.
39
References
40
7. Prediction of real estate price variation based on economic parameters, Li
Li ; Kai-Hsuan Chu, 2017 International Conference on Applied System
Innovation (ICASI)
8. Predicting house sale price using fuzzy logic, Artificial Neural Network
and K-Nearest Neighbor, Muhammad Fahmi Mukhlishin ; Ragil
Saputra ; Adi Wibowo, 2017 1st International Conference on Informatics
and Computational Sciences (ICICoS)
10. Time-Aware Latent Hierarchical Model for Predicting House Prices, Fei
Tan ; Chaoran Cheng ; Zhi Wei, 2017 IEEE International Conference on
Data Mining (ICDM)
41
Paper Acceptance mail:
42
Plagiarism report:
43
C.Journal Paper
K Pavan,T Raghul
Abstract:
Usually, House price index represents the summarized price changes of residential housing.To make it more easier for
a family to search for a house we have made it more precise by asking the required square feet, no of bedrooms and
bathrooms required. With preloaded dataset and data features, a practical data pre-processing, creative feature
engineering method is examined in this paper. The paper also proposes regression technique in machine learning to
predict house price.
1. INTRODUCTION:
Machine Learning Methods
Data is at the heart of technical innovations,
Machine learning can be classified into three types namely
achieving any result is now possible using predictive
the supervised,unsupervised and reinforcement
models. Machine learning is extensively used in this
learning.Supervised machine learning algorithms can
approach. Machine learning means providing valid
apply what has been learned in the past to new data
dataset and further on predictions are based on that,
predict future events. It analysis from a known training
the machine itself learns how much importance a
dataset, and produces a functions to predict outputs.
particular event may have on the entire system
The system will provide outputs for inputs after training.
supported its pre-loaded data and accordingly predicts
The system will compare with the correct, intended output
the result. Various modern applications of this
and find errors and modify it to make the model more
technique include predicting stock prices, predicting
practical and useful.
the possibility of an earthquake, predicting company
sales and the list has endless possibilities.
In contrast, unsupervised machine learning
Our aim is to predict a house price based on their algorithms are the ones which does not require any
needs and priorities.. By analyzing previous market supervision.It is used when when the sample data used to
trends and price ranges, and also upcoming train is classified .As name suggests it, the model itself
developments future prices will be predicted.The finds the hidden patterns and insights. The system may or
functioning involves a website which accepts may not produce right output, but it explores the data and
customers specifications and then combines the can draw inferences from datasets by its own.
application of neuralnetwork.
Semi-supervised machine learning algorithms is a
Machine Learning
combination of both supervised and unsupervised
It is a subsetof artificial intelligence (AI).It provides
learning, In semi-supervised learning, an algorithm learns
system the ability to automatically learn and improve by
from a dataset that includes both labeled and unlabeled
itself.It focuses on the development of computer
data, usually mostly unlabeled.Generally it is chosen
programs that can access data learn by themselves. The
when the sample data requires skilled resources in order
process of learning begins with observations based on the
to train from it. Otherwise, It doesn’t require additional
examples that we provide. The aim is to make computers
resources.
to learn by itself without the need of a human.
44
Reinforcement machine learning algorithm is a
Advantages
learning method that works based on feedback .
Reinforcement learning differs from supervised learning Space complexity is very low it just needs to save the
in not needing labelled input/output pairs be presented. weights at the end of training. hence it's a high
2. RAM : 8 GB
SYSTEM Multi
3. PROCESSOR : 2.4 GHZ
Linear Regression
4. MAIN MEMORY : 8GB RAM
Multiple Linear Regression. It shows the relationship 5. PROCESSING SPEED : 600 MHZ
between two or more explanatory variables and scalar
6. HARD DISK DRIVE : 1TB
response variable .Independent variable value is
7. KEYBOARD :104 KEYS
associated with dependent variable value
SOFTWARE REQUIREMENTS
Limitations
Software requirements deals with defining resource
The dependent variable y must be continuous.. The requirements and prerequisites that needs to be installed
independent variables can be of any type. The on a computer to provide functioning of an application.
dependent variable is usually dependent on independent These requirements are need to be installed separately.
variables. The minimal software requirements are as follows,
45
4. ARCHITECTURE OF PROPOSED SYSTEM:
Data Preprocessing
Data Loading and Pre- Procesing It is a process of transforming the raw, complex data
Determine Dependent and Independent Value
Collecti into systematic understandable knowledge. It will find
on of out missing and redundant data in the dataset. Thus, this
brings uniformity in the dataset. But in our dataset, there
was no missing values .
Import Libraries
46
47
7.Conclusion
References
1. Housing Price Prediction Using Machine
Learning Algorithms: The Case of Melbourne City,
Australia, The Danh Phan.
48
CODING:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib
matplotlib.rcParams["figure.figsize"] = (20,10)
dataset = pd.read_csv(’..\dataset\Bengaluru_House_Data.csv’)
print(dataset.head(10))
print(dataset.shape)
# Data preprocessing
print(dataset.groupby(’area_type’)[’area_type’].agg(’count’))
dataset.drop([’area_type’,’society’,’availability’,’balcony’], axis=’columns’,
inplace=True)
print(dataset.shape)
49
## data cleaning
print(dataset.isnull().sum())
dataset.dropna(inplace=True)
print(dataset.shape)
print(dataset[’size’].unique())
print(dataset[’total_sqft’].unique())
#### defining a function to check whether the value is float or not def
is_float(x):
try:
float(x)
except :
return False
return True
print(dataset[~dataset[’total_sqft’].apply(is_float)].head(10))
50
#### defining a function to convert the range of column values to a single
value
def convert_sqft_to_num(x):
tokens = x.split(’-’)
if len(tokens) == 2:
try:
return float(x)
except:
return None
print(convert_sqft_to_num(’290’))
print(convert_sqft_to_num(’2100 - 2850’))
print(convert_sqft_to_num(’4.46Sq. Meter’))
dataset[’total_sqft’] = dataset[’total_sqft’].apply(convert_sqft_to_num)
print(dataset[’total_sqft’].head(10))
print(dataset.loc[30])
## feature engineering
51
print(dataset.head(10))
dataset[’price_per_sqft’] = dataset[’price’]*100000/dataset[’total_sqft’]
print(dataset[’price_per_sqft’])
print(len(dataset[’location’].unique()))
location_stats =
dataset.groupby(’location’)[’location’].agg(’count’).sort_values(ascendin
g=False)
print(location_stats[0:10])
#### occurance
52
print(len(location_stats[location_stats <= 10]))
print(location_stats_less_than_10)
#### redefining the ’location’ column as ’other’ value where location count
#### is <= 10
print(dataset[’location’].head(10))
print(len(dataset[’location’].unique()))
### checking that ’total_sqft’/’bhk’, if it’s very less than there is some ###
print(dataset.shape)
print(dataset.shape)
53
### checking columns where ’price_per_sqft’ is very low ###
print(dataset[’price_per_sqft’].describe())
### function to remove these extreme cases of very high or low values ### of
def remove_pps_outliers(df):
df_out = pd.DataFrame()
mean = np.mean(subdf[’price_per_sqft’])
std = np.std(subdf[’price_per_sqft’])
df_out
dataset = remove_pps_outliers(dataset)
print(dataset.shape)
### plotting graoh where we can visualize that properties with same location
54
### and the price of 3 bhk properties with higher ’total_sqft’ is less than ### 2
def plot_scatter_chart(df,location):
matplotlib.rcParams[’figure.figsize’] = (15,10)
plt.scatter(bhk2[’total_sqft’],
bhk2[’price’],
color=’blue’,
label=’2 BHK’,
s=50
plt.scatter(bhk3[’total_sqft’],
bhk3[’price’],
marker=’+’,
color=’green’,
label=’3 BHK’,
s=50
plt.ylabel(’Price’)
plt.title(location)
55
plt.legend()
plt.show()
plot_scatter_chart(dataset,"Hebbal")
plot_scatter_chart(dataset,"Rajaji Nagar")
### defining a funcion where we can get the rows where ’bhk’ &
’location’
### is same but the property with less ’bhk’ have more price than the property
### which have more ’bhk’. So, it’s also an anomalu and we have to remove
these
### properties
def remove_bhk_outliers(df):
exclude_indices = np.array([])
bhk_stats = {}
bhk_stats[bhk] = {
’mean’: np.mean(bhk_df[’price_per_sqft’]),
’std’: np.std(bhk_df[’price_per_sqft’]),
’count’: bhk_df.shape[0]
56
for bhk, bhk_df in location_df.groupby(’bhk’):
stats = bhk_stats.get(bhk-1)
exclude_indices = np.append(exclude_indices,
bhk_df[bhk_df[’price_per_sqft’] < (stats[’mean’])].index.values)
dataset = remove_bhk_outliers(dataset)
print(dataset.shape)
def plot_scatter_chart(df,location):
matplotlib.rcParams[’figure.figsize’] = (15,10)
plt.scatter(bhk2[’total_sqft’],
bhk2[’price’],
color=’blue’,
label=’2 BHK’,
s=50
plt.scatter(bhk3[’total_sqft’],
bhk3[’price’],
57
marker=’+’,
color=’green’,
label=’3 BHK’,
s=50
plt.ylabel(’Price’)
plt.title(location)
plt.legend()
plt.show()
plot_scatter_chart(dataset,"Hebbal")
plot_scatter_chart(dataset,"Rajaji Nagar")
matplotlib.rcParams[’figure.figsize’] = (20,10)
plt.hist(dataset[’price_per_sqft’], rwidth=0.8)
plt.show()
58
### exploring bathroom feature
print(dataset[’bath’].unique())
plt.xlabel(’Number of Bathrooms’)
plt.ylabel(’Count’)
plt.show()
print(dataset.shape)
print(dataset.head())
59
## one hot encoding the ’location’ column dummies
= pd.get_dummies(dataset[’location’])
print(dummies.head())
print(dataset.head())
print(dataset.shape)
dataset.drop([’price’],axis= ’columns’)
y = dataset[’price’]
print(X.shape)
print(y.shape)
## splitting the dataset into training set and test set from
X_train,X_test,y_train,y_test =
train_test_split(X,y,test_size=0.2,random_state=10)
60
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)
print(regressor.score(X_test,y_test))
cross_val_score(regressor,X,y,cv=cv)
def find_best_model_using_gridsearch(X,y):
algos = {
’linear_regression’: {
},
’lasso’: {
61
’model’: Lasso(),
’params’: {
’alpha’: [1,2],
’selection’: [’random’,’cyclic’]
},
’decision_tree’:{
’model’: DecisionTreeRegressor(),
’params’: {
’criterion’: [’mse’,’friedman_mse’],
’splitter’: [’best’,’random’]
scores = []
cv = ShuffleSplit(n_splits=5,test_size=0.2,random_state=0) for
algo_name,config in algos.items():
gs = GridSearchCV(config[’model’],
config[’params’],
cv=cv,
n_jobs=-1,
return_train_score=False
62
)
gs.fit(X,y)
scores.append({
gs.best_score_, ’best_params’:
gs.best_params_
})
return pd.DataFrame(scores,columns=[’model’,’best_score’,’best_params’])
model_scores = find_best_model_using_gridsearch(X,y)
print(model_scores)
### so after running grid search, linear regression model have the best score
regressor = LinearRegression()
regressor.fit(X,y)
def predict_price(location,sqft,bath,bhk):
63
loc_index = np.where(X.columns == location)[0][0]
x = np.zeros(len(X.columns))
x[0] = sqft
x[1] = bath
x[2] = bhk
if loc_index >= 0:
x[loc_index] = 1
return regressor.predict([x])[0]
print(predict_price(’Indira Nagar’,1000,3,3))
import pickle
with open(’bangalore_home_prices_model.pickle’,’wb’) as f:
pickle.dump(regressor,f)
# exporting columns
import json
64
with open("columns.json","w") as f:
f.write(json.dumps(columns))
65