0% found this document useful (0 votes)
408 views72 pages

Sathyabama: House Price Prediction

Chatbots and virtual assistants like Siri, Alexa, Cortana are powered by machine learning. They learn from conversations to improve over time. 2. Recommendation Systems: Recommendation engines used by Netflix, Amazon, Spotify use machine learning to understand user preferences and recommend new content. 3. Image and Speech Recognition: Computer vision and speech recognition technologies used in self-driving cars, smartphones, security systems use machine learning. 4. Fraud Detection: Machine learning helps detect fraudulent transactions by analyzing spending patterns and flagging anomalies. 5. Precision Medicine: Machine learning analyzes genetic data and medical records to deliver customized treatment plans. 6. Predictive Maintenance:

Uploaded by

divyanshuroka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
408 views72 pages

Sathyabama: House Price Prediction

Chatbots and virtual assistants like Siri, Alexa, Cortana are powered by machine learning. They learn from conversations to improve over time. 2. Recommendation Systems: Recommendation engines used by Netflix, Amazon, Spotify use machine learning to understand user preferences and recommend new content. 3. Image and Speech Recognition: Computer vision and speech recognition technologies used in self-driving cars, smartphones, security systems use machine learning. 4. Fraud Detection: Machine learning helps detect fraudulent transactions by analyzing spending patterns and flagging anomalies. 5. Precision Medicine: Machine learning analyzes genetic data and medical records to deliver customized treatment plans. 6. Predictive Maintenance:

Uploaded by

divyanshuroka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 72

HOUSE PRICE PREDICTION

Submitted in partial fulfillment of the requirements for


the award of
Bachelor of Engineering degree in Computer Science and Engineering

by

K PAVAN (Reg. No. 37110555) T

RAGHUL(Reg. No. 37110613)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SCHOOL


OF COMPUTING

SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY

(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
JEPPIAAR NAGAR, RAJIV GANDHI
SALAI, CHENNAI – 600 119

MARCH - 2020
i
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY

(DEEMED TO BE UNIVERSITY)
Accredited with “A” grade by NAAC
Jeppiaar Nagar, Rajiv Gandhi Salai, Chennai – 600 119
www.sathyabama.ac.in

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

BONAFIDE CERTIFICATE

This is to certify that this project report is the bonafide work of K PAVAN (Reg. No.
37110555) and T RAGHUL (Reg. No.37110613) who carried out the project entitled
“HOUSE PRICE PREDICTION MODEL” under my supervision from August 2019 to
March 2020.

Internal Guide
Dr..Ashok Kumar.,M.E.,Phd.,

Head of the Department

Submitted for Viva voce Examination held on

Internal Examiner External Examiner

ii
DECLARATION

I K PAVAN and T RAGHUL hereby declare that the Project Report entitled “HOUSE
PRICE PREDICTION MODEL” is done by us under the guidance of DR.ASHOK
KUMAR, M.E.,Phd Department of Computer Science and Engineering at Sathyabama
Institute of Science and Technology is submitted in partial fulfillment of the requirements
for the award of Bachelor of Engineering degree in Computer Science and Engineering.

DATE:

PLACE: CHENNAI SIGNATURE OF THE CANDIDATE

iii
ACKNOWLEDGEMENT

I am pleased to acknowledge my sincere thanks to Board of Management of


SATHYABAMA for their kind encouragement in doing this project and for completing it
successfully. I am grateful to them.

I convey my thanks to Dr. T. Sasikala, M.E., Ph.D., Dean, School of Computing, Dr. S.
Vigneswari, M.E., Ph.D., and Dr. L. Lakshmanan, M.E., Ph.D., Heads of the Department
of Computer Science and Engineering for providing me necessary support and details at
the right time during the progressive reviews.

I would like to express my sincere and deep sense of gratitude to my Project Guide
Dr.ASHOK KUMAR, M.E.,Phd Professor, for his valuable guidance, suggestions and
constant encouragement paved way for the successful completion of my project work.

I wish to express my thanks to all Teaching and Non-teaching staff members of the
Department of Computer Science and Engineering who were helpful in many ways for
the completion of the project.

iv
Abstract:

Usually, House price index represents the summarized price changes of residential housing.To make it
more easier for a family to search for a house we have made it more precise by asking the required
square feet, no of bedrooms and bathrooms required.
With preloaded dataset and data features, a practical data pre-processing, creative feature
engineering method is examined in this paper. The paper also proposes regression technique in
machine learning to predict house price.

Keywords: House Price, Regression Technique, Machine Learning

v
TABLE OF CONTENTS

ABSTRACT v
LIST OF FIGURES viii

CHAPTER No. TITLE PAGE No.

1. INTRODUCTION 1
MACHINE LEARNING 1
ADVANTAGES AND APPLICATIONS 2

2. LITREATURE SURVEY 7

3. AIM AND SCOPE OF PROJECT 13


EXISTING SYSTEM 13
PROPOSED SYSTEM 13
FEASIBILITY STUDY 14

4. EXPERIMENTAL METHODS
AND ALGORITHMS

HARDWARE REQUIREMENTS 16
SOFTWARE REQUIREMENTS 16
PYTHON 17
ANACONDA 19
SYSTEM DESIGN 25
USE CASE DIAGRAM 27
SEQUENCE DIAGRAM 28
ACTIVITY DIAGRAM 29

vi
5. RESULTS AND DISCUSSION 30
MODULE IMPLEMENTATION 30
SOFTWARE TESTION 35
RESULTS

6. CONCLUSION AND FUTURE WORK 39

REFERENCES 40

APPENDIX
A. PAPER ACCEPTANCE MAIL 42
B. PLAGIARISM REPORT 43
C. JOURNAL PAPER 44
D. SOURCE CODE 49

LIST OF FIGURES

FIGURE No. FIGURE NAME PAGE No.

ANACONDA 19
ANACONDA NAVIGATOR 19
SYSTEM DESIGN 25
USE CASE DIAGRAM 27
SEQUENCE DIAGRAM 28
ACTIVITY DIAGRAM 29

vii
CHAPTER 1

INTRODUCTION:

Data is at the heart of technical innovations, achieving any result is now possible
using predictive models. Machine learning is extensively used in this approach. Machine
learning means providing valid dataset and further on predictions are based on that, the
machine itself learns how much importance a particular event may have on the entire
system supported its pre-loaded data and accordingly predicts the result. Various
modern applications of this technique include predicting stock prices, predicting the
possibility of an earthquake, predicting company sales and the list has endless
possibilities.

Our aim is to predict a house price based on their needs and priorities.. By analyzing
previous market trends and price ranges, and also upcoming developments future prices
will be predicted.The functioning involves a website which accepts customers
specifications and then combines the application of neuralnetwork.

Machine Learning
It is a subset of artificial intelligence (AI).It provides system the ability to automatically
learn and improve by itself.It focuses on the development of computer programs that can
access data learn by themselves. The process of learning begins with observations based on
the examples that we provide. The aim is to make computers to learn by itself without the
need of a human.
Machine Learning Methods
Machine learning can be classified into three types namely the supervised, unsupervised
and reinforcement learning. Supervised machine learning algorithms can apply
what has been learned in the past to new data predict future events. It analysis from a
known training dataset, and produces a functions to predict outputs.

1
The system will provide outputs for inputs after training. The system will compare with the
correct, intended output and find errors and modify it to make the model more practical
and useful.

In contrast, unsupervised machine learning algorithms are the ones which


does not require any supervision.It is used when when the sample data used to train is
classified .As name suggests it, the model itself finds the hidden patterns and insights. The
system may or may not produce right output, but it explores the data and can draw
inferences from datasets by its own.

Semi-supervised machine learning algorithms is a combination of both


supervised and unsupervised learning, In semi-supervised learning, an algorithm learns
from a dataset that includes both labeled and unlabeled data, usually mostly
unlabeled.Generally it is chosen when the sample data requires skilled resources in order
to train from it. Otherwise, It doesn’t require additional resources.

Reinforcement machine learning algorithms is a learning method that


works based on feedback . Reinforcement learning differs from supervised learning in not
needing labelled input/output pairs be presented. It is studied in various disciplines such as
statistics,information theory etc.

Advantages of Machine Learning


It helps to manage a large amount of data .There is no need for human interference.it
can also perform complex operations by its own.It is extremely useful for those who are in
the field of e commerce or even healthcare.It is extremely useful in manufacturing
industry.

2
While even experts often cannot be sure where and by which correlation a production error
in a plant fleet arises, Machine Learning offers the possibility to identify the error early this
saves down times and money. Machine learning are now used in the medical field. In the
future, after collecting huge amounts of data apps will be able to warn in case his doctor
wants to prescribe a drug that he cannot tolerate.The app can also suggest alternative
options by taking into account the genetics of patient.

Applications of Machine Learning


1. Virtual Personal Assistants
There are many personal assistants available like apple’s siri, google’s google assistant and
amazon’s alexa.The only work for them is to find information when customers asks to find
it over voice. To ask any questions we need to activate them and ask “What is the time in
London?” or similar questions.To answer,it your personal assistant looks out for the
information in browser, or collect it from phone apps. You can even ask your assistants for
certain tasks like “Set a reminder for tomorrow”, “Remind me to wish my friends
birthday”. Here the personal assistants uses machine learning to respond to users task or
questions. These assistants are also integrated in various other devices such as
televisions(smart tv) and speakers.These assistants make these devices a more smarter one.

2.Traffic Predictions :

Whenever we visit to a new place or when we are not sure about the route we generally use
maps it shows the distance, the amout of time it takes to cover the distance and also it
provides the information regarding traffic congestion ,By making use of machine learning
it predicts the traffic in the particular route by analyzing the previous days traffic on the
route on the same time .hence machine learning helps us in predicting traffic.

3
3. IDEO SURVEILLENCE:
A single person cannot be monitoring multiple cameras at single time that’s where

machine learning is used nowadays video cameras are powered by AI henceit helps us by

tracking unusual behaiviour for example if a person is standing motionless for long

time or if a person is stumbling then it alerts the attendant who is

looking after the camera .It has been used extensively in video survillence and it has been

extremely useful.

4.EMAIL SPAM AND FILTERING

Machine learning has been extensively used in checking spam and malware emails .It

detects new malware and protects users against it.It can detect various malwares and can

protect us .

5.online customer support:


Many websites are providing customers with a chatbox to answer their queries and doubts but
most of the time there will not be any executive behind chatbox.These chatbox are powered by
Ai and machine learning makes them to get better.These chatboxes gets better with time.

6.Search Engine Result Refining:

Whenever we search for anything in web the search engine for example if it is google then
it will keep track of what users are opening after the reults are shown.it checks whether the
users are clicking the top search result or the bottom ones.Machine learning helps and
makes the search engine better with time.

4
7.Product recommendations:
Every time when a product is recommended for you ,be it after you purchase a certain
product from the website or it’s a new product machine learning is the one that helps in
recommending products to customers.

8.Online fraud detection:

It helps in detecting money fraud in online .many payment gateways have started to
implement this technique to prevent fraud .company like paypal uses machine learning to
detect fraud.

5
INTRODUCTION TO PROJECT

Housing is one of the most valuable economic assets an individual can purchase during his
adult life. Hence we need to be extremely careful before buying a house we need to spend
correct money to buy a house.

In the following, we explore different machine learning techniques and methodologies to


predict house prices. The data contains a train and a test dataset. Our objective is, to predict
house prices based on users requirements and needs .Our model predicts the price of a
house from the sample data that has been given.

6
CHAPTER 2

LITERATURE SURVEY

Literature Survey

1. Housing Price Prediction Using Machine Learning


Algorithms: The Case of Melbourne City, Australia

Author: The Danh Phan, 2018

House price Prediction is a crucial topic of land . The literature attempts to get useful
knowledge from historical data of property markets. Machine learning techniques are applied
to research historical property transactions in
Australia to get useful models for house buyers and sellers. Revealed is the the high
discrepancy between house prices within the costliest and most affordable places within
Melbourne city. Moreover, experiments demonstrate that the mixture of Stepwise and
Support Vector
Machine that’s supported mean squared error measurement may be a
competitive approach.

2. Predicting Sales Prices of the Houses Using Regression


Methods of Machine Learning

Authors: Parasich Andrey Viktorovich ; Parasich Viktor


Aleksandrovich ; Kaftannikov Igor Leopoldovich ; Parasich Irina
Vasilevna, 2018

This article we’ll describe our solution for “House Prices: Advanced Regression
Techniques” machine learning competition, which was persisted Kaggle platform. The
goal is to predict house sale price by attributes like house area,year of building etc. In
our solution, we use classic machine learning algorithms, and our original methods, which
may be described here. At the highest of the competition, we took 18th place among
2124 participants from whole world.

7
3. Real Estate Value Prediction Using Linear Regression

Authors: Nehal N Ghosalkar ; Sudhir N Dhage, 2018

The real estate market may be a standout amongst the foremost focused regarding
pricing and keeps fluctuating. It is one among the prime fields to use the ideas of
machine learning on the way to enhance and foresee the prices with high accuracy.
There are three factors that influence the price of a house which includes physical
conditions, concepts and location. The current framework includes estimating the worth
of homes with none expectations of market prices and price increment. The objective
of the paper is prediction of residential prices for the purchasers considering their
financial plans and
wishes . By breaking down past market patterns and value ranges, and coming
advancements future costs are going to be anticipated. This examination means to
predict house prices in Mumbai city with Linear Regression. It will help clients to place
resources into a gift without moving toward a broker. The result from this research
proved linear regression gives minimum prediction error which is 0.3713.

4. Predicting Housing Market Trends Using Twitter Data

Authors: Marlon Velthorst ; Cicek Güven, 2019

In this study, we attempt to predict the Dutch housing market trends using text mining
and machine learning as an application of knowledge science methods in finance. Our
main goal is to predict the short term upward or downward trend of the average house
price in the Dutch market by using text data collected from Twitter. Twitter is widely
used also and has been proven to be a helpful
source of knowledge . However, Twitter, text mining (tokenization,
bag-of-words, n-grams, weighted term frequencies) and machine learning (classification
algorithms) have not been combined yet in order to predict the housing market trends in
short term. In this study, tweets including predefined search words are collected counting
on domain knowledge, and therefore
the corresponding text is grouped by month as documents. Then words and word
sequences are transformed into numerical values. These values served as attributes to
predict whether the housing market moves up or down,

8
i.e. we approached this as a binomial classification problem relating text data of a
month with (up or down) trends for the subsequent month.
Our main results reveal there’s a correlation between the (weighted) frequency of
words and short term housing trends, in other words, we were ready to make accurate
predictions of trends in short term using multiple machine learning and text mining
techniques combined.

5. House Price Prediction Using Machine Learning and Neural


Networks

Authors: Ayush Varma ; Abhijit Sarma ; Sagar Doshi ; Rohini


Nair, 2018

Real estate is that the least transparent industry in our ecosystem. Housing prices keep
changing day in and outing and sometimes are hyped instead of being supported
valuation. Predicting housing prices with real factors is that the main crux of our
scientific research . Here we aim to form our evaluations supported every basic
parameter that’s considered while determining the worth . We use various
regression techniques during this pathway, and our results aren’t sole
determination of 1 technique rather it’s the weighted mean of varied techniques to
offer most accurate results. The results proved that this approach yields minimum error
and maximum accuracy than individual algorithms applied. We also propose to use real-
time neighborhood details using Google maps to urge exact real-world valuations.

6. Forecasting house price index of China using dendritic


neuron model

Authors: Ying Yu ; Shuangbao Song ; Tianle Zhou ; Hanaki


Yachi ; Shangce Gao, 2016

The results of Chinese housing market continues to prosper or not is said to the event of
China, and further it also has an impression on the planet finance.
Thus forecasting the house price level is extremely important and challenging.
during this paper we propose an unsupervised learnable neuron

9
model (DNM) by including the nonlinear interactions between excitation and inhibition
on dendrites.
We use DNM to suit the House price level (HPI) data then forecast the trends of
Chinese housing market. To verify the effectiveness of the DNM, we use a standard
statistical model (i.e., the exponential smoothing (ES) model) to
form a performance comparison. Three quantitative statistical metrics including
normalized mean square error, absolute percentage of error, and coefficient of correlation
are wont to evaluate the forecasting performance of the 2 models. Experimental results
demonstrate that the proposed DNM is best than
ES altogether of the three quantitative statistical metrics.

7. Prediction of real estate price variation based on economic


parameters

Authors: Li Li ; Kai-Hsuan Chu, 2017

It is documented that a lot of economic parameters may more or less influence the
important estate price variation. additionally , the banker and investor also are
interesting to understand the important estate price future change. There had not
appropriate model for including these factors for price prediction. Here, the
influences of most macroeconomic parameters
on land price variation are investigated before establishing the worth fluctuation
prediction model. Here, back propagation neural network (BPN) and radial basis
function neural network (RBF) two schemes are employed to
determine the nonlinear model for real estates price variation prediction of Taipei,
Taiwan supported leading and simultaneous economic indices. Those prediction
results are compared with the general public Cathay House price level or the Sinyi
Home price level . The mean absolute error and root mean square error two indices of
the worth variation are selected because
the performance index. the general public related data of Taipei, Taiwan land
variation during 2005 ~ 2015 are adopted for analysis and prediction
comparison.

10
8. Predicting house sale price using fuzzy logic, Artificial
Neural Network and K-Nearest Neighbor

Authors: Muhammad Fahmi Mukhlishin ; Ragil Saputra ; Adi


Wibowo, 2017

Determining the worth of land and residential are regularly determined at the earliest by
the vendor , however determining the proper price within the sales process will affect
the buyer’s desire to elect and bid. Special characteristics in Indonesia, tax object value
(NJOP) and site parameters are high influence
to the worth . during this paper we proposed the prediction of land and house value
using several methods. symbolic logic , Artificial Neural Network and
K-Nearest Neighbor are compared during this paper to get the foremost
appropriate method which will be used as a reference for
determining the worth by the sellers. Google Maps is employed to represent the spatial
data for prediction parameter. The variables that utilized in the methods are NJOP of
land, the locations, the age, NJOP of house, and therefore
the valuable location of the land. The experimental methods are tested by comparing
between the important price transaction and therefore the prediction using MAPE
formula.

9. Comprehensive Analysis of Housing Price Prediction in


Pune Using Multi-Featured Random Forest Approach

Authors: Rushab Sawant ; Yashwant Jangid ; Tushar


Tiwari ; Saurabh Jain ; Ankita Gupta, 2018

The housing sector in India has been predicted to grow at 30-35% over
subsequent decade. In terms of employment provided, it’s second only to the agricultural
sector. Housing is one among the main domain of land . Pune is emerging together of the
main metropolitan cities of India and has many prestigious Educational institutions and
IT parks. This makes it a perfect place to shop for homes. Vagueness among the
costs of homes makes it challenging for the customer to pick their dream house.

11
The interests of both buyer and seller should be satisfied in order that they are doing not
overestimate or underestimate price. This housing price prediction model acts as a hand
for buyer and seller or a true realtor to form a better-informed decision. to realize this,
diverse features are selected as input from feature set and various algorithms are applied
like Random Forest and Decision Tree.

10. Time-Aware Latent Hierarchical Model for Predicting House


Prices

Authors: Fei Tan ; Chaoran Cheng ; Zhi Wei, 2017

It is widely acknowledged that the worth of a home is the mixture of an outsized


number of characteristics. House price prediction thus presents a singular set of
challenges in practice. While an outsized body of works are dedicated to the present task,
their performance and applications are limited by the shortage of while span of
transaction data, the absence of real-world settings and therefore the insufficiency of
housing features. to the present end, a time-aware latent hierarchical model is introduced
to capture underlying spatiotemporal interactions behind the evolution of house prices.
The hierarchical perspective obviates the necessity for historical transaction data of
exactly same houses when temporal effects are considered. The proposed framework is
examined on a large-scale dataset of the property transaction in Beijing. the entire
procedure strictly complies with the real-world scenario. The empirical evaluation
results demonstrate the outperformance of our approach over alternative competitive
methods.

12
CHAPTER 3

AIM AND SCOPE OF THE PRESENT

SYSTEM EXISTING SYSTEM

Multi Linear Regression

Multiple Linear Regression. It shows the relationship between two or more explanatory
variables and scalar response variable .Independent variable value is associated with
dependent variable value

Limitations

The dependent variable y must be continuous.. The independent variables can be of any
type. The dependent variable is usually affected by the independent variables.

Proposed System

Linear Regression is a technique that helps to identify the relationship between a scalar
response (or dependent variable) and one or more explanatory variables (or independent
variables). The case of one explanatory variable is called simple linear regression.

Advantages

 Space complexity is very low it just needs to save the weights at the end of
training. hence it’s a high latency algorithm.

 Its very simple to understand

 Good interpretability

 Feature importance is generated at the time model building. With the help of
hyperparameter lamba, you can handle features selection hence we can achieve
dimensionality reduction

13
FEASIBILITY STUDY

The feasibility of the project is analyzed in this phase and business proposal is put forth
with a very general plan for the project and some cost estimates. The feasibility study of
the proposed system is carried out. It is carried out to ensure that the proposed system is
not a burden to the company. Economic feasibility

1. Economical feasibility

2. Technical feasibility

3. Social feasibility

ECONOMICAL FEASIBILITY

This study is generally carried out to check whether right amount of funds are invested in
the model.this study is done to eliminate excess amount of money poured into a single
model.It makes sure whether the model is well within the budget.It is extremely
important to spend only right amount of funds to a model.

TECHNICAL FEASIBILITY

It makes sure whether the technical requirements are limited to what we can offerd.Any
system developed should not have high demand on technical resources since it puts burden
on client,It also checks the projects potential what it can do once developed.

14
SOCIAL FEASIBILITY

It is carried out check how a system acts with other systems.It checks the level of
acceptance of the system by the user. It trains the user to use the system efficiently. it is a
necessity. Since a client is the final user of the system he can critizise the system but it
should be in a disciplined and meaningful manner.

15
CHAPTER 4

EXPERIMENTAL METHODS AND ALGORITHMS

HARDWARE REQUIREMENTS

The most common set of requirements defined by any operating system or software
application is the physical computer resources, also known as hardware. A hardware
requirements list is often accompanied by a hardware compatibility list, especially in case
of operating systems. The minimal hardware requirements are as follows,

1. PROCESSOR : PENTIUM IV

2. RAM : 8 GB

3. PROCESSOR : 2.4 GHZ

4. MAIN MEMORY : 8GB RAM

5. PROCESSING SPEED : 600 MHZ

6. HARD DISK DRIVE : 1TB

7. KEYBOARD :104 KEYS

SOFTWARE REQUIREMENTS

Software requirements deals with defining resource requirements and prerequisites that
needs to be installed on a computer to provide functioning of an application. These
requirements are need to be installed separately before the software is installed. The
minimal software requirements are as follows,

1. FRONT END :PYTHON

2. IDE : ANACONDA

3. OPERATING SYSTEM :WINDOWS 10

16
Python Language

 Python is an object-oriented programming language


 It is created by Guido Rossum in 1989.
 It is ideally designed for rapid prototyping of complex applications.
 It is extensible to C or C++.
 Companies like google and nasa also uses python language
 It is majorly used in AI

Python Programming Characteristics

 It provides rich data types

 Syntax is simple

 It is a platform independent scripted language

 Compared to other programming languages, it allows more run-time flexibility

 A module in Python may have one or more classes and free functions

 Libraries in Pythons can also run in Linux and Windows

 For building large applications, Python can be compiled to byte-code

 It supports functional and structured programming

 It supports interactive mode that allows interacting Testing and debugging of


snippets of code

 In Python editing, debugging and testing is fast.

17
Applications of Python Programming

Web Applications

We can create web apps in python by using frameworks and CMS. We can create web
applications using Django, Flask, Pyramid, Plone, Django CMS. Sites like Mozilla, Reddit,
Instagram and PBS are written in Python.

Scientific and Numeric Computing

There are many number of libraries in python that can be used for scientific and numeric
computing . SciPy and NumPy that are used in general purpose computing. EarthPy is used
for earth science, AstroPy is used for Astronomy and so on. It is also used in machine
learning, data mining and deep learning.

Creating software Prototypes

Python is slow but is great for creating prototypes. For example: You can use Pygame
which is used to create game prototype. If you are satisfied with the prototype then you can
build the app using C or C++.

Good Language to Teach Programming

Python has been used by many students.There are several companies teaching python to
their employees. It has a lot of features and capabilities. The syntax is simple and it is one
of the easiest language to learn.

About Opencv Package

Python is a general purpose programming language started by Guido van Rossum,It


became very popular because of its simplicity and code readability. It helps the
programmer to express his ideas in fewer lines of code .

Compared to other languages like C/C++, Python is slower. Python can be easily extended
with C/C++. We can write codes in C/C++ and create a python wrapper.

18
This gives us two advantages: first, our code is as fast as original C/C++ code and second,
it is very easy to code in Python. Hence OpenCV-Python is a Python wrapper around
original C++ implementation.

Python also supports Numpy. It gives a MATLAB-style syntax.The OpenCV array


structures are converted to-and-from Numpy arrays. Whatever operations you can do in
Numpy, you can combine it with OpenCV, which increases number of weapons in your
arsenal. Besides that, several other libraries like SciPy, Matplotlib which supports Numpy
can be used with this.

So OpenCV-Python is an appropriate tool for fast prototyping of computer vision


problems.

FEATURES OF ANACONDA NAVIGATOR

Anaconda is free

It is open source, easy to install distribution of Python and R programming languages.

It is used for scientific computing, data science, statistical analysis and machine learning.

The latest distribution of Anaconda is Anaconda 5.3 .

19
What is Anaconda Navigator?

Anaconda Navigator may be a desktop graphical interface (GUI) included within the
Anaconda distribution. It allows us to launch applications provided within the Anaconda
distribution and simply manage conda packages, environments and channels without the
utilization of command-line commands. It is available for Windows, macOS and Linux.

20
Applications Provided In Anaconda Distribution

The Anaconda distribution comes with the subsequent applications along side Anaconda
Navigator.

1. JupyterLab

2. Jupyter Notebook

3. Qt Console

4. Spyder

5. Glueviz

6. Orange3

7. RStudio

8. Visual Studio Code

> JupyterLab: This is the extensible working environment for interactive and the
reproducible computing, supported the Jupyter Notebook and Architecture.

>Jupyter Notebook: This is an web-based, interactive computing notebook

environment. we will able to edit and runs in human-readable docs while describing the

info

analysis.

> Qt Console: It is an PyQt GUI that supports inline figures, proper multiline

editing with syntax highlighting, graphical calltips and etc..,

Spyder: Spyder is an scientific Python Development Environment. It is a powerful Python


IDE of advanced editing, interactive testing, debugging and the
introspection
features.

VS Code: It is an streamlined code editor within the support for development operations like
debugging, task running and version control.

21
Glueviz: It is used for multidimensional data visualization across the files. It is explored
in relationships within and among related datasets.

Orange 3: It is an component-based on data mining framework. it can be used for the


data visualization and data analysis. The workflows under Orange 3 is very interactive
and provides a large toolbox.

Rstudio: This is a set of integrated tools designed for help you to be more productive
by R.Then it includes R essentials and notebooks.

22
New Features of Anaconda 5.3

Compiled by Latest Python release: Anaconda 5.3 is compiled by Python 3.7, taking
advantage of Python’s speed and feature improvements.
• Better Reliability: The reliability of Anaconda is improved in the latest release by
capturing and storing the package metadata for the installed packages.
Users deploying Tensorflow can make usefull by MKL 2019 for Deep Neural Networks.These
Python binary packages are provided to realize the high CPU performance.
• New packages has been added: These pakages are over 230 packages which is
updated and added in the new release.
• add Progress: there’s a casting bug in Numpy with Python 3.7 but the
team is currently performing on patching it until Numpy is updated.

23
Flask
Flask is an API of Python that permits to create up web-applications. It was developed by
Armin Ronacher. Flask’s framework is more explicit than Django’s framework and it is
also easier to learn because it has the less base code to implement a simple web-
Application.

A Web-Application Framework or Web Framework is the


collection of modules and libraries that helps the developer to write down applications
without writing the low-level codes like protocols, thread management, etc. Flask is
predicated on WSGI(Web Server Gateway Interface) toolkit and Jinja2 template
engine.

METHOD DESCRIPTION

GET This is used to send the data in an without


encryption of the form to the server.

HEAD provides response body to the form

POST Sends the form data to server. Data received by


POST method is not cached by server.

PUT Replaces current representation of


target resource with URL.

DELETE Deletes the target resource of a given


URL

24
SYSTEM DESIGN

Architecture

Collection of
Dataset Data Loading and Determine
Pre-Procesing Dependent and
Independent

Get the results and Calculate the variable


calculate coefficient using Regression
Technique

Determine the
Calculate the Prediction Prediction Results

25
UML DIAGRAMS
o UML stands for Unified Modeling Language.
o It is used in the field of object-oriented software engineering.
o The goal is for UML to become a common language for creating models of object
oriented computer software.
o It consists of two components: a Meta-model and a notation..
 The Unified Modeling Language is a standard language for specifying, Visualization,
Constructing and documenting the artifacts of software system, as well as for business
modeling and other non-software systems.
o It has been proven successful in the modeling of large and complex systems.
o The UML is a very important part of developing objects oriented software and the
software development process. It uses graphical notations to show the design of
software projects.
GOALS:
The Primary goals are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core concepts.
3. Be independent of particular programming languages and development process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations,
frameworks, patterns and components.
7. Integrate best practices.

26
USE CASE DIAGRAM:
A use case diagram is a behaivioural diagram. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals
(represented as use cases), and any dependencies between those use cases. The main
purpose of a use case diagram is to show what system functions are performed for which
actor. Roles of the actors in the system can be depicted.

27
SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a interaction diagram that
shows how processes operate with one another and in what order.. Sequence diagrams are
sometimes called event diagrams, event scenarios, and timing diagrams.

28
ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise activities and
actions with support for choice, iteration and concurrency. Activity diagrams can be used
to describe the business and operational step-by-step workflows of components in a
system. An activity diagram shows the overall flow of control.

29
CHAPTER 5

RESULTS AND PERFORMANCE ANALYSIS

Module Implementation

Collection of Dataset

The dataset used in this project was Parameters such as Area in square meters, Location, no
of bedrooms and no of bathrooms in that particular property. Selling price is a dependent
variable on several other independent variables.

Data Preprocessing

It is a process of transforming the raw, complex data into systematic understandable


knowledge. It will find out missing and redundant data in the dataset. Thus, this brings
uniformity in the dataset. However in our dataset, there was no missing values .

Import Libraries

A library is a collection of modules the first step is to import the libraries that we require in
our system.There are functions for them, which can be invoked without writing the
required code. This is a list for most popular Python libraries for Data Science. We have
imported pandas library and named it as pd.

30
Import the Dataset

A lot of datasets come in CSV formats.At first We have to locate direcotory of csv file and
read it using a method called read_csv which may be found in the library called pandas.

Encoding categorical data

Sometimes our data is in qualitative form, that is we have texts as our data. We can find
categories in text form. Now it gets complicated for machines to know texts and process
them, rather than numbers, since the models are based on mathematical equations and
calculations. Therefore, we have to encode the categorical data.

Split Dataset into Training and Test Set

Now we should split our dataset into two sets — a Training set and a Test set. We will
train our machine learning models on our training set, i.e our machine learning models will
try to understand any correlations in our training set and then we will test the models on
our test set to check how accurately it can predict. In general we need to allocate 80% of
the dataset to training set and the remaining 20% to test set.

Dependent and independent variable in regression

Regression analysis describes the relationship between independent variables and the
dependent variable. It predicts value of dependent variable by analyzing the value of
independent variables.

Regression coefficient

It is same as slope of the line of the regression equation.

31
Prediction

Prediction is nothing but the output of an algorithm after being trained on a dataset and
applied to new data and predicts the output. Finally our model will predict the house price
based on user inputs.

SOFTWARE TESTING

General

In a generalized way, we can say that the system testing is a type of testing in which
the main aim is to make sure that system performs efficiently and seamlessly. The process
of testing is applied to a program with the main aim to discover an unprecedented error,
an error which otherwise could have damaged the future of the software. Test cases which
brings up a high possibility of discovering and error is considered successful. This
successful test helps to answer the still unknown errors.

TEST CASE

Testing, as already explained earlier, is the process of discovering all possible weak-points
in the finalized software product. Testing helps to counter the working of sub-assemblies,
components, assembly and the complete result. The software is taken through different
exercises with the main aim of making sure that software meets the business requirement
and user-expectations and doesn’t fails abruptly. Several types of tests are used today.
Each test type addresses a specific testing requirement.

Testing Techniques

A test plan is a document which describes approach, its scope, its resources and the
schedule of aimed testing exercises. It helps to identify almost other test item, the features
which are to be tested, its tasks, how will everyone do each task, how much the tester is
independent, the environment in which the test is taking place, its technique of design plus
the both the end criteria which is used, also rational of choice of theirs, and whatever kind
of risk which requires emergency planning. It can be also referred to as the record of the
process of test planning. Test plans are usually prepared with signification input from test
engineers.

32
(I) UNIT TESTING

In unit testing, the design of the test cases is involved that helps in the validation of the
internal program logic. The validation of all the decision branches and internal code takes
place. After the individual unit is completed it takes place. Plus it is taken into account
after the individual united is completed before integration. The unit test thus performs the
basic level test at its component stage and test the particular business process, system
configurations etc. The unit test ensures that the particular unique path of the process gets
performed precisely to the documented specifications and contains clearly defined inputs
with the results which are expected.

(II) INTEGRATION TESTING

These tests are designed to test the integrated software items to

determine whether if they really execute as a single program or application. The testing is
event driven and thus is concerned with the basic outcome of field. The Integration tests
demonstrate that the components were individually satisfaction, as already represented by
successful unit testing, the components are apt and fine. This type of testing is specially
aimed to expose the issues that come-up by the components combination.

(III) FUNCTIONAL TESTING

The functional tests help in providing the systematic representation that functions tested
are available and specified by technical requirement, documentation of the system and the
user manual.

(IV) SYSTEM TESTING

System testing, as the name suggests, is the type of testing in which ensure that the
software system meet the business requirements and aim. Testing of the configuration is
taken place here to ensure predictable result and thus analysis of it.System testing is relied
on the description of process and its flow, stressing on pre driven process and the points of
integration.

33
V) WHITE BOX TESTING

The white box testing is the type of testing in which the internal components of the system
software is open and can be processed by the tester. It is therefore a complex type of
testing process. All the data structure, components etc. are tested by the tester himself to
find out a possible bug or error. It is used in situation in which the black box is incapable
of finding out a bug. It is a complex type of testing which takes more time to get applied.

(VI) BLACK BOX TESTING

The black box testing is the type of testing in which the internal components of the
software is hidden and only the input and output of the system is the key for the tester to
find out a bug. It is therefore a simple type of testing. A programmer with basic knowledge
can also process this type of testing. It is less time consuming as compared to the white box
testing. It is very successful for software which are less complex are straight-forward in
nature. It is also less costly than white box testing.

(V) ACCEPTANCE TESTING


User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also make sures that the system meets the functional
requirement.

34
RESULTS:

SAMPLE DATA SET:

area_type availability
location size society total_sqft bath balcony price
Super built-up Electronic City
Area 19-Dec Phase II 2 BHK Coomee 1056 2 1 39.07
Ready To 4
Plot Area Move Chikka Tirupathi Bedroom Theanmp 2600 5 3 120
Built-up Ready To
Area Move Uttarahalli 3 BHK 1440 2 3 62
Super built-up Ready To
Area Move Lingadheeranahalli 3 BHK Soiewre 1521 3 1 95
Super built-up Ready To
Area Move Kothanur 2 BHK 1200 2 1 51
Super built-up Ready To
Area Move Whitefield 2 BHK DuenaTa 1170 2 1 38
Super built-up
Area 18-May Old Airport Road 4 BHK Jaades 2732 4 204
Super built-up Ready To
Area Move Rajaji Nagar 4 BHK Brway G 3300 4 600
Super built-up Ready To
Area Move Marathahalli 3 BHK 1310 3 1 63.25
Ready To
Plot Area Move Gandhi Bazar 6 Bedroom 1020 6 370
Super built-up
Area 18-Feb Whitefield 3 BHK 1800 2 2 70
Ready To 4
Plot Area Move Whitefield Bedroom Prrry M 2785 5 3 295
Super built-up Ready To 7th Phase JP
Area Move Nagar 2 BHK Shncyes 1000 2 1 38
Built-up Ready To
Area Move Gottigere 2 BHK 1100 2 2 40
Ready To 3
Plot Area Move Sarjapur Bedroom Skityer 2250 3 2 148
Super built-up Ready To
Area Move Mysore Road 2 BHK PrntaEn 1175 2 2 73.5
Super built-up Ready To
Area Move Bisuvanahalli 3 BHK Prityel 1180 3 2 48
Super built-up Ready To Raja Rajeshwari
Area Move Nagar 3 BHK GrrvaGr 1540 3 3 60

These are the sample for preloaded data sets in our model

35
Graph:
Before deleting anamolies:

After deleting anamolis(we don’t have any unwanted data):

This graph shows bathrooms per property This graph represents property price by square feet

Importing libraries:
We use pandas library to read the train and test files.
import pandas as pd ( used for data analysis)

import numpy as np (Used for computations)

import matplotlib.pyplot as plt ( used to plot values in graph)

36
Data preprocessing :

It gets the count of area type in dataset and removes unwanted columns

Encoding categorical data:

Splitting dataset into train and test data:

We are taking 80% of our data as training data and 20% as test data.

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=
train_test_split(X,y,test_size=0.2,random_state=10)

Dependent and independent variable in regression:

Eg:

Locality Area Bedrooms Bathrooms Price


Electronic 1056 2 1 40
city
Whitefield 1170 2 1 38

Dependent variable in our model is price(since it relies on other factors for its value)

Independent variables in our model are locality,Area,Bedrooms and bathrooms since it


doesn’t depand on other variables for its value.

Using linear regression it predicts the value of output.

37
Linear regression:

It predicts the result value from user defined datasets def

predict_price(location,sqft,bath,bhk):
loc_index = np.where(X.columns == location)[0][0]

x = np.zeros(len(X.columns))
x[0] = sqft
x[1] = bath
x[2] = bhk
if loc_index >= 0:
x[loc_index] = 1
return regressor.predict([x])[0]

Screenshot:

Fig:The final output of our model

38
CHAPTER 6

Conclusion and Future Work

In this paper, several tests have been performed using linear regression algorithm to
perform house price prediction. This algorithm is to predict prices of new properties that
are going to be listed by taking some input variables and predicting the correct and
justified price.It was a great learning experience building this predictive Sale Price model.
In Future Using different methods that match the time-series data will be used in the
research to obtain smaller error prediction values and using more data to get the better
result.

39
References

1. Housing Price Prediction Using Machine Learning Algorithms: The


Case of Melbourne City, Australia, The Danh Phan, 2018 International
Conference on Machine Learning and Data
Engineering (iCMLDE)

2. Predicting Sales Prices of the Houses Using Regression Methods of


Machine Learning, Parasich Andrey
Viktorovich ; Parasich Viktor Aleksandrovich ; Kaftannikov Igor
Leopoldovich ; Parasich Irina Vasilevna, 2018 3rd Russian-Pacific
Conference on Computer Technology and Applications (RPC)

3. Real Estate Value Prediction Using Linear Regression, Nehal N


Ghosalkar ; Sudhir N Dhage, 2018 Fourth International Conference on
Computing Communication Control and Automation (ICCUBEA)

4. Predicting Housing Market Trends Using Twitter Data, Marlon Velthorst ;


Cicek Güven, 2019 6th Swiss Conference on Data Science (SDS)

5. House Price Prediction Using Machine Learning and Neural Networks,


Ayush Varma ; Abhijit Sarma ; Sagar Doshi ; Rohini Nair, 2018 Second
International Conference on Inventive Communication and Computational
Technologies (ICICCT)

6. Forecasting house price index of China using dendritic neuron model,


Ying Yu ; Shuangbao Song ; Tianle Zhou ; Hanaki Yachi ; Shangce
Gao, 2016 International Conference on Progress in Informatics and
Computing (PIC)

40
7. Prediction of real estate price variation based on economic parameters, Li
Li ; Kai-Hsuan Chu, 2017 International Conference on Applied System
Innovation (ICASI)

8. Predicting house sale price using fuzzy logic, Artificial Neural Network
and K-Nearest Neighbor, Muhammad Fahmi Mukhlishin ; Ragil
Saputra ; Adi Wibowo, 2017 1st International Conference on Informatics
and Computational Sciences (ICICoS)

9. Comprehensive Analysis of Housing Price Prediction in Pune Using


Multi-Featured Random Forest Approach, Rushab Sawant ; Yashwant
Jangid ; Tushar Tiwari ; Saurabh Jain ; Ankita Gupta, 2018
Fourth International Conference on Computing Communication Control
and Automation (ICCUBEA)

10. Time-Aware Latent Hierarchical Model for Predicting House Prices, Fei
Tan ; Chaoran Cheng ; Zhi Wei, 2017 IEEE International Conference on
Data Mining (ICDM)

41
Paper Acceptance mail:

42
Plagiarism report:

43
C.Journal Paper

House Price Prediction using machine learning

K Pavan,T Raghul
Abstract:

Usually, House price index represents the summarized price changes of residential housing.To make it more easier for
a family to search for a house we have made it more precise by asking the required square feet, no of bedrooms and
bathrooms required. With preloaded dataset and data features, a practical data pre-processing, creative feature
engineering method is examined in this paper. The paper also proposes regression technique in machine learning to
predict house price.

Keywords: House Price, Regression Technique, Machine Learning

1. INTRODUCTION:
Machine Learning Methods
Data is at the heart of technical innovations,
Machine learning can be classified into three types namely
achieving any result is now possible using predictive
the supervised,unsupervised and reinforcement
models. Machine learning is extensively used in this
learning.Supervised machine learning algorithms can
approach. Machine learning means providing valid
apply what has been learned in the past to new data
dataset and further on predictions are based on that,
predict future events. It analysis from a known training
the machine itself learns how much importance a
dataset, and produces a functions to predict outputs.
particular event may have on the entire system
The system will provide outputs for inputs after training.
supported its pre-loaded data and accordingly predicts
The system will compare with the correct, intended output
the result. Various modern applications of this
and find errors and modify it to make the model more
technique include predicting stock prices, predicting
practical and useful.
the possibility of an earthquake, predicting company
sales and the list has endless possibilities.
In contrast, unsupervised machine learning

Our aim is to predict a house price based on their algorithms are the ones which does not require any

needs and priorities.. By analyzing previous market supervision.It is used when when the sample data used to

trends and price ranges, and also upcoming train is classified .As name suggests it, the model itself

developments future prices will be predicted.The finds the hidden patterns and insights. The system may or

functioning involves a website which accepts may not produce right output, but it explores the data and

customers specifications and then combines the can draw inferences from datasets by its own.

application of neuralnetwork.
Semi-supervised machine learning algorithms is a
Machine Learning
combination of both supervised and unsupervised
It is a subsetof artificial intelligence (AI).It provides
learning, In semi-supervised learning, an algorithm learns
system the ability to automatically learn and improve by
from a dataset that includes both labeled and unlabeled
itself.It focuses on the development of computer
data, usually mostly unlabeled.Generally it is chosen
programs that can access data learn by themselves. The
when the sample data requires skilled resources in order
process of learning begins with observations based on the
to train from it. Otherwise, It doesn’t require additional
examples that we provide. The aim is to make computers
resources.
to learn by itself without the need of a human.

44
Reinforcement machine learning algorithm is a
Advantages
learning method that works based on feedback .
Reinforcement learning differs from supervised learning  Space complexity is very low it just needs to save the

in not needing labelled input/output pairs be presented. weights at the end of training. hence it's a high

It is studied in various disciplines such as latency algorithm.

statistics,information theory etc.


 Its very simple to understand

The method that we have used here is Supervised  Good interpretability


machine learning.
 Feature importance is generated at the time model

PROBLEM STATEMENT:  building. With the help of hyperparameter lamba,


Buying a House is one of the most valuable asset an you can handle features selection hence we can
individual can purchase during his life. Hence we need achieve dimensionality reduction
to be extremely careful before buying a house we need
3. REQUIRED SYSTEM
to spend correct money to buy a house.
HARDWARE REQUIREMENTS
In the following paper, we explore different machine
The most common set of requirements defined by any
learning techniques and methods to predict prices of
operating system or software application is the physical
house. The data contains the train and the test dataset. Our
computer resources, also known as hardware. A
objective is, to predict house prices based on users
hardware requirements list is often accompanied by a
requirements and needs .Our model predicts the price of
hardware compatibility list, especially in case of
a house from the sample data that has been given.
operating systems. The minimal hardware requirements
2 EXISTING AND PROPOSED SYSTEM are as follows,

EXISTING 1. PROCESSOR : PENTIUM IV

2. RAM : 8 GB
SYSTEM Multi
3. PROCESSOR : 2.4 GHZ
Linear Regression
4. MAIN MEMORY : 8GB RAM
Multiple Linear Regression. It shows the relationship 5. PROCESSING SPEED : 600 MHZ
between two or more explanatory variables and scalar
6. HARD DISK DRIVE : 1TB
response variable .Independent variable value is
7. KEYBOARD :104 KEYS
associated with dependent variable value
SOFTWARE REQUIREMENTS
Limitations
Software requirements deals with defining resource
The dependent variable y must be continuous.. The requirements and prerequisites that needs to be installed
independent variables can be of any type. The on a computer to provide functioning of an application.
dependent variable is usually dependent on independent These requirements are need to be installed separately.
variables. The minimal software requirements are as follows,

Proposed System  FRONT END :PYTHON

Linear Regression is a technique that helps to identify  IDE : ANACONDA


the relationship between a dependent variable and  OPERATING SYSTEM :WINDOWS 10
independent variable. The regression technique that we
used here is linear regression.

45
4. ARCHITECTURE OF PROPOSED SYSTEM:

Data Preprocessing

Data Loading and Pre- Procesing It is a process of transforming the raw, complex data
Determine Dependent and Independent Value
Collecti into systematic understandable knowledge. It will find
on of out missing and redundant data in the dataset. Thus, this
brings uniformity in the dataset. But in our dataset, there
was no missing values .

Import Libraries

A library is a collection of modules the first step is to


Get the results and Calculate
import the libraries that we require in our system.There are
the variable
calculate coefficient functions for them, which can be invoked without
using
Regression writing the required code. This is a list for most popular
Python libraries for Data Science. We have imported
pandas library and named it as pd.

Import the Dataset


Calculate Determine the
the Prediction Results
Predictio A lot of datasets come in CSV formats.At first We have
to locate direcotory of csv file and read it using a
method called read_csv which may be found in the

Description of The Architecture library called pandas.

Encoding categorical data


 Sample dataset is collected
 Sample Data is loaded. Sometimes we have texts as our data. We can find
 It determines the dependent value it is
categories in text form. Now it gets tougher for
nothing but the value that is being dependant
on other values here the dependent value is machines to know texts and process them,hence we are
price
changing them to numbers. Therefore, we have to
 It also determines the independent values it
is nothing but the value that doesnot depand encode the categorical data.
on other value here square feet,area,no of
bedrroms,bathrooms are independent value Split Dataset into Training and Test Set
 Using linear regression it will calculate
the variables.
Now we should split our dataset into two sets — a
 When a user determines the requiremnets it
will predict and shows the results. Training set and a Test set. We will train our machine

learning models on our loaded data trainning set, i.e our
machine learning models will understand the relationships
5. Module Implementation in our training set and then we will test the models on
our test set to check how it predicts. In general we need
Collection of Dataset
to allocate 80% of the dataset to training set and the
The dataset used in this project was Parameters such as remaining 20% to test set.
Area in square meters, Location, no of bedrooms and no
of bathrooms in that particular property. Selling price is
a dependent variable on several other independent
variables.

46
47
7.Conclusion

In this paper, several tests have been performed using


linear regression algorithm to perform house price
prediction. This algorithm is to predict prices of new
properties that are going to be listed by taking some
input variables and predicting the correct and justified
price.It was a great learning experience building this
predictive Sale Price model. In Future Using different
methods that match the time-series data will be used in
the research to obtain smaller error prediction values and
using more data to get the better result.

References
1. Housing Price Prediction Using Machine
Learning Algorithms: The Case of Melbourne City,
Australia, The Danh Phan.

2. Predicting Sales Prices of the Houses Using


Regression Methods of Machine Learning,
Parasich Andrey Viktorovich ; Parasich Viktor
Aleksandrovich ; Kaftannikov Igor Leopoldovich
; Parasich Irina Vasilevna.

3. Real Estate Value Prediction Using Linear


Regression, Nehal N Ghosalkar ; Sudhir N Dhage.

4. Predicting Housing Market Trends Using Twitter


Data, Marlon Velthorst ; Cicek Güven.

5. House Price Prediction Using Machine Learning


and Neural Networks, Ayush Varma ; Abhijit
Sarma ; Sagar Doshi ; Rohini Nair.

48
CODING:

# importing libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import matplotlib

matplotlib.rcParams["figure.figsize"] = (20,10)

# importing the dataset

dataset = pd.read_csv(’..\dataset\Bengaluru_House_Data.csv’)

print(dataset.head(10))

print(dataset.shape)

# Data preprocessing

## getting the count of area type in the dataset

print(dataset.groupby(’area_type’)[’area_type’].agg(’count’))

## droping unnecessary columns

dataset.drop([’area_type’,’society’,’availability’,’balcony’], axis=’columns’,
inplace=True)

print(dataset.shape)

49
## data cleaning

print(dataset.isnull().sum())

dataset.dropna(inplace=True)

print(dataset.shape)

### data engineering

print(dataset[’size’].unique())

dataset[’bhk’] = dataset[’size’].apply(lambda x: float(x.split(’ ’)[0]))

### exploring ’total_sqft’ column

print(dataset[’total_sqft’].unique())

#### defining a function to check whether the value is float or not def

is_float(x):

try:

float(x)

except :

return False

return True

print(dataset[~dataset[’total_sqft’].apply(is_float)].head(10))

50
#### defining a function to convert the range of column values to a single
value

def convert_sqft_to_num(x):

tokens = x.split(’-’)

if len(tokens) == 2:

return (float(tokens[0]) + float(tokens[1]))/2

try:

return float(x)

except:

return None

#### testing the function

print(convert_sqft_to_num(’290’))

print(convert_sqft_to_num(’2100 - 2850’))

print(convert_sqft_to_num(’4.46Sq. Meter’))

#### applying this function to the dataset

dataset[’total_sqft’] = dataset[’total_sqft’].apply(convert_sqft_to_num)

print(dataset[’total_sqft’].head(10))

print(dataset.loc[30])

## feature engineering

51
print(dataset.head(10))

### creating new colomn ’price_per_sqft’ as we know ###

in real estate market, price per sqft matters alot.

dataset[’price_per_sqft’] = dataset[’price’]*100000/dataset[’total_sqft’]

print(dataset[’price_per_sqft’])

### exploring ’location’ column

print(len(dataset[’location’].unique()))

dataset[’location’] = dataset[’location’].apply(lambda x: x.strip())

location_stats =
dataset.groupby(’location’)[’location’].agg(’count’).sort_values(ascendin
g=False)

print(location_stats[0:10])

#### creating ’location_stats’ to get the location with total count or


occurance

#### occurance, and ’location_stats_less_than_10’ to get the location with <=


10

#### occurance

52
print(len(location_stats[location_stats <= 10]))

location_stats_less_than_10 = location_stats[location_stats <= 10]

print(location_stats_less_than_10)

#### redefining the ’location’ column as ’other’ value where location count

#### is <= 10

dataset[’location’] = dataset[’location’].apply(lambda x: ’other’ if x in


location_stats_less_than_10 else x)

print(dataset[’location’].head(10))

print(len(dataset[’location’].unique()))

## Outlier detection and removal

### checking that ’total_sqft’/’bhk’, if it’s very less than there is some ###

anomaly and we have to remove these outliers

print(dataset[dataset[’total_sqft’] / dataset[’bhk’] <


300].sort_values(by=’total_sqft’).head(10))

print(dataset.shape)

dataset = dataset[~(dataset[’total_sqft’] / dataset[’bhk’] < 300)]

print(dataset.shape)

53
### checking columns where ’price_per_sqft’ is very low ###

where it should not be that low, so it’s an anomaly and ### we

have to remove those rows

print(dataset[’price_per_sqft’].describe())

### function to remove these extreme cases of very high or low values ### of

’price_per_sqft’ based on std()

def remove_pps_outliers(df):

df_out = pd.DataFrame()

for key, subdf in df.groupby(’location’):

mean = np.mean(subdf[’price_per_sqft’])

std = np.std(subdf[’price_per_sqft’])

reduced_df = subdf[(subdf[’price_per_sqft’] > (mean - std)) &


(subdf[’price_per_sqft’] <= (mean + std))]

df_out = pd.concat([df_out, reduced_df], ignore_index=True) return

df_out

dataset = remove_pps_outliers(dataset)

print(dataset.shape)

### plotting graoh where we can visualize that properties with same location

54
### and the price of 3 bhk properties with higher ’total_sqft’ is less than ### 2

bhk properties with lower ’total_sqft’

def plot_scatter_chart(df,location):

bhk2 = df[(df[’location’] == location) & (df[’bhk’] == 2)]

bhk3 = df[(df[’location’] == location) & (df[’bhk’] == 3)]

matplotlib.rcParams[’figure.figsize’] = (15,10)

plt.scatter(bhk2[’total_sqft’],

bhk2[’price’],

color=’blue’,

label=’2 BHK’,

s=50

plt.scatter(bhk3[’total_sqft’],

bhk3[’price’],

marker=’+’,

color=’green’,

label=’3 BHK’,

s=50

plt.xlabel(’Total Square Feet Area’)

plt.ylabel(’Price’)

plt.title(location)

55
plt.legend()

plt.show()

plot_scatter_chart(dataset,"Hebbal")

plot_scatter_chart(dataset,"Rajaji Nagar")

### defining a funcion where we can get the rows where ’bhk’ &
’location’

### is same but the property with less ’bhk’ have more price than the property

### which have more ’bhk’. So, it’s also an anomalu and we have to remove
these

### properties

def remove_bhk_outliers(df):

exclude_indices = np.array([])

for location, location_df in df.groupby(’location’):

bhk_stats = {}

for bhk, bhk_df in location_df.groupby(’bhk’):

bhk_stats[bhk] = {

’mean’: np.mean(bhk_df[’price_per_sqft’]),

’std’: np.std(bhk_df[’price_per_sqft’]),

’count’: bhk_df.shape[0]

56
for bhk, bhk_df in location_df.groupby(’bhk’):

stats = bhk_stats.get(bhk-1)

if stats and stats[’count’] > 5:

exclude_indices = np.append(exclude_indices,
bhk_df[bhk_df[’price_per_sqft’] < (stats[’mean’])].index.values)

return df.drop(exclude_indices, axis=’index’)

dataset = remove_bhk_outliers(dataset)

print(dataset.shape)

def plot_scatter_chart(df,location):

bhk2 = df[(df[’location’] == location) & (df[’bhk’] == 2)]

bhk3 = df[(df[’location’] == location) & (df[’bhk’] == 3)]

matplotlib.rcParams[’figure.figsize’] = (15,10)

plt.scatter(bhk2[’total_sqft’],

bhk2[’price’],

color=’blue’,

label=’2 BHK’,

s=50

plt.scatter(bhk3[’total_sqft’],

bhk3[’price’],

57
marker=’+’,

color=’green’,

label=’3 BHK’,

s=50

plt.xlabel(’Total Square Feet Area’)

plt.ylabel(’Price’)

plt.title(location)

plt.legend()

plt.show()

plot_scatter_chart(dataset,"Hebbal")

plot_scatter_chart(dataset,"Rajaji Nagar")

### histogram for properties per sqaure feet area

matplotlib.rcParams[’figure.figsize’] = (20,10)

plt.hist(dataset[’price_per_sqft’], rwidth=0.8)

plt.xlabel(’Price Per Square Feet’) plt.ylabel(’Count’)

plt.title(’Histogram of Properties by Price Per Square Feet’)

plt.show()

58
### exploring bathroom feature

print(dataset[’bath’].unique())

#### having 10 bedrooms and bathroom > 10 is unusual ####

so, we will remove these anomalies

print(dataset[dataset[’bath’] > 10])

#### plotting histogram of bathroom

plt.hist(dataset[’bath’], rwidth=0.8, color=’red’)

plt.xlabel(’Number of Bathrooms’)

plt.ylabel(’Count’)

plt.title(’Histogram of Bathroom per Property’)

plt.show()

print(dataset[dataset[’bath’] > dataset[’bhk’] + 2]) dataset

= dataset[dataset[’bath’] < dataset[’bhk’] + 2]

print(dataset.shape)

### after removing outliers, dropping unwanted features

dataset.drop([’size’,’price_per_sqft’], axis=’columns’, inplace=True)

print(dataset.head())

59
## one hot encoding the ’location’ column dummies

= pd.get_dummies(dataset[’location’])

print(dummies.head())

dataset = pd.concat([dataset,dummies.drop(’other’, axis=’columns’)],


axis=’columns’)

dataset.drop(’location’, axis=1, inplace=True)

print(dataset.head())

print(dataset.shape)

## distributing independent features in ’X’ and dependent feature in ’y’ X =

dataset.drop([’price’],axis= ’columns’)

y = dataset[’price’]

print(X.shape)

print(y.shape)

## splitting the dataset into training set and test set from

sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test =
train_test_split(X,y,test_size=0.2,random_state=10)

## training the model

60
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

regressor.fit(X_train,y_train)

print(regressor.score(X_test,y_test))

## k-fold cross validation

from sklearn.model_selection import ShuffleSplit, cross_val_score cv =

ShuffleSplit(n_splits=5, test_size = 0.2, random_state=0)

cross_val_score(regressor,X,y,cv=cv)

## grid search, hyper parameter tuning

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Lasso

from sklearn.tree import DecisionTreeRegressor

def find_best_model_using_gridsearch(X,y):

algos = {

’linear_regression’: {

’model’: LinearRegression(), ’params’:

{ ’normalize’: [True, False]}

},

’lasso’: {

61
’model’: Lasso(),

’params’: {

’alpha’: [1,2],

’selection’: [’random’,’cyclic’]

},

’decision_tree’:{

’model’: DecisionTreeRegressor(),

’params’: {

’criterion’: [’mse’,’friedman_mse’],

’splitter’: [’best’,’random’]

scores = []

cv = ShuffleSplit(n_splits=5,test_size=0.2,random_state=0) for

algo_name,config in algos.items():

gs = GridSearchCV(config[’model’],

config[’params’],

cv=cv,

n_jobs=-1,

return_train_score=False

62
)

gs.fit(X,y)

scores.append({

’model’: algo_name, ’best_score’:

gs.best_score_, ’best_params’:

gs.best_params_

})

return pd.DataFrame(scores,columns=[’model’,’best_score’,’best_params’])

model_scores = find_best_model_using_gridsearch(X,y)

print(model_scores)

### so after running grid search, linear regression model have the best score

### so i will use linear regression model on the whole dataset

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

regressor.fit(X,y)

## evaluating the model

def predict_price(location,sqft,bath,bhk):

63
loc_index = np.where(X.columns == location)[0][0]

x = np.zeros(len(X.columns))

x[0] = sqft

x[1] = bath

x[2] = bhk

if loc_index >= 0:

x[loc_index] = 1

return regressor.predict([x])[0]

print(predict_price(’1st Phase JP Nagar’,1000,2,2))

print(predict_price(’1st Phase JP Nagar’,1000,3,3))

print(predict_price(’Indira Nagar’,1000,3,3))

# saving the model

import pickle

with open(’bangalore_home_prices_model.pickle’,’wb’) as f:

pickle.dump(regressor,f)

# exporting columns

import json

columns = {’data_columns’: [col.lower() for col in X.columns]}

64
with open("columns.json","w") as f:

f.write(json.dumps(columns))

65

You might also like