0% found this document useful (0 votes)
74 views28 pages

Ids Unit-5

This document provides an overview of data visualization and prototype application development in data science, focusing on tools and techniques for creating interactive dashboards. It discusses various data visualization options, the role of data scientists, and the use of libraries like dc.js, Crossfilter, and d3.js for building dashboards. A case study example is included, demonstrating the process of creating a dashboard for a hospital pharmacy to monitor light-sensitive medicines.

Uploaded by

vijayams16285
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views28 pages

Ids Unit-5

This document provides an overview of data visualization and prototype application development in data science, focusing on tools and techniques for creating interactive dashboards. It discusses various data visualization options, the role of data scientists, and the use of libraries like dc.js, Crossfilter, and d3.js for building dashboards. A case study example is included, demonstrating the process of creating a dashboard for a hospital pharmacy to monitor light-sensitive medicines.

Uploaded by

vijayams16285
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

INTRODUCTION TO DATA SCIENCE

UNIT-5
Data Visualization and Prototype Application Development: Data Visualization
options, Crossfilter, the JavaScript MapReduce library, Creating an interactive
dashboard with dc.js, Dashboard development tools.
Applying the Data Science process for real world problem solving scenarios as a
detailed case study.

INTRODUCTION

Data visualization to the end user

 Data visualization is the process of presenting data in a visual format,


such as charts, graphs, or maps, to make it easier for end users to
understand and interpret.
 It helps users quickly identify patterns, trends, and insights from the
data, making complex information more accessible and actionable.
 Common tools for data visualization include bar charts, line graphs, pie
charts, heatmaps, and dashboards.
 Often, data scientists must deliver their new insights to the end user.

The results can be communicated in several ways:

 A one-time presentation
 A new viewport on your data
 A real-time dashboard

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 1


1. A one-time presentation
 This is typically used for presenting findings from a specific analysis or
project.
 It involves creating a visual and verbal presentation, often using slides,
to explain the insights and recommendations clearly.
 Charts, graphs, and infographics are used to make the information
more understandable and impactful.

2. New Viewport on Your Data


 This involves creating a new way of viewing or interacting with existing
data.
 It might be a new report, chart, or interactive visualization that provides
fresh insights or highlights a specific aspect of the data relevant to the
end user.
 This approach allows users to explore data from different perspectives
and gain new understanding.

3. Real-time Dashboard
 Dashboards provide dynamic, real-time visualizations of data, allowing
end users to monitor key metrics and performance indicators as they
happen. Dashboards are often used for continuous monitoring, offering
an up-to-date view of trends, progress, or any issues that may need
immediate action.
 They are highly interactive and customizable, designed to meet the
specific needs of the user.

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 2


When working on data projects, you need to think about a few key factors:

1. Type of Decision:

Are you supporting a strategic decision or an operational one?

 Strategic decisions are usually made once and may not need frequent
updates.
 Operational decisions require reports that are updated regularly.

2. Size of the Organization:

In small organizations, you might handle everything: from collecting


data to creating reports?

 In small organizations, you are responsible for everything, from


collecting data to making reports.
 In larger organizations, there may be a team that creates dashboards
for you.
 Even then, making a sample dashboard yourself can be useful

 But even in this last situation, delivering a prototype dashboard can


be beneficial because it presents an example and often shortens
delivery time.

Data visualization options


Here are some common data visualization options:

1. Charts and Graphs: Examples include bar charts, line graphs, pie charts,
and scatter plots. They help show trends, comparisons, and distributions.
2. Maps: Geographic maps display data related to locations, such as sales
per region or population density.

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 3


3. Dashboards: Dashboards combine multiple charts and graphs in one view.
They give a quick overview of key information and allow for real-time
monitoring.
4. Infographics: Visual presentations that combine text, images, and data to
tell a story or explain complex information.
5. Heatmaps: Show patterns or intensity of data values using color
gradients. They are useful for highlighting high and low points in large
datasets.
6. Tables: Display raw data in a structured format, making it easy to view
details and compare values directly.

Example

Creating a Dashboard for a Hospital Pharmacy

1. Overview
 A new government rule requires pharmacies to check if their medicines
are sensitive to light and store them in special containers.
 However, the government hasn’t provided a list of which medicines are
light-sensitive.

2. Data Scientist's Role


 As a data scientist, you can find out which medicines are light-
sensitive by examining their patient information leaflets (small printed
sheets or booklet).
 You can use text mining to categorize each medicine as "light
sensitive" or "not light sensitive."

3. Database Update
 After tagging the medicines, you upload this information to a central
database.

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 4


4. Stock Analysis
 The pharmacy provides you with access to its stock data to determine how
many special containers are needed for light-sensitive medicines.

5. Data Format
 The data includes time-series information for one year, with around 10,000
entries for 29 different medicines.

6. Dashboard Options
 There are many options for creating dashboards, but this chapter will focus
on using dc.js, a JavaScript library that combines data handling
(Crossfilter) and data visualization (d3.js).

7. Why dc.js?:

 User-Friendly: dc.js is easy to set up and allows you to create interactive


dashboards where clicking on one graph filters the data shown in other
graphs.
 Time-Saving: It helps you focus on your analysis instead of spending too
much time on dashboard creation.

8. Prerequisites:

 You need to use d3.js and crossfilter.js for dc.js to work.


 Although d3.js can be complex, you don’t need to be an expert in it to
use dc.js.

9. Example Dashboard: You can explore a sample dashboard on the dc.js


website to see how it works and interact with the graphs.

10. Next Steps: By the end of this chapter, you will be able to create a
dashboard yourself using the information and tools provided.

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 5


Crossfilter, the JavaScript MapReduce library

 Crossfilter is a powerful JavaScript library designed to handle large


datasets efficiently in web browsers.
 It allows users to filter and aggregate data across multiple dimensions
simultaneously, making it ideal for interactive data analysis and
visualization.
 The main purpose of Crossfilter is to enable fast, interactive exploration
and analysis of large datasets in web applications.

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 6


 JavaScript is a high-level, dynamic, and interpreted programming
language primarily used for adding interactivity and functionality to web
pages.
 It is one of the core technologies of the World Wide Web, alongside HTML
(HyperText Markup Language) and CSS (Cascading Style Sheets).
 JavaScript isn’t the best language for heavy data processing.
 A MapReduce library is a programming model and associated
implementation for processing and generating large datasets that can be
parallelized across a distributed cluster of computers.
 The MapReduce model allows for the efficient processing of big data by
breaking down tasks into smaller, manageable pieces.
 However, companies like Square created MapReduce libraries for it.
 When working with data, every speed improvement is helpful.
 You want to avoid sending large amounts of data over the internet or even
your internal network for a few reasons

1. Slow Performance: Sending big data files can slow down your
application.
2. Network Congestion: Large data transfers can clog your network,
making it less efficient.
3. Increased Costs: Transferring large volumes of data can lead to higher
data usage costs.
4. Longer Load Times: Users may have to wait longer for data to load,
which can be frustrating.

Setting up everything

It’s time to build the actual application, and the ingredients of our small
dc.js application are as follows:

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 7


 JQuery—To handle the interactivity
 Crossfilter.js—A MapReduce library and prerequisite to dc.js
 d3.js—A popular data visualization library and prerequisite to dc.js
 dc.js—The visualization library you will use to create your interactive
dashboard
 Bootstrap—A widely used layout library you’ll use to make it all look
better

You’ll write only three files:

 index.html—The HTML page that contains your application


 application.js—To hold all the JavaScript code you’ll write
 application.css—For your own CSS
 In addition, you’ll need to run our code on an HTTP server. You could go
through the effort of setting up a LAMP (Linux, Apache, MySQL, PHP),
WAMP (Windows, Apache,MySQL, PHP), or XAMPP (Cross Environment,
Apache, MySQL, PHP, Perl) server.

But for the sake of simplicity we won’t set up any of those servers here.

Instead you can do it with a single Python command.

Steps to Launch a Python HTTP Server in Python 3

1. Create a one folder and named as dashboard


On that folder you can create necessary files

2. Open Command-Line Tool:

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 8


 For Windows: Open Command Prompt (CMD) by searching for "cmd" in
the Start menu.
 For Linux or macOS: Open the Terminal application.

3. Navigate to the Folder


C:\Users\SRIKANTH\Desktop\dashboard and select url type cmd

4. Check Python Installation:


Python --version

5. Start the HTTP Server:

 Run the following command

python -m http.server

6. Access the Server


 Open a web browser and go to http://localhost:8000.
 You should see your index.html file and any other files in that folder.

http://localhost:8000/index.html

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 9


index.html

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Interactive Dashboard</title>
<link rel="stylesheet"
href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.c
ss">
<link rel="stylesheet" href="application.css">
</head>
<body>
<div class="container">
<h1>Medicine Dashboard</h1>

<div id="chart">
<div id="bar-chart"></div>
<div id="pie-chart"></div>
</div>
</div>

<script src="https://code.jquery.com/jquery-3.5.1.min.js"></script>
<script
src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.16.0/d3.min.js"></script>
<script
src="https://cdnjs.cloudflare.com/ajax/libs/crossfilter2/1.4.0/crossfilter.min.
js"></script>
<script
src="https://cdnjs.cloudflare.com/ajax/libs/dc/3.1.1/dc.min.js"></script>
<script src="application.js"></script>
</body>
</html>

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 10


application.js

// Load the data


d3.json("data.json").then(function(data) {
// Create a Crossfilter instance
var ndx = crossfilter(data);

// Define dimensions
var categoryDim = ndx.dimension(function(d) { return d.category; });
var valueDim = ndx.dimension(function(d) { return d.value; });

// Group data
var categoryGroup = categoryDim.group();
var valueGroup = valueDim.groupAll().reduceSum(function(d) { return
d.value; });

// Create a bar chart


var barChart = dc.barChart("#bar-chart");
barChart
.width(400)
.height(200)
.dimension(categoryDim)
.group(categoryGroup)
.x(d3.scaleBand())
.xUnits(dc.units.ordinal)
.renderHorizontalGridLines(true)
.renderVerticalGridLines(true)
.elasticY(true);

// Create a pie chart


var pieChart = dc.pieChart("#pie-chart");
pieChart
.width(200)
.height(200)
.dimension(categoryDim)
.group(categoryGroup);
// Render the charts
dc.renderAll();
});

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 11


application.css

body {
background-color: #f8f9fa;
font-family: Arial, sans-serif;
}

h1 {
text-align: center;
margin-top: 20px;
}

#chart {
display: flex;
justify-content: space-around;
margin-top: 30px;
}

#bar-chart, #pie-chart {
border: 1px solid #ccc;
border-radius: 5px;
background-color: #fff;
padding: 10px;
}

Data.json

[
{"category": "Pain Relief", "value": 10},
{"category": "Antibiotics", "value": 20},
{"category": "Antidepressants", "value": 15},
{"category": "Antihistamines", "value": 5},
{"category": "Vitamins", "value": 25},
{"category": "Others", "value": 30}
]

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 12


Creating an interactive dashboard with dc.js

 Creating an interactive dashboard with dc.js involves combining several


libraries: dc.js itself (for charts), crossfilter.js (for handling the dataset),
and d3.js (for rendering and manipulating SVG elements).

Prerequisites

 dc.js
 crossfilter.js
 d3.js
 HTML/CSS/JavaScript skills

Components in the Dashboard

From the image, we can see the following charts and components:

Line Chart – for stock tracking over time (likely for a single medicine).

Bar Chart – displaying various medicines and their quantities.

Pie Chart – showing a categorical division (e.g., availability "Yes" or "No").

Reset Filters Button – to reset all applied filters.

Steps to Build a Dashboard like this Using dc.js

1. Setting up the HTML structure

index.html

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Medicine Stock Dashboard</title>

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 13


<!-- Include necessary libraries -->
<script
src="https://cdnjs.cloudflare.com/ajax/libs/d3/6.6.2/d3.min.js"></script>
<script
src="https://cdnjs.cloudflare.com/ajax/libs/crossfilter/1.3.12/crossfilter.min.
js"></script>
<script
src="https://cdnjs.cloudflare.com/ajax/libs/dc/4.2.7/dc.min.js"></script>
<link rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/dc/4.2.7/dc.min.css">

<!-- Custom styles -->


<style>
body {
font-family: Arial, sans-serif;
background-color: #f4f4f9;
margin: 20px;
}

.dashboard-container {
display: flex;
flex-wrap: wrap;
justify-content: space-between;
}

.chart {
margin: 20px;
padding: 20px;
background-color: #ffffff;
border: 1px solid #ddd;
border-radius: 10px;
}

.chart h3 {
text-align: center;
margin-bottom: 15px;
}

#reset-filters {

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 14


margin: 20px;
padding: 10px 20px;
background-color: #28a745;
color: white;
border: none;
border-radius: 5px;
cursor: pointer;
}

#reset-filters:hover {
background-color: #218838;
}
</style>
</head>
<body>

<h1>Medicine Stock Dashboard</h1>

<div class="dashboard-container">
<!-- Line Chart (Stock Over Time) -->
<div id="line-chart" class="chart" style="width: 45%;">
<h3>Stock Over Time</h3>
</div>

<!-- Bar Chart (Medicines and Stock Levels) -->


<div id="bar-chart" class="chart" style="width: 45%;">
<h3>Medicine Stock Levels</h3>
</div>

<!-- Pie Chart (Yes/No Availability) -->


<div id="pie-chart" class="chart" style="width: 30%;">
<h3>Medicine Availability</h3>
</div>
</div>

<!-- Reset Filters Button -->


<button id="reset-filters">Reset Filters</button>

<script src="script.js"></script>

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 15


</body>
</html>

2. Create Multiple Charts with dc.js in script.js

This file will hold the logic for the charts. Below is the JavaScript to create
different types of charts (line, bar, pie, etc.), based on the components in the
image.

script.js

// Load the CSV data (replace with your actual dataset)


d3.csv('data.csv').then(function(data) {
// Parse the data
data.forEach(function(d) {
d.date = new Date(d.date); // Assuming date is in the format YYYY-MM-
DD
d.stock = +d.stock;
d.medicine = d.medicine;
d.available = d.available; // "Yes" or "No"
});

// Initialize crossfilter
var ndx = crossfilter(data);

// Define dimensions
var dateDimension = ndx.dimension(function(d) { return d.date; });
var medicineDimension = ndx.dimension(function(d) { return d.medicine; });
var availabilityDimension = ndx.dimension(function(d) { return d.available; });

// Define groups
var stockByDate = dateDimension.group().reduceSum(function(d) { return
d.stock; });
var stockByMedicine = medicineDimension.group().reduceSum(function(d) {
return d.stock; });
var availabilityGroup = availabilityDimension.group();

// Line Chart (Stock Over Time)


var lineChart = dc.lineChart('#line-chart');
lineChart
.width(450)
.height(300)
.dimension(dateDimension)

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 16


.group(stockByDate)
.x(d3.scaleTime().domain(d3.extent(data, function(d) { return d.date; })))
.xAxisLabel("Year")
.yAxisLabel("Stock Level")
.render();

// Bar Chart (Stock Levels by Medicine)


var barChart = dc.barChart('#bar-chart');
barChart
.width(450)
.height(300)
.dimension(medicineDimension)
.group(stockByMedicine)
.x(d3.scaleBand())
.xUnits(dc.units.ordinal)
.xAxisLabel("Medicine")
.yAxisLabel("Stock Level")
.barPadding(0.1)
.outerPadding(0.05)
.render();

// Pie Chart (Availability Yes/No)


var pieChart = dc.pieChart('#pie-chart');
pieChart
.width(300)
.height(300)
.radius(150)
.dimension(availabilityDimension)
.group(availabilityGroup)
.render();

// Reset Filters Button


d3.select('#reset-filters').on('click', function() {
dc.filterAll(); // Reset all filters
dc.renderAll(); // Re-render all charts
});

// Render all charts initially


dc.renderAll();
});

3. Data (data.csv)

To simulate the dashboard, you can create a sample dataset with fields for
date, medicine, stock, and availability.
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 17
data.csv

date,medicine,stock,available

2023-01-01,Paracetamol,200,Yes

2023-01-02,Ibuprofen,150,No

2023-01-03,Amoxicillin,180,Yes

2023-01-04,Aspirin,100,Yes

2023-01-05,Cetirizine,90,No

2023-01-06,Metformin,250,Yes

2023-01-07,Atorvastatin,300,Yes

2023-01-08,Lisinopril,120,No

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 18


Dashboard development tools
 When it comes to dashboard development tools, there are many options
available depending on your budget, technical expertise, and specific needs.

Here’s an overview of some popular tools for developing interactive


dashboards:

1. Paid Dashboard Tools

These tools are known for their powerful features, ease of use, and support.
They are often used by businesses, but usually require purchasing a license.

 Tableau: Known for its ease of use, powerful visualizations, and


interactivity. It’s widely used for business intelligence and data analysis.
 Microsoft Power BI: Microsoft’s tool for creating interactive dashboards,
with seamless integration into the Microsoft ecosystem (Excel, Azure,
etc.).
 QlikView / Qlik Sense: These are strong tools for data visualization and
business intelligence, offering drag-and-drop simplicity.
 SAP Analytics Cloud: Provides cloud-based analytics and data
visualization solutions, tightly integrated with SAP’s enterprise systems.
 MicroStrategy: Known for its powerful data analytics capabilities and
enterprise-level scalability.
 IBM Cognos Analytics: Another enterprise-level tool for building
interactive dashboards, reports, and analytics.
 TIBCO Spotfire: Offers a user-friendly interface with powerful analytics,
and supports a wide range of data sources.
 SAS Visual Analytics: Provides visual data exploration tools, advanced
analytics, and forecasting.

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 19


2. Free and Open-Source Tools

These are excellent options for developers who prefer customizable and cost-
effective solutions.

 Grafana: A popular open-source platform for monitoring and


observability, allowing integration with various data sources like time-
series databases.
 Google Data Studio: A free tool from Google that integrates well with
Google’s ecosystem (Google Analytics, Sheets, etc.) for building
dashboards.
 Metabase: A free, open-source platform designed for easy access to
business data, great for teams to ask questions and create visual
dashboards.
 Redash: Open-source tool to create visualizations and dashboards
directly from SQL databases.
 Kibana: Part of the Elastic Stack, Kibana allows visualization of data
stored in Elasticsearch, commonly used for logs and analytics.
 Dash by Plotly: An open-source framework for building interactive web-
based dashboards using Python, R, or Julia.
 Superset (Apache): An open-source data exploration and visualization
platform originally developed by Airbnb. It allows creating charts, maps,
and dashboards.

3. JavaScript Libraries for Custom Dashboards

If you prefer to build dashboards from scratch, these JavaScript libraries are
excellent for creating custom, highly interactive dashboards.

 D3.js: A powerful library for creating complex data visualizations in the


browser. It provides full control over how data is visualized.

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 20


 dc.js: Built on top of D3.js, dc.js is designed for building fast, interactive
dashboards with crossfiltering capabilities.
 Chart.js: A simple, flexible library for charting. Good for basic charts and
lightweight dashboard solutions.
 Highcharts: A JavaScript charting library used to create interactive
charts. It’s free for personal use, but requires a license for commercial
use.
 Google Charts: Free and easy to use, Google Charts provides a variety of
chart types and works well for basic data visualizations.

4. Embedded Analytics Platforms

For developers who want to embed dashboards within applications:

 Looker: A modern data platform for embedded analytics. Recently


acquired by Google, it provides tools to build powerful visualizations and
insights.
 Sisense: Known for embedding analytics into applications, it provides
powerful visualization and dashboard creation capabilities.
 Embedded Power BI: Microsoft also offers Power BI for embedding
dashboards into web applications or portals.

5. Cloud-Based Tools

These platforms allow you to build, share, and access dashboards entirely
online:

 Zoho Analytics: A cloud-based business intelligence tool that allows


easy drag-and-drop dashboard creation.
 ClicData: A cloud-based platform for building and sharing data
visualizations, with a focus on easy integration with other services.

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 21


 Mode Analytics: A cloud-based platform focused on SQL-based queries
with built-in visualization tools for dashboards.

Applying the Data Science process for real world problem


solving scenarios as a detailed case study.
Case Study: Predicting Housing Prices in a City

Objective

Overview:

 The housing market in urban areas is often influenced by various factors,


such as location, size, number of bedrooms, and local amenities.
 Accurately predicting housing prices can assist buyers, sellers, and real
estate agents in making informed decisions.

Goal:

 The aim of this case study is to develop a predictive model that estimates
housing prices based on several features of the properties.

1. Problem Definition

Key Questions:

 What factors most significantly influence housing prices?


 Can we accurately predict housing prices using historical data?
 How can this model help stakeholders in the real estate market?

Scope:

 The scope includes residential properties within a specific urban area over
the past five years, focusing on factors such as square footage, location (zip

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 22


code), number of bedrooms, number of bathrooms, and additional features
like a garage or garden.

2. Data Collection

Data Sources:

 Real Estate Websites: APIs from platforms like Zillow, Redfin, or web
scraping to collect historical housing data.
 Government Databases: Local government databases for demographic
information, neighborhood crime rates, school district ratings, etc.
 Survey Data: Collect data from surveys to gather information on buyer
preferences and market sentiment.

Dataset Fields:

 Price: The sale price of the house (target variable).


 Square Footage: Total area of the house.
 Bedrooms/Bathrooms: Number of bedrooms and bathrooms.
 Location: Zip code or neighborhood classification.
 Year Built: Age of the house.
 Additional Features: Presence of a garden, garage, pool, etc.

3. Data Cleaning and Preprocessing

Data cleaning and preprocessing are crucial for ensuring the quality of the
dataset before analysis. This step includes:

 Handling Missing Values: Assess and impute or remove records with


missing values.
 Outlier Detection: Identify and manage outliers that could skew results.
 Data Transformation: Convert categorical variables into numerical
format (e.g., using one-hot encoding for neighborhoods).

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 23


 Normalization: Scale numerical features to ensure uniformity, especially
for models sensitive to feature scales.

Example Code:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Load data
data = pd.read_csv('housing_data.csv')

# Handle missing values


data.fillna(data.mean(), inplace=True)

# Outlier detection
data = data[data['price'] < data['price'].quantile(0.95)]

# One-hot encoding for categorical features


data = pd.get_dummies(data, columns=['neighborhood'], drop_first=True)

# Normalize numerical features


scaler = StandardScaler()
data[['square_footage', 'bedrooms', 'bathrooms']] =
scaler.fit_transform(data[['square_footage', 'bedrooms', 'bathrooms']])

4. Exploratory Data Analysis (EDA)

Exploratory Data Analysis helps uncover patterns and relationships within the
data. This includes:

 Visualizations: Use scatter plots to visualize the relationship between


square footage and price, or box plots to understand price distributions
across different neighborhoods.
 Statistical Analysis: Conduct correlation analysis to identify significant
relationships between features and the target variable.

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 24


Example Code for Visualization:

import seaborn as sns


import matplotlib.pyplot as plt

# Scatter plot for square footage vs price


sns.scatterplot(data=data, x='square_footage', y='price')
plt.title('Square Footage vs Price')
plt.xlabel('Square Footage')
plt.ylabel('Price')
plt.show()

# Box plot for price distribution by neighborhood


plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='neighborhood', y='price')
plt.xticks(rotation=90)
plt.title('Price Distribution by Neighborhood')
plt.show()

5. Model Selection

Based on the characteristics of the data and the problem at hand, several
machine learning models can be considered:

 Linear Regression: A simple model for predicting a continuous target


variable based on one or more predictor variables.
 Decision Trees: A non-linear model that can capture interactions
between features.
 Random Forest: An ensemble method that combines multiple decision
trees for better performance.

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 25


 Gradient Boosting Machines (GBM): Another ensemble method that
builds trees in a sequential manner to improve performance.
 Deep Learning: More complex models like neural networks for larger
datasets.

Example Code for Training a Random Forest Model:

from sklearn.model_selection import train_test_split


from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Split data into features and target


X = data.drop('price', axis=1)
y = data['price']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train Random Forest model


model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

6. Model Evaluation

Evaluating the model’s performance is crucial to ensure its accuracy and


reliability. Metrics to consider include:

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 26


 Mean Absolute Error (MAE): Average of the absolute errors between
predicted and actual prices.
 Mean Squared Error (MSE): Average of the squares of the errors.
 R-squared: Measures how well the model explains the variability of the
target variable.

Example Code for Evaluation Metrics:


# Evaluate model performance
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = model.score(X_test, y_test)

print(f'Mean Absolute Error: {mae:.2f}')


print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')

7. Insights

Once the model is evaluated, the following insights can be derived:

 Feature Importance: Analyze which features significantly influence


housing prices (e.g., square footage may be the most important
predictor).
 Predictions vs. Actual Prices: Visualize predicted prices against actual
prices to assess the model's performance visually.
 Recommendations: Suggest potential improvements or adjustments
based on the model's findings, such as focusing on specific
neighborhoods for investment.

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 27


Conclusion

This case study demonstrates the application of the Data Science process in
predicting housing prices. By collecting and preprocessing data, conducting
exploratory analysis, selecting an appropriate model, and evaluating its
performance, stakeholders can leverage data-driven insights to make informed
decisions in the housing market. Continuous monitoring and refinement of the
model can further enhance its accuracy and utility.

Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur Page 28

You might also like