INTRODUCTION TO DATA SCIENCE
UNIT-5
Data Visualization and Prototype Application Development: Data Visualization
options, Crossfilter, the JavaScript MapReduce library, Creating an interactive
dashboard with dc.js, Dashboard development tools.
Applying the Data Science process for real world problem solving scenarios as a
detailed case study.
  INTRODUCTION
  Data visualization to the end user
        Data visualization is the process of presenting data in a visual format,
         such as charts, graphs, or maps, to make it easier for end users to
         understand and interpret.
        It helps users quickly identify patterns, trends, and insights from the
         data, making complex information more accessible and actionable.
        Common tools for data visualization include bar charts, line graphs, pie
         charts, heatmaps, and dashboards.
        Often, data scientists must deliver their new insights to the end user.
  The results can be communicated in several ways:
        A one-time presentation
        A new viewport on your data
        A real-time dashboard
  Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur              Page 1
1. A one-time presentation
      This is typically used for presenting findings from a specific analysis or
       project.
       It involves creating a visual and verbal presentation, often using slides,
       to explain the insights and recommendations clearly.
      Charts, graphs, and infographics are used to make the information
       more understandable and impactful.
2. New Viewport on Your Data
      This involves creating a new way of viewing or interacting with existing
       data.
      It might be a new report, chart, or interactive visualization that provides
       fresh insights or highlights a specific aspect of the data relevant to the
       end user.
      This approach allows users to explore data from different perspectives
       and gain new understanding.
3. Real-time Dashboard
      Dashboards provide dynamic, real-time visualizations of data, allowing
       end users to monitor key metrics and performance indicators as they
       happen. Dashboards are often used for continuous monitoring, offering
       an up-to-date view of trends, progress, or any issues that may need
       immediate action.
      They are highly interactive and customizable, designed to meet the
       specific needs of the user.
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur              Page 2
When working on data projects, you need to think about a few key factors:
1. Type of Decision:
       Are you supporting a strategic decision or an operational one?
           Strategic decisions are usually made once and may not need frequent
            updates.
           Operational decisions require reports that are updated regularly.
2. Size of the Organization:
     In small organizations, you might handle everything: from collecting
     data to creating reports?
           In small organizations, you are responsible for everything, from
            collecting data to making reports.
           In larger organizations, there may be a team that creates dashboards
            for you.
           Even then, making a sample dashboard yourself can be useful
           But even in this last situation, delivering a prototype dashboard can
            be beneficial because it presents an example and often shortens
            delivery time.
Data visualization options
Here are some common data visualization options:
 1. Charts and Graphs: Examples include bar charts, line graphs, pie charts,
     and scatter plots. They help show trends, comparisons, and distributions.
 2. Maps: Geographic maps display data related to locations, such as sales
     per region or population density.
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur                  Page 3
 3. Dashboards: Dashboards combine multiple charts and graphs in one view.
     They give a quick overview of key information and allow for real-time
     monitoring.
 4. Infographics: Visual presentations that combine text, images, and data to
     tell a story or explain complex information.
 5. Heatmaps: Show patterns or intensity of data values using color
     gradients. They are useful for highlighting high and low points in large
     datasets.
 6. Tables: Display raw data in a structured format, making it easy to view
     details and compare values directly.
 Example
Creating a Dashboard for a Hospital Pharmacy
1. Overview
            A new government rule requires pharmacies to check if their medicines
             are sensitive to light and store them in special containers.
            However, the government hasn’t provided a list of which medicines are
             light-sensitive.
2. Data Scientist's Role
             As a data scientist, you can find out which medicines are light-
              sensitive by examining their patient information leaflets (small printed
              sheets or booklet).
             You can use text mining to categorize each medicine as "light
              sensitive" or "not light sensitive."
3. Database Update
   After tagging the medicines, you upload this information to a central
    database.
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur                  Page 4
4. Stock Analysis
   The pharmacy provides you with access to its stock data to determine how
    many special containers are needed for light-sensitive medicines.
5. Data Format
   The data includes time-series information for one year, with around 10,000
    entries for 29 different medicines.
6. Dashboard Options
   There are many options for creating dashboards, but this chapter will focus
    on using dc.js, a JavaScript library that combines data handling
    (Crossfilter) and data visualization (d3.js).
7. Why dc.js?:
       User-Friendly: dc.js is easy to set up and allows you to create interactive
        dashboards where clicking on one graph filters the data shown in other
        graphs.
       Time-Saving: It helps you focus on your analysis instead of spending too
        much time on dashboard creation.
8. Prerequisites:
       You need to use d3.js and crossfilter.js for dc.js to work.
       Although d3.js can be complex, you don’t need to be an expert in it to
        use dc.js.
9. Example Dashboard: You can explore a sample dashboard on the dc.js
    website to see how it works and interact with the graphs.
10. Next Steps: By the end of this chapter, you will be able to create a
    dashboard yourself using the information and tools provided.
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur               Page 5
Crossfilter, the JavaScript MapReduce library
   Crossfilter is a powerful JavaScript library designed to handle large
    datasets efficiently in web browsers.
   It allows users to filter and aggregate data across multiple dimensions
    simultaneously,        making      it   ideal   for   interactive   data   analysis    and
    visualization.
   The main purpose of Crossfilter is to enable fast, interactive exploration
    and analysis of large datasets in web applications.
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur                            Page 6
   JavaScript is a high-level, dynamic, and interpreted programming
    language primarily used for adding interactivity and functionality to web
    pages.
   It is one of the core technologies of the World Wide Web, alongside HTML
    (HyperText Markup Language) and CSS (Cascading Style Sheets).
   JavaScript isn’t the best language for heavy data processing.
   A    MapReduce         library     is   a   programming      model   and   associated
    implementation for processing and generating large datasets that can be
    parallelized across a distributed cluster of computers.
    The MapReduce model allows for the efficient processing of big data by
    breaking down tasks into smaller, manageable pieces.
   However, companies like Square created MapReduce libraries for it.
   When working with data, every speed improvement is helpful.
   You want to avoid sending large amounts of data over the internet or even
    your internal network for a few reasons
    1. Slow Performance: Sending big data files can slow down your
        application.
    2. Network Congestion: Large data transfers can clog your network,
        making it less efficient.
    3. Increased Costs: Transferring large volumes of data can lead to higher
        data usage costs.
    4. Longer Load Times: Users may have to wait longer for data to load,
        which can be frustrating.
Setting up everything
It’s time to build the actual application, and the ingredients of our small
dc.js application are as follows:
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur                     Page 7
      JQuery—To handle the interactivity
       Crossfilter.js—A MapReduce library and prerequisite to dc.js
      d3.js—A popular data visualization library and prerequisite to dc.js
      dc.js—The visualization library you will use to create your interactive
       dashboard
       Bootstrap—A widely used layout library you’ll use to make it all look
       better
You’ll write only three files:
      index.html—The HTML page that contains your application
       application.js—To hold all the JavaScript code you’ll write
       application.css—For your own CSS
      In addition, you’ll need to run our code on an HTTP server. You could go
       through the effort of setting up a LAMP (Linux, Apache, MySQL, PHP),
       WAMP (Windows, Apache,MySQL, PHP), or XAMPP (Cross Environment,
       Apache, MySQL, PHP, Perl) server.
But for the sake of simplicity we won’t set up any of those servers here.
Instead you can do it with a single Python command.
Steps to Launch a Python HTTP Server in Python 3
1. Create a one folder and named as dashboard
   On that folder you can create necessary files
2. Open Command-Line Tool:
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur                Page 8
      For Windows: Open Command Prompt (CMD) by searching for "cmd" in
       the Start menu.
      For Linux or macOS: Open the Terminal application.
3. Navigate to the Folder
       C:\Users\SRIKANTH\Desktop\dashboard and select url type cmd
4. Check Python Installation:
   Python --version
5. Start the HTTP Server:
      Run the following command
                 python -m http.server
6. Access the Server
      Open a web browser and go to http://localhost:8000.
      You should see your index.html file and any other files in that folder.
                               http://localhost:8000/index.html
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur               Page 9
index.html
<!DOCTYPE html>
<html lang="en">
<head>
   <meta charset="UTF-8">
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
   <title>Interactive Dashboard</title>
   <link rel="stylesheet"
href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.c
ss">
   <link rel="stylesheet" href="application.css">
</head>
<body>
   <div class="container">
      <h1>Medicine Dashboard</h1>
     <div id="chart">
        <div id="bar-chart"></div>
        <div id="pie-chart"></div>
     </div>
   </div>
   <script src="https://code.jquery.com/jquery-3.5.1.min.js"></script>
   <script
src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.16.0/d3.min.js"></script>
   <script
src="https://cdnjs.cloudflare.com/ajax/libs/crossfilter2/1.4.0/crossfilter.min.
js"></script>
   <script
src="https://cdnjs.cloudflare.com/ajax/libs/dc/3.1.1/dc.min.js"></script>
   <script src="application.js"></script>
</body>
</html>
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur           Page 10
application.js
// Load the data
d3.json("data.json").then(function(data) {
  // Create a Crossfilter instance
  var ndx = crossfilter(data);
      // Define dimensions
      var categoryDim = ndx.dimension(function(d) { return d.category; });
      var valueDim = ndx.dimension(function(d) { return d.value; });
   // Group data
   var categoryGroup = categoryDim.group();
   var valueGroup = valueDim.groupAll().reduceSum(function(d) { return
d.value; });
      // Create a bar chart
      var barChart = dc.barChart("#bar-chart");
      barChart
        .width(400)
        .height(200)
        .dimension(categoryDim)
        .group(categoryGroup)
        .x(d3.scaleBand())
        .xUnits(dc.units.ordinal)
        .renderHorizontalGridLines(true)
        .renderVerticalGridLines(true)
        .elasticY(true);
      // Create a pie chart
      var pieChart = dc.pieChart("#pie-chart");
      pieChart
         .width(200)
         .height(200)
         .dimension(categoryDim)
         .group(categoryGroup);
      // Render the charts
      dc.renderAll();
});
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur               Page 11
application.css
body {
  background-color: #f8f9fa;
  font-family: Arial, sans-serif;
}
h1 {
  text-align: center;
  margin-top: 20px;
}
#chart {
  display: flex;
  justify-content: space-around;
  margin-top: 30px;
}
#bar-chart, #pie-chart {
  border: 1px solid #ccc;
  border-radius: 5px;
  background-color: #fff;
  padding: 10px;
}
Data.json
[
    {"category":   "Pain Relief", "value": 10},
    {"category":   "Antibiotics", "value": 20},
    {"category":   "Antidepressants", "value": 15},
    {"category":   "Antihistamines", "value": 5},
    {"category":   "Vitamins", "value": 25},
    {"category":   "Others", "value": 30}
]
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur   Page 12
Creating an interactive dashboard with dc.js
   Creating an interactive dashboard with dc.js involves combining several
    libraries: dc.js itself (for charts), crossfilter.js (for handling the dataset),
    and d3.js (for rendering and manipulating SVG elements).
Prerequisites
       dc.js
       crossfilter.js
       d3.js
       HTML/CSS/JavaScript skills
Components in the Dashboard
From the image, we can see the following charts and components:
Line Chart – for stock tracking over time (likely for a single medicine).
Bar Chart – displaying various medicines and their quantities.
Pie Chart – showing a categorical division (e.g., availability "Yes" or "No").
Reset Filters Button – to reset all applied filters.
Steps to Build a Dashboard like this Using dc.js
1. Setting up the HTML structure
index.html
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Medicine Stock Dashboard</title>
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur               Page 13
   <!-- Include necessary libraries -->
   <script
src="https://cdnjs.cloudflare.com/ajax/libs/d3/6.6.2/d3.min.js"></script>
   <script
src="https://cdnjs.cloudflare.com/ajax/libs/crossfilter/1.3.12/crossfilter.min.
js"></script>
   <script
src="https://cdnjs.cloudflare.com/ajax/libs/dc/4.2.7/dc.min.js"></script>
   <link rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/dc/4.2.7/dc.min.css">
   <!-- Custom styles -->
   <style>
      body {
         font-family: Arial, sans-serif;
         background-color: #f4f4f9;
         margin: 20px;
      }
      .dashboard-container {
         display: flex;
         flex-wrap: wrap;
         justify-content: space-between;
      }
      .chart {
         margin: 20px;
         padding: 20px;
         background-color: #ffffff;
         border: 1px solid #ddd;
         border-radius: 10px;
      }
      .chart h3 {
         text-align: center;
         margin-bottom: 15px;
      }
      #reset-filters {
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur           Page 14
          margin: 20px;
          padding: 10px 20px;
          background-color: #28a745;
          color: white;
          border: none;
          border-radius: 5px;
          cursor: pointer;
      }
    #reset-filters:hover {
       background-color: #218838;
    }
  </style>
</head>
<body>
   <h1>Medicine Stock Dashboard</h1>
   <div class="dashboard-container">
     <!-- Line Chart (Stock Over Time) -->
     <div id="line-chart" class="chart" style="width: 45%;">
        <h3>Stock Over Time</h3>
     </div>
      <!-- Bar Chart (Medicines and Stock Levels) -->
      <div id="bar-chart" class="chart" style="width: 45%;">
         <h3>Medicine Stock Levels</h3>
      </div>
     <!-- Pie Chart (Yes/No Availability) -->
     <div id="pie-chart" class="chart" style="width: 30%;">
        <h3>Medicine Availability</h3>
     </div>
   </div>
   <!-- Reset Filters Button -->
   <button id="reset-filters">Reset Filters</button>
   <script src="script.js"></script>
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur   Page 15
</body>
</html>
2. Create Multiple Charts with dc.js in script.js
This file will hold the logic for the charts. Below is the JavaScript to create
different types of charts (line, bar, pie, etc.), based on the components in the
image.
script.js
// Load the CSV data (replace with your actual dataset)
d3.csv('data.csv').then(function(data) {
  // Parse the data
  data.forEach(function(d) {
      d.date = new Date(d.date); // Assuming date is in the format YYYY-MM-
DD
      d.stock = +d.stock;
      d.medicine = d.medicine;
      d.available = d.available; // "Yes" or "No"
  });
   // Initialize crossfilter
   var ndx = crossfilter(data);
   // Define dimensions
   var dateDimension = ndx.dimension(function(d) { return d.date; });
   var medicineDimension = ndx.dimension(function(d) { return d.medicine; });
   var availabilityDimension = ndx.dimension(function(d) { return d.available; });
   // Define groups
   var stockByDate = dateDimension.group().reduceSum(function(d) { return
d.stock; });
   var stockByMedicine = medicineDimension.group().reduceSum(function(d) {
return d.stock; });
   var availabilityGroup = availabilityDimension.group();
   // Line Chart (Stock Over Time)
   var lineChart = dc.lineChart('#line-chart');
   lineChart
      .width(450)
      .height(300)
      .dimension(dateDimension)
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur             Page 16
         .group(stockByDate)
         .x(d3.scaleTime().domain(d3.extent(data, function(d) { return d.date; })))
         .xAxisLabel("Year")
         .yAxisLabel("Stock Level")
         .render();
      // Bar Chart (Stock Levels by Medicine)
      var barChart = dc.barChart('#bar-chart');
      barChart
        .width(450)
        .height(300)
        .dimension(medicineDimension)
        .group(stockByMedicine)
        .x(d3.scaleBand())
        .xUnits(dc.units.ordinal)
        .xAxisLabel("Medicine")
        .yAxisLabel("Stock Level")
        .barPadding(0.1)
        .outerPadding(0.05)
        .render();
      // Pie Chart (Availability Yes/No)
      var pieChart = dc.pieChart('#pie-chart');
      pieChart
         .width(300)
         .height(300)
         .radius(150)
         .dimension(availabilityDimension)
         .group(availabilityGroup)
         .render();
      // Reset Filters Button
      d3.select('#reset-filters').on('click', function() {
          dc.filterAll(); // Reset all filters
          dc.renderAll(); // Re-render all charts
      });
      // Render all charts initially
      dc.renderAll();
});
3. Data (data.csv)
To simulate the dashboard, you can create a sample dataset with fields for
date, medicine, stock, and availability.
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur                  Page 17
data.csv
date,medicine,stock,available
2023-01-01,Paracetamol,200,Yes
2023-01-02,Ibuprofen,150,No
2023-01-03,Amoxicillin,180,Yes
2023-01-04,Aspirin,100,Yes
2023-01-05,Cetirizine,90,No
2023-01-06,Metformin,250,Yes
2023-01-07,Atorvastatin,300,Yes
2023-01-08,Lisinopril,120,No
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur   Page 18
Dashboard development tools
   When it comes to dashboard development tools, there are many options
    available depending on your budget, technical expertise, and specific needs.
Here’s an overview of some popular tools for developing interactive
dashboards:
1. Paid Dashboard Tools
These tools are known for their powerful features, ease of use, and support.
They are often used by businesses, but usually require purchasing a license.
       Tableau: Known for its ease of use, powerful visualizations, and
        interactivity. It’s widely used for business intelligence and data analysis.
       Microsoft Power BI: Microsoft’s tool for creating interactive dashboards,
        with seamless integration into the Microsoft ecosystem (Excel, Azure,
        etc.).
       QlikView / Qlik Sense: These are strong tools for data visualization and
        business intelligence, offering drag-and-drop simplicity.
       SAP      Analytics    Cloud:     Provides     cloud-based   analytics   and     data
        visualization solutions, tightly integrated with SAP’s enterprise systems.
       MicroStrategy: Known for its powerful data analytics capabilities and
        enterprise-level scalability.
       IBM Cognos Analytics: Another enterprise-level tool for building
        interactive dashboards, reports, and analytics.
       TIBCO Spotfire: Offers a user-friendly interface with powerful analytics,
        and supports a wide range of data sources.
       SAS Visual Analytics: Provides visual data exploration tools, advanced
        analytics, and forecasting.
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur                        Page 19
2. Free and Open-Source Tools
These are excellent options for developers who prefer customizable and cost-
effective solutions.
      Grafana:      A    popular      open-source       platform      for     monitoring      and
       observability, allowing integration with various data sources like time-
       series databases.
      Google Data Studio: A free tool from Google that integrates well with
       Google’s     ecosystem       (Google    Analytics,        Sheets,     etc.)   for   building
       dashboards.
      Metabase: A free, open-source platform designed for easy access to
       business data, great for teams to ask questions and create visual
       dashboards.
      Redash: Open-source tool to create visualizations and dashboards
       directly from SQL databases.
      Kibana: Part of the Elastic Stack, Kibana allows visualization of data
       stored in Elasticsearch, commonly used for logs and analytics.
      Dash by Plotly: An open-source framework for building interactive web-
       based dashboards using Python, R, or Julia.
      Superset (Apache): An open-source data exploration and visualization
       platform originally developed by Airbnb. It allows creating charts, maps,
       and dashboards.
3. JavaScript Libraries for Custom Dashboards
If you prefer to build dashboards from scratch, these JavaScript libraries are
excellent for creating custom, highly interactive dashboards.
      D3.js: A powerful library for creating complex data visualizations in the
       browser. It provides full control over how data is visualized.
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur                              Page 20
      dc.js: Built on top of D3.js, dc.js is designed for building fast, interactive
       dashboards with crossfiltering capabilities.
      Chart.js: A simple, flexible library for charting. Good for basic charts and
       lightweight dashboard solutions.
      Highcharts: A JavaScript charting library used to create interactive
       charts. It’s free for personal use, but requires a license for commercial
       use.
      Google Charts: Free and easy to use, Google Charts provides a variety of
       chart types and works well for basic data visualizations.
4. Embedded Analytics Platforms
For developers who want to embed dashboards within applications:
      Looker: A modern data platform for embedded analytics. Recently
       acquired by Google, it provides tools to build powerful visualizations and
       insights.
      Sisense: Known for embedding analytics into applications, it provides
       powerful visualization and dashboard creation capabilities.
      Embedded Power BI: Microsoft also offers Power BI for embedding
       dashboards into web applications or portals.
5. Cloud-Based Tools
These platforms allow you to build, share, and access dashboards entirely
online:
      Zoho Analytics: A cloud-based business intelligence tool that allows
       easy drag-and-drop dashboard creation.
      ClicData: A cloud-based platform for building and sharing data
       visualizations, with a focus on easy integration with other services.
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur                 Page 21
         Mode Analytics: A cloud-based platform focused on SQL-based queries
          with built-in visualization tools for dashboards.
Applying the Data Science process for real world problem
solving scenarios as a detailed case study.
Case Study: Predicting Housing Prices in a City
Objective
Overview:
        The housing market in urban areas is often influenced by various factors,
         such as location, size, number of bedrooms, and local amenities.
        Accurately predicting housing prices can assist buyers, sellers, and real
         estate agents in making informed decisions.
 Goal:
    The aim of this case study is to develop a predictive model that estimates
     housing prices based on several features of the properties.
1. Problem Definition
Key Questions:
         What factors most significantly influence housing prices?
         Can we accurately predict housing prices using historical data?
         How can this model help stakeholders in the real estate market?
Scope:
    The scope includes residential properties within a specific urban area over
     the past five years, focusing on factors such as square footage, location (zip
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur              Page 22
    code), number of bedrooms, number of bathrooms, and additional features
    like a garage or garden.
2. Data Collection
Data Sources:
      Real Estate Websites: APIs from platforms like Zillow, Redfin, or web
       scraping to collect historical housing data.
      Government Databases: Local government databases for demographic
       information, neighborhood crime rates, school district ratings, etc.
      Survey Data: Collect data from surveys to gather information on buyer
       preferences and market sentiment.
Dataset Fields:
      Price: The sale price of the house (target variable).
      Square Footage: Total area of the house.
      Bedrooms/Bathrooms: Number of bedrooms and bathrooms.
      Location: Zip code or neighborhood classification.
      Year Built: Age of the house.
      Additional Features: Presence of a garden, garage, pool, etc.
3. Data Cleaning and Preprocessing
Data cleaning and preprocessing are crucial for ensuring the quality of the
dataset before analysis. This step includes:
      Handling Missing Values: Assess and impute or remove records with
       missing values.
      Outlier Detection: Identify and manage outliers that could skew results.
      Data Transformation: Convert categorical variables into numerical
       format (e.g., using one-hot encoding for neighborhoods).
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur                Page 23
      Normalization: Scale numerical features to ensure uniformity, especially
       for models sensitive to feature scales.
Example Code:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
# Load data
data = pd.read_csv('housing_data.csv')
# Handle missing values
data.fillna(data.mean(), inplace=True)
# Outlier detection
data = data[data['price'] < data['price'].quantile(0.95)]
# One-hot encoding for categorical features
data = pd.get_dummies(data, columns=['neighborhood'], drop_first=True)
# Normalize numerical features
scaler = StandardScaler()
data[['square_footage', 'bedrooms', 'bathrooms']] =
scaler.fit_transform(data[['square_footage', 'bedrooms', 'bathrooms']])
4. Exploratory Data Analysis (EDA)
Exploratory Data Analysis helps uncover patterns and relationships within the
data. This includes:
      Visualizations: Use scatter plots to visualize the relationship between
       square footage and price, or box plots to understand price distributions
       across different neighborhoods.
      Statistical Analysis: Conduct correlation analysis to identify significant
       relationships between features and the target variable.
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur            Page 24
Example Code for Visualization:
import seaborn as sns
import matplotlib.pyplot as plt
# Scatter plot for square footage vs price
sns.scatterplot(data=data, x='square_footage', y='price')
plt.title('Square Footage vs Price')
plt.xlabel('Square Footage')
plt.ylabel('Price')
plt.show()
# Box plot for price distribution by neighborhood
plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='neighborhood', y='price')
plt.xticks(rotation=90)
plt.title('Price Distribution by Neighborhood')
plt.show()
5. Model Selection
Based on the characteristics of the data and the problem at hand, several
machine learning models can be considered:
      Linear Regression: A simple model for predicting a continuous target
       variable based on one or more predictor variables.
      Decision Trees: A non-linear model that can capture interactions
       between features.
      Random Forest: An ensemble method that combines multiple decision
       trees for better performance.
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur       Page 25
      Gradient Boosting Machines (GBM): Another ensemble method that
       builds trees in a sequential manner to improve performance.
      Deep Learning: More complex models like neural networks for larger
       datasets.
Example Code for Training a Random Forest Model:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Split data into features and target
X = data.drop('price', axis=1)
y = data['price']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Train Random Forest model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
6. Model Evaluation
Evaluating the model’s performance is crucial to ensure its accuracy and
reliability. Metrics to consider include:
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur             Page 26
      Mean Absolute Error (MAE): Average of the absolute errors between
       predicted and actual prices.
      Mean Squared Error (MSE): Average of the squares of the errors.
      R-squared: Measures how well the model explains the variability of the
       target variable.
Example Code for Evaluation Metrics:
# Evaluate model performance
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = model.score(X_test, y_test)
print(f'Mean Absolute Error: {mae:.2f}')
print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')
7. Insights
Once the model is evaluated, the following insights can be derived:
      Feature Importance: Analyze which features significantly influence
       housing prices (e.g., square footage may be the most important
       predictor).
      Predictions vs. Actual Prices: Visualize predicted prices against actual
       prices to assess the model's performance visually.
      Recommendations: Suggest potential improvements or adjustments
       based     on    the    model's     findings,     such     as   focusing   on   specific
       neighborhoods for investment.
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur                         Page 27
Conclusion
This case study demonstrates the application of the Data Science process in
predicting housing prices. By collecting and preprocessing data, conducting
exploratory analysis, selecting an appropriate model, and evaluating its
performance, stakeholders can leverage data-driven insights to make informed
decisions in the housing market. Continuous monitoring and refinement of the
model can further enhance its accuracy and utility.
Prepared by Mr.K Srikanth, Asst. Professor, IT, VNITSW, Guntur        Page 28