0% found this document useful (0 votes)

45 views5 pages

Homework 1 OM690

The document contains homework assignments for a course on data mining, focusing on supervised and unsupervised learning tasks, overfitting, and data normalization. It includes specific problems with examples related to loan approval, customer recommendations, and network security, as well as data normalization calculations for age and income. Additionally, it covers data visualization tasks using Python for appliance shipments and riding mower sales analysis.

Uploaded by

laurenmiles99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views5 pages

Homework 1 OM690

Uploaded by

laurenmiles99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Homework 1

OM690
Lauren Miles
6/20/2025
Chapter 2
Problems 2.1 a, b, c: supervised vs unsupervised
2. Assuming that data mining techniques are to be used in the following cases, identify
whether the task required is supervised or unsupervised learning.

a. Deciding whether to issue a loan to an applicant based on demographic and

financial data (with reference to a database of similar data on prior customers).

Supervised because the model learns from labeled data.

b. In an online bookstore, making recommendations to customers concerning

additional items to buy based on the buying patterns in prior transactions.

Unsupervised because there are no labels used, instead this is a task based on user
behavior.

c. Identifying a network data packet as dangerous (virus, hacker attack) based on

comparison to other packets whose threat status is known.

Supervised because it uses labeled historical data.

Problem 2.5: overfitting

5. Using the concept of overfitting, explain why when a model is fit to training data, zero
error with those data are not necessarily good.

Overfitting is when a model learns the underlying patterns in the training data but also the
noise and random fluctuations. If the model has a zero error in the training set, then it has
memorized the data rather than generalized it. This causes poor performance of new data
because the model is too tailored to the training data and fails to make accurate
predictions on the new inputs.
Problem 2.8: Data Normalization
Normalize the data in Table 2.18, showing calculations.
Table 2.18
Age Income
25 49,000
56 156,000
65 99,000
32 192,000
41 39,000
49 57,000

Age: Min=25, Max = 65

25 (25-25)/(65-25) = 0.0
32 (32-25)/40 = 0.175
41 (41-25)/40 = 0.4
49 (49-25)/40 = 0.6
56 (56-25)/40 = 0.775
65 (65-25)/40 = 1.0

Income: Min = 39,000. Max: 192,000

39,000 (39000-39000)/153000 = 0.0
49,000 (49000-39000)/153000 = 0.065
57,000 (57000-39000)/153000 = 0.118
99,000 (99000-39000)/153000 = 0.392
156,000 (156000-39000)/153000 = 0.765
192,000 (192000-39000)/153000 = 1.0
Chapter 3
Problem 3.1
1. Shipments of Household Appliances: Line Graphs. The file ApplianceShipments.csv
contains the series of quarterly shipments (in millions of dollars) of US household
appliances between 1985 and 1989.
a. Create a well-formatted time plot of the data using Python.

b. Does there appear to be a quarterly pattern? For a closer view of the

patterns, zoom in to the range of 3500–5000 on the y-axis.

Yes, Q2 and Q3 are higher, where as Q1 and Q4 are lower.

c. Using Python, create one chart with four separate lines, one line for each
of Q1, Q2, Q3, and Q4. In Python, this can be achieved by add column for
quarter and year. Then group the data frame by quarter and then plot
shipment versus year for each quarter as a separate series on a line graph.
Zoom in to the range of 3500–5000 on the y-axis. Does there appear to be a
difference between quarters?

Just like in part b, Q2 and Q3 are higher, where as Q1 and Q4 are

lower.

d. Using Python, create a line graph of the series at a yearly aggregated level
(i.e., the total shipments in each year).

Problem 3.2
2. Sales of Riding Mowers: Scatter Plots. A company that manufactures riding mowers
wants to identify the best sales prospects for an intensive sales campaign. In
particular, the manufacturer is interested in classifying households as prospective
owners or nonowners on the basis of Income (in $1000s) and Lot Size (in 1000 ft2). The
marketing expert looked at a random sample of 24 households, given in the file
RidingMowers.csv.
a. Using Python, create a scatter plot of Lot Size vs. Income,
color-coded by the outcome variable owner/nonowner.
Make sure to obtain a well-formatted plot (create legible
labels and a legend, etc.).
In [6]: import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset

df = pd.read_csv("ApplianceShipments.csv")

# Extract Year and Quarter

df['Year'] = df['Quarter'].str[-4:].astype(int)
df['Q'] = df['Quarter'].str[:2]

# Map Quarters to Months for proper datetime formatting

quarter_mapping = {'Q1': 1, 'Q2': 4, 'Q3': 7, 'Q4': 10}
df['Month'] = df['Q'].map(quarter_mapping)

# Create a datetime column

df['Date'] = pd.to_datetime(df[['Year', 'Month']].assign(DAY=1))
plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Shipments'], marker='o')
plt.title("Quarterly Appliance Shipments (1985–1989)")
plt.xlabel("Date")
plt.ylabel("Shipments (in millions)")
plt.grid(True)
plt.tight_layout()
plt.show()

In [7]: plt.figure(figsize=(10, 6))

plt.plot(df['Date'], df['Shipments'], marker='o')
plt.title("Quarterly Appliance Shipments (Zoomed 3500–5000)")
plt.xlabel("Date")
plt.ylabel("Shipments (in millions)")
plt.ylim(3500, 5000)
plt.grid(True)
plt.tight_layout()
plt.show()

In [8]: # Group data for separate lines

grouped = df.groupby(['Year', 'Q'])['Shipments'].sum().unstack()

# Plot each quarter separately

grouped.plot(marker='o', figsize=(10, 6))
plt.title("Quarterly Shipments by Quarter")
plt.xlabel("Year")
plt.ylabel("Shipments (in millions)")
plt.ylim(3500, 5000)
plt.grid(True)
plt.legend(title="Quarter")
plt.tight_layout()
plt.show()

In [9]: # Aggregate by year

yearly = df.groupby('Year')['Shipments'].sum()

# Plot yearly totals

plt.figure(figsize=(8, 5))
yearly.plot(kind='line', marker='o')
plt.title("Total Appliance Shipments Per Year")
plt.xlabel("Year")
plt.ylabel("Total Shipments (in millions)")
plt.grid(True)
plt.tight_layout()
plt.show()

In [ ]:
In [1]: import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Excel file and read the correct sheet

df = pd.read_excel("RidingMowers.xlsx", sheet_name='Data')

# Create a scatter plot

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Lot_Size', y='Income', hue='Ownership', palette='Set1', s=100)

plt.title("Lot Size vs. Income by Ownership")

plt.xlabel("Lot Size (in 1000 ft²)")
plt.ylabel("Income (in $1000s)")
plt.legend(title="Ownership")
plt.grid(True)
plt.tight_layout()
plt.show()

In [ ]:

Data Munging & Storage Price Projections
No ratings yet
Data Munging & Storage Price Projections
14 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Case Study
50% (2)
Case Study
8 pages
Technical Assessment 1
No ratings yet
Technical Assessment 1
3 pages
Pandas Prac
No ratings yet
Pandas Prac
4 pages
ML 1-11
No ratings yet
ML 1-11
27 pages
Python Lab Manual
No ratings yet
Python Lab Manual
33 pages
DSBDA Manual
No ratings yet
DSBDA Manual
76 pages
Python Lab PRG
No ratings yet
Python Lab PRG
20 pages
B Tech-AIML-question Bank-2 Answer Key
No ratings yet
B Tech-AIML-question Bank-2 Answer Key
9 pages
23bet10114 Naman Gupta Assignment-1
No ratings yet
23bet10114 Naman Gupta Assignment-1
17 pages
AI Qna
No ratings yet
AI Qna
5 pages
PPPL Final Practical Questions
No ratings yet
PPPL Final Practical Questions
5 pages
Data Analysis Lab with Python
No ratings yet
Data Analysis Lab with Python
11 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
Data Science Sample
No ratings yet
Data Science Sample
5 pages
DXV Guidelines
No ratings yet
DXV Guidelines
3 pages
TD5Numpy Pandas and Matplotlib
No ratings yet
TD5Numpy Pandas and Matplotlib
5 pages
DS For Business Home Assignments
No ratings yet
DS For Business Home Assignments
24 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
ML
No ratings yet
ML
21 pages
DAV Practical File 234003
No ratings yet
DAV Practical File 234003
14 pages
ML (Sudhanshu)
No ratings yet
ML (Sudhanshu)
24 pages
AE II Simulation File PDF
No ratings yet
AE II Simulation File PDF
32 pages
ML Solution
No ratings yet
ML Solution
60 pages
Ass 2
No ratings yet
Ass 2
13 pages
Linear Regression Assignment
0% (2)
Linear Regression Assignment
8 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Guidelines - Data Exploration and Visualization
No ratings yet
Guidelines - Data Exploration and Visualization
3 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
50 pages
Fds QB
No ratings yet
Fds QB
6 pages
Data Science
No ratings yet
Data Science
10 pages
Experiment 8
No ratings yet
Experiment 8
9 pages
Guidelines DAVP
No ratings yet
Guidelines DAVP
3 pages
Manishadav
No ratings yet
Manishadav
27 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Anil DS Project
No ratings yet
Anil DS Project
33 pages
Compare and Contrast CSV, JSON, and XML Dataset Formats. Which Format Would You Choose For Image Data and Why?
No ratings yet
Compare and Contrast CSV, JSON, and XML Dataset Formats. Which Format Would You Choose For Image Data and Why?
9 pages
Walmart Sales Forecasting Guide
No ratings yet
Walmart Sales Forecasting Guide
37 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
Assignment Business Analytics B Biswas
No ratings yet
Assignment Business Analytics B Biswas
7 pages
Efficient Large Data Handling
No ratings yet
Efficient Large Data Handling
6 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
51 pages
Assignment 5
No ratings yet
Assignment 5
7 pages
Semi-Automated EDA in Python
No ratings yet
Semi-Automated EDA in Python
3 pages
Prac 2
No ratings yet
Prac 2
11 pages
Practical File Class 12 2025-26
No ratings yet
Practical File Class 12 2025-26
19 pages
SL-III Lab Manual
No ratings yet
SL-III Lab Manual
74 pages
DAV Practicle File
No ratings yet
DAV Practicle File
28 pages
Important Notes
No ratings yet
Important Notes
8 pages
DWM Practical
No ratings yet
DWM Practical
12 pages
DBDAL LAB - MANUAL - Final
No ratings yet
DBDAL LAB - MANUAL - Final
93 pages
DM Lab Cycle 1
No ratings yet
DM Lab Cycle 1
12 pages
Programming Notes 3
No ratings yet
Programming Notes 3
3 pages
Assignment
No ratings yet
Assignment
4 pages
Data Warehouse Matrix
No ratings yet
Data Warehouse Matrix
7 pages
Explicit Relation Between Thin LM Chromatography and Column Chromatography Conditions From Statistics and Machine Learning
No ratings yet
Explicit Relation Between Thin LM Chromatography and Column Chromatography Conditions From Statistics and Machine Learning
12 pages
BIA 5000 Introduction To Analytics - Lesson 2
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 2
52 pages
Manvendra Pratap Singh: Education Skills
No ratings yet
Manvendra Pratap Singh: Education Skills
1 page
Cs403 Short Notes
No ratings yet
Cs403 Short Notes
32 pages
Continue
No ratings yet
Continue
2 pages
Data Mining & Clustering Guide
No ratings yet
Data Mining & Clustering Guide
3 pages
How To Make A Thesis Powerpoint Presentation
100% (3)
How To Make A Thesis Powerpoint Presentation
6 pages
IRS Filing Error Rules Guide
No ratings yet
IRS Filing Error Rules Guide
6 pages
DBMS-April 2025 Question Paper
No ratings yet
DBMS-April 2025 Question Paper
3 pages
MYSQL Board Most Expected Questions
No ratings yet
MYSQL Board Most Expected Questions
35 pages
DBMS Handwritten Notes Q1j2as
100% (2)
DBMS Handwritten Notes Q1j2as
56 pages
3.3.4 Virtual Memory
No ratings yet
3.3.4 Virtual Memory
7 pages
Barman-3 9 0-Manual
No ratings yet
Barman-3 9 0-Manual
92 pages
Food Process Engineering Department
No ratings yet
Food Process Engineering Department
9 pages
Hana System Replication
No ratings yet
Hana System Replication
5 pages
Free 98 364 Questions
100% (1)
Free 98 364 Questions
12 pages
Literature Review On Student Information Management System
No ratings yet
Literature Review On Student Information Management System
6 pages
Big Data Assignment
No ratings yet
Big Data Assignment
9 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
90 pages
CMPG Detailed Overview
No ratings yet
CMPG Detailed Overview
25 pages
Database Normalization Explained
No ratings yet
Database Normalization Explained
9 pages
HyTex Catalog Strategy Optimization
100% (1)
HyTex Catalog Strategy Optimization
2 pages
DP-900 Exam
No ratings yet
DP-900 Exam
44 pages
SAP BW InfoProvider Guide
No ratings yet
SAP BW InfoProvider Guide
28 pages
Iii Final Output
No ratings yet
Iii Final Output
46 pages
Generic Delta Explained
100% (3)
Generic Delta Explained
5 pages
Network Forensic Process Model and Framework: An Alternative Scenario
No ratings yet
Network Forensic Process Model and Framework: An Alternative Scenario
10 pages
Customer Churn Prediction - Ipynb
No ratings yet
Customer Churn Prediction - Ipynb
170 pages
Bugreport 2020 02 22 01 28 02 Dumpstate - Log 31175
100% (1)
Bugreport 2020 02 22 01 28 02 Dumpstate - Log 31175
6 pages

Homework 1 OM690

Uploaded by

Homework 1 OM690

Uploaded by

Homework 1

a. Deciding whether to issue a loan to an applicant based on demographic and

Supervised because the model learns from labeled data.

b. In an online bookstore, making recommendations to customers concerning

c. Identifying a network data packet as dangerous (virus, hacker attack) based on

Supervised because it uses labeled historical data.

Problem 2.5: overfitting

Age: Min=25, Max = 65

Income: Min = 39,000. Max: 192,000

b. Does there appear to be a quarterly pattern? For a closer view of the

Yes, Q2 and Q3 are higher, where as Q1 and Q4 are lower.

Just like in part b, Q2 and Q3 are higher, where as Q1 and Q4 are

# Load the dataset

# Extract Year and Quarter

# Map Quarters to Months for proper datetime formatting

# Create a datetime column

In [7]: plt.figure(figsize=(10, 6))

In [8]: # Group data for separate lines

# Plot each quarter separately

In [9]: # Aggregate by year

# Plot yearly totals

# Load the Excel file and read the correct sheet

# Create a scatter plot

plt.title("Lot Size vs. Income by Ownership")

You might also like