Homework 1
OM690
Lauren Miles
6/20/2025
Chapter 2
Problems 2.1 a, b, c: supervised vs unsupervised
2. Assuming that data mining techniques are to be used in the following cases, identify
whether the task required is supervised or unsupervised learning.
a. Deciding whether to issue a loan to an applicant based on demographic and
financial data (with reference to a database of similar data on prior customers).
Supervised because the model learns from labeled data.
b. In an online bookstore, making recommendations to customers concerning
additional items to buy based on the buying patterns in prior transactions.
Unsupervised because there are no labels used, instead this is a task based on user
behavior.
c. Identifying a network data packet as dangerous (virus, hacker attack) based on
comparison to other packets whose threat status is known.
Supervised because it uses labeled historical data.
Problem 2.5: overfitting
5. Using the concept of overfitting, explain why when a model is fit to training data, zero
error with those data are not necessarily good.
Overfitting is when a model learns the underlying patterns in the training data but also the
noise and random fluctuations. If the model has a zero error in the training set, then it has
memorized the data rather than generalized it. This causes poor performance of new data
because the model is too tailored to the training data and fails to make accurate
predictions on the new inputs.
Problem 2.8: Data Normalization
Normalize the data in Table 2.18, showing calculations.
Table 2.18
Age Income
25 49,000
56 156,000
65 99,000
32 192,000
41 39,000
49 57,000
Age: Min=25, Max = 65
25 (25-25)/(65-25) = 0.0
32 (32-25)/40 = 0.175
41 (41-25)/40 = 0.4
49 (49-25)/40 = 0.6
56 (56-25)/40 = 0.775
65 (65-25)/40 = 1.0
Income: Min = 39,000. Max: 192,000
39,000 (39000-39000)/153000 = 0.0
49,000 (49000-39000)/153000 = 0.065
57,000 (57000-39000)/153000 = 0.118
99,000 (99000-39000)/153000 = 0.392
156,000 (156000-39000)/153000 = 0.765
192,000 (192000-39000)/153000 = 1.0
Chapter 3
Problem 3.1
1. Shipments of Household Appliances: Line Graphs. The file ApplianceShipments.csv
contains the series of quarterly shipments (in millions of dollars) of US household
appliances between 1985 and 1989.
a. Create a well-formatted time plot of the data using Python.
b. Does there appear to be a quarterly pattern? For a closer view of the
patterns, zoom in to the range of 3500–5000 on the y-axis.
Yes, Q2 and Q3 are higher, where as Q1 and Q4 are lower.
c. Using Python, create one chart with four separate lines, one line for each
of Q1, Q2, Q3, and Q4. In Python, this can be achieved by add column for
quarter and year. Then group the data frame by quarter and then plot
shipment versus year for each quarter as a separate series on a line graph.
Zoom in to the range of 3500–5000 on the y-axis. Does there appear to be a
difference between quarters?
Just like in part b, Q2 and Q3 are higher, where as Q1 and Q4 are
lower.
d. Using Python, create a line graph of the series at a yearly aggregated level
(i.e., the total shipments in each year).
Problem 3.2
2. Sales of Riding Mowers: Scatter Plots. A company that manufactures riding mowers
wants to identify the best sales prospects for an intensive sales campaign. In
particular, the manufacturer is interested in classifying households as prospective
owners or nonowners on the basis of Income (in $1000s) and Lot Size (in 1000 ft2). The
marketing expert looked at a random sample of 24 households, given in the file
RidingMowers.csv.
a. Using Python, create a scatter plot of Lot Size vs. Income,
color-coded by the outcome variable owner/nonowner.
Make sure to obtain a well-formatted plot (create legible
labels and a legend, etc.).
In [6]: import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv("ApplianceShipments.csv")
# Extract Year and Quarter
df['Year'] = df['Quarter'].str[-4:].astype(int)
df['Q'] = df['Quarter'].str[:2]
# Map Quarters to Months for proper datetime formatting
quarter_mapping = {'Q1': 1, 'Q2': 4, 'Q3': 7, 'Q4': 10}
df['Month'] = df['Q'].map(quarter_mapping)
# Create a datetime column
df['Date'] = pd.to_datetime(df[['Year', 'Month']].assign(DAY=1))
plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Shipments'], marker='o')
plt.title("Quarterly Appliance Shipments (1985–1989)")
plt.xlabel("Date")
plt.ylabel("Shipments (in millions)")
plt.grid(True)
plt.tight_layout()
plt.show()
In [7]: plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Shipments'], marker='o')
plt.title("Quarterly Appliance Shipments (Zoomed 3500–5000)")
plt.xlabel("Date")
plt.ylabel("Shipments (in millions)")
plt.ylim(3500, 5000)
plt.grid(True)
plt.tight_layout()
plt.show()
In [8]: # Group data for separate lines
grouped = df.groupby(['Year', 'Q'])['Shipments'].sum().unstack()
# Plot each quarter separately
grouped.plot(marker='o', figsize=(10, 6))
plt.title("Quarterly Shipments by Quarter")
plt.xlabel("Year")
plt.ylabel("Shipments (in millions)")
plt.ylim(3500, 5000)
plt.grid(True)
plt.legend(title="Quarter")
plt.tight_layout()
plt.show()
In [9]: # Aggregate by year
yearly = df.groupby('Year')['Shipments'].sum()
# Plot yearly totals
plt.figure(figsize=(8, 5))
yearly.plot(kind='line', marker='o')
plt.title("Total Appliance Shipments Per Year")
plt.xlabel("Year")
plt.ylabel("Total Shipments (in millions)")
plt.grid(True)
plt.tight_layout()
plt.show()
In [ ]:
In [1]: import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load the Excel file and read the correct sheet
df = pd.read_excel("RidingMowers.xlsx", sheet_name='Data')
# Create a scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Lot_Size', y='Income', hue='Ownership', palette='Set1', s=100)
plt.title("Lot Size vs. Income by Ownership")
plt.xlabel("Lot Size (in 1000 ft²)")
plt.ylabel("Income (in $1000s)")
plt.legend(title="Ownership")
plt.grid(True)
plt.tight_layout()
plt.show()
In [ ]: