0% found this document useful (0 votes)
45 views5 pages

Homework 1 OM690

The document contains homework assignments for a course on data mining, focusing on supervised and unsupervised learning tasks, overfitting, and data normalization. It includes specific problems with examples related to loan approval, customer recommendations, and network security, as well as data normalization calculations for age and income. Additionally, it covers data visualization tasks using Python for appliance shipments and riding mower sales analysis.

Uploaded by

laurenmiles99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views5 pages

Homework 1 OM690

The document contains homework assignments for a course on data mining, focusing on supervised and unsupervised learning tasks, overfitting, and data normalization. It includes specific problems with examples related to loan approval, customer recommendations, and network security, as well as data normalization calculations for age and income. Additionally, it covers data visualization tasks using Python for appliance shipments and riding mower sales analysis.

Uploaded by

laurenmiles99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Homework 1

OM690
Lauren Miles
6/20/2025
Chapter 2
Problems 2.1 a, b, c: supervised vs unsupervised
2. Assuming that data mining techniques are to be used in the following cases, identify
whether the task required is supervised or unsupervised learning.

a. Deciding whether to issue a loan to an applicant based on demographic and


financial data (with reference to a database of similar data on prior customers).

Supervised because the model learns from labeled data.

b. In an online bookstore, making recommendations to customers concerning


additional items to buy based on the buying patterns in prior transactions.

Unsupervised because there are no labels used, instead this is a task based on user
behavior.

c. Identifying a network data packet as dangerous (virus, hacker attack) based on


comparison to other packets whose threat status is known.

Supervised because it uses labeled historical data.

Problem 2.5: overfitting


5. Using the concept of overfitting, explain why when a model is fit to training data, zero
error with those data are not necessarily good.

Overfitting is when a model learns the underlying patterns in the training data but also the
noise and random fluctuations. If the model has a zero error in the training set, then it has
memorized the data rather than generalized it. This causes poor performance of new data
because the model is too tailored to the training data and fails to make accurate
predictions on the new inputs.
Problem 2.8: Data Normalization
Normalize the data in Table 2.18, showing calculations.
Table 2.18
Age Income
25 49,000
56 156,000
65 99,000
32 192,000
41 39,000
49 57,000

Age: Min=25, Max = 65


25 (25-25)/(65-25) = 0.0
32 (32-25)/40 = 0.175
41 (41-25)/40 = 0.4
49 (49-25)/40 = 0.6
56 (56-25)/40 = 0.775
65 (65-25)/40 = 1.0

Income: Min = 39,000. Max: 192,000


39,000 (39000-39000)/153000 = 0.0
49,000 (49000-39000)/153000 = 0.065
57,000 (57000-39000)/153000 = 0.118
99,000 (99000-39000)/153000 = 0.392
156,000 (156000-39000)/153000 = 0.765
192,000 (192000-39000)/153000 = 1.0
Chapter 3
Problem 3.1
1. Shipments of Household Appliances: Line Graphs. The file ApplianceShipments.csv
contains the series of quarterly shipments (in millions of dollars) of US household
appliances between 1985 and 1989.
a. Create a well-formatted time plot of the data using Python.

b. Does there appear to be a quarterly pattern? For a closer view of the


patterns, zoom in to the range of 3500–5000 on the y-axis.

Yes, Q2 and Q3 are higher, where as Q1 and Q4 are lower.

c. Using Python, create one chart with four separate lines, one line for each
of Q1, Q2, Q3, and Q4. In Python, this can be achieved by add column for
quarter and year. Then group the data frame by quarter and then plot
shipment versus year for each quarter as a separate series on a line graph.
Zoom in to the range of 3500–5000 on the y-axis. Does there appear to be a
difference between quarters?

Just like in part b, Q2 and Q3 are higher, where as Q1 and Q4 are


lower.

d. Using Python, create a line graph of the series at a yearly aggregated level
(i.e., the total shipments in each year).

Problem 3.2
2. Sales of Riding Mowers: Scatter Plots. A company that manufactures riding mowers
wants to identify the best sales prospects for an intensive sales campaign. In
particular, the manufacturer is interested in classifying households as prospective
owners or nonowners on the basis of Income (in $1000s) and Lot Size (in 1000 ft2). The
marketing expert looked at a random sample of 24 households, given in the file
RidingMowers.csv.
a. Using Python, create a scatter plot of Lot Size vs. Income,
color-coded by the outcome variable owner/nonowner.
Make sure to obtain a well-formatted plot (create legible
labels and a legend, etc.).
In [6]: import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset


df = pd.read_csv("ApplianceShipments.csv")

# Extract Year and Quarter


df['Year'] = df['Quarter'].str[-4:].astype(int)
df['Q'] = df['Quarter'].str[:2]

# Map Quarters to Months for proper datetime formatting


quarter_mapping = {'Q1': 1, 'Q2': 4, 'Q3': 7, 'Q4': 10}
df['Month'] = df['Q'].map(quarter_mapping)

# Create a datetime column


df['Date'] = pd.to_datetime(df[['Year', 'Month']].assign(DAY=1))
plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Shipments'], marker='o')
plt.title("Quarterly Appliance Shipments (1985–1989)")
plt.xlabel("Date")
plt.ylabel("Shipments (in millions)")
plt.grid(True)
plt.tight_layout()
plt.show()

In [7]: plt.figure(figsize=(10, 6))


plt.plot(df['Date'], df['Shipments'], marker='o')
plt.title("Quarterly Appliance Shipments (Zoomed 3500–5000)")
plt.xlabel("Date")
plt.ylabel("Shipments (in millions)")
plt.ylim(3500, 5000)
plt.grid(True)
plt.tight_layout()
plt.show()

In [8]: # Group data for separate lines


grouped = df.groupby(['Year', 'Q'])['Shipments'].sum().unstack()

# Plot each quarter separately


grouped.plot(marker='o', figsize=(10, 6))
plt.title("Quarterly Shipments by Quarter")
plt.xlabel("Year")
plt.ylabel("Shipments (in millions)")
plt.ylim(3500, 5000)
plt.grid(True)
plt.legend(title="Quarter")
plt.tight_layout()
plt.show()

In [9]: # Aggregate by year


yearly = df.groupby('Year')['Shipments'].sum()

# Plot yearly totals


plt.figure(figsize=(8, 5))
yearly.plot(kind='line', marker='o')
plt.title("Total Appliance Shipments Per Year")
plt.xlabel("Year")
plt.ylabel("Total Shipments (in millions)")
plt.grid(True)
plt.tight_layout()
plt.show()

In [ ]:
In [1]: import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Excel file and read the correct sheet


df = pd.read_excel("RidingMowers.xlsx", sheet_name='Data')

# Create a scatter plot


plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Lot_Size', y='Income', hue='Ownership', palette='Set1', s=100)

plt.title("Lot Size vs. Income by Ownership")


plt.xlabel("Lot Size (in 1000 ft²)")
plt.ylabel("Income (in $1000s)")
plt.legend(title="Ownership")
plt.grid(True)
plt.tight_layout()
plt.show()

In [ ]:

You might also like