---
title: '7.19 Problem Set'
author: 'Trinh Le'
---
1. Save the “tips” dataset as a tibble to a variable. Read the documentation for “tips.”
```{r}
library(tidyverse)
library(ggplot2)
data("tips", package = "reshape2")
tips <- as_tibble(tips)
head(tips)
```
a) In your own words, describe the dataset.
The "tips" dataset shows data on tips received by a waiter in a restaurant. It includes
details like how much customers paid for their meals, the tip amount, and other factors
that might influence tipping, such as the day of the week and the number of people dining.
b) In your own words, describe each variable, including whether it is numeric or
categorical and whether you believe there is a plausible association with “tip.”
total_bill (Numeric): This is the total amount (in dollars) of the customer's bill.
Generally, a higher bill could lead to a higher tip.
tip (Numeric): This is the amount of money (in dollars) given as a tip. It’s what we’re
trying to understand and predict.
sex (Categorical): The gender of the customer (Male or Female). Gender might influence
tipping habits.
smoker (Categorical): Whether the customer is a smoker (Yes or No). It may affect tipping
behavior.
day (Categorical): The day of the week when the customer visited (Thursday, Friday,
Saturday, or Sunday). Tipping might vary based on the day.
time (Categorical): Whether the meal was Lunch or Dinner. Tips may be different based on
the meal.
size (Numeric): The number of people in the dining group. Bigger groups might lead to
higher total bills and tips.
2. Run a bivariate regression using “tip” as the outcome variable and “total bill” as the
predictor variable.
a) Write out the regression equation (you can use “b” instead of “beta,” for example:
“salary = b0 + b1*yrs.since.phd…”).
```{r}
tips_regression <- lm(tip ~ total_bill, data = tips)
summary(tips_regression)
```
tip = b0 + b1 * total_bill
Where b0 is the intercept and b1 is the coefficient for total_bill.
b) Write an interpretation of the results.
Intercept (b0): This is the tip amount you would expect if the total bill was $0 (even
though that's not a realistic situation).
Coefficient (b1): This tells us how much the tip will change for every extra dollar added
to the total bill.
c) Is the effect of “total bill” statistically significant? What percentage of the
variation in “tip” is explained by this model?
P-value: If the p-value for the total bill is less than 0.05, it means there is a strong
link between the total bill and the tip.
R-squared: This number shows how much of the change in tips can be explained by the total
bill. A higher number means a stronger connection between the bill and the tip.
d) Create a scatterplot of the data with a regression line added. Do not use
geom_smooth(); instead, use the actual results of your regression. (Hint: You will need to
use predict() to add a new column to your dataset.)
```{r}
tips <- tips %>%
mutate(predicted_tip = predict(tips_regression))
ggplot(tips, aes(x = total_bill, y = tip)) +
geom_point() +
geom_line(aes(y = predicted_tip), color = "blue") +
labs(title = "Scatterplot of Total Bill vs Tip with Regression Line",
x = "Total Bill ($)",
y = "Tip ($)")
```
e) Create a new dataset with ten random amounts for “total bill.” Predict the tip amount
for each row.
```{r}
new_data <- tibble(total_bill = runif(10, min = min(tips$total_bill), max =
max(tips$total_bill)))
new_data <- new_data %>%
mutate(predicted_tip = predict(tips_regression, newdata = new_data))
new_data
```
3. Run a multivariate regression using “tip” as the outcome variable. Use at least three
predictor variables. At least one predictor variable must be continuous and at least one
must be categorical.
a) Write out the regression equation.
```{r}
tips_multivariate <- lm(tip ~ total_bill + size + sex, data = tips)
summary(tips_multivariate)
```
tip = b0 + b1 * total_bill + b2 * size + b3 * sexFemale
sexFemale is a dummy variable (1 if the customer is female, 0 if male).
b) Write an interpretation of the results for each predictor variable (including each
dummy variable for categorical variables). If the variable is categorical, the
interpretation should state what the reference category is.
total_bill: The expected change in tip for each additional dollar spent on the total bill.
size: The expected change in tip for each additional person in the dining party.
sexFemale: The expected difference in tip amount when the customer is female compared to
the reference category (male).
c) Which of your predictor variables have effects that are statistically significant? What
percentage of the variation in “tip” is explained by this model? What changes do you
notice from the previous model?
Significance of Predictors: Look at the p-values for each predictor. If a p-value is less
than 0.05, it means that predictor has a meaningful effect on the tip.
R-squared Value: This tells us how much of the changes in tip amounts are explained by all
the predictors combined.
Comparing R-squared: Check how the R-squared value changes compared to the model with just
one predictor (total bill). A bigger increase means the additional predictors add more
explanatory power to the model.