Predicting Car Fuel Efficiency
Objective: Use Polynomial Regression to predict car fuel efficiency based on engine size.
Dataset: https://www.kaggle.com/uciml/autompg-dataset
Tasks:
1. Load and explore the dataset.
2. Create scatter plots to visualize the relationships between engine size and fuel efficiency.
3. Implement Polynomial Regression (e.g., degree=3) to predict fuel efficiency.
4. Evaluate and compare the performance with a Simple Linear Regression model.
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Step 1: Load and explore the dataset
file_path = "/content/drive/MyDrive/nkphd/auto-mpg.csv" df = pd.read_csv(file_path)
# Display basic information about the dataset
print("Dataset Overview:")
print(df.head())
print("\nSummary Statistics:")
print(df.describe())
# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())
# Drop rows with missing values
df.dropna(inplace=True)
# Step 2: Scatter plot of engine size vs. fuel efficiency
plt.figure(figsize=(8, 6))
plt.scatter(df['displacement'], df['mpg'], color='blue', alpha=0.6)
plt.title("Engine Size vs. Fuel Efficiency")
plt.xlabel("Engine Size (Displacement)")
plt.ylabel("Fuel Efficiency (MPG)")
plt.grid()
plt.show()
# Step 3: Polynomial Regression
# Define features (engine size) and target (mpg)
X = df[['displacement']]
y = df['mpg']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create polynomial features of degree 3
poly = PolynomialFeatures(degree=3)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
# Train the polynomial regression model
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)
# Predict using the polynomial regression model
y_pred_poly = poly_model.predict(X_test_poly)
# Step 4: Simple Linear Regression
# Train the simple linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
# Predict using the simple linear regression model
y_pred_linear = linear_model.predict(X_test)
# Evaluate the models
mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)
mse_linear = mean_squared_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)
print("\nModel Performance:")
print(f"Polynomial Regression (degree=3) - MSE: {mse_poly:.2f}, R²: {r2_poly:.2f}")
print(f"Simple Linear Regression - MSE: {mse_linear:.2f}, R²: {r2_linear:.2f}")
# Visualize the Polynomial Regression fit
plt.figure(figsize=(8, 6))
plt.scatter(X, y, color='blue', alpha=0.6, label="Actual")
X_sorted = np.sort(X, axis=0)
plt.plot(X_sorted, poly_model.predict(poly.transform(X_sorted)), color='red', label="Polynomial
Regression (degree=3)")
plt.plot(X_sorted, linear_model.predict(X_sorted), color='green', linestyle='--', label="Simple
Linear Regression")
plt.title("Model Comparison")
plt.xlabel("Engine Size (Displacement)")
plt.ylabel("Fuel Efficiency (MPG)")
plt.legend()
plt.grid()
plt.show()
Performance Evaluation
The Mean Squared Error (MSE) and R² score are used to compare both models:
Polynomial Regression (degree=3) provides a lower MSE and a higher R² score,
indicating a better fit and improved predictive accuracy.
Simple Linear Regression, due to its linear nature, has a higher MSE and a lower R²
score, meaning it cannot capture the nonlinear relationship between engine size and fuel
efficiency effectively.
Comparison
Linear Regression assumes a straight-line relationship, leading to underfitting in cases
where the relationship is nonlinear.
Polynomial Regression captures the curvature in the data, fitting more accurately but at
the cost of increased model complexity.
Conclusion
Polynomial Regression (degree=3) performs better in predicting fuel efficiency compared to
Simple Linear Regression. The lower MSE and higher R² score confirm its superior accuracy
in this dataset.