Department of Software Engineering
Mehran University of Engineering and Technology, Jamshoro
Course: SWE – Data Analytics and Business Intelligence
Instructor Ms Sana Faiz Practical/Lab No. 10
Date CLOs 04
Signature Assessment Score
Topic To understand basics of python
Objectives To become familiar with Linear Regression using SciKit
Lab Discussion: Theoretical concepts and Procedural steps
Linear regression
Linear regression is a basic and commonly used type of predictive analysis. The overall
idea of regression is to examine two things:
(1) does a set of predictor variables do a good job in predicting an outcome
(dependent) variable?
(2) Which variables in particular are significant predictors of the outcome
variable?
Simple linear regression
1 dependent variable (interval or ratio), 1 independent variable
These regression estimates are used to explain the relationship between one dependent
variable and one or more independent variables. The simplest form of the regression
equation with one dependent and one independent variable is defined by the formula y =
c + b*x, where y = estimated dependent variable score, c = constant, b = regression
coefficient, and x = score on the independent variable.
Regression variables
Naming the Variables. There are many names for a regression’s dependent variable. It may
be called an outcome variable, criterion variable, endogenous variable, or regress
The independent variables can be called exogenous variables, predictor variables, or
regressors.
Import libraries and read data from csv files
# Import the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Import the dataset
dataset = pd.read_csv('salaryData.csv')
X = dataset.iloc[:, :-1].values # Assuming the feature is in the first column
y = dataset.iloc[:, -1].values # Assuming the target is in the second column
Train classifier and predict outcomes
# Split the dataset into the training set and test set
# We're splitting the data in 1/3, so out of 30 rows, 20 rows will go into the training
set,
# and 10 rows will go into the testing set.
xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size=1/3, random_state=0)
# Optional: Check the split data (this step depends on your needs)
import pandas as pd
show_data = pd.DataFrame({'Training Set': xTrain.flatten(), 'Training Target':
yTrain})
print(show_data)
# Creating a LinearRegression object and fitting it on our training set.
linearRegressor = LinearRegression()
linearRegressor.fit(xTrain.reshape(-1, 1), yTrain)
# Predicting the test set results
yPrediction = linearRegressor.predict(xTest.reshape(-1, 1))
# Flattening the prediction to match original test set format (if needed)
yPrediction = yPrediction.flatten()
print(yPrediction)
Visualizing training and target training
Showing actual and predicted data and visualize it
# Showing test set and predicted values side by side
results = pd.DataFrame({
'Test Set': xTest.flatten(),
'Actual Value': yTest,
'Predicted Values': yPrediction
})
print(results)
# Visualising the training set results
plt.subplot(2, 1, 1) # Define a 2-row, 1-column grid, and use the 1st cell
plt.scatter(xTrain, yTrain, color='red')
plt.plot(xTrain, linearRegressor.predict(xTrain.reshape(-1, 1)), color='blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
Actual and predicted values
Plotting test set data
# Visualising the test set results
plt.subplot(2, 1, 2) # Define a 2-row, 1-column grid, and use the 2nd cell
plt.scatter(xTest, yTest, color='red')
plt.plot(xTrain, linearRegressor.predict(xTrain.reshape(-1, 1)), color='blue') # Use
training data to plot regression line
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
Regression metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Calculating and printing the performance metrics
print("Mean Absolute Error:", mean_absolute_error(yTest, yPrediction))
print("Mean Squared Error:", mean_squared_error(yTest, yPrediction))
print("Variance Score (R^2):", r2_score(yTest, yPrediction))
Regression metrics
Mean Absolute Error: mean absolute error is a measure of difference between two
continuous variables.
Mean Squared Error: the mean squared error or mean squared deviation of an estimator
measures the average of the squares of the errors—that is, the average squared difference
between the estimated values and what is estimated. MSE is a risk function, corresponding to
the expected value of the squared error loss.
Visualizing data
Class Tasks
Submission Date: --
Perform linear regression on the student dataset uploaded on the drive.
Perform linear regression on the dataset of your own choice.