Regression
Regression in machine learning is a technique used to find the
relationships between independent and dependent variables, with the main
purpose of predicting an outcome. It involves training a set of algorithms to
reveal patterns that characterize the distribution of each data point. With
patterns identified, the model can then make accurate predictions for new
data points or input values.
Types of Regression
1. Linear Regression
2. Logistic Regression
Linear Regression
Linear regression is a type of supervised machine-learning
algorithm that learns from the labelled datasets and maps the data points
with most optimized linear functions which can be used for prediction on
new datasets. It assumes that there is a linear relationship between the
input and output, meaning the output changes at a constant rate as the
input changes. This relationship is represented by a straight line.
For example we want to predict a student's exam score based on how
many hours they studied. We observe that as students study more hours,
their scores go up. In the example of predicting exam scores based on
hours studied. Here
Independent variable (input): Hours studied because it's the factor
we control or observe.
Dependent variable (output): Exam score because it depends on
how many hours were studied.
Equation of the Best-Fit Line
For simple linear regression (with one independent variable), the best-fit
line is represented by the equation
y=mx+by
Where:
y is the predicted value (dependent variable)
x is the input (independent variable)
m is the slope of the line (how much y changes when x changes)
b is the intercept (the value of y when x = 0)
The best-fit line will be the one that optimizes the values of m (slope) and
b (intercept) so that the predicted y values are as close as possible to the
actual data points.
Study Test Mean(X) Mean(Y) Deviations(X) Deviations(Y) Product of Sum of Square of
Hours score deviations Product deviations for X
(X) (Y) of
deviations
2 40 4 50 -2 -10 20 40 4
4 50 0 0 0 0
6 60 2 10 20 4
Calculate m = Sum of product of deviations / Sum of square of deviation for X
Calculate b = Mean of Y – (m* Mean of X)
Calculations
Sum of Product of Deviations = 20 + 0 + 20 = 40
Sum of Square of Deviations for X = 4 + 0 + 4 = 8
m = Sum of Product of Deviations / Sum of Square of Deviations for X
m = 40/8 = 5
b=Mean(Y) − (m * mean(X)) =50− (5*4) =30
Final Regression Equation
Y=5X+30Y
Study_hours.py
import pandas as pd
from sklearn.linear_model import LinearRegression
# Dataset
data = {
'StudyHours': [2, 3, 4, 5, 6, 7, 8],
'Marks': [40, 50, 55, 65, 70, 80, 85]
}
df = pd.DataFrame(data)
# Train model
X = df[['StudyHours']]
y = df['Marks']
regr = LinearRegression()
regr.fit(X, y)
# User input
study_hours = float(input("Enter study hours: "))
# Wrap input in DataFrame with the same column name
input_data = pd.DataFrame({'StudyHours': [study_hours]})
predicted_marks = regr.predict(input_data)
print(f"Study Hours: {study_hours}")
print(f"Predicted Marks: {predicted_marks[0]:.2f}")
Output
Enter study hours: 6
Study Hours: 6.0
Predicted Marks: 71.07
Q1: Fit a linear regression model for data set (x, y): (1, 1.5), (2, 3.0), (3, 4.5), (4, 6.0) and
predict y for x = 5
Non-Linear Regression
Non-linear regression is a type of regression in machine learning where the relationship
between input XXX and output YYY is not a straight line. Instead, the data follows a
curved pattern.
In such cases, a straight line (linear regression) does not fit well, so we use equations like
polynomial, exponential, logarithmic, or other non-linear functions.
Logistic Regression
Logistic regression is a type of supervised machine-learning algorithm that also learns from labelled
datasets but is mainly used for classification problems instead of predicting continuous values. It
assumes that the output is categorical, such as Yes/No or 0/1, and maps the data points using a
logistic function (sigmoid curve) to estimate probabilities between 0 and 1. This probability is then
used to decide the class of new data points. For example, we may want to predict whether a student
will pass or fail based on how many hours they studied. We observe that as study hours increase, the
probability of passing also increases, which is captured by the S-shaped logistic curve.
Sigmoid Function
Y = 1/1+e-(a0 + a1*X)
Where :
a0 → Intercept (similar to b in linear regression).
a1→ Coefficient/weight of the feature XXX.
X → Input (independent variable).
Output → A probability between 0 and 1.
Example :
Student_ID Study_Hours Outcome
S01 0.5 Fail
S02 1 Fail
S03 1.5 Fail
S04 2 Fail
S05 2.5 Fail
S06 3 Fail
S07 3.5 Fail
S08 4 Fail
S09 4.5 Fail
S10 5 Pass
S11 5.5 Pass
S12 6 Pass
S13 6.5 Pass
S14 7 Pass
S15 7.5 Pass
S16 8 Pass
S17 8.5 Pass
S18 9 Pass
S19 9.5 Pass
S20 10 Pass