Experiment 2
Roll no:
Aim: To study and implement Logistic Regression Theory:
Theory:
Logistic regression is a supervised machine learning algorithm widely used for binary
classification tasks, such as identifying whether an email is spam or not and diagnosing diseases
by assessing the presence or absence of specific conditions based on patient test results. This
approach utilizes the logistic (or sigmoid) function to transform a linear combination of input
features into a probability value ranging between 0 and 1. This probability indicates the likelihood
that a given input corresponds to one of two predefined categories. The essential mechanism of
logistic regression is grounded in the logistic function's ability to model the probability of binary
outcomes accurately. With its distinctive S-shaped curve, the logistic function effectively maps
any real-valued number to a value within the 0 to 1 interval. For the simplicity, take the following
data (Single feature and single target)
The image shows the basic idea behind Logistic Regression, which is a method used to classify things
into one of two categories — like answering yes or no, spam or not spam, or diabetic or not diabetic.
At the center of this method is a special curve called the logistic function or sigmoid curve, which has an
S-shape like the one shown in the diagram. This curve takes input values (shown along the X-axis) and
turns them into predicted probabilities (shown on the Y-axis) that always fall between 0 and 1. This
means:
• When the input value is low, the prediction is closer to 0 (meaning “no” or negative class).
• When the input value is high, the prediction is closer to 1 (meaning “yes” or positive class).
• In between, the prediction smoothly changes — it’s never just a hard switch.
The method behind this involves calculating a straight-line combination of the input values, then passing
it through the S-shaped curve to get the probability. This allows us to model things like how likely a
person is to have a condition based on their age, test results, or other features.
In the diagram:
• The X-axis is the input or feature.
• The Y-axis is the predicted result — a probability that increases from 0 to 1.
• The S-curve shows how predictions move smoothly from one class to the other.
Types of Logistic Regression
1. Binary Logistic Regression
Binary Logistic Regression is the most common type. In this method, the output (target) can only be one
of two possible outcomes — like success/failure, yes/no, or 0/1. It calculates the probability that a
certain input belongs to one of these two categories by using a curve called the sigmoid function. This
curve turns any number into a value between 0 and 1.
If the predicted value is above a set threshold (usually 0.5), the model says it belongs to one class. If it’s
below, it belongs to the other. This type is often used for things like detecting spam emails, spotting
fraud, or predicting if a student will pass or fail.
2. Multinomial Logistic Regression
Multinomial Logistic Regression is used when the outcome can be more than two categories, and those
categories don’t follow any order. For example, predicting whether someone owns a dog, cat, or rabbit
— the choices are different, but not ranked.
This version uses a different function called softmax, which calculates the probability for each possible
category. Then, it chooses the one with the highest probability. This type is common in problems like
classifying text, marketing segments, or types of images.
3. Ordinal Logistic Regression
Ordinal Logistic Regression is used when the result can be three or more categories that have a natural
order (but we don’t know exactly how far apart they are). For example, rating satisfaction as low,
medium, or high involves an order.
Instead of looking at each class one by one, this model works with cumulative probabilities, which take
the ranking into account. It's useful in surveys, customer reviews, and any situation where the responses
have levels or ranks.
Applications of Logistic Regression
1. Medical Diagnosis
Logistic regression is commonly used in healthcare to predict whether a person has a disease or not
based on their test results and medical history.
2. Spam Detection
It helps in identifying whether an email is spam or not spam by looking at things like the subject line,
sender, and content of the email.
3. Credit Scoring / Loan Approval
Banks and financial institutions use logistic regression to evaluate credit risk and decide whether a
person should be approved for a loan.
Advantages of Logistic Regression
1. Simple and Easy to Implement
Logistic regression is easy to understand and use. It’s a great starting point for solving binary
classification problems, especially when you're just beginning with machine learning.
2. Fast Training and Prediction
It runs quickly, even on large datasets. Logistic regression works well when the link between inputs and
the result is roughly a straight-line relationship.
3. Probabilistic Output
Instead of just saying "yes" or "no," logistic regression gives a probability score (like "87% chance of
success"), which can help in making smarter decisions.
Common Challenges with Logistic Regression
1. Assumes Linear Relationship
Logistic regression assumes a linear connection between the input values and the output. If the data has
non-linear patterns, this model might not perform well.
2. Not Ideal for Large Feature Spaces
If there are too many input features, especially if many are similar or unnecessary, the model can
become unstable unless you apply methods like regularization.
3. Sensitive to Irrelevant Features
Logistic regression doesn’t automatically ignore features that don’t matter. Too many unhelpful inputs
can confuse the model, so it’s important to use feature selection before training.
Code:
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from
sklearn.linear_model import LogisticRegression from sklearn.model_selection
import train_test_split from sklearn.metrics import accuracy_score,
classification_report, confusion_matrix df = pd.read_csv('diabetes.csv') X =
df[['Age']] y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model
= LogisticRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred)) print("\nClassification Report:\n",
classification_report(y_test, y_pred)) print("Confusion Matrix:\n", confusion_matrix(y_test,
y_pred)) x_range = pd.DataFrame({'Age': range(df['Age'].min(), df['Age'].max() + 1)})
pred_probs = model.predict_proba(x_range)[:, 1] plt.figure(figsize=(10, 6))
sns.scatterplot(x='Age', y='Outcome', data=df, hue='Outcome', alpha=0.6,
palette='coolwarm') plt.plot(x_range['Age'], pred_probs, color='black', linewidth=2)
plt.title("Logistic Regression: Age vs Probability of Diabetes") plt.xlabel("Age")
plt.ylabel("Predicted Probability of Diabetes") plt.grid(True) plt.tight_layout() plt.show()d
Output:
Conclusion:
Therefore I have studied and understood logistic regression and implemented the same.