6/1/23, 10:31 PM logistic_regression
Here Some Important Questions with Answer on logistic Regression
What is logistic regression, and when is it used?
logistic regression is also part of regression but the main difference between logistics
regression and other regression problem is ,logistics regression work on categorical
problems like male and female , true and false ,yes or no
When is it used ?
1. Predictive modelling
2. Medical Research
3. Credit Scoring
4. Market and customer analystics
What is the logistic function (also known as the sigmoid function), and why is it used in
logistic regression?
The logistics function, also known as sigmoid function, it is a mathematical function that
maps any real value number to a value between 0 and 1 , it is s-shaped cured and is
represented by formula σ(z) = 1 / (1 + e^(-z))
where σ(z) represents the output (probability) and z represents the input to the function.
How do you evaluate the performance of a logistic regression model?
Here is some commonly used evaluation methods for logistic regression
1. confusion matrix
2. Accuracy
3. Precision
4. Recall
5. F1 Score
6. ROC Curve
Import Ncessary Library
In [ ]: import numpy as np
import pandas as pd
load dataset
In [ ]: df = pd.read_csv('ft.csv')
df.head()
file:///C:/Users/rinki/Downloads/logistic_regression.html 1/28
6/1/23, 10:31 PM logistic_regression
Out[ ]: male age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp
0 1 39 4.0 0 0.0 0.0 0 0
1 0 46 2.0 0 0.0 0.0 0 0
2 1 48 1.0 1 20.0 0.0 0 0
3 0 61 3.0 1 30.0 0.0 0 1
4 0 46 3.0 1 23.0 0.0 0 0
Perform EDA
In [ ]: df.shape
Out[ ]: (4238, 16)
In [ ]: #view null value
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 male 4238 non-null int64
1 age 4238 non-null int64
2 education 4133 non-null float64
3 currentSmoker 4238 non-null int64
4 cigsPerDay 4209 non-null float64
5 BPMeds 4185 non-null float64
6 prevalentStroke 4238 non-null int64
7 prevalentHyp 4238 non-null int64
8 diabetes 4238 non-null int64
9 totChol 4188 non-null float64
10 sysBP 4238 non-null float64
11 diaBP 4238 non-null float64
12 BMI 4219 non-null float64
13 heartRate 4237 non-null float64
14 glucose 3850 non-null float64
15 TenYearCHD 4238 non-null int64
dtypes: float64(9), int64(7)
memory usage: 529.9 KB
In [ ]: #view descriptive statics
df.describe()
file:///C:/Users/rinki/Downloads/logistic_regression.html 2/28
6/1/23, 10:31 PM logistic_regression
Out[ ]: male age education currentSmoker cigsPerDay BPMeds preval
count 4238.000000 4238.000000 4133.000000 4238.000000 4209.000000 4185.000000 42
mean 0.429212 49.584946 1.978950 0.494101 9.003089 0.029630
std 0.495022 8.572160 1.019791 0.500024 11.920094 0.169584
min 0.000000 32.000000 1.000000 0.000000 0.000000 0.000000
25% 0.000000 42.000000 1.000000 0.000000 0.000000 0.000000
50% 0.000000 49.000000 2.000000 0.000000 0.000000 0.000000
75% 1.000000 56.000000 3.000000 1.000000 20.000000 0.000000
max 1.000000 70.000000 4.000000 1.000000 70.000000 1.000000
In [ ]: #check duplicate rows
duplicate_rows = df.duplicated()
#count the number of True values
num_dup_rows = duplicate_rows.sum()
num_dup_rows
Out[ ]: 0
In [ ]: import matplotlib.pyplot as plt
import seaborn as sns
num_feat = ["male","age","education","currentSmoker","cigsPerDay","BPMeds","prev
for feature in num_feat:
plt.figure(figsize =(7,7) )
sns.histplot(df[feature],kde = True)
plt.title(f"Histogram of {feature}")
file:///C:/Users/rinki/Downloads/logistic_regression.html 3/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 4/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 5/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 6/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 7/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 8/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 9/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 10/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 11/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 12/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 13/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 14/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 15/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 16/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 17/28
6/1/23, 10:31 PM logistic_regression
In [ ]: num_feat = ["male","age","education","currentSmoker","cigsPerDay","BPMeds","prev
sns.pairplot(df[num_feat])
plt.show()
file:///C:/Users/rinki/Downloads/logistic_regression.html 18/28
6/1/23, 10:31 PM logistic_regression
In [ ]: for feature in num_feat:
plt.figure(figsize=(6,4))
sns.boxplot(x=df[feature])
plt.title(f'boxplot of {feature}')
plt.show()
file:///C:/Users/rinki/Downloads/logistic_regression.html 19/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 20/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 21/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 22/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 23/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 24/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 25/28
6/1/23, 10:31 PM logistic_regression
file:///C:/Users/rinki/Downloads/logistic_regression.html 26/28
6/1/23, 10:31 PM logistic_regression
In [ ]: X = df[['age','prevalentHyp','sysBP','diaBP','glucose']]
y = df['TenYearCHD']
In [ ]: X.isnull().sum()
Out[ ]: age 0
prevalentHyp 0
sysBP 0
diaBP 0
glucose 388
dtype: int64
In [ ]: X['glucose'] = X['glucose'].fillna(value=df['glucose'].mean())
<ipython-input-19-32a7772c3ba4>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/sta
ble/user_guide/indexing.html#returning-a-view-versus-a-copy
X['glucose'] = X['glucose'].fillna(value=df['glucose'].mean())
In [ ]: from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_st
In [ ]: from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train, y_train)
Out[ ]: ▾ LogisticRegression
LogisticRegression()
file:///C:/Users/rinki/Downloads/logistic_regression.html 27/28
6/1/23, 10:31 PM logistic_regression
In [ ]: score = lr.score(x_train, y_train)
score
Out[ ]: 0.8486176668914363
In [ ]: from sklearn.metrics import confusion_matrix
y_pred = lr.predict(x_test)
y_true = y_test
confusion_matrix(y_true, y_pred)
Out[ ]: array([[1080, 4],
[ 182, 6]])
In [ ]: score = np.array(score).reshape(-1, 1)
In [ ]: from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_true, y_pred)
In [ ]: roc_auc = auc(fpr, tpr)
In [ ]: plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' %
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
file:///C:/Users/rinki/Downloads/logistic_regression.html 28/28