0% found this document useful (0 votes)
21 views67 pages

Professional Machine Learning

ML

Uploaded by

sci.mointariq0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views67 pages

Professional Machine Learning

ML

Uploaded by

sci.mointariq0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

professional-machine-learning-4

October 14, 2024

[72]: #ML Workflow:


'''The whole end-to-end process that is involved can be generalized into a␣
↪six-stage

process outlined here:


1. Problem Understanding: Negotiation with stakeholders for requirments.
2. Data Collection: Retrieval from DBs, APIs or Scrapping
3. Data Annotation and Data Preparation: Labeling, Cleaning, Reformating,␣
↪Normalization etc.

4. Data Wrangling: Conversion of Data for ML use (i.e. Numeric)


5. Model Development, Training, and Evaluation: Test train splits and then␣
↪evaaluating different models and finalising the best one.

6. Model Deployment and Maintenance: Model will be sent to real-world cases and␣
↪then with more data it gets trained more again and again'''

#Scikit-learn is the most commonly used library of python for ML.


import sklearn
'''It is organized around three primary APIs, namely, estimator, predictor, and␣
↪transformer. Transformer (transform) is used in preprocessing the data for

ML modeling. An estimator is initialized from hyperparameter values and␣


↪implements the

actual learning process in the fit method, which you call while providing the␣
↪input data

and labels in the form of X_train and y_train arrays. Predictors provide a␣
↪predict method to take the data which needs to be predicted

through a NumPy array that we usually refer to as X_test. It applies the␣


↪required

transformation with respect to the parameters that have been learned by the fit␣
↪method

and provides the predicted values or labels. Sklearn works on pandas dfs and␣
↪numpy arrays.

Pipeline objects chain multiple estimators into a single one. Thus, you can␣
↪encapsulate

multiple preprocessing, transformation, and prediction steps into a single␣


↪object.

SUPERVISED LEARNING WITH SCIKIT-LEARN scikit-learn syntaxfrom:


sklearn.module import Model
model = Model()

1
(Preprocess data by relevant transformation)
model.fit(X, y)
predictions = model.predict(X_new)
print(predictions)'''
print('.')

[73]: #Supervised Learning:

[74]: #1. Classification:


import pandas as pd
df=pd.read_csv('Churn.csv') #https://www.kaggle.com/datasets/nushkaa/
↪telecom-customer-churn

df.head()

[74]: CustomerID Age Gender Tenure Usage Frequency Support Calls \


0 1 22 Female 25 14 4
1 2 41 Female 28 28 7
2 3 47 Male 27 10 2
3 4 35 Male 9 12 5
4 5 53 Female 58 24 9

Payment Delay Subscription Type Contract Length Total Spend \


0 27 Basic Monthly 598
1 13 Standard Monthly 584
2 29 Premium Annual 757
3 17 Premium Quarterly 232
4 2 Standard Annual 533

Last Interaction Churn


0 9 1
1 20 0
2 21 0
3 18 0
4 18 0

[75]: #1. KNN:


from sklearn.neighbors import KNeighborsClassifier
X=df[['Tenure','Usage Frequency','Support Calls','Payment Delay','Total␣
↪Spend','Last Interaction']].values #predictors

Y=df['Churn'].values #target
#.values is used to convert it to numpy array
knn=KNeighborsClassifier(n_neighbors=10)
knn.fit(X,Y)

[75]: KNeighborsClassifier(n_neighbors=10)

2
[76]: import numpy as np
#let's test how it predicts:
y_pred=knn.predict(np.array([[128,25,265.1,197.4,244.7,10.
↪01],[20,150,155,200,10,3]]))

print(f'Predictions: {y_pred}')

Predictions: [1 1]

[77]: #Train/test split + computing accuracy:


import pandas as pd
# Import the module
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X=df[['Tenure','Usage Frequency','Support Calls','Payment Delay','Total␣
↪Spend','Last Interaction']].values #predictors

y=df['Churn'].values #target
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42, stratify=y)

knn = KNeighborsClassifier(n_neighbors=5)
# Fit the classifier to the training data
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
# Print the accuracy
print(accuracy_score(y_test, y_pred))

0.8111067961165048

[78]: #Large K means less complex model which can cause under-fitting
#Smakk K means more complex model which can cause over-fiiting i.e. considering␣
↪noise (messy data) as real data

# Create neighbors
neighbors = np.arange(1,15)
train_accuracies = {}
test_accuracies = {}
for neighbor in neighbors:
# Set up a KNN Classifier
knn = KNeighborsClassifier(n_neighbors=neighbor)
# Fit the model
knn.fit(X_train, y_train)
# Compute accuracy
train_accuracies[neighbor] = knn.score(X_train, y_train)
test_accuracies[neighbor] = knn.score(X_test, y_test)
print(train_accuracies, '\n', test_accuracies)

{1: 1.0, 2: 0.9002699081535563, 3: 0.9020757684615235, 4: 0.8830851084487077, 5:


0.8780170488747354, 6: 0.8719780966620712, 7: 0.8671818870269326, 8:

3
0.8639779413192489, 9: 0.8601526243228024, 10: 0.8580360783704538, 11:
0.8534534651158275, 12: 0.8532592865880891, 13: 0.8506378764636207, 14:
0.8498805802054409}
{1: 0.785242718446602, 2: 0.7761553398058253, 3: 0.8040388349514563, 4:
0.8034174757281554, 5: 0.8111067961165048, 6: 0.8150679611650485, 7:
0.817009708737864, 8: 0.8203495145631068, 9: 0.8199611650485437, 10:
0.8218252427184466, 11: 0.8213592233009709, 12: 0.8240776699029126, 13:
0.8225242718446601, 14: 0.8233009708737864}

[79]: #Visualising Under-fitting and over-fitting with changing k:


# Add a title
import matplotlib.pyplot as plt
plt.title("KNN: Varying Number of Neighbors")
# Plot training accuracies
plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")
# Plot test accuracies
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")
plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
# Display the plot
plt.show()

4
[80]: #Accuracy is not always good for assessing classification model. It is the␣
↪ratio of the number of correct predictions to the total number of␣

↪predictions. Accuracy gives a general measure of how well the model performs␣

↪overall, but it can be misleading in cases of class imbalance, where the␣

↪majority class dominates the metric.

#We use classification metric to check other facts about model i.e:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
#Precision is the ratio of true positive predictions to the total number of␣
↪predicted positive cases. Precision measures the accuracy of positive␣

↪predictions. It tells you how many of the predicted positives are actually␣

↪positive. High precision means fewer false positives.

#Recall is the ratio of true positive predictions to the total number of actual␣
↪positive cases. It measures how well the model identifies positive cases. It␣

↪indicates the proportion of actual positives that were correctly predicted.␣

↪High recall means fewer false negatives.

#The F1 score is the harmonic mean of precision and recall. It combines both␣
↪metrics into a single score that balances their trade-offs. The F1 score is␣

↪especially useful when you need a single metric to evaluate the performance␣

↪of a model, particularly when dealing with imbalanced classes. It gives a␣

↪better measure of the model’s performance by considering both false␣

↪positives and false negatives. Higher F1 scores indicate a better balance␣

↪between precision and recall.

#Which metric to use when depends on the models requirements like Use accuracy␣
↪when there is no class imbalance and you care about overall correctness of␣

↪model. Use Precsion when False Positive are more costly than false negatives␣

↪and you wanna minimize false alrams i.e If you want to minimize non-spam␣

↪emails being labeled as spam. Use Recall when False negatives are more␣

↪costly than false positives. You want to capture as many positives as␣

↪possible. i.e. You prefer to catch all possible cases of a disease, even if␣
↪some non-diseased people are flagged. Use F1 When: there is an imbalance␣
↪between classes. You need a balance between precision and recall. You're␣

↪concerned with both false positives and false negatives. i.e You want to␣

↪show the most relevant results (high precision), but you also don't want to␣

↪miss relevant results (high recall).

precision recall f1-score support

0 0.84 0.79 0.81 6776


1 0.78 0.84 0.81 6099

accuracy 0.81 12875


macro avg 0.81 0.81 0.81 12875
weighted avg 0.81 0.81 0.81 12875

5
[81]: #ii. Logistic Regression for Classification: Only applicable for Binary target␣
↪variable.

from sklearn.model_selection import train_test_split


from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Initialize the model


model = LogisticRegression()
# Train the model
model.fit(X_train, y_train)
# Predict on test data
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1] # Shows proababilities for each␣
↪test record for 1 or 0 result as 2 columns array. 1st col represents 1␣

↪(positive class) so we used [:,1] i.e positive class probability

print(y_prob)

[0.01971114 0.01458912 0.7208891 … 0.92076501 0.45363509 0.24609744]

[82]: #By default probility threshold is >=0.5 for class 1 (positive). If we vary␣
↪this threshold we can use ROC curve to see how it impacts the True positives␣

↪and False Positive rate (fpr)

from sklearn.metrics import roc_curve


# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.plot([0, 1], [0, 1], 'k--')
# Plot tpr against fpr
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

6
[83]: # Import roc_auc_score: '''The AUC value ranges from 0 to 1:
'''1.0: Perfect model (ideal performance with no errors).
0.5: Random guessing (the model is no better than chance).
0.0: Worst-case scenario (the model is predicting everything incorrectly).'''
from sklearn.metrics import roc_auc_score, confusion_matrix,␣
↪classification_report

# Calculate roc_auc_score
print(roc_auc_score(y_test, y_prob))
# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))
# Calculate the classification report
print(classification_report(y_test, y_pred))
#ROC Curve is mostly used to test the Binary classification model.

0.8945414435900391
[[5545 1248]
[1103 4979]]
precision recall f1-score support

0 0.83 0.82 0.83 6793

7
1 0.80 0.82 0.81 6082

accuracy 0.82 12875


macro avg 0.82 0.82 0.82 12875
weighted avg 0.82 0.82 0.82 12875

[84]: #Regularizing Logistic Regression: It is used to overcome overfitting in␣


↪logistic regression which mainly occurs due to large value of constants in␣

↪equation choosen by the model while minimizing the loss function.

#i. Ridge Regression (L2 Regularization): Minimizes the loss function while␣
↪adding a penalty proportional to the sum of squared coefficients. Hence,␣

↪shrinks the coefficients towards zero, but doesn’t eliminate them. Helps in␣

↪situations with multicollinearity.

#ii. Lasso Regression (L1 Regularization): Minimizes the loss function while␣
↪adding a penalty proportional to the sum of the absolute values of the␣

↪coefficients. Lasso can shrink some coefficients exactly to zero,␣

↪effectively performing feature selection i.e removing redundant features.

#The algorithms of L1 is similar to L2, here we will use it to check the␣


↪importance of features in regression model based on lasso coeff. (coeff of␣

↪feature in regression eq.)

#iii. Elastic Net (Combination of L1 and L2 Regularization): Combine Ridge (L2)␣


↪and Lasso (L1) regularization penalties. It combines the benefits of Ridge␣
↪and Lasso by shrinking coefficients while still allowing feature selection.

from sklearn.model_selection import train_test_split


from sklearn.linear_model import LogisticRegression
# Initialize the model
model = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs',␣
↪max_iter=1000)#penalty: Specifies the norm used in the penalization ('l1',␣

↪'l2', 'elasticnet', or 'none'). C: Inverse of regularization strength;␣

↪smaller values specify stronger regularization. solver: Algorithm to use in␣

↪the optimization problem ('liblinear' for L1, 'lbfgs' and others for L2,␣

↪'saga' for Elastic Net).

# Train the model


model.fit(X_train, y_train)
# Predict on test data
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1] # Shows proababilities for each␣
↪test record for 1 or 0 result as 2 columns array. 1st col represents 1␣

↪(positive class) so we used [:,1] i.e positive class probability

print(y_prob)
print(classification_report(y_test, y_pred))
#Note that we select optimized value of C by Hyperparameter tuning for best␣
↪results.

[0.01971114 0.01458912 0.7208891 … 0.92076501 0.45363509 0.24609744]


precision recall f1-score support

8
0 0.83 0.82 0.83 6793
1 0.80 0.82 0.81 6082

accuracy 0.82 12875


macro avg 0.82 0.82 0.82 12875
weighted avg 0.82 0.82 0.82 12875

[85]: #iii Multi-Class Logistic Regression: The idea of logistic regression can be␣
↪extended to multi-class prediction as follow:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

#>One-vs-Rest Logistic Regression: Makes group of one class C1 and others␣


↪classes as Ci, then based on predit-prob checks if target belongs to C1 or␣

↪Ci. If Ci, then performs same algorithm again.

lr_ovr = LogisticRegression(multi_class="ovr")
lr_ovr.fit(X_train, y_train)
print("One-vs-Rest Logistic Regression")
print("Training Accuracy:", lr_ovr.score(X_train, y_train))
print("Test Accuracy :", lr_ovr.score(X_test, y_test))
print("\nClassification Report:\n", classification_report(y_test, lr_ovr.
↪predict(X_test)))

#>Multinomial (Softmax) Logistic Regression: Uses the softmax function to model␣


↪the probability distribution over all K classes.

lr_mn = LogisticRegression(multi_class="multinomial")
lr_mn.fit(X_train, y_train)
print("Multinomial Logistic Regression")
print("Training Accuracy:", lr_mn.score(X_train, y_train))
print("Test Accuracy :", lr_mn.score(X_test, y_test))
print("\nClassification Report:\n", classification_report(y_test, lr_mn.
↪predict(X_test)))

#For large no. of classes ovr becomes computationaly expensive

One-vs-Rest Logistic Regression


Training Accuracy: 0.95
Test Accuracy : 0.9666666666666667

9
Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 0.89 0.94 9
2 0.92 1.00 0.96 11

accuracy 0.97 30
macro avg 0.97 0.96 0.97 30
weighted avg 0.97 0.97 0.97 30

Multinomial Logistic Regression


Training Accuracy: 0.975
Test Accuracy : 1.0

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

[86]: #iv. SVC (Support Vector Classifier): They aim to find the optimal boundary␣
↪(hyperplane) that best separates data points of different classes with the␣

↪maximum possible margin. This optimal hyperplane ensures that future data␣

↪points are classified with higher confidence. They are alike Logistic␣

↪regression, however SVC is more robust due to it's capability to create␣

↪boundaries including and other than linear one hence is also applicable to␣

↪data with non-linear boundary and is computationaly fast and effective.␣

↪Using training data they create separate hyperplanes(line or plane or ellise␣

↪etc.) for each class using boundary data points(Support vectors) hence,␣

↪ensuring maximum class margins (Margin: The distance between the hyperplane␣

↪and the closest data points from each class). SVM aims to maximize this␣

↪margin to enhance generalization.

from sklearn.model_selection import train_test_split


from sklearn.svm import SVC
from sklearn import datasets
digits = datasets.load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,␣
↪test_size=0.25, random_state=42)

svm = SVC(kernel="linear") #Here our data can be separated by linear boundary

10
'''Linear Kernel (kernel='linear'): Suitable for linearly separable data.␣
↪Faster to compute.

RBF Kernel (kernel='rbf'): Handles non-linear relationships by mapping data to␣


↪a higher-dimensional space.

Polynomial Kernel (kernel='poly'): Useful for data with polynomial␣


↪relationships.

Sigmoid Kernel (kernel='sigmoid'): Less commonly used; can behave like a neural␣
↪network.'''

svm.fit(X_train, y_train)
train_accuracy = svm.score(X_train, y_train)
test_accuracy = svm.score(X_test, y_test)
print(f"SVC Training Accuracy: {train_accuracy:.4f}")
print(f"SVC Test Accuracy : {test_accuracy:.4f}")

SVC Training Accuracy: 1.0000


SVC Test Accuracy : 0.9822

[87]: #Let's apply SVC on nonlinear bounded data:


import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
# Generate synthetic non-linear data
X, y = make_circles(n_samples=500, factor=0.5, noise=0.1, random_state=42)
print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)
print(y)

Feature matrix shape: (500, 2)


Target vector shape: (500,)
[1 0 1 0 0 1 1 0 0 1 0 0 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 0 0 1 1
0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 1
1 1 1 0 1 1 0 0 1 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 1 0 0
1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0
1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 1 1 1
0 1 0 0 0 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 0 1 1
0 0 0 1 0 1 1 0 0 0 1 0 1 1 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 1 0 1 1 0 1 0
1 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0
1 1 1 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 1 0 1 1 0 1 0 1 0 1 0 0 0 1 0 0 0
0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1 0 0 0 1 0 0 0 0 1 1 0 1
0 1 0 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 1 1 1 1
1 1 0 0 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 0 0 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0
1 1 0 1 1 1 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 1 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0
0 1 0 1 0 1 1 0 1 0 1 0 0 0 0 1 1 1 0]

[88]: from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42, stratify=y)

11
print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)

Training set size: (400, 2)


Test set size: (100, 2)

[89]: from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
# Fit the scaler on training data and transform both training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Display mean and variance of the scaled features
print("Mean of scaled features (training set):", np.mean(X_train_scaled,␣
↪axis=0))

print("Variance of scaled features (training set):", np.var(X_train_scaled,␣


↪axis=0))

Mean of scaled features (training set): [6.80011603e-18 7.21644966e-18]


Variance of scaled features (training set): [1. 1.]

[90]: #Tuning Hyperparameters first: (We will learn about it later)


from sklearn.model_selection import GridSearchCV
# Instantiate an RBF SVM
svm = SVC()
# Instantiate the GridSearchCV object and run the search
parameters = {'C':[0.1, 1, 10], 'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1]}
searcher = GridSearchCV(svm, parameters)
searcher.fit(X_train, y_train)
print("Best CV params", searcher.best_params_)

Best CV params {'C': 10, 'gamma': 0.1}

[91]: from sklearn.svm import SVC


# Initialize the SVM classifier with RBF kernel
svm_rbf = SVC(kernel='rbf', C=10, gamma=0.1, probability=True, random_state=42)␣
↪#gamma=Kernel coefficient, controls smoothness of hyperplanes, larger gamma␣

↪means more complex boundaries;

# Train the SVM model


svm_rbf.fit(X_train_scaled, y_train)
# Evaluate accuracy on training and test sets
train_accuracy = svm_rbf.score(X_train_scaled, y_train)
test_accuracy = svm_rbf.score(X_test_scaled, y_test)
print(f"SVM with RBF Kernel - Training Accuracy: {train_accuracy:.2f}")
print(f"SVM with RBF Kernel - Test Accuracy: {test_accuracy:.2f}")

SVM with RBF Kernel - Training Accuracy: 0.99


SVM with RBF Kernel - Test Accuracy: 0.98

12
[92]: #2. Regression:
import pandas as pd
df = pd.read_csv("diabetes_clean.csv") #https://www.kaggle.com/datasets/
↪saurabh00007/diabetescsv

df.head()

[92]: pregnancies glucose diastolic triceps insulin bmi dpf age \


0 6 148 72 35 0 33.6 0.627 50
1 1 85 66 29 0 26.6 0.351 31
2 8 183 64 0 0 23.3 0.672 32
3 1 89 66 23 94 28.1 0.167 21
4 0 137 40 35 168 43.1 2.288 33

diabetes
0 1
1 0
2 1
3 0
4 1

[93]: from sklearn.model_selection import train_test_split


from sklearn.linear_model import LinearRegression
X = df.drop("glucose", axis=1).values
y = df["glucose"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.
↪3,random_state=42)

reg_all = LinearRegression()
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)
#Underneath this regression performs OLS (ordinary least square regression i.e.␣
↪minimizing the least square distance(residual) function (loss/cost/error␣

↪func of data points from actual point to fit point)

[94]: #Testing model performance:


reg_all.score(X_test, y_test) #R-squared explains the proportion of the␣
↪variance in the dependent variable (target) that is predictable from the␣

↪independent variable(s) (features)

[94]: 0.28280468810375137

[95]: from sklearn.metrics import mean_squared_error


rmse=mean_squared_error(y_test, y_pred, squared=False) #square-root of mean of␣
↪residual sum of squares tells about average error in predictions

rmse

C:\Users\14274\anaconda3\Lib\site-packages\sklearn\metrics\_regression.py:483:
FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in

13
1.6. To calculate the root mean squared error, use the
function'root_mean_squared_error'.
warnings.warn(

[95]: 26.34145958223226

[96]: #3. Cross-Validation: We tested our models by Accuracy in Classification and␣


↪R-squared, RSME etc. for Regression however, these metrics depend on the␣

↪test-train split, that can be biased. Hence, we use cross-validation to␣

↪handle this bias. For that purpose we split our data into k fold (k is upto␣

↪you) i.e. Data=f1+f2+...+fk. Then we train-test and find model assesment␣

↪metric by considering each fi as test data and other as train data once. ␣

↪This is known as k-fold crosss validation.

# Import the necessary modules


import numpy as np
from sklearn.model_selection import KFold, cross_val_score
# Create a KFold object
kf = KFold(n_splits=6, shuffle=True, random_state=5) #n-splits=no. of folds,␣
↪shuffle shuffles data before starting process

reg = LinearRegression() #Use model depending on the requirement


# Compute 6-fold cross-validation scores
cv_scores = cross_val_score(reg, X, y, cv=kf)
# Print scores
print(cv_scores)
print(np.mean(cv_scores))
print(np.std(cv_scores))
print(np.quantile(cv_scores, [0.025, 0.975])) #95% confidence interval
#LOOCV: Leave One out CV is CV with n_splits=len(df) or no. of datapoints is␣
↪the efficients technique to find overall error of our model, but is␣

↪computationally too expensive

[0.37915966 0.29257178 0.38953015 0.22314647 0.3677666 0.31175482]


0.3273215793495467
0.05844544156906046
[0.23182464 0.38823384]

[97]: #4. Regularised Regression:It is used to overcome overfitting in regression␣


↪which mainly occurs due to large value of constants in equation choosen by␣

↪the model while minimizing the loss function.

[98]: #i. Ridge Regression (L2 Regularization): Minimizes the sum of squared errors␣
↪(SSE) while adding a penalty proportional to the sum of squared coefficients.

↪ Hence, shrinks the coefficients towards zero, but doesn’t eliminate them.␣

↪Helps in situations with multicollinearity.

# Import Ridge
from sklearn.linear_model import Ridge
alphas = [0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]

14
ridge_scores = []
for alpha in alphas:
# Create a Ridge regression model
ridge = Ridge(alpha=alpha)
# Fit the data
ridge.fit(X_train, y_train)
# Obtain R-squared
score = ridge.score(X_test, y_test)
ridge_scores.append(score)
print(ridge_scores) #hence large values of alpha (overpanalizing loss function)␣
↪don't cause underfitting ! For practical modeling we don't need to simulate␣

↪for different values of alpha, take one and execute the model.

[0.28284666232222233, 0.28320633574804777, 0.2853000732200006,


0.26423984812668133, 0.19292424694100951, 0.1768272855049815]

[99]: #ii. Lasso Regression (L1 Regularization): Minimizes the sum of squared errors␣
↪(SSE) while adding a penalty proportional to the sum of the absolute values␣

↪of the coefficients. Lasso can shrink some coefficients exactly to zero,␣

↪effectively performing feature selection i.e removing redundant features.

#The algorithms of L1 is similar to L2, here we will use it to check the␣


↪importance of features in regression model based on lasso coeff. (coeff of␣

↪feature in regression eq.)

# Import Lasso
from sklearn.linear_model import Lasso
import matplotlib.pyplot as plt# Instantiate a lasso regression model
lasso = Lasso()
# Fit the model to the data
lasso.fit(X, y)
# Compute and print the coefficients
lasso_coef = lasso.coef_
print(lasso_coef)
df_features=df.drop('glucose',axis=1)
plt.bar(df_features.columns,lasso_coef)
plt.xticks(rotation=45)
plt.show()

[-0.22906289 0.1058408 -0.28282436 0.09302862 0.38436673 0.


0.49564721 19.95533798]

15
[100]: #iii. Elastic Net (Combination of L1 and L2 Regularization): Combine Ridge (L2)␣
↪and Lasso (L1) regularization penalties. It combines the benefits of Ridge␣
↪and Lasso by shrinking coefficients while still allowing feature selection.

[101]: #When to Use Regularized Regression:


'''Ridge: When you expect that all predictors contribute a bit to the␣
↪prediction and you want to prevent multicollinearity.

Lasso: When you want to perform feature selection, especially when many␣
↪features are irrelevant.

Elastic Net: When you need a balance between Ridge and Lasso, especially when␣
↪you have correlated features.'''

print('.')

[102]: #5. Hyperparameter Tuning: Search of the best/optimised parameter (k, alpha etc␣
↪arguments of model instatiator) values for our model.

#i. GridSearchCV: It evalutes model for each value in given range


from sklearn.model_selection import GridSearchCV

16
from sklearn.linear_model import Lasso
# Set up the parameter grid i.e which parameters to search
param_grid = {"alpha": np.linspace(0.00001, 1, 20)}
kf = KFold(n_splits=6, shuffle=True, random_state=5) #for cross-validation to␣
↪avoid over-fititng

# Instantiate lasso_cv
lasso = Lasso() #model instantiate
lasso_cv = GridSearchCV(lasso, param_grid, cv=kf) #
# Fit to the training data
lasso_cv.fit(X_train, y_train)
print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_))
print("Tuned lasso score: {}".format(lasso_cv.best_score_))

Tuned lasso paramaters: {'alpha': 0.10527210526315789}


Tuned lasso score: 0.3502520430880187

[103]: #ii. RandomizedSearchCV: GridSearchCV: is good but it is exhaustive for big␣


↪data with large K-folds and hyperparameters of models. Hence, we used this␣

↪new technique which randomly pics hperparameters.

from sklearn.model_selection import RandomizedSearchCV


# Set up the parameter grid i.e which parameters to search
param_grid = {"alpha": np.linspace(0.00001, 1, 20)}
kf = KFold(n_splits=6, shuffle=True, random_state=5) #for cross-validation to␣
↪avoid over-fititng

# Instantiate lasso_cv
lasso = Lasso() #model instantiate
lasso_cv = RandomizedSearchCV(lasso, param_grid, cv=kf, n_iter=3) #n_iter is␣
↪optional which specifies how many random values of parameters to test

# Fit to the training data


lasso_cv.fit(X_train, y_train)
print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_))
print("Tuned lasso score: {}".format(lasso_cv.best_score_))

Tuned lasso paramaters: {'alpha': 0.47368947368421055}


Tuned lasso score: 0.3481428249066394

[104]: #Remember that we must apply data science i.e. Cleaning, Analysis,␣
↪Visualisation to understand relations, and preprocessing before applying␣

↪Machine Learning Models to our data.

[105]:

17
#6. Pipelinig the ML project: Handling Categorical/Missing Data/Data␣
↪Preprocessing and Many more in one go : For conversion of Categorical data␣

↪to numerical we already know many techniques such as One-hot encoding,␣

↪Binary encoding, Pandas get dummies method etc. So we must convert our data␣

↪to numbers before apply ML models. Also for handling missing data we also␣

↪know various techniques of Data Analysis, however Sklearn has one of it's␣

↪own method of refilling missing data using sklearn.impute model. Here we␣

↪also discussed about the ML pipelining where we pipeline steps that code␣

↪should perform on the Data set to create model.

import pandas as pd
df=pd.read_csv('music_genre.csv') #
df.head()

[105]: instance_id artist_name track_name popularity \


0 32894.0 Röyksopp Röyksopp's Night Out 27.0
1 46652.0 Thievery Corporation The Shining Path 31.0
2 30097.0 Dillon Francis Hurricane 28.0
3 62177.0 Dubloadz Nitro 34.0
4 24907.0 What So Not Divide & Conquer 32.0

acousticness danceability duration_ms energy instrumentalness key \


0 0.00468 0.652 -1.0 0.941 0.79200 A#
1 0.01270 0.622 218293.0 0.890 0.95000 D
2 0.00306 0.620 215613.0 0.755 0.01180 G#
3 0.02540 0.774 166875.0 0.700 0.00253 C#
4 0.00465 0.638 222369.0 0.587 0.90900 F#

liveness loudness mode speechiness tempo obtained_date \


0 0.115 -5.201 Minor 0.0748 100.889 4-Apr
1 0.124 -7.043 Minor 0.0300 115.00200000000001 4-Apr
2 0.534 -4.617 Major 0.0345 127.994 4-Apr
3 0.157 -4.498 Major 0.2390 128.014 4-Apr
4 0.157 -6.266 Major 0.0413 145.036 4-Apr

valence music_genre
0 0.759 Electronic
1 0.531 Electronic
2 0.333 Electronic
3 0.270 Electronic
4 0.323 Electronic

[106]: df.
↪drop(['instance_id','artist_name','track_name','obtained_date','key'],axis=1,inplace=True)␣

↪#dropping redundant features

df.head()

18
[106]: popularity acousticness danceability duration_ms energy \
0 27.0 0.00468 0.652 -1.0 0.941
1 31.0 0.01270 0.622 218293.0 0.890
2 28.0 0.00306 0.620 215613.0 0.755
3 34.0 0.02540 0.774 166875.0 0.700
4 32.0 0.00465 0.638 222369.0 0.587

instrumentalness liveness loudness mode speechiness \


0 0.79200 0.115 -5.201 Minor 0.0748
1 0.95000 0.124 -7.043 Minor 0.0300
2 0.01180 0.534 -4.617 Major 0.0345
3 0.00253 0.157 -4.498 Major 0.2390
4 0.90900 0.157 -6.266 Major 0.0413

tempo valence music_genre


0 100.889 0.759 Electronic
1 115.00200000000001 0.531 Electronic
2 127.994 0.333 Electronic
3 128.014 0.270 Electronic
4 145.036 0.323 Electronic

[107]: df=pd.get_dummies(df, columns=['music_genre','mode']) #Converting categorical␣


↪data to binary

df.head()

[107]: popularity acousticness danceability duration_ms energy \


0 27.0 0.00468 0.652 -1.0 0.941
1 31.0 0.01270 0.622 218293.0 0.890
2 28.0 0.00306 0.620 215613.0 0.755
3 34.0 0.02540 0.774 166875.0 0.700
4 32.0 0.00465 0.638 222369.0 0.587

instrumentalness liveness loudness speechiness tempo … \


0 0.79200 0.115 -5.201 0.0748 100.889 …
1 0.95000 0.124 -7.043 0.0300 115.00200000000001 …
2 0.01180 0.534 -4.617 0.0345 127.994 …
3 0.00253 0.157 -4.498 0.2390 128.014 …
4 0.90900 0.157 -6.266 0.0413 145.036 …

music_genre_Blues music_genre_Classical music_genre_Country \


0 False False False
1 False False False
2 False False False
3 False False False
4 False False False

music_genre_Electronic music_genre_Hip-Hop music_genre_Jazz \

19
0 True False False
1 True False False
2 True False False
3 True False False
4 True False False

music_genre_Rap music_genre_Rock mode_Major mode_Minor


0 False False False True
1 False False False True
2 False False True False
3 False False True False
4 False False True False

[5 rows x 23 columns]

[108]: df.dtypes #look at tempo columns data type

[108]: popularity float64


acousticness float64
danceability float64
duration_ms float64
energy float64
instrumentalness float64
liveness float64
loudness float64
speechiness float64
tempo object
valence float64
music_genre_Alternative bool
music_genre_Anime bool
music_genre_Blues bool
music_genre_Classical bool
music_genre_Country bool
music_genre_Electronic bool
music_genre_Hip-Hop bool
music_genre_Jazz bool
music_genre_Rap bool
music_genre_Rock bool
mode_Major bool
mode_Minor bool
dtype: object

[109]: df=df.drop(df[df['tempo']=='?'].index) #the tempo column had an inconsistent␣


↪value '?' so we applied some data science and dropped all rows with that␣

↪data.

df['tempo']=df['tempo'].astype('float')
df.dtypes

20
[109]: popularity float64
acousticness float64
danceability float64
duration_ms float64
energy float64
instrumentalness float64
liveness float64
loudness float64
speechiness float64
tempo float64
valence float64
music_genre_Alternative bool
music_genre_Anime bool
music_genre_Blues bool
music_genre_Classical bool
music_genre_Country bool
music_genre_Electronic bool
music_genre_Hip-Hop bool
music_genre_Jazz bool
music_genre_Rap bool
music_genre_Rock bool
mode_Major bool
mode_Minor bool
dtype: object

[110]: # The below modeling includes pipelining, Cross validation and Hyper-parameters␣
↪tuning and Model Assessment one in all

from sklearn.model_selection import train_test_split, RandomizedSearchCV, KFold


from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Drop rows with missing target values (loudness)
df = df.dropna(subset=['loudness'])
# Set features and target variable
X = df.drop("loudness", axis=1).values # Set of features
y = df["loudness"].values # Target variable
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
↪random_state=42)

# Instantiate an imputer, standard scaler, and Lasso model


imputer = SimpleImputer() # Handles missing values in features
SS = StandardScaler() # Standardizes features
lasso = Lasso(max_iter=5000) # Lasso regression model
# Build steps for the pipeline

21
steps = [("imputer", imputer), ("scaler", SS), ("lasso", lasso)]
# Create the pipeline
pipeline = Pipeline(steps)
# Cross-validation setup
kf = KFold(n_splits=6, shuffle=True, random_state=5)
# Hyperparameter tuning setup
param = {"lasso__alpha": np.linspace(0.0001, 1, 20)} # Alpha range for Lasso
# Use RandomizedSearchCV with the pipeline
cv = RandomizedSearchCV(pipeline, param, cv=kf, n_iter=3, random_state=42)
# Fit the RandomizedSearchCV pipeline to the training data
cv.fit(X_train, y_train)
# Make predictions on the test set
y_pred = cv.predict(X_test)
# Evaluate the model using appropriate regression metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Print evaluation metrics
print(f"Mean Squared Error: {mse}")
print(f"R² Score: {r2}")
# Get the best hyperparameters found by RandomizedSearchCV
print(f"Best Parameters: {cv.best_params_}")

Mean Squared Error: 7.361469722085846


R² Score: 0.8088190354074949
Best Parameters: {'lasso__alpha': 0.0001}

[111]: #7. Evaluating Multiple Models at Once:


import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv('Churn.csv')
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier #We will learn about this soon
X=df[['Tenure','Usage Frequency','Support Calls','Payment Delay','Total␣
↪Spend','Last Interaction']].values #predictors

y=df['Churn'].values #target
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42, stratify=y)

# Create models dictionary


models = {"Logistic Regression": LogisticRegression(), "KNN":␣
↪KNeighborsClassifier(), "Decision Tree Classifier": DecisionTreeClassifier()}

results = []
# Loop through the models' values
for model in models.values():

22
# Instantiate a KFold object
kf = KFold(n_splits=6, random_state=12, shuffle=True)
# Perform cross-validation
cv_results = cross_val_score(model, X_train, y_train, cv=kf)
results.append(cv_results)
print(results)
plt.boxplot(results, labels=models.keys())
plt.show()

[array([0.81220876, 0.81288594, 0.80810905, 0.80927415, 0.80880811,


0.81463358]), array([0.81139329, 0.80927415, 0.81043924, 0.81673075,
0.80577887,
0.81381801]), array([0.85869059, 0.8633345 , 0.86275195, 0.8620529 ,
0.86193639,
0.86554818])]

[112]: #Nota Bina: While applying any ML model in Supervised learning, to check␣
↪whether model is ok for the data asses model on training data i.e. find␣

↪model.score(X_train,y_train), while for post fitting assesment we use model.

↪score(X_test,y_test). This will help you choose best model for your data, to␣

↪whom you can make accurate/precise later by hyperparameter tuning.

23
#In supervised learning, overfiting referes to doing better on the training set␣
↪than the test set.

[113]: #Unsupervised Learning:

[114]: #1. K-Means Clustering:


import numpy as np

24
25
train_points.shape #test_points contain a 2D array with two columns (features)␣
↪of 300 rows (records)

[114]: (300, 2)

[115]: from sklearn.cluster import KMeans


# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters=3)
# Fit model to train points
model.fit(train_points)
# Determine the cluster labels of new_points: labels
new_points=np.array([[-1.45703573e+00, -2.91842036e-01],[-1.59048842e+00, 1.
↪66063031e-01],[ 9.25549284e-01, 7.41406406e-01],[ 1.97245469e-01, -7.
↪80703225e-01],[ 2.88401697e-01, -8.32425551e-01],[ 7.24141618e-01, -7.

↪99149200e-01],[-1.62658639e+00, -1.80005543e-01],[ 5.84481588e-01, 1.


↪13195640e+00],[ 1.02146732e+00, 4.59657799e-01],[ 8.65050554e-01, 9.
↪57714887e-01],[ 3.98717766e-01, -1.24273147e+00],[ 8.62234892e-01, 1.
↪10955561e+00],[-1.35999430e+00, 2.49942654e-02],[-1.19178505e+00, -3.
↪82946323e-02],[ 1.29392424e+00, 1.10320509e+00],[ 1.25679630e+00, -7.
↪79857582e-01],[ 9.38040302e-02, -5.53247258e-01],[-1.73512175e+00, -9.

↪76271667e-02],[ 2.23153587e-01, -9.43474351e-01],[ 4.01989100e-01, -1.

↪10963051e+00],[-1.42244158e+00, 1.81914703e-01],[ 3.92476267e-01, -8.


↪78426277e-01],[ 1.25181875e+00, 6.93614996e-01],[ 1.77481317e-02, -7.
↪20304235e-01],[-1.87752521e+00, -2.63870424e-01],[-1.58063602e+00, -5.

↪50456344e-01],[-1.59589493e+00, -1.53932892e-01],[-1.01829770e+00, 3.
↪88542370e-02],[ 1.24819659e+00, 6.60041803e-01],[-1.25551377e+00, -2.
↪96172009e-02],[-1.41864559e+00, -3.58230179e-01],[ 5.25758326e-01, 8.
↪70500543e-01],[ 5.55599988e-01, 1.18765072e+00],[ 2.81344439e-02, -6.
↪99111314e-01]])

labels = model.predict(new_points)
# Print cluster labels of new_points
print(labels)

[1 1 0 2 2 2 1 0 0 0 2 0 1 1 0 2 2 1 2 2 1 2 0 2 1 1 1 1 0 1 1 0 0 2]
C:\Users\14274\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1446:
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when
there are less chunks than available threads. You can avoid it by setting the
environment variable OMP_NUM_THREADS=2.
warnings.warn(

[116]: #Visualizing Clusters: To decide the value of K for Clustering we analyse the␣
↪data by visualization and then decide how many clusters are possible.

import matplotlib.pyplot as plt


# Assign the columns of new_points: xs and ys
xs = new_points[:,0]
ys = new_points[:,1]

26
# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs,ys,c=labels,alpha=0.8)
# Assign the cluster centers: centroids
centroids = model.cluster_centers_ #The mean of a cluster is called its␣
↪centroid.

# Assign the columns of centroids: centroids_x, centroids_y


centroids_x = centroids[:,0]
centroids_y = centroids[:,1]
# Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x,centroids_y,marker="D",s=50)
plt.show()

[117]: #Assessing Clustering:


#1. Inertia for K: We use metric interia to check quality of our clusters. It␣
↪is the measure of tightness of our clusters (i.e. variability and outliers).␣

↪A good clustering model has less numbe of clusters and are tight. So we can␣

↪check variations in interia of our model by changing no. of clusters and␣

↪then select the best one.

ks = range(1, 6)
inertias = []
for k in ks:
# Create a KMeans instance with k clusters: model

27
model=KMeans(n_clusters=k)
# Fit model to samples
model.fit(train_points)
# Append the inertia to the list of inertias
inertias.append(model.inertia_)
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()
#The inertia decreases very slowly from 3 clusters to 4, so it looks like 3␣
↪clusters would be a good choice for this data.

C:\Users\14274\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1446:
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when
there are less chunks than available threads. You can avoid it by setting the
environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\14274\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1446:
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when
there are less chunks than available threads. You can avoid it by setting the
environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\14274\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1446:
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when
there are less chunks than available threads. You can avoid it by setting the
environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\14274\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1446:
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when
there are less chunks than available threads. You can avoid it by setting the
environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\14274\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1446:
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when
there are less chunks than available threads. You can avoid it by setting the
environment variable OMP_NUM_THREADS=2.
warnings.warn(

28
[118]: #ii. Crosstab for CLuster analysis:
# Create a KMeans model with 3 clusters: model
import pandas as pd
model = KMeans(n_clusters=3)
# Use fit_predict to fit model and obtain cluster labels: labels
labels =model.fit_predict(train_points) #fits model and then asks it predict
labels

C:\Users\14274\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1446:
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when
there are less chunks than available threads. You can avoid it by setting the
environment variable OMP_NUM_THREADS=2.
warnings.warn(

[118]: array([1, 2, 0, 0, 2, 2, 0, 1, 2, 2, 0, 1, 2, 0, 2, 1, 0, 0, 1, 0, 2, 1,
2, 1, 1, 2, 1, 1, 1, 2, 0, 0, 0, 2, 1, 2, 1, 1, 2, 1, 1, 0, 2, 2,
2, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 2, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2,
0, 2, 2, 1, 0, 2, 0, 1, 0, 2, 1, 1, 1, 0, 1, 2, 0, 2, 2, 2, 2, 1,
1, 0, 2, 0, 2, 1, 1, 1, 0, 2, 2, 0, 2, 1, 2, 0, 1, 0, 0, 0, 2, 2,
1, 2, 0, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 2, 0, 1, 2, 2, 0, 0,
2, 1, 2, 1, 0, 2, 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 1, 1, 1, 0, 0, 2,

29
0, 2, 1, 1, 0, 2, 0, 0, 0, 2, 1, 1, 2, 0, 0, 1, 1, 0, 1, 1, 2, 1,
0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 2, 0, 1, 1, 1, 1, 2, 0, 1, 2, 2, 2,
1, 2, 1, 1, 2, 0, 0, 1, 0, 1, 1, 2, 2, 1, 0, 2, 0, 1, 0, 2, 1, 2,
2, 2, 2, 0, 0, 0, 1, 1, 2, 1, 0, 2, 1, 1, 2, 1, 0, 0, 0, 0, 0, 2,
1, 1, 0, 0, 1, 2, 0, 2, 2, 1, 1, 2, 2, 2, 1, 0, 1, 2, 1, 0, 0, 0,
0, 0, 1, 1, 2, 1, 1, 2, 0, 0, 2, 1, 0, 0, 2, 2, 1, 1, 1, 2, 2, 1,
0, 2, 2, 0, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2])

[119]: varieties=[]
for i in labels:
if i==0:
varieties.append('Cluster_1')
elif i==1:
varieties.append('Cluster_2')
else:
varieties.append('Cluster_3')
print(varieties)

['Cluster_2', 'Cluster_3', 'Cluster_1', 'Cluster_1', 'Cluster_3', 'Cluster_3',


'Cluster_1', 'Cluster_2', 'Cluster_3', 'Cluster_3', 'Cluster_1', 'Cluster_2',
'Cluster_3', 'Cluster_1', 'Cluster_3', 'Cluster_2', 'Cluster_1', 'Cluster_1',
'Cluster_2', 'Cluster_1', 'Cluster_3', 'Cluster_2', 'Cluster_3', 'Cluster_2',
'Cluster_2', 'Cluster_3', 'Cluster_2', 'Cluster_2', 'Cluster_2', 'Cluster_3',
'Cluster_1', 'Cluster_1', 'Cluster_1', 'Cluster_3', 'Cluster_2', 'Cluster_3',
'Cluster_2', 'Cluster_2', 'Cluster_3', 'Cluster_2', 'Cluster_2', 'Cluster_1',
'Cluster_3', 'Cluster_3', 'Cluster_3', 'Cluster_2', 'Cluster_2', 'Cluster_1',
'Cluster_2', 'Cluster_1', 'Cluster_1', 'Cluster_1', 'Cluster_2', 'Cluster_2',
'Cluster_2', 'Cluster_3', 'Cluster_2', 'Cluster_2', 'Cluster_3', 'Cluster_1',
'Cluster_3', 'Cluster_2', 'Cluster_2', 'Cluster_1', 'Cluster_1', 'Cluster_3',
'Cluster_1', 'Cluster_3', 'Cluster_3', 'Cluster_2', 'Cluster_1', 'Cluster_3',
'Cluster_1', 'Cluster_2', 'Cluster_1', 'Cluster_3', 'Cluster_2', 'Cluster_2',
'Cluster_2', 'Cluster_1', 'Cluster_2', 'Cluster_3', 'Cluster_1', 'Cluster_3',
'Cluster_3', 'Cluster_3', 'Cluster_3', 'Cluster_2', 'Cluster_2', 'Cluster_1',
'Cluster_3', 'Cluster_1', 'Cluster_3', 'Cluster_2', 'Cluster_2', 'Cluster_2',
'Cluster_1', 'Cluster_3', 'Cluster_3', 'Cluster_1', 'Cluster_3', 'Cluster_2',
'Cluster_3', 'Cluster_1', 'Cluster_2', 'Cluster_1', 'Cluster_1', 'Cluster_1',
'Cluster_3', 'Cluster_3', 'Cluster_2', 'Cluster_3', 'Cluster_1', 'Cluster_3',
'Cluster_3', 'Cluster_3', 'Cluster_2', 'Cluster_3', 'Cluster_1', 'Cluster_1',
'Cluster_2', 'Cluster_2', 'Cluster_2', 'Cluster_2', 'Cluster_2', 'Cluster_3',
'Cluster_1', 'Cluster_2', 'Cluster_3', 'Cluster_3', 'Cluster_1', 'Cluster_1',
'Cluster_3', 'Cluster_2', 'Cluster_3', 'Cluster_2', 'Cluster_1', 'Cluster_3',
'Cluster_1', 'Cluster_2', 'Cluster_1', 'Cluster_1', 'Cluster_2', 'Cluster_1',
'Cluster_1', 'Cluster_2', 'Cluster_1', 'Cluster_3', 'Cluster_2', 'Cluster_2',
'Cluster_2', 'Cluster_1', 'Cluster_1', 'Cluster_3', 'Cluster_1', 'Cluster_3',
'Cluster_2', 'Cluster_2', 'Cluster_1', 'Cluster_3', 'Cluster_1', 'Cluster_1',
'Cluster_1', 'Cluster_3', 'Cluster_2', 'Cluster_2', 'Cluster_3', 'Cluster_1',
'Cluster_1', 'Cluster_2', 'Cluster_2', 'Cluster_1', 'Cluster_2', 'Cluster_2',

30
'Cluster_3', 'Cluster_2', 'Cluster_1', 'Cluster_1', 'Cluster_1', 'Cluster_2',
'Cluster_2', 'Cluster_1', 'Cluster_2', 'Cluster_1', 'Cluster_1', 'Cluster_2',
'Cluster_3', 'Cluster_1', 'Cluster_2', 'Cluster_2', 'Cluster_2', 'Cluster_2',
'Cluster_3', 'Cluster_1', 'Cluster_2', 'Cluster_3', 'Cluster_3', 'Cluster_3',
'Cluster_2', 'Cluster_3', 'Cluster_2', 'Cluster_2', 'Cluster_3', 'Cluster_1',
'Cluster_1', 'Cluster_2', 'Cluster_1', 'Cluster_2', 'Cluster_2', 'Cluster_3',
'Cluster_3', 'Cluster_2', 'Cluster_1', 'Cluster_3', 'Cluster_1', 'Cluster_2',
'Cluster_1', 'Cluster_3', 'Cluster_2', 'Cluster_3', 'Cluster_3', 'Cluster_3',
'Cluster_3', 'Cluster_1', 'Cluster_1', 'Cluster_1', 'Cluster_2', 'Cluster_2',
'Cluster_3', 'Cluster_2', 'Cluster_1', 'Cluster_3', 'Cluster_2', 'Cluster_2',
'Cluster_3', 'Cluster_2', 'Cluster_1', 'Cluster_1', 'Cluster_1', 'Cluster_1',
'Cluster_1', 'Cluster_3', 'Cluster_2', 'Cluster_2', 'Cluster_1', 'Cluster_1',
'Cluster_2', 'Cluster_3', 'Cluster_1', 'Cluster_3', 'Cluster_3', 'Cluster_2',
'Cluster_2', 'Cluster_3', 'Cluster_3', 'Cluster_3', 'Cluster_2', 'Cluster_1',
'Cluster_2', 'Cluster_3', 'Cluster_2', 'Cluster_1', 'Cluster_1', 'Cluster_1',
'Cluster_1', 'Cluster_1', 'Cluster_2', 'Cluster_2', 'Cluster_3', 'Cluster_2',
'Cluster_2', 'Cluster_3', 'Cluster_1', 'Cluster_1', 'Cluster_3', 'Cluster_2',
'Cluster_1', 'Cluster_1', 'Cluster_3', 'Cluster_3', 'Cluster_2', 'Cluster_2',
'Cluster_2', 'Cluster_3', 'Cluster_3', 'Cluster_2', 'Cluster_1', 'Cluster_3',
'Cluster_3', 'Cluster_1', 'Cluster_2', 'Cluster_2', 'Cluster_2', 'Cluster_3',
'Cluster_2', 'Cluster_2', 'Cluster_2', 'Cluster_3', 'Cluster_3', 'Cluster_3']

[120]: # Create a DataFrame with labels and varieties as columns: df


df = pd.DataFrame({'labels': labels, 'varieties': varieties})
# Create crosstab: ct
ct = pd.crosstab(df['labels'],df['varieties'])
# Display ct
print(ct)
#The cross-tabulation shows that the 3 varieties of clusters separate really␣
↪well into 3 clusters. But depending on the type of data you are working␣

↪with, the clustering may not always be this good. Is there anything you can␣

↪do in such situations to improve your clustering? You'll find out next!

varieties Cluster_1 Cluster_2 Cluster_3


labels
0 94 0 0
1 0 111 0
2 0 0 95

[121]: #When data has high variablity it can impact clustering negatively so we should␣
↪Should standarize (same scale for each data 0-1) or normalize (scale based␣

↪on each cols. own data) our data before apply clustering algorithms.

from sklearn.preprocessing import StandardScaler


from sklearn.pipeline import make_pipeline
from sklearn.cluster import KMeans
import pandas as pd
scaler = StandardScaler()

31
kmeans = KMeans(n_clusters=3)
pipeline = make_pipeline(scaler,kmeans)
pipeline.fit(train_points)
labels = pipeline.predict(train_points)
df = pd.DataFrame({'labels':labels,'varieties':varieties})
ct = pd.crosstab(df['labels'],df['varieties'])
# Display ct
print(ct)

varieties Cluster_1 Cluster_2 Cluster_3


labels
0 0 0 95
1 93 0 0
2 1 111 0
C:\Users\14274\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1446:
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when
there are less chunks than available threads. You can avoid it by setting the
environment variable OMP_NUM_THREADS=2.
warnings.warn(

[122]: #2. Hierarchial Clusterning for Visualization of Data: Creates Hierarchy of any␣
↪sort of data for better visualziation to non-tech audiences.

print(train_points[0:50].shape)

(50, 2)

[123]: from scipy.cluster.hierarchy import linkage,dendrogram


import matplotlib.pyplot as plt
# Calculate the linkage: mergings
mergings = linkage(train_points[0:50],method='complete') #How distance b/w␣
↪clusters will be measured i.e. Complete=Max dist., Single=Min dist between␣

↪any set of data poinsts of two clusters

# Plot the dendrogram, using varieties as labels


dendrogram(mergings,labels=['C'+str(i) for i in range(0,50)],
leaf_rotation=90,
leaf_font_size=6,)
plt.show()
#In the output plot the x-labels are all clusters generated by HC while y-axis␣
↪shows distance between clusters. Clustering starts from top to bottom so if␣

↪you draw horizontal line at any point on y-axist the number of lines it␣

↪intersects in no. of clusters till that point.

32
[124]: #We can printout data of these clustes as below:
import pandas as pd
from scipy.cluster.hierarchy import fcluster
# Use fcluster to extract labels: labels
labels = fcluster(mergings, 2, criterion='distance') #2 is height limit at␣
↪y-axis (distance)

# Create a DataFrame with labels and varieties as columns: df


df = pd.DataFrame({'labels': labels, 'varieties':['C'+str(i) for i in␣
↪range(0,50)]})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])
# Display ct
ct
#Note there is nothing like prediction thing in HC, it is just for visualization

[124]: varieties C0 C1 C10 C11 C12 C13 C14 C15 C16 C17 … C45 C46 C47 \
labels …
1 0 1 0 0 1 0 1 0 0 0 … 0 0 0
2 0 0 1 0 0 1 0 0 1 1 … 0 0 1
3 1 0 0 1 0 0 0 1 0 0 … 1 1 0

varieties C48 C49 C5 C6 C7 C8 C9

33
labels
1 0 0 1 0 0 1 1
2 0 1 0 1 0 0 0
3 1 0 0 0 1 0 0

[3 rows x 50 columns]

[125]: #3. t-SNE for 2D visualization of higher dimension data: HC is good for small␣
↪data, but for big data we use t-SNE.

import pandas as pd
df=pd.read_csv('ANSUR II FEMALE Public.csv') #https://www.kaggle.com/datasets/
↪seshadrikolluri/ansur-ii

numeric_df = df.select_dtypes(include=['number']) #TSNE is applied to numeric␣


↪data only, convert categorical to numeric if you want

print(numeric_df.shape)
from sklearn.manifold import TSNE
m = TSNE(learning_rate=50)
tsne_features = m.fit_transform(numeric_df)
print(tsne_features) #Reduced to 2D
#Assigning t-SNE features to our dataset
df['x'] = tsne_features[:,0]
df['y'] = tsne_features[:,1]
import seaborn as sns
sns.scatterplot(x="x", y="y", data=df)
plt.show()

(1986, 99)
[[-45.502087 23.22181 ]
[-45.30828 24.381142]
[-44.91254 24.29287 ]

[ 36.55094 -24.02615 ]
[ 40.648514 -21.196135]
[ 39.447086 -25.889902]]

34
[126]: #We can further customize our plot based on some categories the data belong to:
cat_df = df.select_dtypes(include=['object'])
cat_df.head()

[126]: Gender Date Installation Component Branch \


0 Female 5-Oct-10 Fort Hood Regular Army Combat Support
1 Female 5-Oct-10 Fort Hood Regular Army Combat Service Support
2 Female 5-Oct-10 Fort Hood Regular Army Combat Service Support
3 Female 5-Oct-10 Fort Hood Regular Army Combat Service Support
4 Female 5-Oct-10 Fort Hood Regular Army Combat Arms

PrimaryMOS SubjectsBirthLocation Ethnicity WritingPreference


0 92Y Germany NaN Right hand
1 25U California Mexican Right hand
2 35D Texas NaN Right hand
3 25U District of Columbia Caribbean Islander Right hand
4 42A Texas NaN Right hand

[127]: #Segreggations based on Branch feature:


sns.scatterplot(x="x", y="y",hue='Branch', data=df)

35
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',fontsize='small',␣
↪markerscale=0.7)

plt.show()

[128]: #Segreggations based on Installation feature:


sns.scatterplot(x="x", y="y",hue='Installation', data=df)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',fontsize='small',␣
↪markerscale=0.7)

plt.show()

36
[129]: #Segreggations based on WritingPreference feature:
sns.scatterplot(x="x", y="y",hue='WritingPreference', data=df)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',fontsize='small',␣
↪markerscale=0.7)

plt.show()
#You can see the power of TSNE !!!!

37
[130]: #4. PCA for Dimensionality Reduction: PCA reduces dimensions of a Data Set fo␣
↪features to the intrinsic dimension (mandatory features only)

38
samples=np.array([[ 242. , 23.2, 25.4, 30. , 38.4, 13.4],[ 290. , ␣
↪24. , 26.3, 31.2, 40. , 13.8],[ 340. , 23.9, 26.5, 31.1, 39.
↪8, 15.1],[ 363. , 26.3, 29. , 33.5, 38. , 13.3],[ 430. , 26.5,␣
↪ 29. , 34. , 36.6, 15.1],[ 450. , 26.8, 29.7, 34.7, 39.2, ␣
↪14.2],[ 500. , 26.8, 29.7, 34.5, 41.1, 15.3],[ 390. , 27.6, 30.
↪ , 35. , 36.2, 13.4],[ 450. , 27.6, 30. , 35.1, 39.9, 13.
↪8],[ 500. , 28.5, 30.7, 36.2, 39.3, 13.7],[ 475. , 28.4, 31. ,␣
↪ 36.2, 39.4, 14.1],[ 500. , 28.7, 31. , 36.2, 39.7, 13.3],[␣
↪500. , 29.1, 31.5, 36.4, 37.8, 12. ],[ 600. , 29.4, 32. , 37.
↪2, 40.2, 13.9],[ 600. , 29.4, 32. , 37.2, 41.5, 15. ],[ 700. ,␣
↪ 30.4, 33. , 38.3, 38.8, 13.8],[ 700. , 30.4, 33. , 38.5, ␣
↪38.8, 13.5],[ 610. , 30.9, 33.5, 38.6, 40.5, 13.3],[ 650. , 31.
↪ , 33.5, 38.7, 37.4, 14.8],[ 575. , 31.3, 34. , 39.5, 38.3, ␣
↪ 14.1],[ 685. , 31.4, 34. , 39.2, 40.8, 13.7],[ 620. , 31.5, ␣
↪34.5, 39.7, 39.1, 13.3],[ 680. , 31.8, 35. , 40.6, 38.1, 15.
↪1],[ 700. , 31.9, 35. , 40.5, 40.1, 13.8],[ 725. , 31.8, 35. ,␣
↪ 40.9, 40. , 14.8],[ 720. , 32. , 35. , 40.6, 40.3, 15. ],[␣
↪714. , 32.7, 36. , 41.5, 39.8, 14.1],[ 850. , 32.8, 36. , 41.
↪6, 40.6, 14.9],[1000. , 33.5, 37. , 42.6, 44.5, 15.5],[ 920. ,␣
↪ 35. , 38.5, 44.1, 40.9, 14.3],[ 955. , 35. , 38.5, 44. , ␣
↪41.1, 14.3],[ 925. , 36.2, 39.5, 45.3, 41.4, 14.9],[ 975. , 37.
↪4, 41. , 45.9, 40.6, 14.7],[ 950. , 38. , 41. , 46.5, 37.9, ␣
↪ 13.7],[ 40. , 12.9, 14.1, 16.2, 25.6, 14. ],[ 69. , 16.5, ␣
↪18.2, 20.3, 26.1, 13.9],[ 78. , 17.5, 18.8, 21.2, 26.3, 13.
↪7],[ 87. , 18.2, 19.8, 22.2, 25.3, 14.3],[ 120. , 18.6, 20. ,␣
↪ 22.2, 28. , 16.1],[ 0. , 19. , 20.5, 22.8, 28.4, 14.7],[␣
↪110. , 19.1, 20.8, 23.1, 26.7, 14.7],[ 120. , 19.4, 21. , 23.
↪7, 25.8, 13.9],[ 150. , 20.4, 22. , 24.7, 23.5, 15.2],[ 145. ,␣
↪ 20.5, 22. , 24.3, 27.3, 14.6],[ 160. , 20.5, 22.5, 25.3, ␣
↪27.8, 15.1],[ 140. , 21. , 22.5, 25. , 26.2, 13.3],[ 160. , 21.
↪1, 22.5, 25. , 25.6, 15.2],[ 169. , 22. , 24. , 27.2, 27.7, ␣
↪ 14.1],[ 161. , 22. , 23.4, 26.7, 25.9, 13.6],[ 200. , 22.1, ␣
↪23.5, 26.8, 27.6, 15.4],[ 180. , 23.6, 25.2, 27.9, 25.4, 14.␣
↪],[ 290. , 24. , 26. , 29.2, 30.4, 15.4],[ 272. , 25. , 27. , ␣
↪ 30.6, 28. , 15.6],[ 390. , 29.5, 31.7, 35. , 27.1, 15.3],[ ␣
↪6.7, 9.3, 9.8, 10.8, 16.1, 9.7],[ 7.5, 10. , 10.5, 11.
↪6, 17. , 10. ],[ 7. , 10.1, 10.6, 11.6, 14.9, 9.9],[ 9.7,␣
↪ 10.4, 11. , 12. , 18.3, 11.5],[ 9.8, 10.7, 11.2, 12.4, ␣
↪16.8, 10.3],[ 8.7, 10.8, 11.3, 12.6, 15.7, 10.2],[ 10. , 11.
↪3, 11.8, 13.1, 16.9, 9.8],[ 9.9, 11.3, 11.8, 13.1, 16.9, ␣
↪ 8.9],[ 9.8, 11.4, 12. , 13.2, 16.7, 8.7],[ 12.2, 11.5, ␣
↪12.2, 13.4, 15.6, 10.4],[ 13.4, 11.7, 12.4, 13.5, 18. , 9.
↪4],[ 12.2, 12.1, 13. , 13.8, 3916.5, 9.1],[ 19.7, 13.2, 14.3,␣
↪ 15.2, 18.9, 13.6],[ 19.9, 13.8, 15. , 16.2, 18.1, 11.6],[␣
↪200. , 30. , 32.3, 34.8, 16. , 9.7],[ 300. , 31.7, 34. , 37.
↪8, 15.1, 11. ],[ 300. , 32.7, 35. , 38.8, 15.3, 11.3],[ 300. ,␣
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
# Create scaler: scaler
scaler = StandardScaler()
# Create a PCA instance: pca
pca = PCA()
# Create pipeline: pipeline
pipeline = make_pipeline(scaler,pca)
# Fit the pipeline to 'samples'
pipeline.fit(samples)
# Plot the explained variances
features = range(pca.n_components_) #total no. of components
plt.bar(features,pca.explained_variance_) #explains variance of each feature
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(features)
plt.show()
#So we can see that only 2 features are most important one, hence its intrinsic␣
↪dimension is 2, so we can use n_componets arg=2 in PCA to drop other␣

↪components, see next

40
[131]: scaler = StandardScaler()
pca = PCA(n_components=2)
pipeline = make_pipeline(scaler,pca)
pca_features=pipeline.fit_transform(samples)
print(pca_features.shape)
pca_features

(85, 2)

[131]: array([[-0.57640502, -0.94649159],


[-0.36852393, -1.17103598],
[-0.28028168, -1.59709224],
[-0.00955427, -0.81967711],
[ 0.1238945 , -1.33121167],
[ 0.23193213, -1.1927194 ],
[ 0.33446854, -1.69284796],
[ 0.16081896, -0.70782116],
[ 0.29541529, -1.09283972],
[ 0.46043999, -1.00339441],
[ 0.44608943, -1.14148947],
[ 0.47829437, -0.89109102],
[ 0.4735883 , -0.31831092],
[ 0.73492898, -1.11327468],
[ 0.7739823 , -1.56968088],
[ 0.97048843, -0.96979764],
[ 0.97191941, -0.86843416],
[ 0.92078931, -0.89954755],
[ 0.97269014, -1.19729028],
[ 0.93832898, -1.00798007],
[ 1.09413347, -1.04718308],
[ 1.02607199, -0.7885073 ],
[ 1.19594301, -1.31990788],
[ 1.21981843, -1.01731496],
[ 1.28215964, -1.34797412],
[ 1.28102008, -1.43363506],
[ 1.35512796, -1.07895843],
[ 1.56324473, -1.40804828],
[ 1.92584477, -1.86051611],
[ 1.93991544, -1.17881192],
[ 1.98315357, -1.19459578],
[ 2.10532253, -1.3892677 ],
[ 2.2928189 , -1.24438685],
[ 2.25616715, -0.71826822],
[-2.43111613, -0.52636674],
[-1.90454031, -0.44279213],
[-1.79125251, -0.37108832],

41
[-1.66919652, -0.48749439],
[-1.53019235, -1.26864097],
[-1.64591521, -0.80754599],
[-1.49861879, -0.69611918],
[-1.46918052, -0.36116684],
[-1.30868872, -0.62387538],
[-1.2912297 , -0.67610179],
[-1.19653317, -0.87126453],
[-1.26874943, -0.15394748],
[-1.20676216, -0.75223637],
[-1.00605794, -0.49490585],
[-1.09293025, -0.21265068],
[-0.97227047, -0.93109612],
[-0.87563108, -0.27891495],
[-0.54366744, -1.07524439],
[-0.45311316, -0.95766887],
[ 0.24132972, -0.70313608],
[-3.2247437 , 1.46444109],
[-3.11578856, 1.31921191],
[-3.13606884, 1.49511086],
[-3.01196518, 0.73797522],
[-3.01859702, 1.24741427],
[-3.01953587, 1.35709736],
[-2.94938148, 1.422223 ],
[-2.9682391 , 1.72456031],
[-2.95851531, 1.80816635],
[-2.91341611, 1.31381391],
[-2.88193847, 1.49337533],
[-2.85443347, 1.7047016 ],
[-2.56545202, 0.05902699],
[-2.52301149, 0.79959474],
[-0.21609392, 1.94129785],
[ 0.18753056, 1.60003896],
[ 0.32007038, 1.50797703],
[ 0.52734202, 1.92249333],
[ 0.83194827, 1.37991283],
[ 0.72243211, 2.09400048],
[ 1.37907983, 2.21751453],
[ 1.44168633, 2.18030073],
[ 1.57043452, 1.58066239],
[ 1.71845528, 2.12978513],
[ 1.93944502, 2.11600295],
[ 2.44514154, 2.04389519],
[ 3.1608639 , 1.79776573],
[ 4.09193928, 1.58736259],
[ 4.9648268 , 2.55461606],
[ 4.90112817, 2.55764882],

42
[ 5.49512681, 2.09367309]])

[132]: #5. Dimensionality Reduction in NLP: In NLP we apply tf-idf transformation on a␣


↪feature whose records contains some lines of text. Each record is known as␣

↪document, and each document is converted to a row of 2D numpy array. The␣

↪columns of that array represent the weight of each word in the document.␣

↪Most of the time we get sparse array (which contains 0s mostly). So␣

↪dimensionality reduction of such arrays is mandatory for smooth ML␣

↪application.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD, NMF
import matplotlib.pyplot as plt
import seaborn as sns
# Sample documents
documents = ['I say good', 'I say bad', 'nothing'] # 3 Customer remarks
# Step 1: TF-IDF Transformation
tfidf = TfidfVectorizer()
csr_mat = tfidf.fit_transform(documents)
cols_name = tfidf.get_feature_names_out()
df_tfidf = pd.DataFrame(csr_mat.toarray(), index=['Customer_1', 'Customer_2',␣
↪'Customer_3'],columns=cols_name)

print("TF-IDF DataFrame:")
print(df_tfidf)

TF-IDF DataFrame:
bad good nothing say
Customer_1 0.000000 0.795961 0.0 0.605349
Customer_2 0.795961 0.000000 0.0 0.605349
Customer_3 0.000000 0.000000 1.0 0.000000

[133]: #i TruncatedSVD: TruncatedSVD is a variation of Singular Value Decomposition␣


↪(SVD) that reduces the dimensionality of a matrix by retaining only the top␣

↪k singular values and corresponding singular vectors. It is applicable to␣

↪any type of array but is useful for arrays with both positive and negative␣

↪signs i.e in Image Processing, data with mixed signs.

svd = TruncatedSVD(n_components=2, random_state=42)


reduced_doc_svd = svd.fit_transform(csr_mat)
df_svd = pd.DataFrame(reduced_doc_svd.round(3),index=['Customer_1',␣
↪'Customer_2', 'Customer_3'],columns=['SVD_Component_1', 'SVD_Component_2'])

print("\nTruncatedSVD Features DataFrame:")


print(df_svd)
# Inspect TruncatedSVD Components
feature_names = tfidf.get_feature_names_out()
df_svd_components = pd.DataFrame(svd.components_.
↪round(3),index=['SVD_Component_1', 'SVD_Component_2'],columns=feature_names)

print("\nTruncatedSVD Component Loadings:")

43
print(df_svd_components) #This tells the weightage of each word in new␣
↪components

TruncatedSVD Features DataFrame:


SVD_Component_1 SVD_Component_2
Customer_1 0.827 -0.0
Customer_2 0.827 -0.0
Customer_3 -0.000 1.0

TruncatedSVD Component Loadings:


bad good nothing say
SVD_Component_1 0.481 0.481 -0.0 0.732
SVD_Component_2 -0.000 0.000 1.0 -0.000

[134]: #ii. NMF: Non-negative Matrix Factorization decomposes a non-negative matrix␣


↪into the product of two non-negative matrices, typically with lower rank.␣

↪equires the input array to be non-negative, and is useful for image decomp,␣

↪word docs that never has negative values, recommender system etc. it is more␣

↪useful and iterpretable than SVD.

nmf = NMF(n_components=2, random_state=42)


nmf_features = nmf.fit_transform(csr_mat)
df_nmf = pd.DataFrame(nmf_features.round(2),index=['Customer_1', 'Customer_2',␣
↪'Customer_3'],columns=['NMF_Component_1', 'NMF_Component_2'])

print("\nNMF Features DataFrame:")


print(df_nmf)
df_nmf_components = pd.DataFrame(nmf.components_.
↪round(3),index=['NMF_Component_1', 'NMF_Component_2'],columns=feature_names)

print("\nNMF Component Loadings:")


print(df_nmf_components) #This tells the weightage of each word in new␣
↪components

NMF Features DataFrame:


NMF_Component_1 NMF_Component_2
Customer_1 0.48 0.00
Customer_2 0.48 0.00
Customer_3 0.00 0.77

NMF Component Loadings:


bad good nothing say
NMF_Component_1 0.824 0.824 0.000 1.253
NMF_Component_2 0.000 0.000 1.301 0.000

[135]: component = df_nmf_components.iloc[1] #Influencing word in 2nd component


# Print result of nlargest
print(component.nlargest(1))

44
nothing 1.301
Name: NMF_Component_2, dtype: float64

[136]: #We can reconstruct our original sparse-matrix by using NMF features and NMF␣
↪components (3 as specified above):

import numpy as np
from scipy.sparse import csr_matrix
reconstructed_mat = np.dot(nmf_features, nmf.components_)
# If you prefer to have the reconstructed matrix in sparse format
reconstructed_sparse = csr_matrix(reconstructed_mat)
# Display the reconstructed matrix
print(reconstructed_mat.round(3)) #it is almost same as our original matrix
pd.DataFrame(reconstructed_mat.round(3),␣
↪index=['Customer_1','Customer_2','Customer_3'],columns=cols_name)

[[0.398 0.398 0. 0.605]


[0.398 0.398 0. 0.605]
[0. 0. 1. 0. ]]

[136]: bad good nothing say


Customer_1 0.398 0.398 0.0 0.605
Customer_2 0.398 0.398 0.0 0.605
Customer_3 0.000 0.000 1.0 0.000

[137]: #Statistical ML

[138]: #1. Decision Trees: They don't need feature scaling because they are based on␣
↪if-else rules. They can be used for both regression and classifications␣

↪tasks.

from sklearn.tree import DecisionTreeClassifier #Use DecisionTreeRegressor for␣


↪Continuous data

from sklearn.model_selection import train_test_split


import pandas as pd
data=pd.read_csv('data_cancer.csv') #https://www.kaggle.com/datasets/uciml/
↪breast-cancer-wisconsin-data

data.head()

[138]: id diagnosis radius_mean texture_mean perimeter_mean area_mean \


0 842302 M 17.99 10.38 122.80 1001.0
1 842517 M 20.57 17.77 132.90 1326.0
2 84300903 M 19.69 21.25 130.00 1203.0
3 84348301 M 11.42 20.38 77.58 386.1
4 84358402 M 20.29 14.34 135.10 1297.0

smoothness_mean compactness_mean concavity_mean concave points_mean \


0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017

45
2 0.10960 0.15990 0.1974 0.12790
3 0.14250 0.28390 0.2414 0.10520
4 0.10030 0.13280 0.1980 0.10430

… texture_worst perimeter_worst area_worst smoothness_worst \


0 … 17.33 184.60 2019.0 0.1622
1 … 23.41 158.80 1956.0 0.1238
2 … 25.53 152.50 1709.0 0.1444
3 … 26.50 98.87 567.7 0.2098
4 … 16.67 152.20 1575.0 0.1374

compactness_worst concavity_worst concave points_worst symmetry_worst \


0 0.6656 0.7119 0.2654 0.4601
1 0.1866 0.2416 0.1860 0.2750
2 0.4245 0.4504 0.2430 0.3613
3 0.8663 0.6869 0.2575 0.6638
4 0.2050 0.4000 0.1625 0.2364

fractal_dimension_worst Unnamed: 32
0 0.11890 NaN
1 0.08902 NaN
2 0.08758 NaN
3 0.17300 NaN
4 0.07678 NaN

[5 rows x 33 columns]

[139]: data['diagnosis'].unique()

[139]: array(['M', 'B'], dtype=object)

[140]: data['diagnosis']=data['diagnosis'].replace('M',1).replace('B',0)
data.head()

C:\Users\14274\AppData\Local\Temp\ipykernel_9596\1286790736.py:1: FutureWarning:
Downcasting behavior in `replace` is deprecated and will be removed in a future
version. To retain the old behavior, explicitly call
`result.infer_objects(copy=False)`. To opt-in to the future behavior, set
`pd.set_option('future.no_silent_downcasting', True)`
data['diagnosis']=data['diagnosis'].replace('M',1).replace('B',0)

[140]: id diagnosis radius_mean texture_mean perimeter_mean area_mean \


0 842302 1 17.99 10.38 122.80 1001.0
1 842517 1 20.57 17.77 132.90 1326.0
2 84300903 1 19.69 21.25 130.00 1203.0
3 84348301 1 11.42 20.38 77.58 386.1
4 84358402 1 20.29 14.34 135.10 1297.0

46
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
3 0.14250 0.28390 0.2414 0.10520
4 0.10030 0.13280 0.1980 0.10430

… texture_worst perimeter_worst area_worst smoothness_worst \


0 … 17.33 184.60 2019.0 0.1622
1 … 23.41 158.80 1956.0 0.1238
2 … 25.53 152.50 1709.0 0.1444
3 … 26.50 98.87 567.7 0.2098
4 … 16.67 152.20 1575.0 0.1374

compactness_worst concavity_worst concave points_worst symmetry_worst \


0 0.6656 0.7119 0.2654 0.4601
1 0.1866 0.2416 0.1860 0.2750
2 0.4245 0.4504 0.2430 0.3613
3 0.8663 0.6869 0.2575 0.6638
4 0.2050 0.4000 0.1625 0.2364

fractal_dimension_worst Unnamed: 32
0 0.11890 NaN
1 0.08902 NaN
2 0.08758 NaN
3 0.17300 NaN
4 0.07678 NaN

[5 rows x 33 columns]

[141]: y=data['diagnosis']
X=data[['concave points_mean','radius_mean']]

[142]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣


↪random_state=42, stratify=y)

dt = DecisionTreeClassifier(max_depth=7, random_state=1)
dt.fit(X_train,y_train)
print(dt.score(X_test,y_test))

0.8859649122807017

[143]: #2. Bias-Varaince Tradeoff in Supervised ML:


#Overfitting:If y=f(x1,x2,...,xn) is our model then in overfiting fits the␣
↪training set noise (incorrect labels, measurement errors, irrelevant␣

↪features, or random fluctuations). Hence, training error is low, while␣

↪testing error is high.

47
#Varaince: measures how much a model's predictions fluctuate for different␣
↪training datasets. High variance model causes overfiting. We check variance␣

↪of model by cross validation and hence decide on overfiting. If f suffers␣

↪from high variance CV error of f > training set error of f, and is said to␣

↪overfit the training set. To remedy overfitting:decrease model complexity, i.

↪e. decrease max depth, increase min samples per leaf, gather more data etc.

#Underfitting: underfiting y is not flexible enough to approximate f. Hence,␣


↪training error and test error are equal and high.

#Bias: Difference between original values and predicted values. High Bias model␣
↪causes underfiting. If f suffers from high bias then CV error of f � training␣

↪set error of f and both are high and is said to underfit the training set.␣

↪To remedy underfitting increase model complexity i.e. increase max depth,␣

↪decrease min samples per leaf, gather more relevant features.

#Model Complexity: Flexibility of the model is called it's complexity and is␣
↪controlled by hyperparameters (Maximum tree depth, Minimum samples per leaf␣

↪etc.). Generally, highly complex models cause high variance and hence␣

↪overfiting while very less complex models cause high bias which result in␣

↪underfiting. Hence, best model complexity is important for best model␣

↪setting. This is known as Bias-Variance Tradeoff.

#Generalization error: The overall error of the model. Test set error of the␣
↪model is called gen. error.

[144]: #3. Ensemble Modeling: Apply multiple models to data and select the best one.␣
↪They can be used for both regression and classifications tasks.

from sklearn.metrics import accuracy_score


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier
lr = LogisticRegression()
knn = KNN(n_neighbors=27)
dt = DecisionTreeClassifier(min_samples_leaf=0.13)
# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn),␣
↪('Classification Tree', dt)]

# Iterate over the pre-defined list of classifiers


for clf_name, clf in classifiers:
# Fit clf to the training set
clf.fit(X_train, y_train)
# Predict y_pred
y_pred = clf.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
# Evaluate clf's accuracy on the test set
print('{:s} : {:.3f}'.format(clf_name, accuracy))

48
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier
# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers) #takes the outputs of the models␣
↪defined in the list classifiers and assigns labels by majority voting.

# Fit vc to the training set


vc.fit(X_train, y_train)
# Evaluate the test set predictions
y_pred = vc.predict(X_test)
# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

Logistic Regression : 0.860


K Nearest Neighbours : 0.860
Classification Tree : 0.886
Voting Classifier: 0.860

[145]: #4. Bagging: Bagging is like the basic algorithm for ensembles, except that,␣
↪instead of fitting the various models to the same data, instead a single␣

↪Model (called base model) is fitted to various bootstrap resamples of the␣

↪data (with replacement). The final model decision is Majority␣

↪votting(BaggingClassifier) for Classification and␣

↪Averaging(BaggingRegressor) for Regression. Hence, it reduces variance and␣

↪controls overfiting.

from sklearn.tree import DecisionTreeClassifier #Base Model


from sklearn.ensemble import BaggingClassifier
dt = DecisionTreeClassifier(random_state=1)
bc = BaggingClassifier(dt, n_estimators=50, random_state=1) #n_estimators␣
↪defines no. of bootstraps to make

bc.fit(X_train, y_train)
y_pred = bc.predict(X_test)
acc_test = accuracy_score(y_test, y_pred)
print('Test set accuracy of bc: {:.2f}'.format(acc_test)) #Test accuracy is␣
↪better then simple ensemble learning

Test set accuracy of bc: 0.90

[146]: #OOB (out of bag) Evaluation: In bagging since we use bootstraping so many of␣
↪data gets missed while training (on average 37% is missed due to␣

↪bootstrapping). So we can use this 37% data for testing instead of␣

↪cross-validation and stuff like that.

from sklearn.tree import DecisionTreeClassifier #Base Model


from sklearn.ensemble import BaggingClassifier
dt = DecisionTreeClassifier(random_state=1)
bc = BaggingClassifier(dt, n_estimators=50,oob_score=True,random_state=1)␣
↪#n_estimators defines no. of bootstraps to make

49
bc.fit(X_train, y_train)
y_pred = bc.predict(X_test)
acc_test = accuracy_score(y_test, y_pred)
# Evaluate OOB accuracy
acc_oob = bc.oob_score_
print('Test set accuracy: {:.3f}, OOB accuracy: {:.3f}'.format(acc_test,␣
↪acc_oob)) #Accuracies are close so means less variance or overfitting

Test set accuracy: 0.904, OOB accuracy: 0.914

[147]: #i. Random Forest: It is a Bagging ensemble method with DecisionTree as its␣
↪base model with one important extension: in addition to sampling the␣

↪records, the algorithm also samples the variables i.e uses some specific␣

↪samples at each leaf depending on their importance and without replacement.

import pandas as pd
df=pd.read_csv('data_cancer.csv')
df.head()

[147]: id diagnosis radius_mean texture_mean perimeter_mean area_mean \


0 842302 M 17.99 10.38 122.80 1001.0
1 842517 M 20.57 17.77 132.90 1326.0
2 84300903 M 19.69 21.25 130.00 1203.0
3 84348301 M 11.42 20.38 77.58 386.1
4 84358402 M 20.29 14.34 135.10 1297.0

smoothness_mean compactness_mean concavity_mean concave points_mean \


0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
3 0.14250 0.28390 0.2414 0.10520
4 0.10030 0.13280 0.1980 0.10430

… texture_worst perimeter_worst area_worst smoothness_worst \


0 … 17.33 184.60 2019.0 0.1622
1 … 23.41 158.80 1956.0 0.1238
2 … 25.53 152.50 1709.0 0.1444
3 … 26.50 98.87 567.7 0.2098
4 … 16.67 152.20 1575.0 0.1374

compactness_worst concavity_worst concave points_worst symmetry_worst \


0 0.6656 0.7119 0.2654 0.4601
1 0.1866 0.2416 0.1860 0.2750
2 0.4245 0.4504 0.2430 0.3613
3 0.8663 0.6869 0.2575 0.6638
4 0.2050 0.4000 0.1625 0.2364

fractal_dimension_worst Unnamed: 32

50
0 0.11890 NaN
1 0.08902 NaN
2 0.08758 NaN
3 0.17300 NaN
4 0.07678 NaN

[5 rows x 33 columns]

[148]: y=df['area_mean']
X=df[['texture_mean','perimeter_mean','radius_mean','smoothness_mean','compactness_mean']]

[149]: from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

[150]: from sklearn.ensemble import RandomForestRegressor #For classification we use␣


↪RFClassifier

from sklearn.metrics import mean_squared_error as MSE


rf = RandomForestRegressor(n_estimators=25,min_samples_leaf=0.
↪10,random_state=2) #min_samples_leaf=0.10 each leaf contains min. 10% of␣

↪training data

rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)
print('Test set RMSE of rf: {:.2f}'.format(rmse_test))
importances = pd.Series(data=rf.feature_importances_,index= X_train.columns)␣
↪#Which features were how much important in the prediction on test data, it␣

↪changes with test data because for predicting each test point model picks␣

↪specific features.

importances

Test set RMSE of rf: 119.47

[150]: texture_mean 0.00000


perimeter_mean 0.06848
radius_mean 0.93152
smoothness_mean 0.00000
compactness_mean 0.00000
dtype: float64

[151]: #5. Boosting: Boosting algorithm trains models (can be same model again and␣
↪again) sequentially where each successive model tries better predictions␣

↪then previous one, hence reduces bias with final decision as weighted␣

↪predictions (weighted majority voting for classification and weighted␣

↪average for regression), with more accurate models having more influence.

51
#i. Ada Boost: In this boosting each successive model pays more attention to␣
↪wrongly predicted instances of it's predecessor, however each model is␣

↪trained on whole data. Alpha is error that defines the weight of the model␣

↪in the final decision in combination with learning rate (we'll see in␣

↪example) and it depends on training error (no. of incorrect predictions made␣

↪by that model).

from sklearn.tree import DecisionTreeClassifier


from sklearn.ensemble import AdaBoostClassifier
y=data['diagnosis']
X=data[['concave points_mean','radius_mean']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42, stratify=y)

dt = DecisionTreeClassifier(max_depth=2, random_state=1)
ada = AdaBoostClassifier(dt, n_estimators=180, random_state=1) #n_estimators␣
↪defines no. of models to use, and learning rate (eta) combined with alpha is␣

↪weight of each model in final decision

ada.fit(X_train, y_train)
# Compute the probabilities of obtaining the positive class
y_pred_proba = ada.predict_proba(X_test)[:,1]
from sklearn.metrics import roc_auc_score
# Evaluate test-set roc_auc_score
ada_roc_auc = roc_auc_score(y_test, y_pred_proba)
print('ROC AUC score:',ada_roc_auc)

C:\Users\14274\anaconda3\Lib\site-
packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R
algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME
algorithm to circumvent this warning.
warnings.warn(
ROC AUC score: 0.9704034391534391

[152]: #ii. Gradient Boosting(GB): In each model is trained on whole training data␣
↪with more focus on errors data, however in GB the error data of one model␣

↪becomes training data of its successor. Hence, final decision can be made by␣

↪any model on the way till end (i.e. the model with minimum error on that␣

↪data point(s)).

from sklearn.ensemble import GradientBoostingClassifier


gb = GradientBoostingClassifier(max_depth=4,n_estimators=200,random_state=2)
gb.fit(X_train, y_train)
y_pred_proba = gb.predict_proba(X_test)[:,1]
# Evaluate test-set roc_auc_score
gb_roc_auc = roc_auc_score(y_test, y_pred_proba)
print('ROC AUC score:',gb_roc_auc)

ROC AUC score: 0.9690806878306878

52
[153]: #Stochastic Gradient Boosting(SGB): Gradient Boosting involving Data and␣
↪Feature Subsampling (without replacement) at each split(change in model). It␣

↪adds randomnes factor (akin to bagging) in GB to control variance along with␣

↪bias.

# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingClassifier
sgbr = GradientBoostingClassifier(max_depth=4, subsample=0.9,max_features=0.
↪75,n_estimators=200,random_state=2) #Sample 90% of prior data for training␣

↪at each step, max 75% of features

# Fit sgbr to the training set


sgbr.fit(X_train, y_train)
# Predict test set labels
y_pred_proba = sgbr.predict_proba(X_test)[:,1]
# Evaluate test-set roc_auc_score
sgbr_roc_auc = roc_auc_score(y_test, y_pred_proba)
print('ROC AUC score:',sgbr_roc_auc)

ROC AUC score: 0.9646164021164021

[154]: #Note: Use Hyperparameter Tuning to optimize your ML models for better␣
↪performance.

[155]: #Extreme Gradient Boosting(XGB): XGB is most optimized Boosting as well as␣
↪Supervised ML algorithm. It incorporates L1 and L2 regularization with the␣

↪Gradient Boosting. Also for other models we have to explicitly handle␣

↪missing data, while XGB automatically learns optimal splits for missing␣

↪values. Also it is Highly optimized with parallel processing.

import xgboost as xgb


xg_cl = xgb.XGBClassifier(booster='gbtree',objective='binary:logistic',␣
↪n_estimators=10,reg_lambda=10, seed=123) #For regression use XGBRegressor

'''The booster parameter defines base learner (gbtree, gblinear, dart etc.)
The objective parameter defines the learning task and loss function that␣
↪XGBoost

will use to optimize during training. It tells the algorithm what kind of␣
↪problem it's

solving and how to measure the error. "binary:logistic": This specifies that␣
↪the model

is performing binary classification with logistic regression as the loss␣


↪function. Depending

on your specific use case, you might choose different objectives:


"reg:squarederror": For regression tasks using mean squared error.
"multi:softmax": For multi-class classification using softmax.
"binary:hinge": For binary classification using hinge loss (similar to SVM).
"rank:pairwise": For ranking tasks
reg_lambda defines l2 regularization, alpha for l1, gamma to control loss␣
↪function minimization'''

53
xg_cl.fit(X_train,y_train)
y_pred_proba = xg_cl.predict_proba(X_test)[:,1]
xg_cl_roc_auc = roc_auc_score(y_test, y_pred_proba)
print('ROC AUC score:',xg_cl_roc_auc)

ROC AUC score: 0.9708994708994708

[156]: #XGB CV: XGB has built-in CV method. We don't need to apply CV explicitly
import pandas as pd
churn_data = pd.read_csv("Churn.csv")
churn_data.head()

[156]: CustomerID Age Gender Tenure Usage Frequency Support Calls \


0 1 22 Female 25 14 4
1 2 41 Female 28 28 7
2 3 47 Male 27 10 2
3 4 35 Male 9 12 5
4 5 53 Female 58 24 9

Payment Delay Subscription Type Contract Length Total Spend \


0 27 Basic Monthly 598
1 13 Standard Monthly 584
2 29 Premium Annual 757
3 17 Premium Quarterly 232
4 2 Standard Annual 533

Last Interaction Churn


0 9 1
1 20 0
2 21 0
3 18 0
4 18 0

[157]: churn_dmatrix = xgb.DMatrix(data=churn_data[['Age','Tenure','Usage␣


↪Frequency','Support Calls','Payment Delay','Total Spend','Last␣

↪Interaction']],label=churn_data.Churn) #converts to structure accepted by XGB

params={'booster':'gbtree',"objective":"binary:logistic","max_depth":
↪4,'reg_lambda':10}

cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=4,␣


↪num_boost_round=10, metrics="error", as_pandas=True)#num_boost_round (␣

↪,nfold (no. of cvs)

cv_results

[157]: train-error-mean train-error-std test-error-mean test-error-std


0 0.095510 0.000509 0.095536 0.001525
1 0.099160 0.001899 0.098984 0.003351
2 0.095510 0.000509 0.095536 0.001525

54
3 0.087654 0.000513 0.087660 0.001533
4 0.087489 0.000489 0.087737 0.001391
5 0.087375 0.000343 0.087706 0.001436
6 0.086614 0.000620 0.086914 0.001782
7 0.085205 0.000720 0.085640 0.001271
8 0.085806 0.000312 0.086184 0.001344
9 0.084113 0.000614 0.084397 0.002022

[158]: print("Accuracy: %f" %((1-cv_results["test-error-mean"]).iloc[-1]))

Accuracy: 0.915603

[159]: #Tuning in XGB:


#Tree tunable parameters i.e. when base model is gbtree
'''learning rate: learning rate/eta
gamma: min loss reduction to create new tree split
reg_lambda: L2 reg on leaf weights
alpha: L1 reg on leaf weights
max_depth: max depth per tree
subsample: % samples used per tree
colsample_bytree: % features used per tree'''
#Linear tunable parameters i.e. when base model is gblinear
'''reg_lambda: L2 reg on weights
alpha: L1 reg on weights
lambda_bias: L2 reg term on bias'''
#You can also tune the number of estimators (n_estimators) used for both base␣
↪model types.

#let's tune a XGBmodel:

[159]: 'reg_lambda: L2 reg on weights\nalpha: L1 reg on weights\nlambda_bias: L2 reg


term on bias'

[160]: import pandas as pd


import xgboost as xgb
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
df=pd.read_csv('data_cancer.csv')
y=df['area_mean']
X=df[['texture_mean','perimeter_mean','radius_mean','smoothness_mean','compactness_mean']]
df_dmatrix = xgb.DMatrix(data=X,label=y)
gbm_param_grid = {'learning_rate': np.arange(0.05,1.05,.05),'n_estimators':␣
↪[200],'subsample': np.arange(0.05,1.05,.05)}

gbm = xgb.XGBRegressor()
randomized_mse = RandomizedSearchCV(estimator=gbm,␣
↪param_distributions=gbm_param_grid, n_iter=25,␣

↪scoring='neg_mean_squared_error', cv=4, verbose=1)

randomized_mse.fit(X, y)

55
print("Best parameters found: ",randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))

Fitting 4 folds for each of 25 candidates, totalling 100 fits


Best parameters found: {'subsample': 1.0, 'n_estimators': 200, 'learning_rate':
0.1}
Lowest RMSE found: 23.29125947679293

[161]: #However, tuning with pipeline is little bit different:


import pandas as pd
import xgboost as xgb
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
xgb_pipeline = Pipeline([("st_scaler",StandardScaler()), ("xgb_model",xgb.
↪XGBRegressor())])

gbm_param_grid = {'xgb_model__subsample': np.arange(.05, 1, .


↪05),'xgb_model__max_depth': np.arange(3,20,1),'xgb_model__colsample_bytree':␣

↪np.arange(.1,1.05,.05) }

randomized_neg_mse =␣
↪RandomizedSearchCV(estimator=xgb_pipeline,param_distributions=gbm_param_grid,␣

↪n_iter=10, scoring='neg_mean_squared_error', cv=4)

randomized_neg_mse.fit(X, y)
print("Best rmse: ", np.sqrt(np.abs(randomized_neg_mse.best_score_)))
print("Best model: ", randomized_neg_mse.best_estimator_)

Best rmse: 26.665464920465865


Best model: Pipeline(steps=[('st_scaler', StandardScaler()),
('xgb_model',
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=0.9500000000000003, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=None, grow_policy=None,
importance_type=None,
interaction_constraints=None, learning_rate=None,
max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None,
max_depth=7, max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, …))])

56
[162]: '''XGB is very powerful ML library and most widely used. What We Have Not␣
↪Covered (And How You CanProceed):

-Using XGBoost for ranking/recommendation problems(Netflix/Amazon problem)


-Using more sophisticated hyperparameter tuning strategiesfor tuning XGBoost␣
↪models (Bayesian Optimization, there are entire new field for it)

-Using XGBoost as part of an ensemble of other models for regression/


↪classification, XGB itself is ensemble but nothing stops us to ensemble it␣

↪with other models even with XGB!'''

#XGB is ideal for any kind of data, but is not optimal for Image Processing,␣
↪NLP, CV tasks, rather use Deep learning for them.

[162]: 'XGB is very powerful ML library and most widely used. What We Have Not Covered
(And How You CanProceed):\n-Using XGBoost for ranking/recommendation
problems(Netflix/Amazon problem)\n-Using more sophisticated hyperparameter
tuning strategiesfor tuning XGBoost models (Bayesian Optimization, there are
entire new field for it)\n-Using XGBoost as part of an ensemble of other models
for regression/classification, XGB itself is ensemble but nothing stops us to
ensemble it with other models even with XGB!'

[163]: #Dimensionality Reduction for ML:

[164]: #a General Insight to Data: We can peak into our data by visualization using␣
↪simple pair-plot for low dimension data, however for high dimension data we␣

↪use tsne visulization:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df=pd.read_csv('ANSUR II FEMALE Public.csv') #https://www.kaggle.com/datasets/
↪seshadrikolluri/ansur-ii

numeric_df = df.select_dtypes(include=['number'])
print(df.shape)
print(numeric_df.shape)
from sklearn.manifold import TSNE
m = TSNE(learning_rate=50)
tsne_features = m.fit_transform(numeric_df)
print(tsne_features) #Reduced to 2D
#Assigning t-SNE features to our dataset
df['x'] = tsne_features[:,0]
df['y'] = tsne_features[:,1]
sns.scatterplot(x="x", y="y",hue='Branch', data=df)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',fontsize='small',␣
↪markerscale=0.7)

plt.show()

(1986, 108)
(1986, 99)
[[-46.565998 22.655666]

57
[-45.529686 23.14485 ]
[-45.72889 23.567385]

[ 35.08874 -23.190556]
[ 39.116215 -20.26782 ]
[ 37.955853 -25.027208]]

[165]: #b. Feature Selection: Selecting only Important Features for Modeling:
#i. Dropping Features with Low variance and High Missing Values:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.005) #Filter features with variance less␣
↪than/equal to 0.005

sel.fit(numeric_df / numeric_df.mean()) #We divided the data by mean to␣


↪Normalize it then check for variance

mask = sel.get_support()
reduced_df = numeric_df.loc[:, mask]
print(reduced_df.shape) #We dropped lot of features

(1986, 31)

[166]: print(df.shape)
mask = df.isna().sum() / len(df) < 0.3 #Filter rows with less than 30% missing␣
↪values

reduced_df = df.loc[:, mask]


print(reduced_df.shape) #We droped only 1 feature with missing values more than␣
↪30%

(1986, 110)

58
(1986, 109)

[167]: reduced_df.head()

[167]: SubjectId abdominalextensiondepthsitting acromialheight \


0 10037 231 1282
1 10038 194 1379
2 10042 183 1369
3 10043 261 1356
4 10051 309 1303

acromionradialelength anklecircumference axillaheight \


0 301 204 1180
1 320 207 1292
2 329 233 1271
3 306 214 1250
4 308 214 1210

balloffootcircumference balloffootlength biacromialbreadth \


0 222 177 373
1 225 178 372
2 237 196 397
3 240 188 384
4 217 182 378

bicepscircumferenceflexed … PrimaryMOS SubjectsBirthLocation \


0 315 … 92Y Germany
1 272 … 25U California
2 300 … 35D Texas
3 364 … 25U District of Columbia
4 320 … 42A Texas

SubjectNumericRace DODRace Age Heightin Weightlbs WritingPreference \


0 2 2 26 61 142 Right hand
1 3 3 21 64 120 Right hand
2 1 1 23 68 147 Right hand
3 8 2 22 66 175 Right hand
4 1 1 45 63 195 Right hand

x y
0 -46.565998 22.655666
1 -45.529686 23.144850
2 -45.728889 23.567385
3 -46.882313 23.612629
4 -47.587242 23.539295

[5 rows x 109 columns]

59
[168]: #ii. Dealing Highly Coorelated Features:
#Visualization:
corr=reduced_df.select_dtypes(include=['number']).corr()
corr #We can drop one of two highly coorelated featurs

[168]: SubjectId abdominalextensiondepthsitting \


SubjectId 1.000000 -0.013789
abdominalextensiondepthsitting -0.013789 1.000000
acromialheight 0.004100 0.214947
acromionradialelength 0.022717 0.237222
anklecircumference -0.031470 0.372719
… … …
Age -0.018173 0.293228
Heightin 0.008401 0.170861
Weightlbs -0.014049 0.767667
x 0.279870 -0.052270
y 0.000265 0.053877

acromialheight acromionradialelength \
SubjectId 0.004100 0.022717
abdominalextensiondepthsitting 0.214947 0.237222
acromialheight 1.000000 0.811059
acromionradialelength 0.811059 1.000000
anklecircumference 0.350197 0.259060
… … …
Age 0.039226 0.057805
Heightin 0.896396 0.728164
Weightlbs 0.553647 0.505108
x 0.061294 0.147724
y 0.002129 0.041040

anklecircumference axillaheight \
SubjectId -0.031470 0.004606
abdominalextensiondepthsitting 0.372719 0.151886
acromialheight 0.350197 0.981581
acromionradialelength 0.259060 0.791981
anklecircumference 1.000000 0.324684
… … …
Age -0.086713 0.000305
Heightin 0.345679 0.891467
Weightlbs 0.586055 0.493883
x -0.072520 0.062300
y 0.004004 -0.003276

balloffootcircumference balloffootlength \
SubjectId 0.000897 0.033391
abdominalextensiondepthsitting 0.316927 0.241917

60
acromialheight 0.468676 0.676899
acromionradialelength 0.398971 0.616667
anklecircumference 0.590284 0.413988
… … …
Age 0.051034 0.016836
Heightin 0.462897 0.646926
Weightlbs 0.553959 0.522848
x -0.021848 0.096852
y -0.017098 0.019238

biacromialbreadth bicepscircumferenceflexed \
SubjectId -0.029601 -0.010903
abdominalextensiondepthsitting 0.225261 0.729729
acromialheight 0.511699 0.278318
acromionradialelength 0.472225 0.276870
anklecircumference 0.326983 0.499570
… … …
Age -0.016117 0.268575
Heightin 0.518662 0.243009
Weightlbs 0.442281 0.833521
x -0.040787 0.031924
y -0.014967 0.076291

… weightkg wristcircumference \
SubjectId … -0.010213 -0.010356
abdominalextensiondepthsitting … 0.791290 0.474677
acromialheight … 0.556479 0.531893
acromionradialelength … 0.510630 0.451131
anklecircumference … 0.608834 0.644307
… … … …
Age … 0.220650 0.064341
Heightin … 0.502772 0.511133
Weightlbs … 0.970784 0.695328
x … 0.048091 0.010348
y … 0.061714 0.002420

wristheight SubjectNumericRace DODRace \


SubjectId 0.016271 0.008103 0.028523
abdominalextensiondepthsitting 0.205898 0.063056 -0.000055
acromialheight 0.888409 0.031315 -0.195841
acromionradialelength 0.547174 0.053812 -0.161936
anklecircumference 0.363866 -0.009231 -0.177347
… … … …
Age 0.059227 0.068719 0.012311
Heightin 0.789351 0.036638 -0.192266
Weightlbs 0.513441 0.053603 -0.069460
x 0.114857 0.024622 -0.000311

61
y -0.002867 0.029857 0.022460

Age Heightin Weightlbs x \


SubjectId -0.018173 0.008401 -0.014049 0.279870
abdominalextensiondepthsitting 0.293228 0.170861 0.767667 -0.052270
acromialheight 0.039226 0.896396 0.553647 0.061294
acromionradialelength 0.057805 0.728164 0.505108 0.147724
anklecircumference -0.086713 0.345679 0.586055 -0.072520
… … … … …
Age 1.000000 0.017574 0.225218 -0.060501
Heightin 0.017574 1.000000 0.502403 0.046455
Weightlbs 0.225218 0.502403 1.000000 0.040829
x -0.060501 0.046455 0.040829 1.000000
y 0.066305 0.009133 0.054477 0.048208

y
SubjectId 0.000265
abdominalextensiondepthsitting 0.053877
acromialheight 0.002129
acromionradialelength 0.041040
anklecircumference 0.004004
… …
Age 0.066305
Heightin 0.009133
Weightlbs 0.054477
x 0.048208
y 1.000000

[101 rows x 101 columns]

[169]: #iii. Dropping Features with less importance:


#Using Non-Tree Models: Logistic regression (or same process for Regression)
df_1=reduced_df.drop(['x','y'],axis=1).select_dtypes(include=['number'])
df_1.head()

[169]: SubjectId abdominalextensiondepthsitting acromialheight \


0 10037 231 1282
1 10038 194 1379
2 10042 183 1369
3 10043 261 1356
4 10051 309 1303

acromionradialelength anklecircumference axillaheight \


0 301 204 1180
1 320 207 1292
2 329 233 1271
3 306 214 1250

62
4 308 214 1210

balloffootcircumference balloffootlength biacromialbreadth \


0 222 177 373
1 225 178 372
2 237 196 397
3 240 188 384
4 217 182 378

bicepscircumferenceflexed … waistfrontlengthsitting \
0 315 … 345
1 272 … 329
2 300 … 367
3 364 … 371
4 320 … 380

waistheightomphalion weightkg wristcircumference wristheight \


0 942 657 152 756
1 1032 534 155 815
2 1035 663 162 799
3 999 782 173 818
4 911 886 152 762

SubjectNumericRace DODRace Age Heightin Weightlbs


0 2 2 26 61 142
1 3 3 21 64 120
2 1 1 23 68 147
3 8 2 22 66 175
4 1 1 45 63 195

[5 rows x 99 columns]

[170]: X=df_1[['abdominalextensiondepthsitting','acromialheight','acromionradialelength','anklecircum
y=df_1['DODRace'].astype('category')
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
lr = LogisticRegression(multi_class="multinomial")
lr.fit(X_train_std, y_train)
X_test_std = scaler.transform(X_test)
y_pred = lr.predict(X_test_std)
print(accuracy_score(y_test, y_pred))

63
print(dict(zip(X.columns, abs(lr.coef_[0])))) #Gives coefficients of each␣
↪feature used in modeling

0.6359060402684564
{'abdominalextensiondepthsitting': 0.21515381931083946, 'acromialheight':
0.8877099972958575, 'acromionradialelength': 0.10341865403147825,
'anklecircumference': 0.663169629845846, 'axillaheight': 0.17577417771122847,
'balloffootcircumference': 0.35320106477499696, 'balloffootlength':
0.44790839786798325, 'biacromialbreadth': 0.14615567832723877,
'bicepscircumferenceflexed': 0.0853947527952258}

[171]: X.drop('acromionradialelength', axis=1,inplace=True) #droping feature with␣


↪low-valued coefficients

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


lr.fit(scaler.fit_transform(X_train), y_train)
print(accuracy_score(y_test, lr.predict(scaler.transform(X_test)))) #Accuracy␣
↪is almost same even after droping the feature

0.6124161073825504
C:\Users\14274\AppData\Local\Temp\ipykernel_9596\2131005849.py:1:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-


docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
X.drop('acromionradialelength', axis=1,inplace=True) #droping feature with
low-valued coefficients

[172]: #Be catuios when you use lasso regularization in your model, because it already␣
↪reduces the cofficients to zero for non-important features so just drop zero␣

↪coefficient features in that case.

#Note drop only least important one feature only, not two or more because when␣
↪you drop one the importance of other changes when you apply the same model␣

↪again. If you further wanna drop features, drop one then apply model again␣

↪then drop one and so on. Not one at all.

[173]: #Using Tree Based Models:


from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
print(accuracy_score(y_test, rf.predict(X_test)))

0.5637583892617449

[174]: print(rf.feature_importances_)
mask = rf.feature_importances_ >0.11

64
print(mask)
X_reduced = X.loc[:, mask]
print(X_reduced.columns)

[0.12111906 0.12687506 0.13888741 0.12178715 0.10499965 0.14776223


0.11075023 0.12781921]
[ True True True True False True True True]
Index(['abdominalextensiondepthsitting', 'acromialheight',
'anklecircumference', 'axillaheight', 'balloffootlength',
'biacromialbreadth', 'bicepscircumferenceflexed'],
dtype='object')

[175]: #Using Combination/Ensemble of models: (multiple models at once)


df=pd.read_csv('ANSUR II FEMALE Public.csv')
numeric_df = df.select_dtypes(include=['number'])
X=numeric_df.drop('DODRace',axis=1)
y=numeric_df['DODRace'].astype('category')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

[176]: #First apply LassoCV


from sklearn.linear_model import LassoCV #lassoCV is same as lasso␣
↪regularization with tuned alpha value.

lcv = LassoCV()
lcv.fit(X_train, y_train)
lcv.score(X_test, y_test)
lcv_mask = lcv.coef_ != 0
sum(lcv_mask)

[176]: 5

[177]: from sklearn.feature_selection import RFE #As we discussed above we first check␣
↪importanceo of features and drop least important feature, apply model again␣

↪then again drop leat imp. feature and so on. The Recursive feature␣

↪elimination (RFE) automates this process.

from sklearn.ensemble import RandomForestClassifier


rfe_rf = RFE(estimator=RandomForestClassifier(),n_features_to_select=5, step=5,␣
↪verbose=1)

rfe_rf.fit(X_train, y_train)
rf_mask = rfe_rf.support_

Fitting estimator with 98 features.


Fitting estimator with 93 features.
Fitting estimator with 88 features.
Fitting estimator with 83 features.
Fitting estimator with 78 features.
Fitting estimator with 73 features.
Fitting estimator with 68 features.

65
Fitting estimator with 63 features.
Fitting estimator with 58 features.
Fitting estimator with 53 features.
Fitting estimator with 48 features.
Fitting estimator with 43 features.
Fitting estimator with 38 features.
Fitting estimator with 33 features.
Fitting estimator with 28 features.
Fitting estimator with 23 features.
Fitting estimator with 18 features.
Fitting estimator with 13 features.
Fitting estimator with 8 features.

[ ]: from sklearn.feature_selection import RFE


from sklearn.ensemble import GradientBoostingClassifier
rfe_gb = RFE(estimator=GradientBoostingClassifier(), n_features_to_select=5,␣
↪step=5, verbose=1)

rfe_gb.fit(X_train, y_train)
gb_mask = rfe_gb.support_

Fitting estimator with 98 features.


Fitting estimator with 93 features.

[ ]: import numpy as np
votes = np.sum([lcv_mask, rf_mask], axis=0)
print(votes)
mask = votes >= 1 #We filtered out all features that didn't get vote of␣
↪retention from any of above models.

reduced_X = X.loc[:, mask]

[ ]: reduced_X.head()

[ ]: #iv. Using PCA for Feature selection and Modeling:


import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.
↪3,random_state=0)

df=pd.read_csv('ANSUR II FEMALE Public.csv')


X=df.select_dtypes(include=['number']).drop('DODRace',axis=1)
y=df['DODRace'].astype('category')

66
pipe = Pipeline([('scaler', StandardScaler()),('reducer', PCA(n_components=0.
↪9)),('classifier', RandomForestClassifier())]) ##We kept 90% variance of our␣

↪data for modeling (i.e. only features that explain 90% variance will be␣

↪selected)

pipe.fit(X_train, y_train)
print(pipe['reducer'].explained_variance_ratio_.sum()) #Tells how much variance␣
↪each feature explains

print(pipe.score(X_test,y_test))

67

You might also like