0% found this document useful (0 votes)
12 views5 pages

Report

Uploaded by

pedanticwiles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views5 pages

Report

Uploaded by

pedanticwiles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Assignment 2: ELL 409

Arnav Raj
October 13, 2024

Abstract
In this report, I present the implementation and analysis of a Decision Tree
classifier that I built from scratch, including both pre-pruning and post-pruning
strategies. I focus on handling class imbalance, choosing appropriate pruning
methods, and evaluating the model’s performance before and after pruning. I also
discuss my rationale for not using certain techniques like SMOTE and detail how I
selected pruning parameters such as the alpha value.

1 Introduction
1.1 Problem Statement
The objective of this assignment was to implement a Decision Tree classifier from scratch to predict
whether a bank client will subscribe to a term deposit based on various attributes. I faced an
imbalanced dataset, with a significant majority of the ’no’ class. The key challenges included handling
class imbalance, selecting appropriate pruning strategies to prevent overfitting, and optimizing the
model’s performance.

2 Approach
2.1 Data Preprocessing
I began by loading the dataset and inspecting it for missing values and data types. I used Label
Encoding to convert categorical variables into numerical values suitable for the model. Then, I split
the data into training and validation sets using an 80-20 split with stratification to maintain the
class distribution.

2.2 Handling Class Imbalance


Initially, I applied the Synthetic Minority Over-sampling Technique (SMOTE) to balance the dataset.
However, it led to overfitting; the model began to predict the minority class (’yes’) excessively,
matching the number of ’no’ predictions. To mitigate this, I employed a custom resampling strategy:

• I oversampled the ’yes’ class to constitute one-third of the total training data.

• This approach ensured that the minority class was better represented without causing the
model to overfit.

• I left the ’no’ class unchanged to preserve the original data distribution as much as possible.

1
2.3 Decision Tree Implementation
I implemented the Decision Tree with the following considerations:

• Criteria for Splitting: I used entropy as the criterion for measuring the quality of a split.

• Handling Continuous Features: I handled continuous features by finding the best threshold
that maximizes information gain.

• Stopping Conditions: I controlled the tree growth using parameters like maximum depth,
minimum samples per leaf, and minimum gain.

2.4 Pruning Strategy


2.4.1 Post-Pruning (Cost Complexity Pruning)
I chose post-pruning over pre-pruning for the following reasons:

• Full Tree Exploration: It allows the model to consider all possible splits before simplifying
the tree, potentially capturing more complex patterns.

• Better Generalization: By pruning after the tree is fully grown, I can use validation data
to decide which branches to prune, enhancing generalization.

2.4.2 Alpha Value Selection


The alpha value controls the complexity penalty in the cost complexity pruning algorithm. I selected
the alpha value as follows:

• I generated a range of alpha values from the tree’s structure.

• I performed cross-validation to select the optimal alpha that minimizes the validation error.

• The optimal alpha value was found to be 0.25.

• The selected alpha balances the trade-off between tree complexity and model performance.

2.5 Rationale for Not Using SMOTE


I decided not to use SMOTE for the following reasons:

• Overfitting Risk: SMOTE generated synthetic samples that caused the model to overfit, as
it started predicting the minority class excessively.

• Data Integrity: The synthetic samples might not represent realistic scenarios, potentially
introducing noise.

• Alternative Approach: My custom resampling provided better control over the class distri-
bution without significantly altering the original data.

3 Results and Observations


3.1 Optimal Alpha and Pruning
After performing cross-validation, the optimal alpha value was determined to be 0.25. Pruning the
tree with this alpha value resulted in a significant reduction in tree size and improved generalization.

2
3.2 Model Performance Before and After Pruning
I evaluated the model’s performance on the validation set before and after pruning:

• Performance Before Pruning:

– Accuracy: 0.8552
– Precision: 0.4149
– Recall: 0.5784
– F1-Score: 0.4832

• Performance After Pruning:

– Accuracy: 0.8588
– Precision: 0.4292
– Recall: 0.6276
– F1-Score: 0.5098

Interestingly, the performance metrics remained the same before and after pruning. However, the
pruned tree is significantly less complex, which is beneficial for interpretability and may generalize
better to unseen data.

3.3 Comparison of Tree Sizes

Before Pruning After Pruning


Total Nodes 5989 523
Total Leaves 2995 262

Table 1: Comparison of Tree Sizes Before and After Pruning

As shown in Table 1, pruning reduced the total number of nodes from 5989 to 523 and the number
of leaves from 2995 to 262. This significant reduction in complexity indicates that many branches in
the unpruned tree were not contributing to better performance.

3
3.4 Visualizations

Figure 1: Feature Importance

Figure 2: Confusion Matrix of the Pruned Model on Validation Data

4
Figure 3: ROC Curve of the Pruned Model on Validation Data

4 Conclusion
In conclusion, implementing a Decision Tree classifier with post-pruning effectively addressed the
issues of overfitting and class imbalance. By adjusting the class distribution to make the ’yes’ class
one-third of the training data, I was able to better learn the patterns associated with the minority
class without overfitting. Post-pruning with an optimal alpha value of 0.25 significantly simplified
the tree without compromising performance, enhancing the model’s generalization capabilities.

5 References
1 Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81-106.

2 Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques. Elsevier.

3 Imbalanced-learn Documentation. https://imbalanced-learn.org/stable/

4 Scikit-learn Documentation. https://scikit-learn.org/stable/modules/tree.html

5 SMOTE: Synthetic Minority Over-sampling Technique. https://arxiv.org/abs/1106.1813

You might also like