0% found this document useful (0 votes)
7 views7 pages

ML Digit Classification Report

This project report focuses on handwritten digit recognition using classical machine learning models, specifically evaluating the UCI Digits dataset. Various models including Logistic Regression, K-Nearest Neighbors, Support Vector Machine, Decision Tree, and Random Forest were implemented, with the Support Vector Machine achieving the highest accuracy of over 99.1%. The report discusses data preprocessing techniques, model evaluation metrics, and concludes with the performance comparison of the models.

Uploaded by

prarit.work
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views7 pages

ML Digit Classification Report

This project report focuses on handwritten digit recognition using classical machine learning models, specifically evaluating the UCI Digits dataset. Various models including Logistic Regression, K-Nearest Neighbors, Support Vector Machine, Decision Tree, and Random Forest were implemented, with the Support Vector Machine achieving the highest accuracy of over 99.1%. The report discusses data preprocessing techniques, model evaluation metrics, and concludes with the performance comparison of the models.

Uploaded by

prarit.work
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

ARM 210

Introduction to
machine learning
Project report

Submitted To: Submitted By:


Dr.Amit Choudhary Prarit Arora
AIML B1
04919051623
Email- aroraprarit017.pa@gmail.com
Contact - 9999538421
Digit Recognition Using Classical Machine
Learning Models

Link to Notebook

Abstract

Handwritten digit recognition is a classical problem in machine learning and computer


vision, often used to benchmark model performance. This project utilizes the UCI Digits
dataset to evaluate various traditional machine learning classifiers on their ability to
identify handwritten digits (0–9). A range of models including Logistic Regression, K-
Nearest Neighbors, Support Vector Machine, Decision Tree, and Random Forest were
implemented. Their performance was compared using metrics such as accuracy,
precision, recall, F1-score, confusion matrices, and cross-validation scores. Preprocessing
techniques, including normalization and an attempted dimensionality reduction using
PCA, are discussed. Support Vector Machine achieved the highest performance among
all models.

Keywords

Handwritten Digit Recognition, Supervised Learning, Classification Algorithm, Model


Evaluation, Confusion Matrix

1. Introduction

Handwritten digit classification is a well-known pattern recognition problem and serves


as an ideal case study for evaluating various supervised learning algorithms. The task is
to automatically recognize digits written by hand, which is foundational to applications
like postal code recognition, bank check processing, and digit-based entry systems.

This study uses the UCI Digits dataset, which is smaller and more lightweight compared
to MNIST, making it ideal for quick prototyping and comparisons.
2. Dataset Overview

 Dataset Source: UCI Machine Learning Repository (via


sklearn.datasets.load_digits)
 Shape: 1797 images of 8x8 pixels (64 features per image)
 Classes: 10 (Digits 0 through 9)
 Format: Each image is flattened into a 1D array of 64 pixel intensity values

Each sample in the dataset represents a grayscale digit image. Pixel values range from 0
to 16.

3. Data Preprocessing

 Normalization: Since pixel values range from 0–16, all values were normalized
by dividing by 16 to bring them into the [0, 1] range, which often improves model
convergence and accuracy.
 Train-Test Split:
o 80% for training (1437 samples)
o 20% for testing (360 samples)
o Stratified split was used to ensure class distribution remains consistent
across sets.
 Principal Component Analysis (PCA):
o PCA was attempted to reduce dimensionality and possibly enhance
performance.
o However, applying PCA led to a slight drop in accuracy, possibly due to
loss of information critical for classification. Hence, the raw normalized
features were retained.

4. Models Used

1. Logistic Regression

 A baseline linear classifier that works well with normalized numeric data.
 Surprisingly effective for this task, achieving over 93% accuracy.

2. K-Nearest Neighbors (KNN)

 A non-parametric model that classifies based on the majority class of its k closest
neighbors.
 It performed extremely well, achieving nearly 98.6% accuracy, as digit images
tend to cluster well in pixel-space.

3. Support Vector Machine (SVM)

 A powerful classifier that finds the optimal hyperplane to separate classes using
kernel tricks (RBF used here).
 This model achieved the highest accuracy of all: over 99.1%.

4. Decision Tree

 A simple and interpretable model that recursively splits data based on feature
values.
 Its performance was the weakest among all, with an accuracy of 83.3%.

5. Random Forest

 An ensemble model of multiple decision trees, helping reduce overfitting and


improve generalization.
 Achieved 96.1% accuracy — much better than a single tree.
5. Evaluation Metrics

The following metrics were used for evaluation:

 Accuracy: Ratio of correctly predicted instances over total instances.


 Precision (weighted): True positives / (True positives + False positives),
weighted by class.
 Recall (weighted): True positives / (True positives + False negatives), weighted
by class.
 F1-Score (weighted): Harmonic mean of precision and recall.
 Confusion Matrix: Shows detailed breakdown of actual vs predicted classes.
 Cross-Validation (5-fold): Measures model stability across multiple subsets.

6. Results

Metric Comparison Table:


Model Accuracy Precision Recall F1-Score

Logistic Regression 0.9361 0.9366 0.9361 0.9353

K-Nearest Neighbors 0.9861 0.9867 0.9861 0.9861

Support Vector Machine 0.9917 0.9920 0.9917 0.9917

Decision Tree 0.8333 0.8372 0.8333 0.8335

Random Forest 0.9611 0.9620 0.9611 0.9609

Cross-Validation Scores (5-fold):

 SVM: 0.9882 ± 0.0052


 KNN: 0.9882 ± 0.0087
 Random Forest: 0.9756 ± 0.0062
 Logistic Regression: 0.9429 ± 0.0061
 Decision Tree: 0.8427 ± 0.0233
7. Conclusion

Among the evaluated models, Support Vector Machine (SVM) performed the best with
an accuracy of 99.17% on the test set and strong cross-validation performance. KNN was
a close second, showing the strength of instance-based learning for small image datasets.
Decision Tree, while simple and fast, underperformed likely due to its tendency to
overfit small datasets. Random Forest demonstrated a strong balance between
interpretability and performance.

8. Refrences

1. Scikit-learn Developers. (2024). Scikit-learn User Guide. https://scikit-


learn.org/stable/user_guide.html
2. Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. Journal of
Machine Learning Research, 12, 2825–2830.
3. Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and
TensorFlow (2nd ed.). O'Reilly Media.
4. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to
Statistical Learning (2nd ed.). Springer. https://www.statlearning.com
5. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical
Learning (2nd ed.). Springer.
6. Dua, D., & Graff, C. (2019). UCI Machine Learning Repository: Optical
Recognition of Handwritten Digits Dataset. University of California, Irvine.
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digi
ts
7. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based
learning applied to document recognition. Proceedings of the IEEE, 86(11),
2278–2324. https://doi.org/10.1109/5.726791

8. Kotsiantis, S. B., Zaharakis, I., & Pintelas, P. (2007). Supervised machine


learning: A review of classification techniques. Emerging Artificial Intelligence
Applications in Computer Engineering, 160, 3–24.

9. Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and
recent developments. Philosophical Transactions of the Royal Society A:
Mathematical, Physical and Engineering Sciences, 374(2065).
https://doi.org/10.1098/rsta.2015.0202

10. Bhatele, M., Jadon, S., & Chaurasia, P. (2021). A Comparative Study of
Machine Learning Techniques for Digit Recognition. International Journal of
Computer Applications, 183(2), 6–11.

You might also like