9/7/2018                                                  komal_digits_recognition_dataset
The digits recognition dataset
           In [18]: from sklearn import datasets
                    import matplotlib.pyplot as plt
                    import numpy as np
           In [4]: # Load the digits dataset: digits
                   digits = datasets.load_digits()
           In [5]: # Print the keys and DESCR of the dataset
                   print(digits.keys())
                    dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])
           In [15]: digits['data']
           Out[15]: array([[ 0.,       0.,    5., ..., 0.,         0.,     0.],
                           [ 0.,       0.,    0., ..., 10.,        0.,     0.],
                           [ 0.,       0.,    0., ..., 16.,        9.,     0.],
                           ...,
                           [ 0.,       0., 1., ..., 6.,            0.,     0.],
                           [ 0.,       0., 2., ..., 12.,           0.,     0.],
                           [ 0.,       0., 10., ..., 12.,          1.,     0.]])
           In [16]: digits['target']
           Out[16]: array([0, 1, 2, ..., 8, 9, 8])
file:///D:/komal/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/codes%20in%20pdf/komal_digits_recognition_dataset.h…   1/4
9/7/2018                                                  komal_digits_recognition_dataset
           In [6]: print(digits.DESCR)
                    Optical Recognition of Handwritten Digits Data Set
                    ===================================================
                    Notes
                    -----
                    Data Set Characteristics:
                        :Number of Instances: 5620
                        :Number of Attributes: 64
                        :Attribute Information: 8x8 image of integer pixels in the range 0..16.
                        :Missing Attribute Values: None
                        :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
                        :Date: July; 1998
                    This is a copy of the test set of the UCI ML hand-written digits datasets
                    http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Dig
                    its
                    The data set contains images of hand-written digits: 10 classes where
                    each class refers to a digit.
                    Preprocessing programs made available by NIST were used to extract
                    normalized bitmaps of handwritten digits from a preprinted form. From a
                    total of 43 people, 30 contributed to the training set and different 13
                    to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
                    4x4 and the number of on pixels are counted in each block. This generates
                    an input matrix of 8x8 where each element is an integer in the range
                    0..16. This reduces dimensionality and gives invariance to small
                    distortions.
                    For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
                    T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
                    L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
                    1994.
                    References
                    ----------
                      - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
                        Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
                        Graduate Studies in Science and Engineering, Bogazici University.
                      - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
                      - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
                        Linear dimensionalityreduction using relevance weighted LDA. School of
                        Electrical and Electronic Engineering Nanyang Technological University.
                        2005.
                      - Claudio Gentile. A New Approximate Maximal Margin Classification
                        Algorithm. NIPS. 2000.
           In [7]: # Print the shape of the images and data keys
                   print(digits.images.shape)
                   print(digits.data.shape)
                    (1797, 8, 8)
                    (1797, 64)
file:///D:/komal/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/codes%20in%20pdf/komal_digits_recognition_dataset.h…   2/4
9/7/2018                                                  komal_digits_recognition_dataset
           In [8]: # Display digit 1010
                   plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')
                   plt.show()
    APPLICATION OF KNN
           In [9]: from sklearn.model_selection import train_test_split
                   from sklearn.neighbors import KNeighborsClassifier
           In [10]: X = digits.data
                    y = digits.target
           In [11]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, ran
                    dom_state=42, stratify=y)
           In [12]: knn = KNeighborsClassifier(n_neighbors=7)
           In [13]: knn.fit(X_train,y_train)
           Out[13]: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                               metric_params=None, n_jobs=1, n_neighbors=7, p=2,
                               weights='uniform')
           In [14]: print(knn.score(X_test, y_test))
                    0.9833333333333333
    Overfitting and underfitting
file:///D:/komal/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/codes%20in%20pdf/komal_digits_recognition_dataset.h…   3/4
9/7/2018                                                  komal_digits_recognition_dataset
           In [19]: # Setup arrays to store train and test accuracies
                    neighbors = np.arange(1, 9)
                    train_accuracy = np.empty(len(neighbors))
                    test_accuracy = np.empty(len(neighbors))
                    # Loop over different values of k
                    for i, k in enumerate(neighbors):
                        # Setup a k-NN Classifier with k neighbors: knn
                        knn = KNeighborsClassifier(n_neighbors=k)
                          # Fit the classifier to the training data
                          knn.fit(X_train,y_train)
                          #Compute accuracy on the training set
                          train_accuracy[i] = knn.score(X_train,y_train)
                          #Compute accuracy on the testing set
                          test_accuracy[i] = knn.score(X_test, y_test)
                    # Generate plot
                    plt.title('k-NN: Varying Number of Neighbors')
                    plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
                    plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
                    plt.legend()
                    plt.xlabel('Number of Neighbors')
                    plt.ylabel('Accuracy')
                    plt.show()
  OBSERVATIONS MADE: Low value of K --> OVERFITTING High value of K --> UNDERFITTING
file:///D:/komal/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/codes%20in%20pdf/komal_digits_recognition_dataset.h…   4/4