Machine Learning Guide: Meher Krishna Patel
Machine Learning Guide: Meher Krishna Patel
Table of contents i
2 Multiclass classification                                                                                                                                                                    7
  2.1 Introduction . . . . . . . . . . . . . . . . . .             . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   7
  2.2 Iris-dataset . . . . . . . . . . . . . . . . . .             . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   7
      2.2.1 Load the dataset . . . . . . . . . . .                 . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   7
      2.2.2 Split the data as ‘training’ and ‘test’                data        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   8
  2.3 Conclusion . . . . . . . . . . . . . . . . . .               . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   9
3 Binary classification                                                                                                                                                                        10
  3.1 Introduction . . . . . . . . . . . . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
  3.2 Dataset . . . . . . . . . . . . . . . . . . . .              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
  3.3 Extract the data i.e. ‘features’ and ‘targets’               .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   12
  3.4 Prediction . . . . . . . . . . . . . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   14
  3.5 Rock vs Mine example . . . . . . . . . . . .                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
  3.6 Conclusion . . . . . . . . . . . . . . . . . .               .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   18
4 Regression                                                                                                                                                                                   19
  4.1 Noisy sine wave dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                              19
  4.2 Regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                             21
  4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                             23
5 Cross validation                                                                                                                                                                             24
  5.1 Introduction . . . . . . . . . . . . .   . . . . . . . . .                   . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   24
  5.2 Cross validation . . . . . . . . . . .   . . . . . . . . .                   . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   25
  5.3 Splitting of data . . . . . . . . . .    . . . . . . . . .                   . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   27
      5.3.1 Manual shuffling . . . . . .       . . . . . . . . .                   . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   27
      5.3.2 Automatic shuffling (KFold,        StratifiedKFold                     and ShuffleSplit)                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   28
  5.4 Template for comparing algorithms        . . . . . . . . .                   . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   30
6 Clustering 32
                                                                                                                                                                                                i
                                                                                                                                                                            Table of contents
   6.1   Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                                 32
   6.2   KMeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                                   32
7 Dimensionality reduction                                                                                                                                                                              36
  7.1 Introduction . . . . . . . . . . . . . . . . . . . . . .                              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   36
  7.2 Principal component analysis (PCA) . . . . . . . .                                    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   36
      7.2.1 Create dataset . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   36
      7.2.2 Reduce dimension using PCA . . . . . . . .                                      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   37
      7.2.3 Compare the performances . . . . . . . . .                                      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   38
  7.3 Usage of PCA for dimensionality reduction method                                      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   41
  7.4 PCA limitations . . . . . . . . . . . . . . . . . . .                                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   41
  7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . .                                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   41
9 Pipeline                                                                                                                                                                                              51
  9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                                    51
  9.2 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                                    51
11 Image recognition                                                                                                                                                                                    58
   11.1 Introduction . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   58
   11.2 Fetch the dataset . . . . . . .    .   .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   58
   11.3 Plot the images . . . . . . . .    .   .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   59
   11.4 Prediction using SVM model         .   .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   61
   11.5 Convert features to images . .     .   .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   64
                                                                               ii                                                                                                               PythonDSP
Table of contents
1.1 Introduction
In this chapter, we will understand the basic building blocks of SciKit-Learn library. Further, we will discuss the
various types of machine learning algorithms. Also, we will see several terms which are used in machine learning
process.
Machine learning algorithms is a part of data analysis process. The data analysis process involves following steps,
   •   Collecting the data from various sources
   •   Cleaning and rearranging the data e.g. filling the missing values from the dataset etc.
   •   Exploring the data e.g. checking the statistical values of the data and visualizing the data using plots etc.
   •   Modeling the data using correct machine learning algorithms.
   •   Lastly, check the performance of the newly created model.
In this tutorial we will see all the steps of data analysis process except the first step i.e. data
collection process. We will use the data which are available on the various websites.
Important: Data analysis requires the knowledge of multiple field e.g. data cleaning using Python or R language.
Good knowledge of mathematics for measuring the statistical parameter of the data. Also, we need to have the
knowledge of some specific field on which we want to apply the machine learning algorithm. Lastly, we must have
the understanding of the machine learning algorithms.
In general programming methods, we write the codes to solve the problem; and the code can solve a particular
types of problem only. This is known as ‘hard coding’ method. But in the machine learning process, the codes are
designed to see the patterns in the datasets to solve the problems, therefore it is more generalizes and can make
the decisions on the new problems as well. This difference is shown in Table 1.1.
                                                                                                                       2
Chapter 1. Machine learning terminologies
Lastly, the Machine learning can be defined as the process of extracting knowledge from the data, such that an
accurate predication can be made on the future data. In the other words, machine learning algorithms are able to
predict the outcomes of the new data based on their training.
In this section, we will see basic building blocks of SciKit library along with several terms used in machine learning
process.
Data is stored in two dimensional form in the SciKit, which are known as the ‘samples’ and ‘features’.
Note:
   • Samples: Each data has certain number of samples.
   • Features: Each sample has some features, e.g if we have samples of lines, then features of this lines can be
     ‘x’ and ‘y’ coordinates.
   • All the features should be identical in SciKit. For example, all the lines should have only two features i.e.
     ‘x’ and ‘y’ coordinates. If some lines have third feature as ‘thickness of line’, then we need to append/delete
     this feature to all the lines.
1.3.2 Target
   • Target: There may be the certain numbers of possible outputs for the data, which is known as ‘target’. For
     example, the the points can be on the ‘straight line’ or on the ‘curve line’. Therefore, the possible targets
     for this case are ‘line’ and ‘curve’.
   • Different names are used for ‘targets’ and ‘features’ as shown in Table 1.2,
Let’s understand this with an example. The SciKit library includes some input data as well. First we will use
these data and later we will read the data from the files for the data analysis.
   • The stored datasets in the SciKit library can be used as below,
• Now, we can see the data stored in the ‘iris’. Note that dataset is stored in the form of ‘dictionary’.
>>> iris.keys()
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
           >>> iris.feature_names
                   ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
   • ‘data’: It contains certain numbers of samples for the data e.g. this dataset contains 150 samples and each
     sample has four features. In the below results, the first three entries of the data is shown. The name of
     the columns (i.e. features of the data) are shown by the ‘feature_names’ e.g. the first column stores the
     speal-length.
   • ‘target’: It is the possible outputs for the data (optional). This is required for supervised learning, which
     will be discussed in this chapter. Here ‘0’ represents the ‘setoas’ family of the Iris-flower.
           >>> iris.target
           array([0, 0, 0, 0, ..., 0, 1, 1, 1, ..., 2, 2, 2])
           >>> iris.DESCR
           'Iris Plants Database\n====================\n [...]
Note: Following are the important points about the dataset, which we discussed in this section,
   • Datasets have samples of data, which includes some features of the data.
   • All the features should be available in every data. If there are missing/extra features in some data, the we
     need to add/remove those features from the data for SciKit.
   • Also, the dataset may contain the ‘target’ values in it.
Machine learning can be divided into two categories i.e. supervised and unsupervised, as shown in this section,
In Supervised Learning, we have a dataset which contains both the input ‘features’ and output ‘target’, as discussed
in Section 1.3.3, where Iris flower dataset has both ‘features’ and ‘target’.
The supervised learning can be further divided into two categories i.e. classification and regression.
   • Classification: In classification the targets are discrete i.e. there are fixed number of values of the outputs
     e.g. in Section 1.3.3 there are only three types of flower. Also, these outputs are represented using strings
     e.g. (Male/Female) or with fixed number of integers as shown for ‘iris’ dataset in Section 1.3.3 where 0,
     1 and 2 are used for three types of flower.
        – If the target has only two possible values, then it is known as ‘binary classification’.
                                                         4                                                 PythonDSP
Chapter 1. Machine learning terminologies
        – If the target has more than two possible values, then it is known as ‘multiclass classification’.
   • Regression: In regression the targets are continuous e.g. we want the calculate the ‘age of the animal (i.e.
     target)’ with the help of the ‘fossil dataset (i.e. feature)’. In this case, the problem regression problem as
     the age is a continuous quantity as it does not have fixed number of values.
In Unsupervised Learning, the dataset contains only ‘features’ and ‘no target’. Here, we need to find the relationship
between the various types of data. In the other words, we have to find the labels from the given dataset.
Unsupervised learning can be divided into three categories i.e. Clustering, Dimensionality reduction and Anomaly
detection.
   • Clustering: It is process of reducing the observations. This is acheived by collecting the simialar data in one
     class.
   • Dimensionality reduction: This is the reduction of higher dimensional data to 2 dimensional or 3 dimensional
     data, as it is easy to visualize the data in 2 dimensional and 3 dimensional form.
   • Anomaly detection: This is the process of removal of undesired data from the dataset.
Note: Sometimes these two methods, i.e. supervised and unsupervised learning, are combined. For example the
unsupervised learning can be used to find useful features and targets; and then these features can be used by the
supervised training method.
For example, we have a the ‘titanic’ dataset, where we have all the information about the passengers e.g. age,
gender, traveling-class and number of people died during accident etc. Here, we need to find the relationship
between various types of data e.g. people who are traveling in higher-class must have higher chances of survival
etc.
   • Below is the summery of this section. Table 1.3 shows the types of machine learning, and Table 1.4 shows
     the types of variable in machine learning algorithms.
1.5 Conclusion
In this chapter, we discussed various terms used in machine learning algorithms, which are shown Table 1.2, Table
1.3 and Table 1.4. In next section, we will see an example of ‘multiclass classification’.
                                                       6                                                 PythonDSP
Chapter 2
Multiclass classification
2.1 Introduction
In this chapter, we will use the ‘Iris-dataset’ which is available in the ‘SciKit library’. Here, we will use ‘KNeigh-
borsClassifier’ for training the data and then trained models is used to predict the outputs for the test data. And
finally, predicted outputs are compared with the desired outputs.
2.2 Iris-dataset
Lets see the Iris-dataset which has following features and target available in it, which are show in Listing 2.1.
   • Features:
       – sepal length in cm
       – sepal width in cm
       – petal length in cm
       – petal width in cm
   • Targets:
       – Iris Setosa
       – Iris Versicolour
       – Iris Virginica
                                                                                                                    7
                                                                                                         2.2. Iris-dataset
          We have 150 samples in our data. We can divide it into two parts i.e. ‘training dataset’ and ‘testing
          dataset’. A good choices can be 80% training data and 20% test data.
          Important: The training data set must included all the possible ‘targets’ in it, otherwise the machine
          will not be trained for all the ‘targets’; and will generate huge errors when those datasets will appear
          in the test. We can use “stratify” in the ‘train_test_split’ which takes care of this, as shown in Listing
          2.2.
     Here we will use the ‘KNeighborsClasssifier’ class of ‘sklearn’ for training the machine. Lets write the code in the
     file. Here Lines 17-27 are used to create the training and test datasets. Then Line 36 instantiates an object of
     KNeighborsClasssifier, which fits the models based on training data at Line 38. Next, the trained model is used
     to predict the outcome of the test data at Line 40. Finally, prediction error is calculated at Line 44.
3    import numpy as np
4    from sklearn.datasets import load_iris
5    from sklearn.neighbors import KNeighborsClassifier
6    from sklearn.model_selection import train_test_split
7
                                                              8                                                   PythonDSP
     Chapter 2. Multiclass classification
34
$ python multiclass_ex.py
Accuracy: 0.933333333333
     Note: We need to follow the below steps for training and testing the machine,
        •   Get the inputs i.e. ‘features’ from the datasets.
        •   Get the desired output i.e. ‘targets’ from the datasets ‘targets’.
        •   Next, split the dataset into ‘training’ and ‘testing’ data.
        •   Then train the model using ‘fit’ method on the ‘training’ data.
        •   Finally, predict the outputs for the ‘test data’, and print and plot the outputs in different formats. This
            printing and plotting operation will be discussed in next chapter.
2.3 Conclusion
     In this chapter, we learn to split the dataset into ‘training’ and ‘test’ data. Then the training data is used to fit
     the model and finally the models is used for predicting the outputs for the test data for a ‘classification problem’.
     In the next chapter, we will discuss the ‘binary classification problem’. Also, we will read the from the file, instead
     of using inbuilt dataset of SciKit.
Binary classification
3.1 Introduction
     In Chapter 2, we see the example of ‘classification’, which was performed on the data which was already available
     in the SciKit. In this chapter, we will read the data from external file. Here the “Hill-Valley ” dataset is used
     which is available at UCI Repository, which contains 100 input points (i.e. features) in it. Based on these points,
     the output (i.e. ‘target’) is assigned with one of the two values i.e. “1 for Hill” or “0 for Valley”. Fig. 3.1 shows the
     graph of these points for the Valley and the hill. Further, we will use “LogisticRegression” model for classification
     in this chapter. It is a linear model, which finds a line to separate the ‘hill’ from the ‘valley’.
     Note that, there are different datasets available on the website i.e. noisy and without noise. In this chapter, we
     will use the dataset without any noise. Lastly, we can download different data from the website according to our
     study e.g. data for regression problem, classification problem or mixed problem etc.
3.2 Dataset
     Lets quickly see the contents of the dataset “Hill_Valley_without_noise_Training.data”, as shown in Listing 3.1.
     The Fig. 3.2 shows the plot of the Rows 10 and 11 of the data, which represents the “hill” and “valley” respectively.
     In Listing 3.1, the Lines 12-23 are reading the data, cleaning it (i.e. removing the header line and line-breaks
     etc.) and changing it into desired format (i.e making list of list and then numpy array). This process is known as
     Data-cleaning and Data-transformation, which constitute 70%-90% of the work in machine-learning tasks.
3    # 1:hill, 0:valley
4
8    f = open("data/Hill_Valley_without_noise_Training.data", 'r')
9    data = f.read()
10   f.close()
11
                                                                                                                           10
Chapter 3. Binary classification
42 plt.show()
$ python hill_valley.py
     In Chapter 2, it is shown that the machine-learning tasks require the ‘features’ and ‘targets’. In the current data,
     both are available in the dataset in the combined form i.e. ‘target’ is available at the end of each data sample.
     Now, our task is to extract the ‘features’ and ‘targets’ in separate variables, so that the further code can be written
     easily. This can be done as shown in Listing 3.2,
3    # 1:hill, 0:valley
4
8    f = open("data/Hill_Valley_without_noise_Training.data", 'r')
9    data = f.read()
10   f.close()
11
                                                               12                                                    PythonDSP
     Chapter 3. Binary classification
42   # plt.show()
43
44
48   # features : last column i.e. target value will be removed form the dataset
49   features = np.zeros((row_sample, col_sample-1), float)
50   # target : store only last column
51   targets = np.zeros(row_sample, int)
52
69 plt.show()
3.4 Prediction
     Once data is transformed in the desired format, the prediction task is quite straight forward as shown in Listing
     3.3. Here following steps are performed for prediction,
        •   Split the data for training and testing (Lines 77-88).
        •   Select the classifier for modeling, and fit the data (Lines 90-93).
        •   Check the accuracy of prediction for the training set itself (Lines 95-98).
        •   Finally check the accuracy of the prediction for the test-data (Lines 100-103).
Note: The ‘accuracy_score’ is used here to calculate the accuracy (see Lines 97 and 102).
3    # 1:hill, 0:valley
4
12
13   f = open("data/Hill_Valley_without_noise_Training.data", 'r')
14   data = f.read()
15   f.close()
16
                                                              14                                                  PythonDSP
     Chapter 3. Binary classification
47   # plt.show()
48
49
50   # extract targets
51   row_sample, col_sample = data_list.shape # extract row and columns in dataset
52
53   # features : last column i.e. target value will be removed form the dataset
54   features = np.zeros((row_sample, col_sample-1), float)
55   # target : store only last column
56   targets = np.zeros(row_sample, int)
57
74   # plt.show()
75
76
90    # use LogisticRegression
91    classifier = LogisticRegression()
92    # training using 'training data'
93    classifier.fit(train_features, train_targets) # fit the model for training data
94
      $ python hill_valley.py
      Accuracy for training data (self accuracy): 0.997933884298
      Accuracy for test data: 1.0
      Note: In Iris-data set in Chapter 2 , the target depends directly on the input features i.e. width and length
      of petal and sepal. But in Hill-valley problem, the output does not directly depends on the location of the input
      values, but on the relative-positions of the certain inputs with all other inputs.
      LogisticRegression assign a weight to each of the features and then calculate the sum for making decisions e.g. if
      sum is greater than 0 then ‘hill’ and if less than 0 then ‘valley’. The coefficients which are assigned to each feature
      can be seen as below,
$ python -i hill_valley.py
      Also, the KNeighborsClassifier will not work here, as it looks for the features which are nearer to the ‘targets’,
      and then decide the boundaries. But, in Hill-Valley case, a valley can be at the top of the graph as shown in
      Fig. 3.1, or at the bottom of the graph. Similarly a Hill can be at the top of graph or at the bottom location.
      Therefore it is not possible to find the nearest points for the Hill-Valley problem, which can distinguish a Hill from
      a Vally. Hence, KNeighborsClassifier will have the accuracy_score = 0.5 (i.e. random guess). We can verify it by
      importing the “KNeighborsClassifier” and replacing the “LogisticRegression” to “KNeighborsClassifier” in Listing
      3.3.
                                                                16                                                   PythonDSP
     Chapter 3. Binary classification
     The file “sonar.all-data” contains the patterns obtained by bouncing sonar signals off a metal cylinder and the
     rocks under similar conditions. Last column contains the target names i.e. ‘R’ and ‘M’, where ‘R’ and ‘M’ are
     rocks and metals respectively.
     Note: Remember that, in classification problems the targets must be descrete; and can have the value as ‘string’
     or ‘number’ as shown in Table 1.4.
     As oppose to previous section, here the ‘targets’ has the direct relationship with ‘features’, therefore we can use
     both the classifier i.e. “LogisticRegression” and “KNeighborsClassifier” as shown in Listing 3.4.
     Since, the target is not the numeric value, therefore targets are stored in the list as shown in Line 33 (instead of
     numpy-array). Select any one of the classifier from Lines 55-56 and run the code to see the prediction accuracy.
13
14   f = open("data/sonar.all-data", 'r')
15   data = f.read()
16   f.close()
17
27   # extract targets
28   row_sample, col_sample = len(data_list), len(data_list[0])
29
30   # features : last column i.e. target value will be removed form the dataset
31   features = np.zeros((row_sample, col_sample-1), float)
32   # target : store only last column
33   targets = [] # targets are 'R' and 'M'
34
54   # select classifier
55   classifier = LogisticRegression()
56   # classifier = KNeighborsClassifier()
57
     (for LogisticRegression)
     $ python rock_mine.py
     Accuracy for training data (self accuracy): 0.795180722892
     Accuracy for test data: 0.761904761905
     (for KNeighborsClassifier)
     $ python rock_mine.py
     Accuracy for training data (self accuracy): 0.843373493976
     Accuracy for test data: 0.785714285714
3.6 Conclusion
     In this chapter, we read the data from the file, and then converted the data into the format which is used by SciKit
     library for further operations. Further, we used the class ‘LogisticRegression’ for modeling the system, and check
     the accuracy of the model for the training and test data.
                                                             18                                                  PythonDSP
Chapter 4
Regression
In previous chapters, we saw the example of supervised learning for ‘classification’ problems; i.e. the ‘targets’ had
the fixed number of values. In this section, we will see the another class of supervised learning i.e. ‘regression’,
where ‘targets’ can have continuous values. Note the ‘features’ can have continuous values in both the cases.
Also, in previous chapters, we used the SciKit’s inbuilt-dataset and read the dataset from the file. In this chapter,
we will create the dataset by ourselves.
Let’s create a dataset where the ‘features’ are the samples of the cooridantes of the x-axis, whereas the ‘targets’
are the noisy samples of the sine waves i.e. uniformly distributed noise samples will be added to the sine-wave;
and the corresponding waveforms are shown in Fig. 4.1. This can be achieved as below,
Fig. 4.1: Sine wave + Uniformly distributed noise generated by Listing 4.1
                                                                                                                  19
                                                                                             4.1. Noisy sine wave dataset
3    import numpy as np
4    import matplotlib.pyplot as plt
5
     Note: For SciKit library, the features must be in 2-dimensional format, i.e. features are the ‘list of list’, whereas
     target must be in 1-dimensional format. Currently, we have both in 1-dimensional format, therefore we need to
     convert the ‘features’ into 2-dimensional format as shown in Listing 4.2.
3    import numpy as np
4    import matplotlib.pyplot as plt
5
     $ python regression_ex.py
     Before: (100,)
     After: (100, 1)
                                                              20                                                 PythonDSP
     Chapter 4. Regression
     Now, we test the regression model i.e. “LinearRegression” on the dataset as below, which has the similar steps as
     classification problems. The predicted and actual points of the sine wave is shown in Fig. 4.2.
3    import numpy as np
4    import matplotlib.pyplot as plt
5
29
     $ python regression_ex.py
     Accuracy for training data (self accuracy): 0.843858910263
     Accuracy for test data: 0.822872868183
                                                           22                                                  PythonDSP
Chapter 4. Regression
4.3 Conclusion
In this chapter, we saw the example of Regression problems. Also, we saw the basic differences between the scoring
in the Regression and Classification problems.
Cross validation
5.1 Introduction
     In this chapter, we will enhance the Listing 2.2 to understand the concept of ‘cross validation’. Let’s comment the
     Line 24 of the Listing 2.2 as shown below and and excute the code 7 times.
1    # multiclass_ex.py
2
3    import numpy as np
4    from sklearn.datasets import load_iris
5    from sklearn.neighbors import KNeighborsClassifier
6    from sklearn.model_selection import train_test_split
7
34
                                                                                                                      24
     Chapter 5. Cross validation
• Now execute the code 7 times and we will get different ‘accuracy’ at different run.
     $ python multiclass_ex.py
     Accuracy: 0.966666666667
     $ python multiclass_ex.py
     Accuracy: 1.0
     $ python multiclass_ex.py
     Accuracy: 1.0
     $ python multiclass_ex.py
     Accuracy: 0.966666666667
     $ python multiclass_ex.py
     Accuracy: 1.0
     $ python multiclass_ex.py
     Accuracy: 0.966666666667
     $ python multiclass_ex.py
     Accuracy: 0.933333333333
     Note:
        • The ‘accuracy’ may be changed dramatically for some other datasets for different ‘train’ and ‘test’ dataset.
          Therefore it is not a good measure to compare the two models.
        • Also, in this method of finding the accuracy, we have very few data as the ‘test-data’. Further, we have less
          train-data as well due to splitting.
     To avoid these problems, the ‘cross-validation’ method is used for calculating the accuracy.
In the below code, the cross-validation value is set to 7 i.e. ‘cv=7’ at Line 48.
1    # multiclass_ex.py
2
3    import numpy as np
                                                                                                          (continues on next page)
35
47   # cross-validation
48   scores = cross_val_score(classifier, features, targets, cv=7)
49   print("Cross validation scores:", scores)
50   print("Mean score:", np.mean(scores))
• Below are the outputs for above code, which are the same for each run,
     $ python multiclass_ex.py
     Cross validation scores: [ 0.95833333     1.   0.95238095
         0.9047619   0.95238095 1. 1. ]
     Mean score: 0.966836734694
     $ python multiclass_ex.py
     Cross validation scores: [ 0.95833333     1.   0.95238095
          0.9047619   0.95238095 1. 1. ]
     Mean score: 0.966836734694
     $ python multiclass_ex.py
                                                                                                (continues on next page)
                                                            26                                                PythonDSP
     Chapter 5. Cross validation
         Warning:
           • Note that, in cross-validation, the data is not split randomly, therefore it is not good for the data where
             the ‘targets’ are nicely arranged. Therefore, it is good to shuffle the targets before applying the ‘cross-
             validation’ as shown in Listing 5.1.
           • Further, it does not create the model to predict the new samples; it only gives an idea about the accuracy
             of model.
           • It takes time to cross validate the dataset as number of iterations are increased e.g. for cv=7, the data
             will be split in 7 parts and each part will be tested with respect to others. Further, the data will be
             iterated 7 times, therefore total 49 checks will be performed.
3    import numpy as np
4    from sklearn.datasets import load_iris
5    from sklearn.neighbors import KNeighborsClassifier
6    from sklearn.model_selection import cross_val_score
7    from sklearn.model_selection import train_test_split
8
35
53   # cross-validation
54   scores = cross_val_score(classifier, features, targets, cv=7)
55   print("Cross validation scores:", scores)
56   print("Mean score:", np.mean(scores))
        • Below is the output of above code. In the iris dataset we have equal number of samples for each target,
          therefore the effect of shuffle and no-shuffle is almost same, but may vary when targets do not have equal
          distribution.
     $ python multiclass_ex.py
     Targets before shuffle:
      [0 0 0 0 0 0 0 ... 0 0 0 0 0
      1 1 1 1 1 1 1 1 ... 1 1 1 1 1
      2 2 2 2 2 2 2 2 ... 2 2 2 2 2
      ]
     Targets after shuffle:
      [2 1 0 2 0 2 0 1 1 1 2 1 1 1 ...
      1 1 1 2 0 2 0 0 1 2 2 2 2 1 2 ...
      1 0 2 1 0 1 2 1 0 2 2 2 2 0 0 ...
      ]
The shuffling can be performed using inbuilt functions as well as shown in below code.
     Note: The data are not shuffled in the Listing 5.2, but chosen random during splitting the data into the ‘training
     data’ and ‘test data’. Following 3 options are available for splitting (select any one from the Lines 55, 56 and 58),
        • KFold(n_splits=3, shuffle=True) : Shuffle the data and split the data into 3 equal part (same as Listing
          5.1).
        • StratifiedKFold(n_splits=3, shuffle=True) : KFold with ‘stratify’ option (see Listing 2.2 for details).
        • ShuffleSplit(n_splits=3, test_size=0.2) : Randomly splits the data. Also, it has the option to define the
          size of the test data.
                                                               28                                                 PythonDSP
     Chapter 5. Cross validation
         Warning: Note that in the Iris data set, the targets are equally distributed, therefore if we use the option
         KFold(n_splits=3), i.e. no shuffling, then we will have the accuracy ‘0’; as the data will be trained on only one
         set. Hence it is a good idea to keep shuffle on.
3    import numpy as np
4    from sklearn.datasets import load_iris
5    from sklearn.neighbors import KNeighborsClassifier
6    from sklearn.model_selection import cross_val_score
7    from sklearn.model_selection import train_test_split
8    from sklearn.model_selection import KFold, StratifiedKFold, ShuffleSplit
9
36
54   # cross-validation
                                                                                                      (continues on next page)
     As discussed before, the main usage of cross-validation is to compare various algorithms, which can be done as
     below, where 4 algorithms (Lines 9-12) are compared.
1    # cross_valid_ex.py
2
3    import numpy as np
4    import matplotlib.pyplot as plt
5    from sklearn.datasets import load_iris
6    from sklearn.model_selection import cross_val_score
7    from sklearn.model_selection import StratifiedKFold
8
20   models = []
21   models.append(('LogisticRegression', LogisticRegression()))
22   models.append(('KNeighborsClassifier', KNeighborsClassifier()))
23   models.append(('SVC', SVC()))
24   models.append(('DecisionTreeClassifier', DecisionTreeClassifier()))
25
• Below is the output of above code, where we can see that SVC performs better than other algorithms.
                                                             30                                                   PythonDSP
Chapter 5. Cross validation
$ python cross_valid_ex.py
Model:LogisticRegression, Score: mean=0.96088, var=0.00141
Model:KNeighborsClassifier, Score: mean=0.96088, var=0.00141
Model:SVC, Score: mean=0.97449, var=0.00164
Model:DecisionTreeClassifier, Score: mean=0.95408, var=0.00115
 Warning: Note that different values of ‘cv’ will give different results, e.g. if we put cv=3 at Line 29 (instead
 of cv=cv), then we will get following results, which shows that ‘KNeighborsClassifier’ has the best performance.
 $ python cross_valid_ex.py
 Model:LogisticRegression, Score: mean=0.94690, var=0.00032
 Model:KNeighborsClassifier, Score: mean=0.98693, var=0.00009
 Model:SVC, Score: mean=0.97345, var=0.00008
 Model:DecisionTreeClassifier, Score: mean=0.96732, var=0.00111
Clustering
6.1 Introduction
     In this chapter, we will see the examples of clustering. Lets understand the clustering with an example first. In
     the Listing 6.1, two lists i.e. x and y are plotted using ‘scatter’ plot. We can see that the data can be divided into
     three clusters as shown in Fig. 6.1.
     Note: In Fig. 6.1, it is easy to see the clusters as samples are very small; but it can not visualize so easily if we
     have a huge number of samples, as shown in this chapter. In those cases, the machine learning approach can be
     quite useful.
6.2 KMeans
1    # cluster_ex.py
2
                                                                                                                            32
     Chapter 6. Clustering
     $ python cluster_ex.py
     [[-3 11]
      [25 66]
      [-2 13]
      [ 7 25]
      [-1 12]
      [ 9 27]]
        • Now, we can use the “KMeans” algorithm to the transformed data as shown in Listing 6.2. The clusters
          generated by the algorithm is shown in Fig. 6.2.
     Note:
        • Centroids are the location of mean points generated by KMeans algorithm, which can be generated using
          ‘cluster_centers_’.
        • Also, each points can be assigned a label using ‘labels_’. Note that, once we get the labels, then we can use
          supervised learning for further analysis.
        • Number of samples should be higher than the number of clusters. For example, currently we have 6 samples,
          if we use “n_clusters=7”, then error will be generated.
        • We should increase the value of “n_clusters” to remove the outliers from the clustering. For example, in the
          current dataset, the points location i.e. [25, 66] can be seen as outliers i.e. it may be in the dataset due to
          measurement error or noise. Since, it is present in the dataset, it will affect the final locations of clusters. In
          the other words, if we put “n_clusters=2”, then one cluster will locate the point [25, 66], and second cluster
          will take the mean the of rest of the points, which may not be desirable, therefore, we need to decide the
          value of “n_clusters” according to dataset.
$ python cluster_ex.py
     Centroids:
      [[ -2. 12.]
      [ 25. 66.]
      [ 8. 26.]]
     Targets or Lables:
      [0 1 0 2 0 2]
     Tip: KMeans algorithm should be used for the number of samples less than 10000. If there are more than 10000
     samples, then MiniBatchKMeans algorithm must be used, which converge faster than the KMeans, but the quality
     of the results may reduce.
                                                          34                                             PythonDSP
Chapter 6. Clustering
Dimensionality reduction
7.1 Introduction
During the data collection process, our aim is to collect as much as data possible. During this process, it might
possible some of the ‘features’ are correlated. If the dataset has lots of features, then it is good to remove some of
the correlated features, so that the data can be processed faster; but at the same time the accuracy of the model
may reduced.
PCA is one of the technique to reduce the dimensionality of the data, as shown in this section.
# dimension_ex.py
import numpy as np
import pandas as pd
#   feature values
x   = np.random.randn(1000)
y   = 2*x
z   = np.random.randn(1000)
# target values
t=len(x)*[0] # list of len(x)
for i, val in enumerate(z):
    if x[i]+y[i]+z[i] < 0:
        t[i] = 'N' # negative
    else:
        t[i] = 'P'
                                                                                                                   36
     Chapter 7. Dimensionality reduction
         Warning: The output ‘t’ depends on the the variables ‘x’, ‘y’ and ‘z’, therefore if these variables are not
         correlated, then dimensionality reduction will result in severe performance degradation as shown in this chapter.
     $ python dimension_ex.py
                           0                   1                            2    3
     0     1.619558594848966   3.239117189697932          -1.7181741395151733    P
     1    0.7926656328473467  1.5853312656946934          -0.5003026519806438    P
     2 -0.40666904321652636 -0.8133380864330527           -0.5233957097467451    N
     3    -1.813173189559588  -3.626346379119176           -1.418416461398814    N
     4    0.4357818365640018  0.8715636731280036           1.7840245820080853    P
     Now, we create the PCA model as shown Listing 7.1, which will transform the above datasets into a new dataset
     which will have only 2 features (instead of 3).
     Note: The PCA can have inputs which have only ‘numeric features’, therefore we need to ‘drop’ the ‘categorical’
     features as shown in Line 26.
     Next we need to instantiate an object of class PCA (Line 27) and the apply ‘fit’ method (Line 28).
     Finally, we can transform our data using ‘transform’ method as shown in Line 29.
3    import numpy as np
4    import pandas as pd
5    from sklearn.decomposition import PCA
6
7    #   feature values
8    x   = np.random.randn(1000)
9    y   = 2*x
10   z   = np.random.randn(1000)
11
12   # target values
13   t=len(x)*[0] # list of len(x)
14   for i, val in enumerate(z):
15       if x[i]+y[i]+z[i] < 0:
16           t[i] = 'N' # negative
17       else:
18           t[i] = 'P'
19
• Following is the output of above code, where the dataset has only two features,
     $ python dimension_ex.py
     [[-2.54693351 -0.07879497]
      [ 0.42820972 -0.90158131]
      [-1.94145497 -1.70738801]
      ...,
      [-0.92088711 0.54590025]
      [-2.44899588 -1.403821 ]
      [-1.94568343 -0.50371273]]
Now, we will compare the performances of the system with and without dimensionality reduction.
     The code which is added to Listing 7.1 is exactly same as the code which is discussed in Listing 3.3; i.e. split of
     dataset into ‘test’ and ‘training’ and then check the score, as shown in below code.
     Here Lines 42-70 calculates the score for ‘without dimensionality reduction’ case, whereas Lines 73-103 calculates
     the score of “dimensionality reduction using PCA”.
3    import numpy as np
4    import pandas as pd
5    from sklearn.decomposition import PCA
6    from sklearn.linear_model import LogisticRegression
7    from sklearn.metrics import accuracy_score
8    from sklearn.model_selection import train_test_split
9
10
11   #   feature values
12   x   = np.random.randn(1000)
13   y   = 2*x
14   z   = np.random.randn(1000)
15
16   # target values
17   t=len(x)*[0] # list of len(x)
18   for i, val in enumerate(z):
19       if x[i]+y[i]+z[i] < 0:
20           t[i] = 'N' # negative
21       else:
22           t[i] = 'P'
23
                                                               38                                                  PythonDSP
     Chapter 7. Dimensionality reduction
56   # use LogisticRegression
57   classifier = LogisticRegression()
58   # training using 'training data'
59   classifier.fit(train_features, train_targets) # fit the model for training data
60
72
73
90   # use LogisticRegression
                                                                                             (continues on next page)
      Note: Since the ‘x’ and ‘y’ are completely correlated (i.e. y = 2*x), therefore the performance of dimensionality
      reduction is exactly same as the without reduction case.
      Also, we will get different results for different execution of code, as the ‘x’, ‘y’ and ‘z’ are randomly generated on
      each run.
$ python dimension_ex.py
         • Next replace the value of ‘y’ at Line 13 of Listing 7.2 with following value, and run the code
           again,
      [...]
      y = 2*x + np.random.randn(1000)
      [...]
      As, noise is added to ‘x’ as noise is added, therefore the ‘x’ and ‘y’ are not completely correlated (but still highly
      correlated), therefore the performance of the system will reduce slightly, as shown in below results,
      Note: Remember, the ‘target’ variable depends on ‘x’, ‘y’ and ‘z’ i.e. it is the sign of the sum of these variables.
      Therefore, if the correlation between the ‘features’ will reduce, the performance of the dimensionality reduction
      will also reduce.
$ python dimension_ex.py
                                                               40                                                   PythonDSP
Chapter 7. Dimensionality reduction
   • Again, replace the value of ‘y’ at Line 13 of Listing 7.2 with following value, and run the code
     again,
[...]
y = np.random.randn(1000)
[...]
Now ‘x’, ‘y’ and ‘z’ are completely independent of each other, therefore the performance will reduce significantly
as shown below,
Note: Each run will give different result, below is the worst case result, where test data accuracy is 0.575 (i.e.
probability 0.5), which is equivalent to the random guess of the target.
$ python dimension_ex.py
 Warning: Note that the PCA is very sensitive to scaling operations, more specifically it maximizes variability
 based on the variances of the features.
 Due to this reason, it gives more weight to ‘high variance features i.e. high-variance-feature will dominate the
 overall performance.
 To avoid this problem, it is better to normalized the features before applying the PCA model as shown in
 Section 8.4.
7.5 Conclusion
In this chapter, we learn the concept of dimensionality reduction and PCA. In the next chapter, we will see the
usage of PCA in a practical problem.
In previous chapters, we did some minor preprocessing to the data, so that it can be used by SciKit library. In
this chapter, we will do some preprocessing of the data to change the ‘statitics’ and the ‘format’ of the data, to
improve the results of the data analysis.
The “chronic_kidney_disease.arff” dataset is used for this tutorial, which is available at the UCI Repository.
   • Lets read and clean the data first,
import pandas as pd
import numpy as np
                                                                                                                 42
     Chapter 8. Preprocessing of the data using Pandas and SciKit
     $ python kidney_dis.py
     Total samples: 157
     Partial data
         age bp      sg al su     rbc
     30 48 70 1.005 4 0       normal
     36 53 90 1.020 2 0 abnormal
     38 63 70 1.010 3 0 abnormal
     41 68 80 1.010 3 2       normal
     In this dataset we have two ‘targets’ i.e. ‘ckd’ and ‘notckd’ in the last column (‘classification’). It is better to save
     the ‘targets’ of classification problem with some ‘color-name’ for the plotting purposes. This helps in visualizing
     the scatter-plot as shown in this chapter.
3    import pandas as pd
4    import numpy as np
5
26   targets = df['classification'].astype('category')
27   # save target-values as color for plotting
28   # red: disease, green: no disease
29   label_color = ['red' if i=='ckd' else 'green' for i in targets]
30   print(label_color[0:3], label_color[-3:-1])
     Note: We can convert the ‘categorical-targets (i.e. strings ‘ckd’ and ‘notckd’) into ‘numeric-targets (i.e. 0 and
     1’) using “.cat.codes” command, as shown below,
• Below is the first three and last two samples of the ‘label_color’,
     $ python kidney_dis.py
     ['red', 'red', 'red'] ['green', 'green']
Let’s perform the dimensionality reduction using PCA, which is discussed in Section 7.2.
     Note that, for PCA the features should be ‘numerics’ only. Therefore we need to remove the ‘categorical’ features
     from the dataset.
3    import pandas as pd
4    import numpy as np
5
26   targets = df['classification'].astype('category')
27   # save target-values as color for plotting
28   # red: disease, green: no disease
29   label_color = ['red' if i=='ckd' else 'green' for i in targets]
30   # print(label_color[0:3], label_color[-3:-1])
31
                                                               44                                              PythonDSP
     Chapter 8. Preprocessing of the data using Pandas and SciKit
          • Below is the output of the above code. Note that, if we compare the below results with the results of Listing
            8.1, we can see that the ‘rbc’ column is removed.
     $ python kidney_dis.py
     Partial data
         age bp      sg al su bgr
     30 48 70 1.005 4 0 117
     36 53 90 1.020 2 0       70
     38 63 70 1.010 3 0 380
     41 68 80 1.010 3 2 157
     Let’s perform dimensionality reduction using the PCA model as shown in Listing 8.4. The results are shown in
     Fig. 8.1, where we can see that the model can fairly classify the kidney disease based on the provided features. In
     the next section we will improve this performance by some more preprocessing of the data.
3    import pandas as pd
4    import numpy as np
5    import matplotlib.pyplot as plt
6    from sklearn.decomposition import PCA
7
28   targets = df['classification'].astype('category')
29   # save target-values as color for plotting
30   # red: disease, green: no disease
31   label_color = ['red' if i=='ckd' else 'green' for i in targets]
32   # print(label_color[0:3], label_color[-3:-1])
33
46   pca = PCA(n_components=2)
47   pca.fit(df)
48   T = pca.transform(df) # transformed data
49   # change 'T' to Pandas-DataFrame to plot using Pandas-plots
50   T = pd.DataFrame(T)
51
     It is shown in Section 7.4, that the overall performance of the PCA is dominated by ‘high variance features’.
     Therefore features should be normalized before using the PCA model.
     In the below code ‘StandardScalar’ preprocessing module is used to normalized the features, which sets the
     ‘mean=0’ and ‘variance=1’ for all the features. Note that the improvement in the results in Fig. 8.2, just by
     adding one line in Listing 8.5.
                                                          46                                                 PythonDSP
     Chapter 8. Preprocessing of the data using Pandas and SciKit
     If we want to use the preprocessing in the ‘supervised learning’, then it is better to ‘split’ the dataset as ‘test and
     train’ first; and then apply the preprocessing to the ‘training data’ only. This is the good practice as in
     real-life problems we will not have the future data for preprocessing.
3    import pandas as pd
4    import numpy as np
5    import matplotlib.pyplot as plt
6    from sklearn.decomposition import PCA
7    from sklearn import preprocessing
8
29   targets = df['classification'].astype('category')
30   # save target-values as color for plotting
31   # red: disease, green: no disease
32   label_color = ['red' if i=='ckd' else 'green' for i in targets]
33   # print(label_color[0:3], label_color[-3:-1])
34
50   pca = PCA(n_components=2)
51   pca.fit(df)
52   T = pca.transform(df) # transformed data
53   # change 'T' to Pandas-DataFrame to plot using Pandas-plots
54   T = pd.DataFrame(T)
55
     Note that, in Section 8.3.1, we dropped several ‘categorical features’ as these can not be used by PCA. But we can
     convert these features to ‘numeric features’ and use them in PCA model.
     Again, see the further improvement in the results in Fig. 8.3, just by adding one line in Listing 8.6.
3    import pandas as pd
4    import numpy as np
5    import matplotlib.pyplot as plt
6    from sklearn.decomposition import PCA
7    from sklearn import preprocessing
8
                                                              48                                                   PythonDSP
     Chapter 8. Preprocessing of the data using Pandas and SciKit
29   targets = df['classification'].astype('category')
30   # save target-values as color for plotting
31   # red: disease, green: no disease
32   label_color = ['red' if i=='ckd' else 'green' for i in targets]
33   # print(label_color[0:3], label_color[-3:-1])
34
53   pca = PCA(n_components=2)
54   pca.fit(df)
55   T = pca.transform(df) # transformed data
56   # change 'T' to Pandas-DataFrame to plot using Pandas-plots
57   T = pd.DataFrame(T)
58
     Important: Let’s summarize what we did in this chapter. We had a dataset which had a large number of
     features. PCA looks for the correlation between these features and reduces the dimensionality. In this example,
     we reduce the number of features to 2 using PCA.
     After the dimensionality reduction, we had only 2 features, therefore we can plot the scatter-plot, which is easier
to visualize. For example, we can clearly see the differences between the ‘ckd’ and ‘not ckd’ in the current example.
In conclusion, dimensionality reduction methods, such as PCA and Isomap, are used to reduce the dimensionality
of the features to 2 or 3. Next, these 2 or 3 features can be plotted to visualize the information.
It is important that the plot should be 2D or 3D format, otherwise it is very difficult for the eyes to visualize it
and interpret the information.
                                                         50                                                 PythonDSP
     Chapter 9
Pipeline
9.1 Introduction
     Pipelines takes ‘a list of tranforms’ along with ‘one estimator at the end’ as the inputs. In this chapter, we will
     use the ‘Pipeline’ to reimplement the Listing 8.6.
9.2 Pipeline
     In this section, Listing 8.6 is reimplemented using ‘Pipeline’. In Listing 9.1 the Pipeline ‘pca’ is defined at Lines
     56-60. When ‘pca.fit(df)’ operation is applied at Line 62, the ‘df’ is send to Pipeline for processing and model is
     fit, and finally used by Line 63. This can be very handy tool when we have a chain of preprocessing.
3    import pandas as pd
4    import numpy as np
5    import matplotlib.pyplot as plt
6    from sklearn.decomposition import PCA
7    from sklearn import preprocessing
8    from sklearn.pipeline import Pipeline
9
                                                                                                                        51
                                                                                           9.2. Pipeline
30   targets = df['classification'].astype('category')
31   # save target-values as color for plotting
32   # red: disease, green: no disease
33   label_color = ['red' if i=='ckd' else 'green' for i in targets]
34   # print(label_color[0:3], label_color[-3:-1])
35
54   # pca = PCA(n_components=2)
55
62   pca.fit(df)
63   T = pca.transform(df) # transformed data
64   # change 'T' to Pandas-DataFrame to plot using Pandas-plots
65   T = pd.DataFrame(T)
66
                                                        52                                      PythonDSP
Chapter 10
10.1 Introduction
In previous chapters, we saw the examples of ‘clustering Chapter 6’, ‘dimensionality reduction (Chapter 7 and
Chapter 8)’, and ‘preprocessing (Chapter 8)’. Further, in Chapter 8, the performance of the dimensionality
reduction technique (i.e. PCA) is significantly improved using the preprocessing of data.
Remember, in Chapter 7 we used the PCA model to reduce the dimensionality of the features to 2, so that a 2D
plot can be plotted, which is easy to visualize. In this chapter, we will combine these three techniques together,
so that we can get much information from the scatter plot.
Note: In this chapter, we will use a ‘whole sale customer’ dataset, which is available at UCI Repository.
Our aim is to cluster the data so that we can see the products, which are bought by the customer together. For
example, if a person went to shop to buy some grocery, then is is quite likely that he will but the ‘milk’ as well,
therefore we can put the ‘milk’ near the grocery items; similarly it is quite unlikely that the same person will buy
the fresh vegetables at the same time.
If we can predict such behavior of the customer, then we can arrange the shop accordingly, which will increase the
sell of the items. In this chapter, we will do the same.
   • First the the dataset and drop the columns which have “Null” values,
# whole_sale.py
import pandas as pd
   • Following is the output of above code. Note that there is no ‘Null’ value, therefore we need not to drop
     anything.
$ python whole_sale.py
Channel             0
Region              0
Fresh               0
Milk                0
Grocery             0
                                                                                              (continues on next page)
                                                                                                                  53
                                                                                         10.3. Clustering using KMean
        • Next, our aim is to find the buying-patterns of the customers, therefore we do not need the columns ‘Channel’
          and ‘Region’ for this analysis. Hence we will drop these two columns,
1    # whole_sale.py
2
3    import pandas as pd
4
        • Now perform the clustering as below. Note that, the ‘Normalizer()’ is used at Line 14 for the preprocessing.
          We can try the different preprocessing-methods as well, to visualize the outputs.
     Note: After completing the chapter, try following as well and see the outputs,
        • Use different ‘preprocessing’ methods e.g ‘MaxAbsScaler’ and ‘StandardScaler’ etc. and see the performance
          of the code.
        • Use different values of n_clusters e.g 2, 3 and 4 etc.
1    # whole_sale.py
2
3    import pandas as pd
4    from sklearn import preprocessing
5    from sklearn.cluster import KMeans
6
13   # preprocessing
14   T = preprocessing.Normalizer().fit_transform(df)
15
                                                            54                                                  PythonDSP
     Chapter 10. Clustering with dimensionality reduction
Now, we will perform the dimensionality reduction using PCA. We will reduce the dimensions to 2.
     Important:
        • Currently, we are performing the clustering first and then dimensionality reduction as we have few features
          in this example.
        • If we have a very large number of features, then it is better to perform dimensionality reduction first and
          then use the clustering algorithm e.g. KMeans.
1    # whole_sale.py
2
3    import pandas as pd
4    from sklearn import preprocessing
5    from sklearn.cluster import KMeans
6    from sklearn.decomposition import PCA
7
14   # preprocessing
15   T = preprocessing.Normalizer().fit_transform(df)
16
27   # Dimesionality reduction to 2
28   pca_model = PCA(n_components=2)
29   pca_model.fit(T) # fit the model
30   T = pca_model.transform(T) # transform the 'normalized model'
31   # transform the 'centroids of KMean'
32   centroid_pca = pca_model.transform(centroids)
33   # print(centroid_pca)
     Finally plot the results as below. The scatter plot is shown in Fig. 10.1.
        • Lines 36-39 assign colors to each ‘label’, which are generated by KMeans at Line 24.
        • Lines 41-45, plots the components of PCA model using the scatter-plot. Note that, KMeans generates
          3-clusters, which are used by ‘PCA’, therefore total 3 colors are displayed by the plot.
        • Lines 47-51, plots the ‘centroids’ generated by the KMeans.
        • Line 53-66 plots the ‘features names’ along with the ‘arrows’.
Important:
        • The arrows are the projection of each feature on the principle component axis. These arrows represents
          the level of importance of each feature in the multidimensional scaling. For example, ‘Frozen’ and ‘Fresh’
          contribute more that the other features.
        • In Fig. 10.1 we can conclude that the ‘Fresh items such as fruits and vegetables’ should be places place
          separately; whereas ‘Grocery’, ‘Detergents_Paper’ and ‘Milk’ should be placed close to each other.
1    # whole_sale.py
2
3    import pandas as pd
4    import matplotlib.pyplot as plt
5    from sklearn import preprocessing
6    from sklearn.cluster import KMeans
7    from sklearn.decomposition import PCA
8
15   # preprocessing
16   T = preprocessing.Normalizer().fit_transform(df)
17
28   # Dimesionality reduction to 2
29   pca_model = PCA(n_components=2)
30   pca_model.fit(T) # fit the model
31   T = pca_model.transform(T) # transform the 'normalized model'
32   # transform the 'centroids of KMean'
33   centroid_pca = pca_model.transform(centroids)
34   # print(centroid_pca)
35
                                                           56                                                PythonDSP
     Chapter 10. Clustering with dimensionality reduction
68 plt.show()
Image recognition
11.1 Introduction
     In previous chapters, we saw the examples of ‘classification’, ‘regression’, ‘preprocessing’, ‘dimensionality reduction’
     and ‘clustering’. In these examples we considered the numeric and categorical features. In this chapter, we will
     use the ‘numerical features’, but these features will represent the images.
     Note: In Chapter 2, we used the the Iris-dataset which was available in the SciKit library package; and the
     dataset which is available in the SciKit library starts with prefix ‘load_’ e.g. load_iris.
     In this chapter, we will use the dataset whose names are available in the dataset. And we need Internet connection
     to load them on the computer. These datasets start with ‘fetch_’ e.g. ‘fetch_olivetti_faces’, as shown in next
     section.
     When the dataset ‘fetch_olivetti_faces’ is instantiated, then the data will be downloaded and will be saved in
     ~/scikit_learn_data. Further, once the data set is downloaded then it will be used from this directory.
     Lets download the dataset and see the contents of it. Note that the dataset will be downloaded during instantiation
     (Line 4), and not by the Line 2.
     Note: In the dataset, there are images of 40 people with 10 different poses e.g. smiling and angry faces etc.
     Therefore, there are total 400 samples (i.e. 40x10).
                                                                                                                          58
    Chapter 11. Image recognition
    Following is the output of above code. Note that there are total 400 samples and the images size is (64, 64), which
    is stored as features of size 4096 (i.e. 64x64).
    $ python faces_ex.py
    Note: Please look at the values of the ‘images’, ‘data’ and ‘targets’ as well as below,
    $ python -i faces_ex.py
    >>> # Sizes
    >>> print(faces.images[0].shape)
    (64, 64)
    >>> print(faces.data[0].shape)
    (4096,)
    >>> print(faces.target[0].size)
    1
    >>> # Contents
    >>> print(faces.images[0])
    [[ 0.30991736 0.36776859        0.41735536 ...,     0.37190083    0.33057851
       0.30578512]
     [ 0.3429752   0.40495867       0.43801653 ...,     0.37190083    0.33884299
       0.3140496 ]
     [ 0.3429752   0.41735536       0.45041323 ...,     0.38016528    0.33884299
       0.29752067]
     ...,
     [ 0.21487603 0.20661157        0.22314049 ...,    0.15289256    0.16528925
       0.17355372]
     [ 0.20247933 0.2107438         0.2107438   ...,    0.14876033    0.16115703
       0.16528925]
     [ 0.20247933 0.20661157        0.20247933 ...,    0.15289256    0.16115703
       0.1570248 ]]
Let’s plot the images of first 20 images, which are shown in Fig. 11.1,
14   # note that images can not be saved as features, as we need 2D data for
15   # features, whereas faces.images are 3D data i.e. (samples, pixel-x, pixel-y)
16   features = faces.data # features
17   targets = faces.target # targets
18
26 plt.show()
        • Before moving further, let’s convert the Listing 11.2 into a function, so that the code can be reused. Listing
          11.3 is the function which can be used to plot any number of images with desired number of rows and columns
          e.g. Line 26 plots 10 images with 2 rows and 5 columns.
                                                             60                                                 PythonDSP
     Chapter 11. Image recognition
23   # note that images can not be saved as features, as we need 2D data for
24   # features, whereas faces.images are 3D data i.e. (samples, pixel-x, pixel-y)
25   features = faces.data # features
26   targets = faces.target # targets
27
     Since there are images of 10 people here, therefore the number of different target values are fixed, hence the problem
     is a ‘classification’ problem. In Chapter 2 and Chapter 3, we used the ‘KNeighborsClassifier’ and ‘LogisticRegres-
     sion’ for the classification problems; in this chapter we will used the ‘Support Vector Machine (SVM)’ model for
     the classification.
Note: SVM looks for the line that seperates the two classes in the best way.
     The code for prediction is exactly same as in Chapter 2 and Chapter 3, the only difference is that the ‘SVC (from
     SVM)’ model is used with ‘ kernel=”linear” (Line 49)’. Note that, by default ‘ kernel=”rbf” ‘ is used in SVC, which
     is required for the non-linear problems.
26   # note that images can not be saved as features, as we need 2D data for
27   # features, whereas faces.images are 3D data i.e. (samples, pixel-x, pixel-y)
28   features = faces.data # features
29   targets = faces.target # targets
30
48   # use SVC
49   classifier = SVC(kernel="linear") # default kernel=rbf
50   # training using 'training data'
51   classifier.fit(train_features, train_targets) # fit the model for training data
52
$ python faces_ex.py
        • Let’s print the locations of first 20 images, where the test-images and the predicted-images are different from
          each other. Also, plot the images to see the differences in the images.
                                                             62                                                  PythonDSP
     Chapter 11. Image recognition
27   # note that images can not be saved as features, as we need 2D data for
28   # features, whereas faces.images are 3D data i.e. (samples, pixel-x, pixel-y)
29   features = faces.data # features
30   targets = faces.target # targets
31
49   # use SVC
50   classifier = SVC(kernel="linear") # default kernel=rbf
51   # training using 'training data'
52   classifier.fit(train_features, train_targets) # fit the model for training data
53
64
        • Below are the outputs of above code. The plotted test-images and predicted-images are shown in Fig. 11.2
          and Fig. 11.3 respectively, where we can see that the image at location 14 (see red boxes) is at error.
     $ python faces_ex.py
     Accuracy for training data (self accuracy): 1.0
     Accuracy for test data: 0.9875
     Wrongly detected image-locations: 14
     Note: In Listing 11.5, we have used the ‘images (i.e. faces_test.append(images[i]))’ at Lines 75 and 80, to plot
     the images.
     Also, we can convert the ‘features’ into images for plotting the images as shown in Lines 77 and 84 of Listing 11.6.
                                                             64                                                  PythonDSP
Chapter 11. Image recognition
            66                                        PythonDSP
     Chapter 11. Image recognition
27   # note that images can not be saved as features, as we need 2D data for
28   # features, whereas faces.images are 3D data i.e. (samples, pixel-x, pixel-y)
29   features = faces.data # features
30   targets = faces.target # targets
31
49   # use SVC
50   classifier = SVC(kernel="linear") # default kernel=rbf
51   # training using 'training data'
52   classifier.fit(train_features, train_targets) # fit the model for training data
53
64
                                                        68                                  PythonDSP
Chapter 12
12.1 Introduction
In this chapter, some more examples are added for Supervised learning.
In this section, we will visualize the dataset using ‘numpy’ and ‘matplotlib’ which is available in the Scikit dataset.
• First load the data set and quickly see the contents of it,
# visualization_ex1.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
# unique targets
print("Unique targets:", np.unique(iris.target)) # [0, 1, 2]
# counts of each target
print("Bin counts for targets:", np.bincount(iris.target))
                                                                                                                    69
                                                                                          12.2. Visualizing the Iris dataset
$ python visualization_ex1.py
Unique targets: [0 1 2]
12.2.2 Histogram
• Let’s plot the histogram of the ‘targets’ with respect to each feature of the dataset,
1    # visualization_ex1.py
2
5    import numpy as np
6    import matplotlib.pyplot as plt
7    from sklearn.datasets import load_iris
8
18   #   # unique targets
19   #   print("Unique targets:", np.unique(iris.target)) # [0, 1, 2]
20   #   # counts of each target
21   #   print("Bin counts for targets:", np.bincount(iris.target))
22
                                                               70                                                  PythonDSP
     Chapter 12. More examples on Supervised learning
        • The Fig. 12.1 shows the histogram of the targets with resepct to each feature. We can clear see that the
          feature ‘petal widht’ can distinguish the targets better that other features.
        • Now, we will plot the scatter-plot between ‘petal-width’ and ‘all other features’.
1    # visualization_ex1.py
2
5    import numpy as np
6    import matplotlib.pyplot as plt
7    from sklearn.datasets import load_iris
8
18   #   # unique targets
19   #   print("Unique targets:", np.unique(iris.target)) # [0, 1, 2]
20   #   # counts of each target
21   #   print("Bin counts for targets:", np.bincount(iris.target))
22
          • The Fig. 12.2 shows the scatter-plots between ‘petal width’ and ‘all other features’. Here we can see that
            some of the ‘setosa’ can be clearly disntinguish from ‘versicolor’ and ‘virginica’; but the ‘versicolor’ and
            ‘virginica’ can not be completely separated with each other with any combinations of ‘x’ and ‘y’ axis.
          • In Fig. 12.2, we plotted the scatter-plots between ‘petal width’ and ‘all other features’; however, many other
            combinations are still possible e.g. ‘petal length’ and ‘all other features’. Pandas library provides a method
            ‘scatter_matrix’, which plots the scatter plot for all the possible combinations along with the histogram, as
            shown below,
1    # visualization_ex1.py
2
5    import numpy as np
6    import pandas as pd
7    import matplotlib.pyplot as plt
8    from sklearn.datasets import load_iris
9
                                                               72                                                 PythonDSP
     Chapter 12. More examples on Supervised learning
19   #   # unique targets
20   #   print("Unique targets:", np.unique(iris.target)) # [0, 1, 2]
21   #   # counts of each target
22   #   print("Bin counts for targets:", np.bincount(iris.target))
23
54   # create Pandas-dataframe
55   iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
56   # print(iris_df.head())
57   pd.plotting.scatter_matrix(iris_df, c=iris.target, figsize=(8, 8));
58   plt.show()
• Below are the histogram and scatter plot generated by above code,
        • Next, split the data as ‘training’ and ‘test’ data. Then, we will fit the training-data to the model “KNeigh-
          borsClassifier”, and check the accuracy of the model on the test-data.
1    # visualization_ex1.py
2
                                                            74                                                  PythonDSP
     Chapter 12. More examples on Supervised learning
22   #   # unique targets
23   #   print("Unique targets:", np.unique(iris.target)) # [0, 1, 2]
24   #   # counts of each target
25   #   print("Bin counts for targets:", np.bincount(iris.target))
26
57   #   # create Pandas-dataframe
58   #   iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
59   #   # print(iris_df.head())
60   #   pd.plotting.scatter_matrix(iris_df, c=iris.target, figsize=(8, 8));
61   #   plt.show()
62
63
75   # select classifier
76   cls = KNeighborsClassifier()
77   cls.fit(train_X, train_y)
78
$ python visualization_ex1.py
1    # visualization_ex1.py
2
5    import numpy as np
6    import pandas as pd
7    import matplotlib.pyplot as plt
8    from sklearn.datasets import load_iris
9    from sklearn.model_selection import train_test_split
10   from sklearn.metrics import accuracy_score
11   from sklearn.neighbors import KNeighborsClassifier
12
22   #   # unique targets
23   #   print("Unique targets:", np.unique(iris.target)) # [0, 1, 2]
24   #   # counts of each target
25   #   print("Bin counts for targets:", np.bincount(iris.target))
26
                                                              76                                                  PythonDSP
     Chapter 12. More examples on Supervised learning
57   #   # create Pandas-dataframe
58   #   iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
59   #   # print(iris_df.head())
60   #   pd.plotting.scatter_matrix(iris_df, c=iris.target, figsize=(8, 8));
61   #   plt.show()
62
63
75   # select classifier
76   cls = KNeighborsClassifier()
77   cls.fit(train_X, train_y)
78
114       plt.xlabel('{0}'.format(iris.feature_names[feature_x]))
115       plt.ylabel('{0}'.format(iris.feature_names[feature_y]))
116       plt.legend()
117   plt.show()
         • Results for above code are shown in Fig. 12.4. In the two subplots, there are only 3 triangles, as two of these
           are overlapped with each other; also the overlapped triangles will look darker as we are using the ‘alpha’
           parameter.
Overlapped points
                                                              78                                                  PythonDSP
     Chapter 12. More examples on Supervised learning
1    $ python -i visualization_ex1.py
2
In this section, we see the classification-boundaries of the ‘linear’ and ‘nonlinear’ classification models.
• Let’s create the dataset ‘make_blob’ with two centers and plot the scatter-plot for it,
# make_blob_ex.py
X, y = make_blobs(centers=2, random_state=0)
     plt.xlabel('first feature')
     plt.ylabel('second feature')
     plt.legend()
     plt.show()
• Below is the output of the above code. The Fig. 12.5 is the scatter plot which is generated by above code,
$ python make_blob_ex.py
     First 5 samples:
      [[ 4.21850347 2.23419161]
      [ 0.90779887 0.45984362]
      [-0.27652528 5.08127768]
      [ 0.08848433 2.32299086]
      [ 3.24329731 1.21460627]]
First 5 labels: [1 1 0 0 1]
8    X, y = make_blobs(centers=2, random_state=0)
9
                                                              80                                                   PythonDSP
     Chapter 12. More examples on Supervised learning
19   #   plt.xlabel('first feature')
20   #   plt.ylabel('second feature')
21   #   plt.legend()
22   #   plt.show()
23
29   # Linear classifier
30   cls = LogisticRegression()
31   cls.fit(X_train, y_train)
32   prediction = cls.predict(X_test)
33   score = cls.score(X_test, y_test)
34   print("Accuracy:", score)
$ python make_blob_ex.py
Accuracy: 0.9
     Since the model is linear, therefore it will use the ‘straight line’ for defining the boundary for the classification.
     The boundary can be drawn using ‘plot_2d_separator’ as shown in below code,
9    X, y = make_blobs(centers=2, random_state=0)
10
20   #   plt.xlabel('first feature')
21   #   plt.ylabel('second feature')
22   #   plt.legend()
23   #   plt.show()
24
30   # Linear classifier
31   cls = LogisticRegression()
32   cls.fit(X_train, y_train)
33   prediction = cls.predict(X_test)
34   score = cls.score(X_test, y_test)
35   print("Accuracy:", score)
36
• The Fig. 12.6 shows the decision boundary generated by above code,
Let’s use the nonlinear classifier i.e. ‘KNeighborsClassifier’ and see the decision boundary for it,
                                                              82                                                   PythonDSP
     Chapter 12. More examples on Supervised learning
10   X, y = make_blobs(centers=2, random_state=0)
11
21   #   plt.xlabel('first feature')
22   #   plt.ylabel('second feature')
23   #   plt.legend()
24   #   plt.show()
25
31   # Linear classifier
32   # cls = LogisticRegression()
33
34   # Nonlinear classifier
35   cls = KNeighborsClassifier()
36   cls.fit(X_train, y_train)
37   prediction = cls.predict(X_test)
38   score = cls.score(X_test, y_test)
39   print("Accuracy:", score)
40
          • Below is the output of above code. The Fig. 12.7 shows the nonlinear decision boundary generate by the
            code.
     $ python make_blob_ex.py
     Accuracy: 1.0
     Note:
          • Now, increase the noise (i.e. cluster_std) in the make_blobs dataset by replacing the Line 10 of Listing 12.2
   • Note that, we may get multiple boundaries in nonlinear classification, when the noise is high; which will
     reduce the performance of the system. Those multiple boundaries can be removed by increasing the number
     of neighbors at Line 35 for ‘KNeighborsClassifier’ as shown below,
cls = KNeighborsClassifier(n_neighbors=25)
 Warning: Increasing the ‘n_neighbors’ in ‘KNeighborsClassifier’ does not mean that it will increase the
 performance all the time. It may reduce the performance as well.
 For better results, we must have higher number of samples to reduce the variability in the performance metrics.
                                                       84                                                  PythonDSP
Chapter 13
13.1 Introduction
In the previous chapters, we saw the examples of ‘supervised machine learning’, i.e. classification and regression
models. Also, we calculated the ‘score’ to see the performance of these models. But there are several other standard
methods to evaluate the performance of the models. Table 13.1 shows the list of metrics which can be used to
measure the performance of different types of model, , which are discussed in the chapter.
In this section, we will see the performance measurement of the classification problem.
13.2.1 Accuracy
The ‘accuracy’ is the ratio of the ‘correct predictions’ and ‘all the predictions’. By default, the scoring is done
based on ‘accuracy’,
Note:    In previous chapters, we already calculated ‘accuracy’ for the ‘training’ and ‘test’ datasets. For
easy analysis, the ‘Cross-validation’ class have in-built performance-measurement methods e.g. ‘accuracy’,
‘mean_squared_error and r2_score’ etc. as shown in this chapter.
                                                                                                                  85
                                                                         13.2. Performance of classification problem
It measures the probability of having the correct predictions, and prints the logarithmic value of the probability.
Since the probability has the range between 0 and 1, therefore ‘Logarithmic loss’ has the range between 0 and
‘-infinity’.
Note: Higher the ‘Logarithmic loss’ value, better is the model. Perfect model will have the maximum value i.e.
‘0’.
                                                        86                                                  PythonDSP
Chapter 13. Performance analysis of models
Classification report gives the ‘precision’, ‘recall’, ‘F1-score’ and ‘support’ values for each class as shown below,
Let’s understand the Confusion matrix first, which is the basis for ROC, which can be used with ‘binary (not
multiclass) classification’. Confusion matrix is a 2 × 2 matrix, whose columns are shown in Table 13.2 and
explained below,
   •   True positive : Actual value is positive, and predicted value is also positive.
   •   False negative : Actual value is positive, and predicted value is negative.
   •   False positive : Actual value is negative, and predicted value is positive.
   •   True negative : Actual value is negative, and predicted value is negative.
Note: Clearly the desired results are the ‘True positive’ and ‘True negative’ columns. Therefore, for better
performance, these values should be higher than the ‘False negative’ and ‘False positive’ columns.
   •   True positive = 9
   •   True negative = 9
   •   False positive = 1
   •   False negative = 1
>>> from sklearn.datasets import make_blobs
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.metrics import confusion_matrix
>>>
>>> X, y = make_blobs(centers=2, random_state=0)
>>>
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
...         test_size=0.2,
...         random_state=23,
...         stratify=y)
>>>
>>> # Linear classifier
... cls = LogisticRegression()
>>> cls.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
>>> prediction = cls.predict(X_test)
>>> c_matrix = confusion_matrix(y_test, prediction)
>>> print(c_matrix) # print confusion_matrix
[[9 1]
 [1 9]]
ROC is the plot between the ‘true positive rate’ and ‘false positive rate’, which are defined as below,
   • True positive rate = (True positive) / (True positive + False negative)
   • False positive rate = (False positive) / (False positive + True negative)
Note: ROC and AUC are used for ‘binary (not multiclass) classification’ problem; and ‘AUC = 1’ represents the
perfect model,
                                                        88                                                   PythonDSP
     Chapter 13. Performance analysis of models
     Note:
        • By default, Scikit library calculates the ‘r2_score’ as shown in Lines 44-46. The ‘r2_score’ has the values
          between 0 (no fit) and 1 (perfect fit).
        • Mean absolute error (MAE) is the sum of the ‘absolute differences’ between the predicted and the actual
          values, and calculated at Lines 48-50.
        • Mean square error (MSE) is the sum of squares of the errors, where errors are the differences between
          the ‘predicted’ and ‘actual’ values. This is calculated at Lines 53-55.
      Error:
           • Mean score for ‘r2’ is calculated as ‘-7.7967’, which is negative. Note that, the negative value is not
             possible for ‘r2’ score.
           • Similarly, replace ‘r2’ with ‘neg_mean_squared_error’ and ‘neg_mean_absolute_error’, and it may give
             some undesired results.
           • Please clarify the reason.
                                                             90                                                 PythonDSP
     Chapter 13. Performance analysis of models
14.1 Introduction
In previous chapters, we saw several examples of machine learning methods. In this chapter, we will summarize
those methods along with several other useful ways to analyze the data.
When we get the data, we need to see the data and it’s statistics. Then we need to perform certain clean/transform
operations e.g. filling the null values etc. In this section, we will see several steps which may be useful to understand
the data,
Although we can use Python or Numpy to load the data, but it is better to use Pandas library to load the data.
   • Add header to data : In the below code, the first 29 rows are skipped as these lines do not contain samples
     but the information about the each sample.
                                                                                                                      92
     Chapter 14. Quick reference guide
1    >>> df_kidney.isnull().sum()
2    age               0
3    bp                0
4    sg                0
5    al                0
6    su                0
7    rbc               0
8    pc                0
9    pcc               0
10   ba                0
11   bgr               0
12   bu                0
13   sc                0
14   sod               0
15   pot               0
16   hemo              0
17   pcv               0
18   wbcc              0
19   rbcc              0
20   htn               0
21   dm                1
22   cad               0
23   appet             0
24   pe                0
25   ane               0
26   classification    0
27   dtype: int64
>>> df_kidney[df_kidney.dm.isnull()]
    age bp      sg al su     rbc     pc                 pcc              ba   bgr   \
369 75 70 1.020 0 0 normal normal                notpresent      notpresent   107
[1 rows x 25 columns]
>>>
>>> df_kidney[df_kidney.dm.isnull()].iloc[:, 0:2] # display only two columns
    age bp
369 75 70
Sometimes the datatypes are not correctly read by the Pandas, therefore it is better to check the data types of
each columns.
      • In the below results are all the types are ‘object’ (not numeric), because samples have ‘ ?’ in it, therefore we
        need to replace the ‘ ?’ values with some other values,
>>> df_kidney.dtypes
age               object
bp                object
sg                object
al                object
su                object
rbc               object
pc                object
pcc               object
ba                object
bgr               object
bu                object
sc                object
sod               object
pot               object
hemo              object
pcv               object
wbcc              object
rbcc              object
htn               object
dm                object
cad               object
appet             object
pe                object
ane               object
classification    object
dtype: object
• If we perform the ‘conversion’ operation at this moment, then error will be generate due to ‘ ?’ in the data,
• Replace the ‘ ?’ with ‘NaN’ value using ‘replace’ command; and change the ‘type’ of ‘bgr’ column,
                                                            94                                                  PythonDSP
Chapter 14. Quick reference guide
• Next, we can drop or fill the ‘NaN’ values. In the below code we dropped the NaN values,
>>> df_whole_sale.describe()
          channel        area               fresh           milk          grocery   \
count 440.000000 440.000000            440.000000     440.000000       440.000000
mean     1.322727    2.543182        12000.297727    5796.265909      7951.277273
std      0.468052    0.774272        12647.328865    7380.377175      9503.162829
min      1.000000    1.000000            3.000000      55.000000         3.000000
25%      1.000000    2.000000         3127.750000    1533.000000      2153.000000
50%      1.000000    3.000000         8504.000000    3627.000000      4755.500000
75%      2.000000    3.000000        16933.750000    7190.250000     10655.750000
max      2.000000    3.000000       112151.000000   73498.000000     92780.000000
>>> df_whole_sale.describe().iloc[:,0:2]
          channel      area
count 440.000000 440.000000
mean     1.322727    2.543182
std      0.468052    0.774272
min      1.000000    1.000000
25%      1.000000    2.000000
50%      1.000000    3.000000
75%      2.000000    3.000000
max      2.000000    3.000000
It is better to see the distributions of the outputs for the classification problem. In the below output, we can see
that we have more data for ‘no chronic kidney disease (nockd)’ than ‘chronic kidney disease (ckd)’,
>>> df_kidney.groupby("classification").size()
classification
ckd        43
notckd    114
dtype: int64
It is also good see the correlation between the features. In the below results we can see that the correlation of
‘milk’ is higher with ‘grocery’ and ‘detergent’, which indicates that customers who are buying ‘milk’ are more likely
to buy ‘grocery’ and ‘detergent’ as well. See Chapter 10 for more details about this relationship.
                                                         96                                                  PythonDSP
Chapter 14. Quick reference guide
In the tutorial, we already saw several data-visualization techniques such as ‘histogram’ and ‘scatter plot’ etc. In
this section, we will summarize these techniques.
The plots can be divided into two categories as shown in Table 14.1. These plots are described below,
The univariate plots are the plots which are used to visualize the data independently. In this section we will some
of the important univariate plots,
14.3.1.1 Histogram
Histograms are the quickest way to visualize the distributions of the data as shown below,
                   98                                         PythonDSP
Chapter 14. Quick reference guide
Box and Whisker plots draws a line at the median-value and a box around the 25th and 75th percentiles.
The multivariate plots are the plots which are used to visualize the relationship between two or more data.
Important: Note that we need to convert the numpy-array into Pandas DataFrame for plotting it using Pandas.
This is applicable to both ‘univariate’ and ‘multivariate’ plots
Note: We can plot the multicolor ‘Scatter plot’ and ‘Histogram’ as show in Section 12.2, which is easier to
visualize as compare to single color plots.
For colorful scatter_matrix plot, we can use below code,
• Below is the code, which plots the correlation values of the data, which is known as correlation-matrix plot,
• Also, we can add ‘colorbar’ to see the relationship between the color and the correlation values,
                                                       100                                                PythonDSP
Chapter 14. Quick reference guide
                          102                                                 PythonDSP
Chapter 14. Quick reference guide
Fig. 14.6: Correlation-matrix plot with ‘colorbar’ for the wholesale data
      • Finally, we can add ‘headers’ to the plot so that it will more readable. Below is the complete code for plotting
        the data,
Fig. 14.7: Correlation-matrix plot with ‘colorbar’ and ‘tick-name’ for the wholesale data
Note: From the correlation-matrix plot it is quite clear that the people are buying the ‘grocery’ and ‘detergent’
together.
See Chapter 10 for more details about these relationships, where scatter plot is used to visualize the relationships.
In Chapter 8, we saw the examples of preprocessing of the data and saw the performance improvement in the
model. Further, we learn that the some of the algorithm are sensitive to statistics of the features, e.g. PCA
algorithm gives more weight age to the feature which has high variances. In the other words, the feature with high
variance will dominate the performance of the PCA. In this section, we will summarize some of the preprocessing
methods.
                                                        104                                                  PythonDSP
Chapter 14. Quick reference guide
• Let’s read the samples from the ‘Whole sale data’ first, and we will preprocess this data in this section,
14.4.2 StandardScaler
We used the ‘StandardScaler’ in Chapter 8 and saw the performance improvement in the model with it. It sets
the ‘mean = 0’ and ‘variance = 1’ for all the features,
      • Now, process the data using StandardScaler,
Also, we can combine the above two steps (i.e. fit and transform) into one step as below,
   • Note that the type of the ‘df_temp’ is ‘numpy.ndarray’, therefore we need to loop through each column
     to calculate mean and variance as shown below,
>>> type(df_temp) # numpy array
<class 'numpy.ndarray'>
>>>
>>> # mean and var of each column
... for i in range(df_temp.shape[1]):
...     print("row {0}: mean={1:<5.2f} var={2:<5.2f}".format(i,
...         np.mean(df_temp[:,i]),
...         np.var(df_temp[:,i])
...         )
...     )
...
row 0: mean=0.00 var=1.00
row 1: mean=0.00 var=1.00
row 2: mean=-0.00 var=1.00
row 3: mean=-0.00 var=1.00
row 4: mean=-0.00 var=1.00
row 5: mean=0.00 var=1.00
row 6: mean=0.00 var=1.00
row 7: mean=-0.00 var=1.00
   • Also, we can convert the numpy-array to Pandas-DataFrame and then calculate the mean and variance,
>>> # convert numpy-array to Pandas-dataframe
... df = pd.DataFrame(df_temp, columns=header)
>>>
>>> type(df) # Pandas DataFrame
<class 'pandas.core.frame.DataFrame'>
>>>
>>> np.mean(df) # mean = 0
channel        -2.523234e-18
area            2.828545e-16
fresh          -3.727684e-17
milk           -8.815549e-18
grocery        -5.197665e-17
frozen          3.587724e-17
detergent       2.618250e-17
delicatessen   -2.508450e-18
dtype: float64
>>>
>>> np.var(df)
channel         1.0
area            1.0
fresh           1.0
milk            1.0
grocery         1.0
frozen          1.0
detergent       1.0
delicatessen    1.0
dtype: float64
MinMax scaler scales the features in the range (0 to 1) i.e. minimum and maximum values are scaled to 0 and 1
respectively.
>>> from sklearn.preprocessing import MinMaxScaler
>>> df_temp = MinMaxScaler().fit_transform(df_whole_sale)
>>> df = pd.DataFrame(df_temp, columns=header)
                                                                                        (continues on next page)
                                                    106                                               PythonDSP
Chapter 14. Quick reference guide
14.4.4 Normalizer
Normalizer process the row such that the sum of each row is ‘1’, as shown in below code,
In Chatper 7, we saw an example of feature selection, where the PCA analysis is done to reduce the dimension of
the features.
Note: While collecting the data, our aim is to collect the data without thinking the relationship between the
‘features’ and the ‘targets’. It is possible that some of these data has no impact on the target e.g. ‘First name’ of
the person has no relationship with the ‘chronic kidney disease’. If we use this feature, i.e. First name, to predict
the ‘chronic kidney disease’, then we will have the wrong results.
Feature selection is the process of ‘removing’ or ‘giving less weight’ to irrelevant or partially relevant features. In
this way we can achieve following,
  1. Reduce overfitting: as the partially relevant data is removed from the dataset.
  2. Reduce training time: as we have less features after feature selection.
14.5.1 SelectKBest
The ‘SelectKBest’ class can be used to find the best ‘K’ features from the dataset. In the below code, the
‘new_features’ contains the last two columns of the ‘features’,
RFE recursively checks the accuracy of the model and removes attributes which result in lower accuracy,
                                                     108                                                  PythonDSP
     Chapter 14. Quick reference guide
     Please see the Chatper 7 where PCA is discussed in detail. Note that, it does not select the features but transform
     the features.
14.6 Algorithms
In this section, we will see some of the widely use algorithms for the ‘classification’ and ‘regression’ problems.
     Important: Note that all the models do not work well in all the cases. Therefore, we need to check the
     performance of various machine learning algorithms before finalizing the model.
     Table 14.2 shows some of the widely used classification algorithms. We already see the examples of ‘Logistic
     Regression (Chapter 3)’, ‘K-nearest neighbor (Chapter 2)’ and ‘SVM (Chapter 11)’. In this section we will discuss
     LDA, Naive Bayes and Regression tree algorithms.
The below code is same as Listing 3.4 but LDA is used instead of ‘K-nearest’ and ‘LogisticRegression’ algorithms,
1    # rock_mine2.py
2
5    import numpy as np
6    from sklearn.metrics import accuracy_score
7    from sklearn.model_selection import train_test_split
8    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
9
10   f = open("data/sonar.all-data", 'r')
11   data = f.read()
12   f.close()
13
23   # extract targets
24   row_sample, col_sample = len(data_list), len(data_list[0])
25
26   # features : last column i.e. target value will be removed form the dataset
                                                                                                     (continues on next page)
50   # select classifier
51   classifier = LinearDiscriminantAnalysis()
52
     $ python rock_mine2.py
     Accuracy for training data (self accuracy): 0.885542168675
     Accuracy for test data: 0.809523809524
     Note: Both LogisticRegression and LinearDiscriminantAnalysis algorithms assume that input features have
     Gaussian distributions.
     It assumes that all the features are independent of each other and have Gaussian distribution. Below is the example
     of the Naive Bayes algorithm,
# multiclass_ex.py
     import numpy as np
     from sklearn.datasets import load_iris
                                                                                                   (continues on next page)
                                                            110                                                  PythonDSP
Chapter 14. Quick reference guide
# select classifier
classifier = GaussianNB()
# cross-validation
scores = cross_val_score(classifier, features, targets, cv=3)
print("Cross validation scores:", scores)
print("Mean score:", np.mean(scores))
It creates a binary decision tree from the training data to minimize the cost function,
# multiclass_ex.py
import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
# select classifier
classifier = DecisionTreeClassifier()
# cross-validation
scores = cross_val_score(classifier, features, targets, cv=3)
print("Cross validation scores:", scores)
print("Mean score:", np.mean(scores))
Table 14.3 shows some of the widely used regression algorithms. We already see the examples of ‘Linear regression
(Chapter 3)’. Also we saw the examples of ‘K-nearest neighbor (Chapter 2)’, ‘SVM (Chapter 11)’ and Decision Tree
     (Section 14.6.1.3) for ‘classification problems; in this section we will use these algorithms for regression problems.
     Further, we will discuss ‘Ridge’, ‘LASSO’ and ‘Elastic-net’ algorithms.
     It is the extended version of the Linear regression, where the ridge coefficients minimize a penalized residual sum
     of square known as L2 norm.
1    # regression_ex.py
2
3    import numpy as np
4    from sklearn.model_selection import train_test_split
5    from sklearn.linear_model import Ridge
6
     $ python regression_ex.py
     Accuracy for test data: 0.82273039102
                                                              112                                                 PythonDSP
     Chapter 14. Quick reference guide
     It is the extended version of the Linear regression, where the ridge coefficients minimize the sum of absolute values
     which is known as L1 norm.
1    # regression_ex.py
2
3    import numpy as np
4    from sklearn.model_selection import train_test_split
5    from sklearn.linear_model import Lasso
6
3    import numpy as np
4    from sklearn.model_selection import train_test_split
5    from sklearn.linear_model import ElasticNet
6
     $ python regression_ex.py
     Accuracy for test data: 0.744348295083
     Note: Note that SVR is used for regression problem, whereas SVC was used in classification problem. Same is
     applicable for ‘Decision tree’ and ‘K-nearest neighbor’ algorithms.
1    # regression_ex.py
2
3    import numpy as np
4    from sklearn.model_selection import train_test_split
5    from sklearn.svm import SVR
6
                                                         114                                              PythonDSP
     Chapter 14. Quick reference guide
     $ python regression_ex.py
     Accuracy for test data: 0.961088256595
1    # regression_ex.py
2
3    import numpy as np
4    from sklearn.model_selection import train_test_split
5    from sklearn.tree import DecisionTreeRegressor
6
1    # regression_ex.py
2
3    import numpy as np
4    from sklearn.model_selection import train_test_split
5    from sklearn.neighbors import KNeighborsRegressor
6
                                                         116                                              PythonDSP
Chapter 14. Quick reference guide
$ python regression_ex.py
Accuracy for test data: 0.991613506388