Python for Machine Learning Enthusiasts
Python for Machine Learning Enthusiasts
Heikki Huttunen
          University Lecturer, Dr.Eng.
          Heikki.Huttunen@tut.fi
 • Motivation
    • Why machine learning?
    • Machine learning basics
    • Why Python?
 • Review of widely used classifiers and their implementation in
   Python
 • Examples
     • Data driven design of a classifier for ovarian cancer detection
     • ICANN MEG Mind Reading Challenge 2011
     • IEEE MLSP Birds 2013 Competition
     • DecMeg 2014 Decoding the Human Brain Competition
 • Hands-on session (TC407)
 • Slides available at http://www.cs.tut.fi/~hehu/MMLP.pdf
Introduction
                                                   Python
• However, due to licensing issues and heavy
  development of Python, scientific Python
  started to gain its user base.
                                                  numpy
• Python’s strength is in its variability and
  huge community.
                                                   sklearn
• There are 2 versions: Python 2.7 and 3.4.
  We’ll use the former one as it’s better
                                                   matplotlib
  supported by most packages interesting to us.
 Alternatives to Python in Science
                                         Python vs. R
      Python vs. Matlab
                                 • R is #1 workhorse for statistics
• Matlab is #1 workhorse for
                                   and data analysis.
  linear algebra.
                                 • Popularity: http://www.
• Matlab is professionally
                                   kdnuggets.com/polls/2015/
  maintained product.
                                   r-vs-python.html
• Some Matlab’s toolboxes are
                                 • R is great for specific data
  great (Image Processing tb).
                                   analysis and visualization needs.
  Some are obsolete (Neural
  Network tb).                   • However, Python interfaces to
                                   other domains ranging from
• New versions 2 twice a year.
                                   deep neural networks (Caffe,
  Amount of novelty varies.
                                   Theano) and image analysis
• Matlab is expensive for          (OpenCV) to even a fullblown
  non-educational users.           webserver (Django)
Essential Modules
                  Samples: x1 , x2 , . . . , xN ∈ RP
              Class labels: y1 , y2 , . . . , yN ∈ {1, 2, . . . , C }
                 Classifier: F (x) : RP 7→ {1, 2, . . . , C }
  • Now, the task is to find the function F that maps the samples
    most accurately to their corresponding labels.
  • For example: Find the function F that minimizes the number
    of erroneous predictions, i.e., the cases in which F (xk ) 6= yk .
Classification Example
2 Classified as RED
• The expression wT x = k wk xk
                          P
  essentially transforms the
  multidimensional data x to a real number,
  which is then compared to a threshold b.
     Flavors of Linear Classifiers
0 0 0
4 4 4
6 6 6
8 8 8
10                                                      10                                                   10
     2    1     0    1     2      3    4     5      6        2    1    0    1    2     3     4    5      6        2    1    0      1    2     3       4   5     6
Details
 • The LDA:
     • The oldest of the three: Fisher, 1935.
     • ”Find the projection that maximizes class separation”, i.e., pull
        the classes as far from each other as possible.
     • Closed form solution, fast to train.
 • The SVM:
     • Vapnik and Chervonenkis, 1963.
     • ”Maximize the margin between classes”.
     • Slowest of the three, performs well with sparse
        high-dimensional data.
 • Logistic Regression (a.k.a. Generalized Linear Model):
     • History traces back to 1700’s, but proper formulation and
        efficient training algorithm by Nelder and Wedderburn in 1972.
     • Statistical algorithm: ”Maximize the likelihood of the data”.
     • Outputs also class probability. Has been extended to
        automatic feature selection.
The Class Probabilities of Logistic Regression
                                              Classified as RED
               2                                 p = 60.6 %
              10
                   2   1       0     1    2     3     4        5   6
 Nonlinear SVM
                                                          2       Support Vector Machine with Kernel Trick
2 Classified as BLUE
  feature importances.
                                                              6
               Relative Importance of Attributes
                                            Horizontal axis    8
                                    26.8%
                                                              10
                                                                   2   1   0     1    2      3    4    5       6
73.2%
       Vertical axis
Generalization
 • The important thing is how well the classifier works with
   unseen data.
 • An overfitted classifier memorizes the training data and does
   not generalize.
                2       1-NN / Training Data: Acc = 100.0 %        2           1-NN / Test Data: Acc = 88.2 %
0 0
2 2
4 4
6 6
8 8
               10                                                 10
                    2   1    0     1    2     3    4     5    6        2   1      0    1     2    3     4       5   6
 Overfitting
                                                     3            Example time series
                                                     2
                                                     1
                                                     0
• Generalization is also related to overfitting.     1
                                                 20
                                                                  `2 penalty: 24 nonzeros
• A zero coefficient for a linear operator is    15
                                                 10
  equivalent to discarding the corresponding      5
                                                  0
  feature altogether.                             5
                                                 10
• The plots illustrate the model coefficients    15
                                                    0         5         10           15     20   25
                                                                             Order
  without regularization, with traditional        6
                                                                  `1 penalty: 8 nonzeros
                                                  4
  regularization and sparse regularization.       2
                                                  0
• The importance of sparsity is twofold: The      2
                                                  4
  model can be used for feature selection, but    6
                                                  8
  often also generalizes better.                    0         5         10
                                                                             Order
                                                                                     15     20   25
   Example: Classifier Design for Ovarian Cancer
   Detection
        preprocessing step).
                                                                           0
    a
      Conrads, Thomas P., et al. ”High-resolution serum proteomic
                                                                            5
features for ovarian cancer detection.” Endocrine-Related Cancer (2004).     0   500   1000   1500 2000   2500   3000   3500   4000
 Reading the Data
  from sklearn.cross_validation \
       import cross_val_score
  scores = cross_val_score(clf, X, y)
 • The four most popular Deep learning toolboxes are (in the
   order of decreasing popularity):
     •   Caffe: http://caffe.berkeleyvision.org/ from Berkeley
     •   Torch: http://torch.ch/ from NYU (used by Facebook)
     •   pylearn2: http://deeplearning.net/ from U. Montreal
     •   Minerva https://github.com/dmlc/minerva from New
         York, Peking Univ.
 • All except Torch can be used via a Python front end (Torch
   uses Lua language).
Applications
    1
      H. Huttunen et al., ”Regularized logistic regression for mind reading with parallel validation,” Proc.
ICANN/PASCAL2 Challenge: MEG Mind-Reading. Aalto University publication series, Espoo, June 2011.
    2
      H. Huttunen et al., ”MEG Mind Reading: Strategies for Feature Selection,” in Proc. of The Federated
Computer Science Event 2012, Helsinki, May 2012.
    3
      H. Huttunen et al., ”Mind Reading with Regularized Multinomial Logistic Regression,” Machine Vision and
Applications, Aug. 2013
ICANN MEG Mind Reading Challenge Data
−2
−4
            20   40   60   80   100   120   140   160   180   200
        −11
     x 10
 2
−2
−4
  0         20   40   60   80   100   120   140   160   180   200
Where the Selected Features are Located?
2000 2000
           1500                                                                           1500
   Count
                                                                                  Count
           1000                                                                           1000
500 500
             0                                                                              0
                 0   5   10   15   20      25     30       35    40    45    50                 0   5    10    15    20      25     30   35   40   45   50
                                        Atom index                                                                        Atom index
                                                 2500
2000
                                                 1500
                                         Count
1000
500
                                                   0
                                                       0   5    10    15    20      25     30       35    40    45    50
                                                                                 Atom index
 DecMeg2014 Competition
                                                                 0
                                                                                  Face shown
50
                                               Sensor Number
  task was to predict whether test persons                     150
                                                               300
• Sequences were 1.5 second long                                 0.00   0.25   0.50     0.75 1.00
                                                                                      Time / s
                                                                                                    1.25   1.50
                                               Sensor Number
  approximately 600 datapoints from each                       150
subject. 200
                                                               300
                                                                 0.00   0.25   0.50     0.75 1.00   1.25   1.50
                                                                                      Time / s
Our Model
       20                                                         LR
       30
                                                                  ...
                 50        100      150     200    250   300
                                   Sensor
                                                                  LR
LR ... LR LR LR
                                                                  RF       Decision
 Feature Importances
                                                                      0.06
0.05
0.04
                                                 Feature Importance
                                                                                                               31 timepoints
                                                                      0.03
0.01