Machine Learning:
Introduction
Why “Learn”?
◼   Machine learning is programming computers to
    optimize a performance criterion using example
    data or past experience.
◼   There is no need to “learn” to calculate payroll
◼   Learning is used when:
       Human expertise does not exist (navigating on Mars),
       Humans are unable to explain their expertise (speech
        recognition)
       Solution changes in time (routing on a computer network)
       Solution needs to be adapted to particular cases (user
        biometrics)
                                                                   2
What We Talk About When We
Talk About“Learning”
◼   Learning general models from a data of particular
    examples
◼   Data is cheap and abundant (data warehouses, data
    marts); knowledge is expensive and scarce.
◼   Example in retail: Customer transactions to
    consumer behavior:
      People who bought “Da Vinci Code” also bought “The Five
      People You Meet in Heaven” (www.amazon.com)
◼   Build a model that is a good and useful
    approximation to the data.
                                                                3
Data Mining/KDD
Definition := “KDD is the non-trivial process of
identifying valid, novel, potentially useful, and
ultimately understandable patterns in data” (Fayyad)
    Applications:
◼    Retail: Market basket analysis, Customer
     relationship management (CRM)
◼    Finance: Credit scoring, fraud detection
◼    Manufacturing: Optimization, troubleshooting
◼    Medicine: Medical diagnosis
◼    Telecommunications: Quality of service
     optimization
◼    Bioinformatics: Motifs, alignment
◼    Web mining: Search engines
◼    ...                                               4
Knowledge Discovery (KDD) Process
       Data mining—core of
        knowledge discovery                 Pattern Evaluation
        process
                                     Data Mining
                      Task-relevant Data
        Data Warehouse         Selection
 Data Cleaning
            Data Integration
          Databases
What is Machine Learning?
◼   Machine Learning
       Study of algorithms that
       improve their performance
       at some task
       with experience
◼   Optimize a performance criterion using example
    data or past experience.
◼   Role of Statistics: Inference from a sample
◼   Role of Computer science: Efficient algorithms to
      Solve the optimization problem
      Representing and evaluating the model for
       inference                                        6
       Growth of Machine Learning
◼    Machine learning is preferred approach to
       Speech recognition, Natural language processing
       Computer vision
       Medical outcomes analysis
       Robot control
       Computational biology
◼    This trend is accelerating
       Improved machine learning algorithms
       Improved data capture, networking, faster computers
       Software too complex to write by hand
       New sensors / IO devices
       Demand for self-customization to user, environment
       It turns out to be difficult to extract knowledge from human
        experts→failure of expert systems in the 1980’s.
                                                                       7
Alpydin & Ch. Eick: ML Topic1
Applications
◼   Association Analysis
◼   Supervised Learning
     Classification
     Regression/Prediction
◼   Unsupervised Learning
◼   Semi supervised Learning
◼   Reinforcement Learning
                               8
Learning Associations
◼     Basket analysis:
      P (Y | X ) probability that somebody who buys X also
      buys Y where X and Y are products/services.
      Example: P ( chips | beer ) = 0.7
Market-Basket transactions
TID      Items
1        Bread, Milk
2        Bread, Diaper, Beer, Eggs
3        Milk, Diaper, Beer, Coke
4        Bread, Milk, Diaper, Beer
5        Bread, Milk, Diaper, Coke
Classification: Definition
 ◼   Given a collection of records
     (training set )
        Each record contains a set of attributes, one of the
         attributes is the class.
 ◼   Find a model for class attribute as
     a function of the values of other
     attributes.
 ◼   Goal: previously unseen records
     should be assigned a class as
     accurately as possible.
        A test set is used to determine the accuracy of the
         model. Usually, the given data set is divided into
         training and test sets, with training set used to build
         the model and test set used to validate it.
Classification
◼   Example: Credit
    scoring
◼   Differentiating
    between low-risk
    and high-risk
    customers from
    their income and
    savings
     Discriminant: IF income > θ1 AND savings > θ2
                        THEN low-risk ELSE high-risk
     Model                                             11
Examples of Classification Task
◼   Predicting tumor cells as benign or malignant
◼   Classifying credit card transactions
    as legitimate or fraudulent
◼   Classifying secondary structures of protein
    as alpha-helix, beta-sheet, or random
    coil
◼   Categorizing news stories as finance,
    weather, entertainment, sports, etc
           Illustrating a basic Classification Task
                                                                       ◼   Instances are described
                                                                           as a set of features or
                                                                           attributes and their values
                                                                       ◼   The class that the
     Tid
     1
           Attrib1
           Yes
                       Attrib2
                      Large
                                 Attrib3
                                 125K
                                           Class
                                           No
                                                           Learning        instance belongs to is
     2     No         Medium     100K      No
                                                           algorithm
                                                                           also called its “label”
     3     No         Small      70K       No
     4     Yes        Medium     120K      No
                                                   Induction                 Input is a set of
                                                                               “labeled instances”
     5     No         Large      95K       Yes
     6     No         Medium     60K       No
     7     Yes        Large      220K      No                  Learn
     8     No         Small      85K       Yes                 Model
     9     No         Medium     75K       No
     10    No         Small      90K       Yes
                                                                           Model
10
                 Training Set
                                                               Apply
     Tid   Attrib1     Attrib2   Attrib3   Class               Model
     11    No         Small      55K       ?
     12    Yes        Medium     80K       ?
     13    Yes        Large      110K      ?       Deduction
     14    No         Small      95K       ?
     15    No         Large      67K       ?
10
                     Test Set
Example of a Decision Tree
        Tid Refund Marital
                   Status
                              Taxable
                              Income Cheat
                                             Splitting Attributes
        1    Yes    Single    125K   No
        2    No     Married   100K   No                 Refun
                                              Yes           d      No
        3    No     Single    70K    No
        4    Yes    Married   120K   No       NO                   MarSt
        5    No     Divorced 95K     Yes                                    Married
                                                    Single, Divorced
        6    No     Married   60K    No
        7    Yes    Divorced 220K    No                   TaxInc            NO
        8    No     Single    85K    Yes        < 80K               > 80K
        9    No     Married   75K    No
                                                    NO             YES
        10   No     Single    90K    Yes
   10
                   Training Data               Model: Decision Tree
Another Example of Decision
Tree
                                                       MarSt   Single,
        Tid Refund Marital    Taxable
                                             Married           Divorce
                   Status     Income Cheat                           d
                                                NO             Refun
        1    Yes    Single    125K   No
                                                        Yes         d    No
        2    No     Married   100K   No
        3    No     Single    70K    No                   NO             TaxInc
        4    Yes    Married   120K   No                        < 80K              > 80K
        5    No     Divorced 95K     Yes
                                                                   NO             YES
        6    No     Married   60K    No
        7    Yes    Divorced 220K    No
        8    No     Single    85K    Yes
        9
        10
             No
             No
                    Married
                    Single
                              75K
                              90K
                                     No
                                     Yes
                                               There could be more
                                               than one tree that fits
   10
                                               the same data!
Decision Tree Classification Task
           Tid   Attrib1     Attrib2   Attrib3   Class
                                                                   Tree
           1     Yes        Large      125K      No              Induction
           2     No         Medium     100K      No              algorithm
           3     No         Small      70K       No
           4     Yes        Medium     120K      No
                                                         Induction
           5     No         Large      95K       Yes
           6     No         Medium     60K       No
           7     Yes        Large      220K      No                  Learn
           8     No         Small      85K       Yes                 Model
           9     No         Medium     75K       No
           10    No         Small      90K       Yes
                                                                             Model
      10
                      Training Set
                                                                     Apply   Decision
                                                                     Model   Tree
           Tid   Attrib1     Attrib2   Attrib3   Class
           11    No         Small      55K       ?
           12    Yes        Medium     80K       ?
           13    Yes        Large      110K      ?
                                                         Deduction
           14    No         Small      95K       ?
           15    No         Large      67K       ?
      10
                           Test Set
Apply Model to Test Data
                                                     Test Data
                                                     Refund Marital    Taxable
                                                            Status     Income Cheat
                                                     No      Married   80K    ?
                 Refund                         10
     Yes                     No
    NO                       MarSt
             Single, Divorced         Married                   Assign Cheat to “No”
                    TaxInc            NO
         < 80K                > 80K
           NO                YES
  Types of Modelers/Models
Instance
 Instance
  Instance
    ss
   Instance
      s
                              ◼   Logistic regression
Modeler                       ◼   Naïve Bayes classifiers
                              ◼   Support vector machines
                     New
                   instance
                                  (SVMs)
 Model                        ◼   Decision trees
                              ◼   Random forests
      Classifier              ◼   Kernel methods
                              ◼   Genetic algorithms
              Class           ◼   Neural networks
               Class
                Class
                 Class
                                                            25
    What Modeler to Choose?
◼   Logistic regression        ◼   Data scientists
◼   Naïve Bayes classifiers        try different
◼   Support vector machines        modelers, with
    (SVMs)                         different
◼   Decision trees                 parameters, and
◼   Random forests                 check the
◼   Kernel methods                 accuracy to
◼   Genetic algorithms (GAs)       figure out which
◼   Neural networks:               one works best
    perceptrons                    for the data at
                                   hand               26
 Ensembles
                                     ◼   An ensemble method uses
          Instance
           Instance                      several algorithms that do the
            Instance
              ss
             Instance
                s                        same task, and combines their
                                         results
                                           “Ensemble learning”
Modeler     Modeler        Modeler   ◼   A combination function joins
  A           B              C           the results
                                           Majority vote: each
 Model                      Model           algorithm gets a vote
             ModelB
  A                          C             Weighted voting: each
                                            algorithm’s vote has a
            Combination                     weight
             Function                      Other complex
                                            combination functions
             Final Model
                                                                      27
http://magizbox.com/index.php/machine-learning/ds-model-building/ensemble/   28
Evaluating a
   Classifier
                29
Classification Accuracy
 ◼   Accuracy: percentage of correct classifications
                  Total test instances classified correctly
     Accuracy =
                      Total number of test instances
       Many more metrics….. We will study.
                                                              30
Prediction: Regression
◼   Example: Price of a
    used car
◼   x : car attributes    y = wx+w0
    y : price
         y = g (x | θ )
    g ( ) model,
    θ parameters
                                      34
Regression Applications
◼   Navigating a car: Angle of the steering wheel (CMU
    NavLab)
◼   Kinematics of a robot arm
         (x,y)             α1= g1(x,y)
                           α2= g2(x,y)
                      α2
                 α1
                                                         35
Supervised Learning: Uses
Example: decision trees tools that create rules
◼   Prediction of future cases: Use the rule to predict
    the output for future inputs
◼   Knowledge extraction: The rule is easy to
    understand
◼   Compression: The rule is simpler than the data it
    explains
◼   Outlier detection: Exceptions that are not covered
    by the rule, e.g., fraud
                                                          36
Unsupervised Learning
◼   Learning “what normally happens”
◼   No output
◼   Clustering: Grouping similar instances
◼   Other applications: Summarization, Association
    Analysis
◼   Example applications
      Customer segmentation in CRM
      Image compression: Color quantization
      Bioinformatics: Learning motifs
                                                     37
Reinforcement Learning
◼   Topics:
       Policies: what actions should an agent take in a particular
        situation
       Utility estimation: how good is a state (→used by policy)
◼   No supervised output but delayed reward
◼   Credit assignment problem (what was responsible
    for the outcome)
◼   Applications:
       Game playing
       Robot in a maze
       Multiple agents, partial observability, ...
                                                                      38
Resources: Datasets
◼   UCI Repository:
    http://www.ics.uci.edu/~mlearn/MLRepository.html
◼   UCI KDD Archive:
    http://kdd.ics.uci.edu/summary.data.application.html
◼   Statlib: http://lib.stat.cmu.edu/
◼   Delve: http://www.cs.utoronto.ca/~delve/
                                                           39
Resources: Journals
◼   Journal of Machine Learning Research www.jmlr.org
◼   Machine Learning
◼   IEEE Transactions on Neural Networks
◼   IEEE Transactions on Pattern Analysis and Machine
    Intelligence
◼   Annals of Statistics
◼   Journal of the American Statistical Association
◼   ...
                                                    40
    Resources: Conferences
◼   International Conference on Machine Learning (ICML)
◼   European Conference on Machine Learning (ECML)
◼   Neural Information Processing Systems (NIPS)
◼   Computational Learning
◼   International Joint Conference on Artificial Intelligence (IJCAI)
◼   ACM SIGKDD Conference on Knowledge Discovery and Data Mining
    (KDD)
◼   IEEE Int. Conf. on Data Mining (ICDM)
                                                                 41