UNIT-V
Learning: Forms of Learning, Supervised Learning, Learning Decision Trees.
   Knowledge in Learning: Logical Formulation of Learning,
   Knowledge in Learning, Explanation-Based Learning, Learning Using
   Relevance Information, Inductive Logic Programming.
   Learning from Examples Forms of Learning, Supervised Learning, Learning
   Decision Trees, Evaluating and Choosing the Best Hypothesis, The Theory of
Learning
   Learning, Regression and Classification with Linear Models, Artificial Neural
      An agent is learning if it improves its performance on future tasks after
   Networks,     Nonparametric
      making observations about theModels,
                                    world. Support Vector Machines, Ensemble
   Learning, Practical Machine Learning Knowledge in Learning A Logical
Forms  Of Learning of Learning, Knowledge in Learning, Explanation-Based
   Formulation
      Any component
   Learning,          of an agent
                Learning    UsingcanRelevance
                                      be improvedInformation
                                                    by learning from data.It depends
                                                                   , Inductive       upon 4
                                                                                  Logic
      factors:
   Programming
      • Which component is to be improved
         o direct mapping from conditions on the current state to actions
         o infer relevant properties of the world
         o results of possible actions
         o Action-value information
         o Goals that describe classes of states
      • What prior knowledge the agent already has.
      • What representation is used for the data and the component.
         o representations: propositional and first-order logical sentences
         o Bayesian networks for the inferential components
         o factored representation—a vector of attribute values—and outputs that
            can be either a continuous numerical value or a discrete value
      • What feedback is available to learn from : types of feedback that determine the
        three main types of learning
         o In unsupervised learning the agent learns patterns in the input even
            though no explicit feedback is supplied
         o reinforcement      learning the agent learns from a series of
            reinforcements—rewards or punishments.
         o supervised learning the agent observes some example input–output pairs
              and learns a function that maps from input to output
          o   semi-supervised learning we are given a few labeled examples and must
              make what we can of a large collection of unlabelled examples
SUPERVISED LEARNING
       Given a training set of N example input–output pairs (x1, y1), (x2, y2), . . . (xN,
       yN) , where each yj was generated by an unknown function y = f(x), discover a
       function h that approximates the true function f. The function h is a hypothesis. To
       measure the accuracy of a hypothesis we give it a test set of examples that are distinct
       from the training set.
       Conditional Probability Distribution : the function f is stochastic—it is not strictly
       a function of x, and what we have to learn is a , P(Y | x)
       Classification :When the output y is one of a finite set of values the learning
       problem is called
classification
       Regression : When y is a number (such as tomorrow’s temperature), the learning
       problem is called
regression
       Hypothesis space, H, can be a set of polynomials. A polynomial is fitting a function
       of a single variable to some data points.
       Ockham’s razor :how do we choose a function or a polynomial from among
       multiple consistent hypotheses? One answer is to prefer the simplest hypothesis
       consistent with the data. This principle is called Ockham’s razor
       Realizable : a learning problem is realizable if the hypothesis space contains the
       true function. Unfortunately, we cannot always tell whether a given learning problem
       is realizable, because the true function is not known.
       Supervised learning can be done by choosing the hypothesis ” h”that is most
       probable one for the given data:
      There is a tradeoff between the expressiveness of a hypothesis space and the
      complexity of finding a good hypothesis within that space.
LEARNING DECISION TREES
    Decision tree induction is one of the simplest and yet most successful forms of machine
    learning.
    The decision tree representation :The aim here is to learn a definition for the goal
    predicate.
    A decision tree represents a function that takes as input a vector of attribute values and
    returns a “decision”—a single output value. The input and output values can be discrete or
    continuous
    o A decision tree reaches its decision by performing a sequence of tests.
    o Each internal node in the tree corresponds to a test of the value of one of the input
       attributes, Ai,
    o the branches from the node are labeled with the possible values of the attribute, Ai =vik.
    o Each leaf node in the tree specifies a value to be returned by the function.
         Decision Tree Algorithm:
         The DECISION-TREE-LEARNING algorithm adopts a greedy divide-and-
         conquer strategy. This test divides the problem up into smaller subproblems that
         can then be solved recursively.
         function DECISION-TREE-LEARNING(examples, attributes, parent examples)
         returns
         a tree
         if examples is empty then return PLURALITY-VALUE(parent examples)
         else if all examples have the same classification then return the classification
         else if attributes is empty then return PLURALITY-VALUE(examples)
         else
         A←argmaxa ∈ attributes IMPORTANCE(a,
         examples) tree←a new decision tree with root
         test A
   for each value vk of A do
   exs ←,e : e ∈examples and e.A = vk}
   subtree←DECISION-TREE-LEARNING(exs,
   attributes −A, examples) add a branch to tree with
   label (A = vk) and subtree subtree return tree
   Expressiveness of decision trees
   A Boolean decision tree is logically equivalent to the assertion that the goal
   attribute is true if and only if the input attributes satisfy one of the paths leading
   to a leaf with value true.
   Goal ⇔(Path1 ∨Path2 ∨・・・) , where each Path is a conjunction of attribute-
   value tests required to follow that path. A tree consists of just tests on
   attributes in the interior nodes, values of attributes on the branches, and output
   values on the leaf nodes. For a wide variety of problems, the decision tree format
   yields a nice, concise result. But some functions cannot be represented concisely.
   We can evaluate the accuracy of a learning algorithm with a learning curve.
Choosing attribute tests
The greedy search used in decision tree learning is designed to approximately
minimize the depth of the final tree. The idea is to pick the attribute that goes as far
as possible toward providing an exact classification of the examples. A perfect
attribute divides the examples into sets, each of which are all positive or all negative
and thus will be leaves of the tree.
Entropy is a measure of the uncertainty of a random variable; acquisition of
information corresponds to a reduction in entropy.
We can check that the entropy of a fair coin flip is indeed 1 bit:
H(Fair) = −(0.5 log2 0.5 + 0.5 log2 0.5) = 1 .
The information gain from the attribute INFORMATION GAIN test on A is the
expected reduction in entropy:
Pruning
      In decision trees, a technique called decision tree pruning combats overfitting. Pruning
      works by eliminating nodes that are not clearly relevant.
      Issues in decision trees:
         • Missing data
         • Multivalued attributes
         • Continuous and integer-valued input attributes
         • Continuous-valued output attributes
      LEARNING
      A LOGICAL FORMULATION OF LEARNING
      Current-best-hypothesis search
      The idea behind current-best-hypothesis search is to maintain a single hypothesis,
      and to adjust it as new examples arrive in order to maintain consistency.
      The extension of the hypothesis must be increased to include new examples. This
      is called generalization.
      function CURRENT-BEST-LEARNING(examples, h) returns a hypothesis or fail
      if examples is empty
      then return h
      e←FIRST(examples)
      if e is consistent with h then
      return                    CURRENT-BEST-
      LEARNING(REST(examples), h) else if e
      is a false positive for h then
      for each hin specializations of h consistent with examples
      seen        so       far     do       h←CURRENT-BEST-
      LEARNING(REST(examples), h)
      if h = fail then return h
      else if e is a false negative for h then
     for each hin generalizations of h consistent with examples
     seen       so       far     do      h←CURRENT-BEST-
     LEARNING(REST(examples), h)
     if h = fail then
     return h return fail
     The extension of the hypothesis must be decreased to exclude the example. This is
     called specialization. Least-commitment search
     Backtracking arises because the current-best-hypothesis approach has to choose a
     particular hypothesis as its best guess even though it does not have enough data yet
     to be sure of the choice. What we can do instead is to keep around all and only those
     hypotheses that are consistent with all the data so far. Each new example will either
     have no effect or will get rid of some of the hypotheses.
     One important property of this approach is that it is incremental: one never has to go
     back and reexamine the old examples.
Boundary Set :
     We also have an ordering                on    the    hypothesis    space,    namely,
     generalization/specialization.
     This is a partial ordering, which means that each boundary will not be a point but
     rather a set of hypotheses called a boundary set.
     The great thing is that we can represent the entire G-SET version space using just
     two boundary sets:
      a most general boundary (the G-set) and
     a most S-SET specific boundary (the S-set).
     Everything in between is guaranteed to be consistent with the examples.
     The members Si and Gi of the S- and G-sets.
     For each one, the new example may be a false positive or a false negative.
     1. False positive for Si: This means Si is too general, but there are no consistent
     specializations of Si (by definition), so we throw it out of the S-set.
     2. False negative for Si: This means Si is too specific, so we replace it by all its
     immediate generalizations, provided they are more specific than some member of G.
      3. Falsepositive for Gi: This means Gi is too general, so we replace it by all its
     immediate specializations, provided they are more general than some member of S.
     4. False negative for Gi: This means Gi is too specific, but there are no consistent
     generalizations of Gi (by definition) so we throw it out of the G-set
EXPLANATION-BASED LEARNING
     Explanation-based learning is a method for extracting general rules from individual
     observations.
Memorization
     The technique of memorization has long been used in computer science to speed up
     programs by saving the results of computation. The basic idea of memo functions is
     to accumulate a database of input– output pairs; when the function is called, it first
     checks the database to see whether it can avoid solving the problem from scratch.
     Explanation-based learning takes this a good deal further, by creating general rules
     that cover an entire class of cases.
      Basic EBL process works as follows:
      1. Given an example, construct a proof that the goal predicate applies to the example
      using the available background knowledge
      2. In parallel, construct a generalized proof tree for the variabilized goal using the
      same inference steps as in the original proof.
      3. Construct a new rule whose left-hand side consists of the leaves of the proof tree
      and whose right- hand side is the variabilized goal (after applying the necessary
      bindings from the generalized proof).
      4. Drop any conditions from the left-hand side that are true regardless of the values
      of the variables in the goal.
      Three factors involved in the analysis of efficiency gains from EBL:
      1. Adding large numbers of rules can slow down the reasoning process, because the
      inference mechanism must still check those rules even in cases where they do not
      yield a solution. In other words, it increases the branching factor in the search space.
      2. To compensate for the slowdown in reasoning, the derived rules must offer
      significant increases in speed for the cases that they do cover. These increases come
      about mainly because the derived rules avoid dead ends that would otherwise be
      taken, but also because they shorten the proof itself.
      3. Derived rules should be as general as possible, so that they apply to the largest possible
        set of cases.
LEARNING USING RELEVANCE INFORMATION
     The learning algorithm is based on a straightforward attempt to find the simplest
     determination consistent with the observations.
     A determination P ' Q says that if any examples match on P, then they must also
     match on Q. A determination is therefore consistent with a set of examples if every
     pair that matches on the predicates on the left-hand side also matches on the goal
     predicate.
      An algorithm for finding a minimal consistent determination
      function MINIMAL-CONSISTENT-DET(E,A) returns a set of attributes
      inputs: E, a set of
      examples A, a set of
      attributes, of size n
      for i = 0 to n do
      for each subset Ai of A of size i do
      if CONSISTENT-DET?(Ai,E) then return Ai
      function CONSISTENT-DET?(A,E) returns a truth value
      inputs: A, a set of
      attributes E, a set of
      examples
      local variables: H, a hash table
      for each example e in E do
      if some example in H has the same values as e for
      the attributes A but a different classification then
      return false
      store the class of e in H, indexed by the values for attributes A of the example e
      return true
      Given an algorithm for learning determinations, a learning agent has a way to
      construct a minimal hypothesis within which to learn the target predicate. For
      example, we can combine MINIMAL- CONSISTENT-DET with the DECISION-
      TREE-LEARNING algorithm.
      This yields a relevance-based decision-tree learning algorithm RBDTL that first
      identifies a minimal set of relevant attributes and then passes this set to the decision
      tree algorithm for learning.
INDUCTIVE LOGIC PROGRAMMING
     Inductive logic programming (ILP) combines inductive methods with the power of
     first-order representations, concentrating in particular on the representation of hypotheses
     as logic programs.
     It has gained popularity for three reasons.
         1. ILP offers a rigorous approach to the general knowledge-based inductive learning
            problem.
         2. It offers complete algorithms for inducing general, first-order theories from
            examples, which can therefore learn successfully in domains where attribute-
            based algorithms are hard to apply.
         3. Inductive logic programming produces hypotheses that are (relatively) easy
     for humans to read The object of an inductive learning program is to come up with
     a set of sentences for the Hypothesis such that the entailment constraint is
     satisfied. Suppose, for the moment, that the agent has no background knowledge:
     Background is empty. Then one possible solution we would need to make pairs of
     people into objects.
     Top-down inductive learning methods
     The first approach to ILP works by starting with a very general rule and gradually
     specializing it so that it fits the data.
     This is essentially what happens in decision-tree learning, where a decision tree is
     gradually grown until it is consistent with the observations.
      To do ILP we use first-order literals instead of attributes, and the hypothesis is a
     set of clauses instead of a decision tree.
     Three kinds of literals
               1. Literals using predicates
               2. Equality and inequality literals
               3. Arithmetic comparisons
     Inductive learning with inverse deduction
     The second major approach to ILP involves inverting the normal deductive
     proof process. Inverse resolution is based INVERSE on the observation.
      Recall that an ordinary resolution step takes two clauses C1 and C2 and resolves
      them to produce the resolvent C.
      An inverse resolution step takes a resolvent C and produces two clauses C1 and
      C2, such that C is the result of resolving C1 and C2.
      Alternatively, it may take a resolvent C and clause C1 and produce a clause C2 such that
      C is the result of resolving C1 and C2.
      A number of approaches to taming the search implemented in ILP systems
                   1. Redundant choices can be eliminated
                   2. The proof strategy can be restricted
                   3. The representation language can be restricted
                   4. Inference can be done with model checking rather than theorem proving
                   5. Inference can be done with ground propositional clauses rather
                       than in first-order logic.
                 6.
Neural Networks (ANN - Artificial Neural Network)
Introduction The term "Artificial Neural Network" is derived from Biological neural networks
that develop the structure of a human brain.
Similar to the human brain that has neurons interconnected to one another, artificial neural
networks also have neurons that are interconnected to one another in various layers of the
networks. These neurons are known as nodes.
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.
Ensemble Learning is a machine learning paradigm that combines multiple models (often
referred to as "weak learners" or "base learners") to produce a more robust and accurate prediction
model. The central idea is that by aggregating the outputs of multiple models, the ensemble can
outperform any single model.
Ensemble learning is a cornerstone of practical machine learning, as it often leads to significant
improvements in predictive performance. Below is a guide to ensemble learning and its practical
applications.
Key Ensemble Techniques
  1. Bagging (Bootstrap Aggregating):
        o Involves training multiple instances of the same learning algorithm on different
           random subsets of the training data (sampled with replacement).
        o Example: Random Forest.
        o Strength: Reduces variance and helps avoid overfitting.
  2. Boosting:
        o Trains multiple models sequentially, where each model focuses on correcting the
           errors of its predecessor.
        o Example: AdaBoost, Gradient Boosting Machines (e.g., XGBoost, LightGBM,
           CatBoost).
        o Strength: Reduces bias and variance by creating a strong learner from weak ones.
  3. Stacking (Stacked Generalization):
        o Combines predictions from multiple base models using a meta-model (another
           model that learns how to best combine the predictions).
        o Strength: Can leverage the strengths of diverse algorithms.
  4. Voting:
        o Combines predictions from multiple models by averaging (for regression) or taking
           a majority vote (for classification).
        o Types: Hard voting (majority vote), Soft voting (weighted probabilities).
        o Strength: Simple and effective.
  5. Blending:
        o A simplified version of stacking, where predictions from base models are combined
           directly without cross-validation on a separate holdout dataset.
Practical Applications
  1. Handling Complex Datasets:
         o Ensemble models are particularly effective when individual models struggle with
            capturing the complexity of the dataset.
         o Examples: Image recognition, financial modeling, and text classification.
  2. Improving Model Robustness:
         o Reduces sensitivity to noise in the data, ensuring better generalization.
  3. Winning Data Science Competitions:
         o Ensembles are a go-to strategy in competitions like Kaggle, where blending and
            stacking are widely used.
  4. Feature Selection and Importance:
         o Ensemble methods, especially Random Forest and Gradient Boosting, provide
            insights into feature importance.
Challenges in Ensemble Learning
  1. Computational Complexity:
        o Training multiple models can be resource-intensive.
  2. Risk of Overfitting:
        o Boosting methods like XGBoost need careful tuning to avoid overfitting.
  3. Interpretability:
        o Combining models reduces the interpretability compared to single models.
Ensemble Learning in Practice:
  •   Use Random Forest for quick, baseline results.
  •   Apply Boosting for high-performance tasks, especially when optimizing for competitions.
  •   Experiment with Stacking and Blending for complex datasets or when combining diverse
      algorithms.