Introduction To Data Mining
Introduction To Data Mining
Introduction
Scenarios in life
    1. Human in vitro fertilization involves collecting several eggs from a woman’s ovaries, which,
       after fertilization with partner or donor sperm, produce several embryos. Some of these are
       selected and transferred to the woman’s uterus. The challenge is to select the “best” embryos
       to use—the ones that are most likely to survive. Selection is based on around 60 recorded
       features of the embryos—characterizing their morphology, oocyte, and follicle, and the
       sperm sample. The number of features is large enough to make it difficult for an
       embryologist to assess them all simultaneously and correlate historical data with the crucial
       outcome of whether that embryo did or did not result in a live child. Machine learning has
       been investigated as a technique for making the selection, using historical records of
       embryos and their outcome as training data.
    2. Every year, large scale dairy farmers have to make a tough business decision: which cows to
       retain in their herd and which to sell off to an abattoir. Typically, one-fifth of the cows in a
       dairy herd are culled each year near the end of the milking season as feed reserves dwindle.
       Each cow’s breeding and milk production history influences this decision. Other factors
       include age (a cow nears the end of its productive life at eight years), health problems,
       history of difficult calving, undesirable temperament traits (kicking or jumping fences), and
       not being pregnant with calf for the following season. About 700 attributes for each of
       several million cows have been recorded over the years. Machine learning has been
       investigated as a way of ascertaining what factors are taken into account by successful
       farmers—not to automate the decision but to propagate their skills and experience to others.
Machine learning is a growing new technology for mining knowledge from data.
  Data mining is defined as the process of discovering patterns in data. The process must be
  automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that
  they lead to some advantage, usually an economic one. The data is invariably present in
  substantial quantities.
Machine Learning provides techniques for finding and describing structural patterns in data.
  Machine Learning
  What is learning, anyway? What is machine learning? These are philosophical questions.
  Dictionaries define “to learn” as;
        To get knowledge of something by study, experience, or being taught.
        To become aware by information or from observation
        To commit to memory
        To be informed of or to ascertain
        To receive instruction
  These meanings have some shortcomings when it comes to talking about computers. For the
  first two, it is virtually impossible to test whether learning has been achieved or not.
        How do you know whether a machine has got knowledge of something?
        How do you know whether it has become aware of something?
  For the last three meanings, merely committing to memory and receiving instruction seem to
  fall far short of what we might mean by machine learning.
       They are too passive – computers find these tasks trivial.
       An entity can receive instruction without benefiting from it at all.
  Note: Things learn when they change their behavior in a way that makes them perform better in
  the future. This ties learning to performance rather than knowledge. You can test learning by
  observing present behavior and comparing it with past behavior.
                                            Examples
The Weather Problem
   This weather problem concerns conditions that are suitable for playing some unspecified
    game.
   In general, instances in a dataset are characterized by the values of features, or attributes,
    that measure different aspects of the instance. In this case there are four attributes: outlook,
    temperature, humidity, and windy. The outcome is whether to play or not.
   In table 1.2, all four attributes have values that are symbolic categories rather than numbers.
    Outlook can be sunny, overcast, or rainy; temperature can be hot, mild, or cool; humidity
    can be high or normal; and windy can be true or false. This creates 36 possible
    combinations (3 × 3 × 2 × 2 = 36), of which 14 are present in the set of input examples.
   The following rules which could be interpreted in sequence could be derived from the data
    above;
   A set of rules that are intended to be interpreted in sequence is called a decision list.
   Suppose that two of the attributes—temperature and humidity—have numeric values as
    shown in table 1.3. This means that any learning scheme must create inequalities involving
    these attributes rather than simple equality tests as in the former case. This is called a
    numeric-attribute problem—in this case, a mixed-attribute problem because not all
    attributes are numeric.
      The first rule might take the form:
      The rules we have seen so far are classification rules: They predict the classification of the
       example in terms of whether to play or not.
      It is equally possible to disregard the classification and just look for any rules that strongly
       associate different attribute values. These are called association rules. Examples from table
       1.2 include;
   1. Web Mining: Search engine companies examine the hyper-links in web pages to come up
      with a measure to rank web pages. The problem of how to rank web pages is to use machine
      learning based on a training set of example queries—documents that contain the terms in the
      query and human judgments about how relevant the documents are to that query. Then a
      learning algorithm analyzes this training data and comes up with a way to predict the
      relevance judgment for any document and query. For each document, a set of feature values
      is calculated that depends on the query term—for example, whether it occurs in the title tag,
      whether it occurs in the document’s URL, how often it occurs in the document itself, and
      how often it appears in the anchor text of hyperlinks that point to the document. Search
      engines mine the web to select advertisements that you might be interested in.
   2. Judgment on loans: Loan defaulter are a risky group of customers. Credit industry
      professionals point out that if only their repayment future could be reliably determined, it is
      precisely these customers whose business should be wooed since they will always be in
      financial need. A machine learning procedure is used to produce a small set of classification
      rules that made correct predictions on two-thirds of the borderline case customers in an
      independently chosen test set. These rules improve the success rate of the loan decisions on
      defaulters.
   3. Screening of Images for oil leaks on oceans.
   4. Marketing and sales: Market basket analysis is the use of association techniques to find
      groups of items that tend to occur together in transactions, typically supermarket checkout
      data.
Classification learning is sometimes called supervised, because, in a sense, the scheme operates
under supervision by being provided with the actual outcome for each of the training examples—the
play or don’t play judgment, the lens recommendation, the type of iris, the acceptability of the labor
contract. This outcome is called the class of the example. The success of classification learning can
be judged by trying out the concept description that is learned on an independent set of test data for
which the true classifications are known but not made available to the machine. The success rate on
test data gives an objective measure of how well the concept has been learned.
The input to a machine learning scheme is a set of instances. These instances are the things that are
to be classified or associated or clustered. Although until now we have called them examples,
henceforth we will generally use the more specific term instances to refer to the input. In the
standard scenario, each instance is an individual, independent example of the concept to be learned.
Instances are characterized by the values of a set of predetermined attributes.
Each instance that provides the input to machine learning is characterized by its values on a fixed,
predefined set of features or attributes. The instances are the rows of the tables that we have shown
for the weather and the iris problems while the attributes are the columns.
Attributes can either be numeric or nominal quantities. Numeric attributes, sometimes called
continuous attributes, measure numbers—either real or integer valued (abuse!). Nominal attributes
take on values in a prespecified, finite set of possibilities and are sometimes called categorical.
1) TABLES
The simplest, most rudimentary way of representing the output from machine learning is to make it
just the same as the input—a table. For example, Table 1.2 is a decision table for the weather data:
You just look up the appropriate conditions to decide whether or not to play.
2) LINEAR MODELS
Another simple style of representation is a linear model, the output of which is just the sum of the
attribute values, except that weights are applied to each attribute before adding them together. The
trick is to come up with good values for the weights—ones that make the model’s output match the
desired output. For linear models, the output and the inputs—attribute values—are all numeric.
Regression is the process of predicting a numeric quantity, and regression model is another term
for this kind of linear model. Linear models are easiest to visualize in two dimensions, where they
involve drawing a straight line through a set of data points.
Figure 3.1 shows a line fitted to the CPU performance data where only the cache attribute is used as
input. The class attribute performance is shown on the vertical axis, with cache on the horizontal
axis; both are numeric. The straight line represents the “best fit” prediction equation.
Given a test instance, a prediction can be produced by plugging the observed value of cache into
this expression to obtain a value for performance.
Linear models can also be applied to binary classification problems. In this case, the line produced
by the model separates the two classes: It defines where the decision changes from one class value
to the other. Such a line is often referred to as the decision boundary. Figure 3.2 shows a decision
boundary for the iris data that separates the Iris setosas from the Iris versicolors.
In this case, the data is plotted using two of the input attributes—petal length and petal width—and
the straight line defining the decision boundary is a function of these two attributes. As before,
given a test instance, a prediction is produced by plugging the observed values of the attributes in
question into the expression. But here we check the result and predict one class if it is greater than
or equal to 0 (in this case, Iris setosa) and the other if it is less than 0 (Iris versicolor).
3) TREES
Consider table 1.1 that shows contact lens data that tells the kind of contact lens to prescribe, given
certain information about a patient.
   The first column of Table 1.1 gives the age of the patient where presbyopia is a form of
    longsightedness that accompanies the onset of middle age.
   The second column gives the spectacle prescription: Myope means shortsighted and
    hypermetrope means longsighted.
   The third column shows whether the patient is astigmatic, while the fourth relates to the rate
    of tear production, which is important in this context because tears lubricate contact lenses.
   The final column shows which kind of lenses to prescribe, whether hard, soft, or none.
   Rules shown in figure 1.1 can be derived from this table.
   However, a smaller set of rules could perform better.
   Machine learning techniques can be used to gain insight into the structure of data rather
    than to make predictions for new cases.
   Figure 1.2 shows a structural description for the contact lens data in the form of a decision
    tree, which is a more concise representation of the rules and has the advantage that it can be
    visualized more easily.
      Nodes in a decision tree involve testing a particular attribute. Usually, the test compares an
       attribute value with a constant. Leaf nodes give a classification that applies to all instances
       that reach the leaf, or a set of classifications, or a probability distribution over all possible
       classifications. To classify an unknown instance, it is routed down the tree according to the
       values of the attributes tested in successive nodes, and when a leaf is reached the instance is
       classified according to the class assigned to the leaf.
      In summary, decision trees divide the data at a node by comparing the value of some
       attribute with a constant.
4) RULES
     Rules are a popular alternative to decision trees.
     The antecedent, or precondition, of a rule is a series of tests just like the tests at nodes in
      decision trees, while the consequent, or conclusion, gives the class or classes that apply to
      instances covered by that rule, or perhaps gives a probability distribution over the classes.
Classification Rules
    It is easy to read a set of classification rules directly off a decision tree since one rule is
       generated for each leaf.
    The antecedent of the rule includes a condition for every node on the path from the root to
       that leaf, and the consequent of the rule is the class assigned by the leaf.
    The left diagram of Figure 3.6 shows an exclusive-or function for which the output is a if
       x=1 or y=1 but not both. To make this into a tree, you have to split on one attribute first,
       leading to a structure like the one shown in the center. In contrast, rules can faithfully reflect
       the true symmetry of the problem with respect to the attributes, as shown on the right.
      In this example the rules are not notably more compact than the tree.
      Reasons why rules are popular:
           1. Each rule seems to represent an independent “nugget” of knowledge.
           2. New rules can be added to an existing rule set without disturbing ones already there,
               whereas to add to a tree structure may require reshaping the whole tree. However,
               this independence is something of an illusion because it ignores the question of how
               the rule set is executed.
                    if rules are meant to be interpreted in order as a “decision list,” some of them,
                       taken individually and out of context, may be incorrect.
                    if the order of interpretation is supposed to be immaterial, then it is not clear
                       what to do when different rules lead to different conclusions for the same
                       instance.
                    This situation cannot arise for rules that are read directly off a decision tree
                       because the redundancy included in the structure of the rules prevents any
                       ambiguity in interpretation.
    What if a rule set gives multiple classifications for a particular example?
           1. One solution is to give no conclusion at all.
           2. Another is to count how often each rule fires on the training data and go with the
               most popular one.
   Note: Individual rules are simple, and sets of rules seem deceptively simple—but given just a
   set of rules with no additional information, it is not clear how it should be interpreted.
Association Rules
    Association rules can predict any attribute, not just the class – this gives them the freedom to
       predict combinations of attributes too.
    Also, association rules are not intended to be used together as a set, as classification rules
       are.
    Different association rules express different regularities that underlie the dataset, and they
       generally predict different things.
    Since so many different association rules can be derived from even a very small dataset,
       interest is restricted to those that apply to a reasonably large number of instances and have a
       reasonably high accuracy on the instances to which they apply.
    The coverage of an association rule is the number of instances for which it predicts
       correctly—this is often called its support. Its accuracy – often called confidence – is the
       number of instances that it predicts correctly, expressed as a proportion of all instances to
       which it applies.
5) INSTANCE-BASED REPRESENTATION
     The simplest form of learning is plain memorization, or rote learning.
     Once a set of training instances has been memorized, on encountering a new instance the
      memory is searched for the training instance that most strongly resembles the new one.
     The only problem is how to interpret “resembles”.
     In instance based learning, instances are stored after which new instances whose class is
      unknown are related to existing ones whose class is known.
     In this case, instead of trying to create rules, learning happens directly from the examples.
     Note: In a sense, all the other learning methods are instance-based too, because we always
      start with a set of instances as the initial training information.
     But the instance-based knowledge representation uses the instances themselves to represent
      what is learned, rather than inferring a rule set or decision tree and storing it instead.
     The difference between this method and the others that we have seen is the time at which the
      “learning” takes place.
          1. Instance-based learning is lazy, deferring the real work as long as possible, whereas
              other methods are eager, producing a generalization as soon as the data has been
                seen.
           2.   In instance-based classification, each new instance is compared with existing ones
                using a distance metric, and the closest existing instance is used to assign the class to
                the new one. This is called the nearest-neighbor classification method. Sometimes
                more than one nearest neighbor is used, and the majority class of the closest k
                neighbors (or the distance-weighted average if the class is numeric) is assigned to the
                new instance. This is termed the k-nearest-neighbor method.
           3.   Computing the distance between two examples is trivial when examples have just
                one numeric attribute: It is just the difference between the two attribute values.
           4.   For examples with several numeric attributes the standard Euclidean distance
                (distance between two points) is used with the assumption that the attributes are
                normalized and are of equal importance., The main problems in learning is to
                determine the “important” features. Some attributes will be more important than
                others, and this is usually reflected in the distance metric by some kind of attribute
                weighting. Deriving suitable attribute weights from the training set is a key problem
                in instance-based learning.
           5.   When nominal attributes are present, it is necessary to come up with a “distance”
                between different values of that attribute. What are the distances between, say, the
                values red, green, and blue? Usually, a distance of zero is assigned if the values are
                identical; otherwise, the distance is one. Thus, the distance between red and red is
                zero but the distance between red and green is one.
           6.   However, it may be desirable to use a more sophisticated representation of the
                attributes. For example, with more colors one could use a numeric measure of hue in
                color space, making yellow closer to orange than it is to green and ocher closer still.
           7.   An apparent drawback to instance-based representations is that they do not make
                explicit the structures that are learned i.e., instances do not really “describe” the
                patterns in data.
Figure 3.10 shows different ways of partitioning the instance space. Given a single instance of each
of two classes, the nearest-neighbor rule effectively splits the instance space along the perpendicular
bisector of the line joining the instances. Given several instances of each class, the space is divided
by a set of lines that represent the perpendicular bisectors of selected lines joining an instance of
one class to one of another class.
    1. Figure 3.10(a) illustrates a nine-sided polygon that separates the filled-circle class from the
        open-circle class. This polygon is implicit in the operation of the nearest-neighbor rule.
    2. When training instances are discarded (to save space and increase execution speed), the
        result is to save just a few critical examples of each class. Figure 3.10(b) shows only the
        examples that actually get used in nearest-neighbor decisions: The others (the light-gray
        ones) can be discarded without affecting the result. These examples serve as a kind of
        explicit knowledge representation.
    3. Some instance-based representations go further and explicitly generalize the instances.
        Typically, this is accomplished by creating rectangular regions that enclose examples of the
        same class. Figure 3.10(c) shows the rectangular regions that might be produced. Unknown
        examples that fall within one of the rectangles will be assigned the corresponding class; ones
        that fall outside all rectangles will be subject to the usual nearest-neighbor rule. Of course,
        this produces different decision boundaries from the straightforward nearest-neighbor rule,
        as can be seen by superimposing the polygon in Figure 3.10(a) onto the rectangles. Any part
        of the polygon that lies within a rectangle will be chopped off and replaced by the
        rectangle’s boundary.
6) CLUSTERS
When a cluster rather than a classifier is learned, the output takes the form of a diagram that shows
how the instances fall into clusters.
    1. In the simplest case this involves associating a cluster number with each instance, which
       might be depicted by laying the instances out in two dimensions and partitioning the space
       to show each cluster, as illustrated in Figure 3.11(a).
    2. Some clustering algorithms allow one instance to belong to more than one cluster, so the
       diagram might lay the instances out in two dimensions and draw overlapping subsets
       representing each cluster—a Venn diagram, as in Figure 3.11(b).
    3. Some algorithms associate instances with clusters probabilistically rather than categorically.
       In this case, for every instance there is a probability or degree of membership with which it
       belongs to each of the clusters. This is shown in Figure 3.11(c). The numbers for each
       example sum to 1.
    4. Other algorithms produce a hierarchical structure of clusters so that at the top level the
       instance space divides into just a few clusters, each of which divides into its own subcluster
       at the next level down, and so on. In this case a diagram such as the one in Figure 3.11(d) is
       used, in which elements joined together at lower levels are more tightly clustered than ones
joined together at higher levels. Such diagrams are called dendrograms (Greek) – Tree
diagrams.
                                DATA MINING ALGORITHMS
This section explains the basic ideas behind the techniques that are used in practical data mining.
The section looks at the basic ideas about algorithms.
Concept of 1R algorithm
    Rules are made that test a single attribute and branch accordingly. Each branch corresponds
      to a different value of the attribute.
    The class that occurs most often in the training data is used as the best classification.
    Then error rate of the rules is determined by counting the errors that occur on the training
      data—that is, the number of instances that do not have the majority class.
    Each attribute generates a different set of rules, one rule for every value of the attribute. An
      evaluation of the error rate for each attribute’s rule set is done and the best is chosen.
STATISTICAL MODELING
The 1R method uses a single attribute as the basis for its decisions and chooses the one that works
best. Another simple technique is to use all attributes and allow them to make contributions to the
decision that are equally important and independent of one another, given the class. This is
unrealistic, of course: What makes real-life datasets interesting is that the attributes are certainly not
equally important or independent. But it leads to a simple scheme that, again, works surprisingly
well in practice.
      Table 4.2 (generate the table in class) shows a summary of the weather data obtained by
       counting how many times each attribute–value pair occurs with each value (yes and no) for
       play.
      For example, from the weather data outlook is sunny for five examples, two of which have
       play = yes and three of which have play = no.
      The cells in the first row of the new table simply count these occurrences for all possible
       values of each attribute, and the play figure in the final column counts the total number of
       occurrences of yes and no.
      The lower part of the table contains the same information expressed as fractions, or
       observed probabilities. For example, of the nine days that play is yes, outlook is sunny for
       two, yielding a fraction of 2/9.
      For play the fractions are different: They are the proportion of days that play is yes and no,
       respectively.
Now suppose we encounter a new example with the values that are shown in Table 4.3.
      We treat the five features in Table 4.2—outlook, temperature, humidity, windy, and the
       overall likelihood that play is yes or no—as equally important, independent pieces of
       evidence and multiply the corresponding fractions.
      Looking at the outcome yes gives;
      The fractions are taken from the yes entries in the table according to the values of the
       attributes for the new day, and the final 9/14 is the overall fraction representing the
       proportion of days on which play is yes. A similar calculation for the outcome no leads to
      The likelihood numbers can be turned into probabilities by normalizing them so that they
       sum to 1:
      This indicates that for the new day, no is four times more likely than yes.
      This simple and intuitive method is based on Bayes’ rule of conditional probability. Bayes’
       rule says that if you have a hypothesis H and evidence E that bears on that hypothesis, then
      The Pr[yes] at the end is the probability of a yes outcome without knowing any of the
       evidence E—that is, without knowing anything about the particular day in question—and
       it’s called the prior probability of the hypothesis H. In this case, it’s just 9/14, because 9 of
       the 14 training examples had a yes.
      Substituting the fractions in Table 4.2 for the appropriate evidence probabilities leads to;
      This method goes by the name of Naïve Bayes because it’s based on Bayes’ rule and
       “naïvely” assumes independence—it is only valid to multiply probabilities when the events
       are independent. The assumption that attributes are independent (given the class) in real life
       certainly is a simplistic one.
      Despite the name, Naïve Bayes works very effectively when tested on actual datasets.
       However, things go wrong for this algorithm when one of the attributes' probabilities is zero.
       This results on the overall probability becoming zero. Probabilities that are zero hold a veto
       over the other ones.
The problem is how to determine which attribute to split on, given a set of examples with different
classes. Consider the weather data. There are four possibilities for each split, and at the top level
they produce the trees in Figure 4.2.
Any leaf with only one class—yes or no—will not have to be split further, and the recursive process
down that branch will terminate. Because we seek small trees, we would like this to happen as soon
as possible. If we had a measure of the purity of each node, we could choose the attribute that
produces the purest daughter nodes.
The measure of purity that we will use is called the information and is measured in units called bits.
Associated with each node of the tree, it represents the expected amount of information that would
be needed to specify whether a new instance should be classified yes or no, given that the example
reached that node. Unlike the bits in computer memory, the expected amount of information usually
involves fractions of a bit—and is often less than 1! It is calculated based on the number of yes and
no classes at the node.
When evaluating the tree in Figure 4.2(a), the number of yes and no classes at the leaf nodes are [2,
3], [4, 0], and [3, 2], respectively, and the information values of these nodes are
We calculate the average information value of these, taking into account the number of instances
that go down each branch i.e., five down the first and third and four down the second:
The training examples at the root comprised nine yes and five no nodes, corresponding to an
information value of;
Thus, the tree in Figure 4.2(a) is responsible for an information gain of:
Thus we need to calculate the information gain for each attribute and split on the one that gains the
most information. In the situation that is shown in Figure 4.2:
Therefore, we select outlook as the splitting attribute at the root of the tree. It is the only choice for
which one daughter node is completely pure, which gives it a considerable advantage over the
other attributes. Humidity is the next best choice because it produces a larger daughter node that is
almost completely pure. Then we continue, recursively.
Figure 4.3 shows the possibilities for a further branch at the node reached when outlook is sunny.
Continued application of the same idea leads to the decision tree of Figure 4.4 for the weather data.
LINEAR REGRESSION
    When the outcome, or class, is numeric, and all the attributes are numeric, linear regression
     is a natural technique to consider. This is a staple method in statistics. The idea is to express
     the class as a linear combination of the attributes, with predetermined weights:
      Linear regression is a process that allows one to make predictions about variable “Y” based
       on knowledge about variable “X” . It summarizes how average values of a numerical
       outcome variable vary over subpopulations defined by linear functions of predictors.
      Linear regression is an excellent, simple method for numeric prediction.
      However, linear models suffer from the disadvantage of linearity – if the data exhibits a
       nonlinear dependency, the best-fitting straight line would be found. This line may not fit
       very well.
LOGISTIC REGRESSION
   Logistic regression is a discriminative classifier that employs a probabilistic approach to
     directly model posterior probability by learning a discriminative function that maps an input
     feature vector directly onto a class label. The goal of Logistic regression is to directly
     estimate the posterior probability P(Y (T) = i|X ) from the training data. The Logistic
     regression model is defined as;
Precision = TP/(TP+FP)
        In terms of defective software modules, precision as the ratio of actually defective modules
         within the modules predicted as defective while recall is the ratio of detected defective
         modules among all defective modules. Recall (sensitivity) therefore measures how often we
         find what we are looking for and evaluates to a value of “1” if all instances of the True class
         are classified to the True class.
        The main goal for learning from imbalanced datasets is to improve recall without hurting precision.
         However, recall and precision goals can often conflict since when increasing TP for minority class,
         the number of FP can also be increased thus reducing precision. To resolve this, a n F-value metric
         measure that combines the trade-offs between precision and recall is used. The relationship
         of recall and precision is given by the following F-value equation;
     where β is usually set to 1 and corresponds to the relative importance of precision vs recall.
    The single number that the F-value outputs reflects the “goodness” of a classifier in the
     presence of rare classes. While recall (sensitivity) measures how often we find what we are
     looking for, specificity measures how often what we find is what we are not looking for.
      To plot a curve for a model, its sensitivity and specificity are calculated and plotted as a
       point on the ROC graph.
      An ideal model would be represented by the location (0, 1) on the graph in Fig. 2
       corresponding to 100% specificity and 100% sensitivity.
      To obtain a curve on the ROC plot that corresponds to a single classifier, a threshold of the
       quantity under study is chosen and then used to calculate a point classifier. If the value of
       the threshold is varied, several points are obtained which when plotted generate an ROC
       curve as shown by the curves for classifiers A and B in Fig.2.
      For models with curves that do not overlap in the graph space, the curve that is more to the
    upper left would indicate a better classifier.
   However the best way of selecting an optimal model is by determining the Area Under
    Curve (AUC) of the ROC whereby better performing model will have the largest Area
    Under Curve (AUC).
   A perfect classifier would have an AUC of 1, while a random classifier is expected to
    achieve an AUC of 0.5 .