0% found this document useful (0 votes)
10 views11 pages

DWDM Unit-3 (3-2)

Data classification is a key data analysis technique that predicts categorical class labels, with applications in various fields such as banking and medicine. The classification process involves a learning step to build a model and a classification step to predict labels, often utilizing decision trees for their intuitive structure and ease of understanding. Decision trees are constructed using attribute selection measures like information gain, gain ratio, and Gini index, and can be pruned to improve accuracy and reduce complexity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
10 views11 pages

DWDM Unit-3 (3-2)

Data classification is a key data analysis technique that predicts categorical class labels, with applications in various fields such as banking and medicine. The classification process involves a learning step to build a model and a classification step to predict labels, often utilizing decision trees for their intuitive structure and ease of understanding. Decision trees are constructed using attribute selection measures like information gain, gain ratio, and Gini index, and can be pruned to improve accuracy and reduce complexity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 11
WWW.JNTUK396.COM CRAZY NOTES Data Warehousing and Data Mining UNIT-3 DATA CLASSIFICATION Classification is a form of data analysis that extracts models describing important data ses. Such models, called classifiers, predict categorical (discrete, unordered) class labels. c For example, we can build a classification model to categorize bank loan applications as either safe or risky. Such analysis can help provide us with a better understanding of the data at large. Many classification methods have been proposed by researchers in machine learning, pattern recognition, and statistics, 3.1 Why Classification? A bank loans officer needs analysis of her data to learn which loan applicants are “safe” and which are “risky” for the bank. A marketing manager at AllElectronies needs data analysis to help guess whether a customer with a given profile will buy a new computer. A medical researcher wants to analyze breast cancer data to predict which one of three specific treatments a patient should receive. In each of these examples, the data analysis task is classification, where a model or classifier is constructed to predict class (categorical) labels, such as “safe” or “risky” for the loan application data; “yes” or “no” for the marketing data; or “treatment A,” “treatment B,” or “treatment C” for the medical data. Suppose that the marketing manager wants to predict how much a given customer will spend during a sale at AllElectronics. This data analysis task is an example of numeric prediction, where the model constructed predicts a continuous-valued function, or ordered value, as opposed to a class label. This model is a predictor. Regression analysis is a statistical methodology that is most often used for numeric prediction; hence the two terms tend to be used synonymously, although other methods for numeric predictionexist. Classification and numeric prediction are the two major types of prediction problems. 3.2 General Approach for Classification: Data classification is a two-step process, consisting of alearning step (where a classification model is constructed) and a classification step (wherethe model is used to predict class labels for given data) ‘In the first step, a classifier is built describing a predetermined set of data classes or concepts. This is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing or “leaming from” a training set made up of database tuples and their associated class labels. + Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute ‘In the second step, the model is used for classification. First, the predictive accuracy of the classifier is estimated. If we were to use the training set to measure the classifier’s accuracy, this estimate would likely be optimistic, because the classifier tends to overfit the data, * Accuracy rate is the percentage of test set samples that are correctly classified by the model Page 1 WWW.JNTUK396.COM Data Warehousing and Data Mining UNIT-3 Clasifeaon Algorithms SE fssatantPrr| —7—[—yes \ Peter |e] \ fasette Pratl 7 —|—yes = Snare 5 (Bankr fseaw Pays [no] (ORSEABZS Fig: Learning Step oes ieeors sor | Fig: Classification Step 3.3 Decision Tree Induction: Decision tree induction is the learning of decision trees from class-labeled training tuples. A decision tree is a flowchart-like tree structure, where each internal node (non leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node. Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals. “How are decision trees used for classification?” Given a tuple, X, for which the associated class label is unknown, the attribute values of the tuple are tested against the decision tree. A path is traced from the root to a leaf node, which holds the class prediction for that tuple. D “Why are decision tree classifiers so popular?” The construction of decision tree ion trees can easily be converted to classification rules. classi does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery. Decision trees can handle multidimensional data. Their representation of acquired knowledge in tree form is intuitive and generally easy to assimilate by humans. The learning and classification steps of decision tree induction are simple and fast. Decision tree induction algorithms have been used for classification in many application areas such as medicine, manufacturing and production, financial analysis, astronomy, and molecular biology. Decision trees are the basis of several commercial rule induction systems. Page 2 WWW.JNTUK396.COM Data Warehousing and Data Mining UNIT-3 During tree construction, attribute selection measures are used to select the attribute that best partitions the tuples into distinct classes. When decision trees are built, many of the branches may reflect noise or outliers in the training data. Tree pruning attempts to identify and remove such branches, with the goal of improving classification accuracy on unseen data. ‘During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning, developed a decision tree algorithm known as ID3 (Iterative Dichotomiser). ‘This work expanded on earlier work on concept learning systems, described by E. B. Hunt, J. Marinand P. T. Stone. Quinlan later presented C4.5 (a successor of ID3), which became a benchmark to which newer supervised learning algorithms are often compared ‘In 1984,a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone) Publishedthe book Classification and Regression Trees (CART), which described the generation ofbinary decision trees. 3.4 Decision Tree Algo! hm: Algorithm: Generate decision tree. Generate a decision tree from the training tuples of data partition, D. Input: * Data partition, D, which is a set of training tuples and their associated class labels; * attribute list, the set of candidate attributes; * Attribute selection method, a procedure to determine the splitting criterion that “best” partitions the data tuples into individual classes. This criterion consists of a splitting attribute and, possibly, either a split-point or splitting subset. 1) create anode N; 2) if'tuples in D are all of the same class, C, then 3) return Nas a leaf node labeled with the class C; 4) if attribute list is empty then 5) return Nas a leaf node labeled with the majority class in D; i majority voting 6) apply Attribute selection method(D, attribute list) to find the “best” splitting criterion; 7) label node N with splitting criterion; 8) if'splitting attribute is discrete-valued and multiway splits allowed then // not restricted to binary trees 9) attribute list Cattribute list - splitting attribute; I! remove splitting attribute 10) for each outcome j of splitting criterion 1! partition the tuples and grow subtrees for each partition 11) let Dj be the set of data tuples in D satisfying outcome j; // a partition 12) if Djis empty then 13) attach a leaf labeled with the majority class in D to node N; 14) else attach the node returned by Generate decision tree(Dj,, attribute list) to node N; endfor 15) return N; Page 3 WWW.JNTUK396.COM Data Warehousing and Data Mining UNIT-3 3.5Methods for selecting best test conditions Decision tree induction algorithms must provide a method for expressing an attribute test condition and its corresponding outcomes for different attribute types. Binary Attributes: The test condition for a binary attribute generates two potential outcomes. Body ene Warm- —_Cold- blooded blooded ‘Nominal Attributes: These can have many values, These can be represented in two ways, _ (Masta) ( siaus_) single Divorced Married (@) Muttway spit arta!) Sas) iY Ce) (at) (Marita) (See, Sige) ered, Sage, ore cet Bees ara (©) Binary split oy grouping attrbute values) Ordinal attributes: These can produce binary or multiway splits. The values can be grouped as long as the grouping does not violate the order property of attribute values. XN ee {Small, (Large, {Small} (Medium, Large, (Small, __ (Medium, Medium) — Extra Large) Extra Large) Large) Extra Large) @ 0) © Page 4 WWW.JNTUK396.COM Data Warehousing and Data Mining UNIT-3 3.6 Attribute Selection Measures > An attribute selection measure is a heuristic for selecting the splitting criterion that est” separates a given data partition, D, of class-labeled training tuples into individual classes. > If we were to split D into smaller partitions according to the outcomes of the splitting criterion, ideally cach partition would be pure (i.c., all the tuples that fall into a given partition would belong to the same class). > Conceptually, the “best” splitting criterion is the one that most closely results in such a scenario. Attribute selection measures are also known as splitting rules because they determine how the tuples at a given node are to be split. > The attribute selection measure provides a ranking for each attribute describing the given training tuples. The attribute having the best score for the measured is chosen as the splitting attribute for the given tuples, > If the splitting attribute is continuous-valued or if we are restricted to binary trees, then, respectively, either a split point or a splitting subset must also be determined as part of the splitting criterion. > The tree node created for partition D is labeled with the splitting criterion, branches are grown for each outcome of the criterion, and the tuples are partitioned accordingly. > There are three popular attribute selection measures—information gain, gain ratio, and Gini index. 3.6.1 Information Gain ID3 uses information gain as its attribute selection measure. Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is chosen as the splitting attribute for node N, This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or “impurity” in these partitions. Such an approach minimizes the expected number of tests needed to classily a given tuple and guarantees that a simple (but not necessarily the simplest) tree is found. The expected information needed to classify a tuple in D is given by Dd prlogs(p), ial Where piis the nonzero probability that an arbitrary tuple in D belongs to class Cand is estimated by |C,DV|D|. A log function to the base 2 is used, because the information is encoded in bits.Jnfo(D) is also known as the entropy of D. Info( D) Information needed after using A to split D into V partitions. * ID; Infog(D) 5a x Info( Dj). Information gain is defined as the difference between the original information requirement (i.e., based on just the proportion of classes) and the new requirement (i.e., obtained after pattitioning on A). That is, Gain(A) = Info(D) — Info,(D). Pages WWW.JNTUK396.COM Data Warehousing and Data Mining UNIT-3 The attribute A with the highest information gain, Gain(A), is chosen as the splittingattribute at node. This is equivalent to saying that we want to partition on the attributed that would do the “best classification,” so that the amount of information still requiredto finish classifying the tuples is minimal, 3.6.2 Gain Ratio 4.5, a successor of ID3, uses an extension fo information gain known as gain ratio, which attempts to overcome this bias. It applies a kind of normalization to information gain using a “split information” value defined analogously with Info(D) as Y |D; D; Splitinfo,(D) = — ye x logy (ie). F This value represents the potential information generated by splitting the trainingdata set, D, into v partitions, corresponding to the v outcomes of a test on attribute 4. Note that, for each outcome, it considers the number of tuples having that outcome with respect to the total number of tuples in D. It differs from information gain, which measures the information with respect to classification that is acquired based on the same partitioning. The gain ratio is defined as Gain(A) Splitinfo,(D) jainRatio(A) = 3.6.3 Gini Index The Gini index is used in CART. Using the notation previously described, the Gini indexmeasures the impurity of D, a data partition or set of training tuples, as Gini(D) =1- 77, = Where pis the nonzero probability that an arbitrary tuple in D belongs to class Cand is estimated by |C,,D/[D| over m classes. Note: The Gini index considers a binary split for each attribute. When considering a binary split, we compute a weighted sum of the impurity of eachresulting partition. For example, if a binary split on A partitions D into Dy and Ds, the Gini index of D given that partitioning is IP a IP2| Fp Siti) + Ty GiniDa) > For cach attribute, each of the possible binary splits is considered. For a discrete-valued attribute, the subsct that gives the minimum Gini index for that attribute is selected as its splitting subset. > For continuous-valued attributes, each possible split-point must be considered. The strategy is similar to that described earlier for information gain, where the midpoint between each pair of (sorted) adjacent values is taken as a possible split-point. > The reduction in impurity that would be incurred by a binary split on a discrete- or continuous-valued attribute 4 is AGini(A) = Gini(D) — Ginia(D). Ginig(D) Page 6 WWW.JNTUK396.COM Data Warehousing and Data Mining UNIT-3 3.7 Tree Pruning: » When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers. > Tree pruning methods address this problem of overfitting the data, Such methods typically use statistical measures to remove the least-reliable branches. ® Pruned trees tend to be smaller and less complex and, thus, easier to comprehend. > They are usually faster and better at correctly classifying independent test data (i.c., of previously unseen tuples) than unpruned trees. “How does tree pruning work?” There are two common approaches to tree pruning: prepruning and postpruning. > In the prepruning approach, a tree is “pruned” by halting its construction early, Upon halting, the node becomes a leaf. The leaf may hold the most frequent class among the subset tuples or the probability distribution of those tuples. > If partitioning the tuples at a node would result in a split that falls below a prespecified threshold, then further partitioning of the given subset is halted, There are difficulties, however, in choosing an appropriate threshold. > In the postpruning, which removes subtrees from a “fully grown” tree. A subtree at a given node is pruned by removing its branches and replacing it with a leaf. The leaf is labeled with the most frequent class among the subtree being repl S Ds _ Gan (2) Ge [%) Gud = a of Sea Xe wf Se apanaan Dae aut) Gal) Gms) Gen) Gnd) Gee) Fig: Unpruned and Pruned Trees > The cost complexity pruning algorithm used in CART is an example of the postpruning approach. This approach considers the cost complexity of a tree to be a function of the number of leaves in the tree and the error rate of the tree (where the error rate is the percentage of tuples misclassified by the tree). It starts from the bottom of the tree. For each internal node, N, it computes the cost complexity of the subtree at N, and the cost complexity of the subtree at NV if it were to be pruned (i.e., replaced by a leaf, node). The two values are compared. If pruning the subtree at node N would result in a smaller cost complexity, then the subtree is pruned. Otherwise, it is kept. > A pruning set of class-labeled tuples is used to estimate cost complexity. v v v Page 7 WWW.JNTUK396.COM Data Warehousing and Data Mining UNIT-3 > This set isindependent of the training set used to build the unpruned tree and of any test set usedfor accuracy estimation, > The algorithm generates a set of progressively pruned trees. Ingeneral, the smallest decision tree that minimizes the cost complexity is preferred. > C45 uses a method called pessimistic pruning, which is similar to the cost complexitymethod in that it also uses error rate estimates to make decisions regarding subtreepruning. 3.8 Scalability of Decision Tree Induction: “What if D, the disk-resident training set of class-labeled tuples, does not fit in memory? Inother words, how scalable is decision tree induction?” The efficiency of existing decisiontree algorithms, such as ID3, C4.5, and CART, has been well established for relativelysmall data sets. Efficiency becomes an issue of concem when these algorithms are appliedto the mining of very large real-world databases. The pioneering decision tree algorithmsthat we have discussed so far have the restriction that the training tuples should residein memory. In data mining applications, very large training sets of millions of tuples are common.Most often, the training data will not fit in memory! Therefore, decision tree construction becomes inefficient due to swapping of the training tuples in and out of main and cache memories. More scalable approaches, capable of handling training data that are too large to fit in memory, are required. Earlier strategies to “save space” included discretizing continuous-valued attributes and sampling data at each node. These techniques, however, still assume that the training set can fit in memory. Several scalable decision tree induction methods have been introduced in recent studies.RainForest, for example, adapts to the amount of main memory available and appliesto any decision tree induction algorithm. The method maintains an AVC-set (where“AVC” stands for “Attribute-Value, Classlabel”) for each attribute, at each tree node,describing the training tuples at the node. The AVC-set of an attribute 4 at node Ngives the class label counts for each value of 4 for the tuples at V. The set of all AVC-sets at a node Nis the AVC-groupof N. The size of an AVC-set for attribute A at node N depends only on the number ofdistinct values of A and the number of classes in the set of tuples at N. Typically, this sizeshould fit in memory, even for real-world data. RainForest also has techniques, however,for handling the case where the AVC-group does not fit in memory. Therefore, themethod has high scalability for decision tree induction in very large data sets. The mae Tans computer Trax sopaar fe ~ 3 4 excellent 3 3 Fig: AVC Sets for dataset Page 8 WWW.JNTUK396.COM Data Warehousing and Data Mining UNIT-3 Example for Decision Tree construction and Classification Rules: Construct Decision Tree for following dataset, age income | student | eredit_rating | buys_computer youth high no fair no youth high no excellent no middle aged | high | no fair yes senior | medium| no fair yes senior Tow | yes fair yes senior low, yes excellent. no middle_aged| low yes: excellent yes youth | medium| no fair no youth Tow | _yes fair yes senior__| medium| yes fair yes youth | medium | _ yes excellent yes middle_aged | medium| _no excellent yes: middie aged | high | yes fair yes senior | medium] no excellent no Solution: Here the target class is buys_computer and values are yes, no. By using ID3 algorithm, we are constructing decision tree. For ID3 Algorithm we have calculate Information gain attribute selection measure. 5 P__| buys computer(yes) | _9 CLASS N__|_ buys computer (no) 3 TOTAL 14 Info(D) = 10,5 flog, = - log, = = 0.940 Age Pp N TOTAL (P,N) youth 23) 5 123 middle aged) 4) 0) 4 | Kao) senior 3.2 [5 | 162) 12,3) = = 0.970 a 48 1(4,0) =- Stog, + - 210g," = 0 1@,2) =- Zlogy?- Zlogy 2 = 0.970 Page 9 WWW.JNTUK396.COM Data Warehousing and Data Mining UNIT-3 Age P| _N | TOTAL HPN) youth 2/3 5 12,3] 0.970 middle aged | 4 | 0 4 140) | 0 senior af? 5 13,2) _[ 0.970 Info nge(D) == 12,3) +4 1(4,0)+ = 1(3, 2) = 0.693, Gain(Age) = Info(D)— Infoxge(D) = 0,940 - 0693 = 0.247 Similarly, Gain(Income) = 0.029 Gain (Student) = 0.151 Gain (credit_rating) = 0.048 Finally, age has the highest information gain among the attributes, it is selected as the splitting attribute. Node N is labeled with age, and branches are grown for each of the attribute's values. The tuples are then partitioned accordingly, as middle_aged\ senior Fesee sor Tt To high high medium, low medium, medium low low medium medium fair fair excellent fair excellent excellent fair fair excellent fair excellent excellent fair Page 10 WWW.JNTUK396.COM Data Warehousing and Data Mining UNIT-3 The Tree after splitting branches is [a ) (ome) [ie é35 83 8 The Tree after Tree Pruning, Excallont Finally, The Classification Rules are, > IF age=Youth AND Student=Yes THEN buys_computer=Yes > IF age=Middle_aged THEN buys_computer=Yes > IF age=Senior AND Credit_rating=Fair THEN buys_computer=Yes Page 11

You might also like