0% found this document useful (0 votes)
11 views58 pages

Machine Learning Module 2

The document discusses various statistical concepts relevant to machine learning, including correlation coefficients, multivariate statistics, and graphical representations like heatmaps and pairplots. It emphasizes the importance of linear algebra, probability, and statistics in understanding and analyzing data, particularly in the context of machine learning models. Additionally, it covers probability distributions, including normal, uniform, and binomial distributions, and their applications in data analysis.

Uploaded by

ramkumarpriya549
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
11 views58 pages

Machine Learning Module 2

The document discusses various statistical concepts relevant to machine learning, including correlation coefficients, multivariate statistics, and graphical representations like heatmaps and pairplots. It emphasizes the importance of linear algebra, probability, and statistics in understanding and analyzing data, particularly in the context of machine learning models. Additionally, it covers probability distributions, including normal, uniform, and binomial distributions, and their applications in data analysis.

Uploaded by

ramkumarpriya549
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 58
Understanding Data « 47 ee Find the correlation coefficient of data X = {1, 2, 3, 4,5) and Y={1, 4, 9, 16, 25}. Solution: The mean values of X and Y are 2 =3and = =11. The standard deviations of X and Y are 1.41 and 8,6486, respectively. Therefore, the correlation coefficient is given as ratio of covariance (12 from the previous problem 2.5) and standard deviation of x and y as per Eq. (2.18) as: 12, "= Tarxecage 0°84 FE" 2.7 MULTIVARIATE STATISTICS In machine leaning, almost all datasets are multivariable. Multivariate data is the analysis of more than two observable variables, and often, thousands of multiple measurements need to be conducted for one or more subjects. The multivariate data is like bivariate data but may have more than two dependant variables. Some of the multivariate analysis are regression analysis, principal component analysis, and path analysis. Id Attribute | Attribute 2 Attribute 3 1 4 4 1 2 2 5 2 303 6 1 The mean of multivariate data is a mean vector and the mean of the above three attributes is given as (2, 7.5, 1.33). The variance of multivariate data becomes the covariance matrix. The mean vector is called centroid and variance is called dispersion matrix. This is discussed in the next section. Multivariate data has three or more variables. The aim of the multivariate analysis is much more. They are regression analysis, factor analysis and multivariate analysis of variance that are explained in the subsequent chapters of this book. Heatmap Heatmap is a graphical representation of 2D matrix. It takes a matrix as input and colours it. The darker colours indicate very large values and lighter colours indicate smaller values. The advantage of this method is that humans perceive colours well. So, by colour shaping, larger values can be perceived well. For example, in vehicle traffic data, heavy traffic regions can be differentiated from low traffic regions through heatmap. In Figure 2.13, patient data highlighting weight and health status is plotted. Here, X-axis is weights and Y-axis is patient counts. The dark colour regions highlight patients’ weights vs patient counts in health status. 48° © Machine Learning Count of self assessed health status vs. Weight ly Tl | | Excellent fofo}1 olo 0 ololo + s}o} ne lo o ofofafol2is 2}afa oft i | | | ° B Fair fofofo 1]1 0 ololo 1 o)1}olo}o 2}0 1 0 1 0 1]0]0 0 o|1/ol0)1 1 ofo ofolofolol1 1 ololofolo 0 1}ol0 ol: 0000 £ | Ly hs 7 ! Wot 2 Good fafafo ofo + 4]1]2.0 s]aftl2ts a]o.0 1.0 1 ofolt 2 olofololo.o oft sls ahyfolo o2|a\1/sJ0 0 2/o|1 2fo|s|1 00 a j Ib i | I MH 5 Poor Jololo alo 0 ajo 0 olsJololo oo 0.0 1 0 SeaReagunsenesonsmema: ° Figure 2.13: Heatmap for Patient Data Pairplot Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix consists of several pair-wise scatter plots of variables of the multivariate data. All the results are presented ina matrix format. By visual examination of the chart, one can easily find relationships among the variables such as correlation between the variables. A random matrix of three columns is chosen and the relationships of the columns is plotted as a pairplot (or scattermatrix) as shown below in Figure 2.14, 50 40 30 20 10 0 100 3210123 4327012 321701723 Figure 2.14: Pairplot for Random Data UnderstandingData + 49 2.8 ESSENTIAL MATHEMATICS FOR MULTIVARIATE DATA Machine learning involves many mathematical concepts from the domain of Linear algebra, Statistics, Probability and Information theory. The subsequent sections discuss important aspects of linear algebra and probability. ‘Linear Algebra’ is a branch of mathematics that is central for many scientific applications and other mathematical subjects. While all branches of mathematies are crucial for machine learning, linear algebra plays a major large role as it is the mathematics of data. Linear algebra deals with linear equations, vectors, matrices, vector spaces and transformations. These are the driving forces of machine learning and machine learning cannot exist without these data types. Let us discuss some of the important concepts of linear algebra now. 2.8.1 Linear Systems and Gaussian Elimination for Multivariate Data A linear system of equations is a group of equations with unknown variables. Let Ar= y, then the solution x is given as: xaylA-Aty (219) This is true if y is not zero and A is not zero. The logic can be extended for N-set of equations with ‘n’ unknown variables. My My” Itmeans if A= and y=(y, y,--- Y,)- then the unknown variable x can be computed as: x-ylA-Aty (220) If there is a unique solution, then the system is called consistent independent. If there are various solutions, then the system is called consistent dependant. If there are no solutions and if the equations are contradictory, then the system is called inconsistent. For solving large number of system of equations, Gaussian elimination can be used. The procedure for applying Gaussian elimination is given as follows: 1. Write the given matrix. 2. Append vector y to the matrix A. This matrix is called augmentation matrix. 3. Keep the element ,, as pivot and eliminate all a,, in second row using the matrix operation, a, a R- (=} here R, is the second row and (=) is called the multiplier. The same logic hn can be used to remove 4,, in all other equations. 4, Repeat the same logic and reduce it to reduced echelon form. Then, the unknown variable as: 21) 5. Then, the remaining unknown variables can be found by back-substitution as: Yas ~ 4% xx, (2.22) a ‘aKet) This part is called backward substitution. 50 + Machine Learning To facilitate the application of Gaussian elimination method, the following row operations are applied: 1. Swapping the rows 2. Multiplying or dividing a row by a constant 3. Replacing a row by adding or subtracting a multiple of another row to it These concepts are illustrated in Example 28. ——___ Solve the following set of equations using Gaussian Elimination method. 2x, + 4x,=6 4x, +3y, Solution: Rewrite this in matrix form as follows: 2416 4317 Apply the transformation by dividing the row 1 by 2. There are no general guidelines of row operations other than reducing the given matrix to row echelon form. The operator ~ means reducing to. The above matrix can further be reduced as follows: 1213), pa “la 37 2 AR, 1213 (3 -5 13 )Ra8r9 121 -( 18-82% Ey oti? 1old (i 11 :) Therefore, in the reduced echelon form, it can be observed that: x=1 2.8.2 Matrix Decompositions It is often necessary to reduce a matrix to its constituent parts so that complex matrix operations can be performed. These methods are also known as matrix factorization methods. The most popular matrix decomposition is called eigen decomposition. It is a way of reducing the matrix into eigen values and eigen vectors. Then, the matrix A can be decomposed as: A=QAQ" (2.23) where, Qis the matrix of eigen vectors, A is the diagonal matrix and Q" is the transpose of matrix Q. LU Decomposition One of the simplest matrix decompositions is LL decomposition where the matrix A can be decom- posed matrices: A-LU Here, L is the lower triangular matrix and U is the upper triangular matrix. The decomposition can be done using Gaussian elimination method as discussed in the previous section. First, an identity matrix is augmented to the given matrix. Then, row operations and Gaussian elimination is applied to reduce the given matrix to get matrices Land U. Example 2.9 illustrates the application of Gaussian elimination to get LU, ee [EEE Find LU decomposition of the given matrix: 124 A=|332 342 Solution: First, augment an identity matrix and apply Gaussian elimination. The steps are as shown in: 1oolf124 010/332 [Initial Matrix| 001{[/342 1001 2 4 310]/0-3 10] [R=R,-3R, 001 3 100]1 310|}0-3 40} [R,=R,-3R, 30 1}f0 1oof12 4 3.1 0f/o 3 -10 321fo o 22 3 3 Now, it can be observed that the first matrix is L as it is the lower triangular matrix whose values are the determiners used in the reduction of equations above such as 3, 3 and 2/3. The second matrix is U, the upper triangular matrix whose values are the values of the reduced matrix because of Gaussian elimination. 100 12 4 L=|3 10|andu=|0-3 -10]. 2 10 351 0 0o-= 3 3 52 + Machine Learning It can be cross verified that the multiplication of LU yields the original matrix A. What are the applications of LU decomposition? Some of the applications are finding matrix inverses and deter- minant. If the order of the matrix is large, then this method can be used. ee “*”* 2.8.3 Machine Learning and Importance of Probability and Statistics Machine learning is linked with statistics and probability. Like linear algebra, statistics is the heart of machine learning. The importance of statistics needs to be stressed as without statistics; analysis of data is difficult. Probability is especially important for machine learning. Any data can be assumed to be generated by a probability distribution. Machine learning datasets have multiple data that are generated by multiple distributions. So, a knowledge of probability distribution and random variables are must for better understanding of the machine learning concepts. What is the objective for an experiment? This is about hypothesis or constructing a model. Machine learning has many models. So, the theory of hypothesis, construction of models, evalu- ating the models using hypothesis testing, significance of the model and evaluation of the model are all linking factors of machine learning and probability theory. Similarly, the construction of datasets using sampling theory is also required. ‘The subsequent section will deal with the important concepts of statistics and probability. Probability Distributions A probability distribution of a variable, say X, summarizes the probability associated with X's events. Distribution is a parameterized mathematical function. In other words, distribution is a function that describes the relationship between the observations in a sample space. Consider a set of data. The data is said to follow a distribution if it obeys a mathematical function that characterizes that distribution. The function can be used to calculate the probability of individual observations. Probability distributions are of two types: 1. Discrete probability distribution 2. Continuous probability distribution The relationships between the events for a continuous random variable and their probabilities is called a continuous probability distribution. It is summarized as Probability Density Function (PDF). PDF calculates the probability of observing an instance. The plot of PDF shows the shape of the distribution. Cumulative Distributive Function (CDF) computes the probability of an obser- vation . 54 + Machine Learning Discrete Distribution Binomial, Poisson, and Bernoulli distributions fall under this category. 1. Binomial Distribution — Binomial distribution is another distribution that is often encoun- tered in machine learning. It has only two outcomes: success or failure. This is also called Bernoulli trial. The objective of this distribution is to find probability of getting success k out of 1 trials. The way to get success out of k out of n number of trials is given as: n nt ()- K(n— B®! e27 The binomial distribution function is given as follows, where p is the probability of success and probability of failure is (1 — p). The probability of success in a certain number of trials is given as: PAL py tor pt gt * (2.28) Combining both, one gets PDF of binomial distribution as: (cj -pn 229) Here, p is the probability of each choice, k is the number of choices, and n is the total number of choices. The mean of binomial distribution is given below: Hanxp (230) And the variance is given as: o*=np(—p) @31) Hence, the standard deviation is given as: o = Jnp(l— p) (2.32) 2. Poisson Distribution ~ It is another important distribution that is quite useful. Given an interval of time, this distribution is used to model the probability of a given number of events k. The mean rule 2 is inclusive of previous events. Some of the examples of Poisson distribution are number of emails received, number of customers visiting a shop and the number of phone calls received by the office. The PDF of Poisson distribution is given as follows: ena S(X = x; A) = PX = x]= — (2.33) x! Here, x is the number of times the event occurs and .is the mean number of times an event occurs. The mean is the population mean at number of emails received and the standard deviation is V2. 3, Bernoulli Distribution — This distribution models an experiment whose outcome is binary. The outcome is positive with p and negative with 1 — p. The PMF of this distribution is given as: =1-p ifk=0 Pp ifk=1. (234) The mean is p and variance is p(1—p)=4 Fi Understanding Data « 55 Density Estimation Let there be a set of observed values x, x, --- , x, from a larger set of data whose distribution is not known. Density estimation is the problem of estimating the density function from an observed data. The estimated density function, denoted as, p(x) can be used to value directly for any unknown data, say x, as p(x). If its value is less than ¢, then x, is not an outlier or anomaly data. Else, it is categorized as an anomaly data. There are two types of density estimation methods, namely parametric density estimation and non-parametric density estimation. Parametric Density Estimation It assumes that the data is from a known probabilistic distri- ution and can be estimated as p(x | ©), where, @ is the parameter. Maximum likelihood function is a parametric estimation method. Maximum Likelihood Estimation For a sample of observations, one can estimate the probability distribution. This is called density estimation. Maximum Likelihood Estimation (MLE) is a probabilistic framework that can be used for density estimation. This involves formulating a function called likelihood function which is the conditional probability of observing the observed samples and distribution function with its parameters. For example, if the observations are X = [x,,x, -~ , x,}, then density estimation is the problem of choosing a PDF with suitable parameters to describe the data. MLE treats this problem as a search or optimization problem where the probability should be maximized for the joint probabilities of X and its parameter, 6. For example, this is expressed as p(X; 6), where, X= (x, xy, x,) The likelihood of observing the data is given as a function L(X; 6). The objective of MLE is to maximize this function as max L(X; 0). The joint probability of this problem can be restated as [Tp(z,; 8). The computation of the above formula is unstable and the hence the problem is restated as maximum of log conditional probability given 0. This is given as: Slog pAx,; 6) (235) Instead of maximizing, one can minimize this function as: min =—¥ log p(x,3 9) as, often, minimization is preferred over maximization, This is called negative log-likelihood function. The relevance of this theory of MLE for machine learning is that MLE can solve the problem of predictive modelling in machine learning. Regression problem is treated in Chapter 5 using the principle of least-square approach. The same problem can be viewed from MLE perspective too. If one assumes that the regression problem can be framed as predicting output y given input x, then for p(y/x), the MLE framework can be applied as: max Dlog(y | x,, 4) (236) 56 + Machine Learning Here, his the linear regression model. If Gaussian distribution is assumed as it is an obvious fact that most of the data follow Gaussian distribution, then MLE can be stated as: ui8) (237) Here, fis the regression coefficient and x, is the given sample. One can maximize this function or minimize the negative log likelihood function to provide a solution for linear regression problem. The Eq. (2.37) yields the same answer of the least-square approach. Stochastic gradient descent (SGD) is an algorithm that is normally used to maximize or minimize any function. This algorithm can be used to minimize the above function to get the results effectively. Hence, many statistical packages use this approach for solving linear regression problem. Theoretically, any model can be used instead of linear regression in the above equation. Logistic regression is another model that can be used in MLE framework, Logistic regressioy discussed in Chapter 5 of this book. Gaussian Mixture Model and Expectation-Maximization (EM) Algorithm In machine learning, clustering is one of the important tasks. It is discussed in Chapter 13. MLE framework is quite useful for designing model-based methods for clustering data. A model is a statistical method and data is assumed to be generated by a distribution model with its parameter, 8, There may be many distributions involved and that is why it is called as mixture model. Since, Gaussian are normally assumed for data, this mixture model is categorized as Gaussian Mixture Model (GMM). The EM algorithm is one algorithm that is commonly used for estimating the MLE in the presence of latent or missing variables. What is a latent variable? Let us assume that the dataset includes weights of boys and girls. Considering the fact that the boys’ weights would be slightly more than the weights of the girls, one can assume that the larger weights are generated by one gaussian distribution with one set of parameters while girls’ weights are generated with another set of parameters. There is an influence of gender in the data, but it is not directly present or observable. These are called latent variables. The EM algorithm is effective for estimating the PDF in the presence of latent variables. Generally, there can be many unspecified distributions with different set of parameters. The EM algorithm has two stages: 1. Expectation (F) Stage - In this stage, the expected PDF and its parameters are estimated for each latent variable. 2. Maximization (M) stage In this, the parameters are optimized using the MLE function, This process is iterative, and the iteration is continued till all the latent variables are fitted by probability distributions effectively along with the parameters. Non-parametric Density Estimation A non-parametric estimation can be generative or discriminative. Parzen window is a generative estimation method that finds p(x | ) as conditional density. Discriminative methods directly compute p(@ | x) as posteriori probability. Parzen window and k-Nearest Neighbour (KNN) rule are examples of non-parametric density estimation. Let us discuss about them now. UnderstandingData + 57 Parzen Window Let there be ‘n’ samples, X = (x, x, --- ,x,} The samples are drawn independently, called as identically independent distribution. Let R be the region that covers’K’ samples of total n’ samples. Then, the probability density function is given as: p=kKin (2.38) The estimate is given as: it pe) == 239) where, Vis the volume of the region R. If R is the hypercube centered at x and h is the length of the hypercube, the volume V is i? for 2D square cube and i for 3D cube. The Parzen window is given as follows: Lit (240) 0 otherwise The window indicates if the sample is inside the region or not. The Parzen probability density function estimate using Eq. (2.40) is given as: _ kin PX) = (241) This window can be replaced by any other function too. If Gaussian function is used, then it is called Gaussian density function. ENN Estimation The KNN estimation is another non-parametric density estimation method. Here, the initial parameter k is determined and based on that k-neighbours are determined. The probability density function estimate is the average of the values that are retumed by the neighbours. 2.9 OVERVIEW OF HYPOTHESIS Data collection alone is not enough. Data must be interpreted to give a conclusion. The conclusion should be a structured outcome. This assumption of the outcome is called a hypothesis. Statistical methods are used to confirm or reject the hypothesis. The assumption of the statistical test is called null hypothesis. It is also called as hypothesis zero (H0). In other words, hypothesis is the existing belief. The violation of this hypothesis is called first hypothesis (H1) or hypothesis one. This is the hypothesis the researcher is trying to establish. There are two types of hypothesis tests, parametric and non-parametric. Parametric tests are based on parameters such as mean and standard deviation. Non-parametric tests are dependent on characteristics such as independence of events or data following certain distribution. Statistical tests help to: 1. Define null and alternate hypothesis 2. Describe the hypothesis using parameters 3. Identify the statistical test and statistics 58 + Machine Learning 4. Decide the criteria called significance value a 5. Compute p-value (probability value) 6. Take the final decision of accepting or rejecting the hypothesis based on the parameters Hypothesis testing is particularly important as it is an integral part of the leaming algorithms. Generally, the data size is small. So, one may have to know whether the hypothesis will work for additional samples and how accurate it is. No matter how effective the statistical tests are, two kinds of errors are involved, that are Type I and Type II. Type I error is the incorrect rejection of a true null hypothesis and is called false positive. Type If error is the incomplete failure of rejecting a false hypothesisand is called false negative. During these calculations, one must include the size of the data sample. Degree of freedom indicates the number of independent pieces of information used for the test. It is indicated as v. The mean or variance can be used to indicate the degree of freedom. Hypothesis Testing Let us define two important errors called sample error and true (or actual error). Let us assume that D is the unknown distribution, Target function is f(x): x > {0, 1}, x is the instance, h(2) is the hypothesis, and sample set is S$ that derives the samples on instances drawn from X. Then, the actual error is denoted as: error g(h) = PrecoVix) # h(x) (2.42) In other words, true error is the probability that the hypothesis will misclassify an instance that is drawn at random. The point is that population is very large and hence it is not possible to determine true error and can only be estimated. So, another error is called sample error or estimator. Sample error is with respect to sample S. It is the probability for instances drawn from X, that is, the fractions of $ that are misclassified. The sample error is given as follows: error) = 2, 6/08), ha) (243) Here, (f(x), h(x)) is 1 if f(x) + h(x), otherwise is zero. ‘The data associated with the sample has two problems. One is bias and another is variance. Since the dataset is small in size, the data may be overfitting. To generalize the concept of the population with smaller dataset is difficult. If S is the sample set, then the data is biased and can be estimated as Elerror,()] — error,(h). Another data error is variance, that is, deviation of error,(Ht) from error, (i). Here, errors are sample error and true or actual error. p-value Statistical tests can be performed to either accept or reject the null hypothesis. This is done by the value called p-value or probability value. It indicates the probability of hypothesis being true. The p-value is used to interpret or quantify the test. For example, a statistical test result may give a value of 0.03. Then, one can compare it with the level 0.05. As 0.03 <0.05, the result is assumed to be significant. This means that the variables tested are not independent. Here, 0.05 is called significant level. In general, significant level is called and p-value is compared with . If p-value < o, then the hypothesis H1 is rejected and if p-value >a, then the hypothesis H0 is rejected. UnderstandingData + 59 Confidence Intervals The acceptance or rejection of the hypothesis can also be done using confidence interval. The confidence interval is computed as: Confidence interval = 1 - significant level (2.44) Confidence level is the range of values that indicates the location of true mean. Confidence intervals indicate the confidence of the result. If the confidence level is 90%, then it infers that there is 90% of chance that the true mean lies in this range and remaining 10% indicates that true mean in not present. For finding this, one requires mean and standard deviation. Then, x can be given as 5 mean(x) +z Here, sis the standard deviation, N is the number of samples, and zis the value associated with 90% and is called % of confidence. This is also called as margin of error. Sample error is the unbiased estimate of true error. If no information is provided, then both errors are the same. It is, however, often safe to suggest a margin of confidence associated with the hypothesis. The hypothesis with 95% confidence about the sample error can be given as follows: error, (h)(1 — error, (h)) error,(h) £1.96, (245) " This 1.96 indicates the 95% confidence of the error. The number 1.96 can be replaced by any number that is associated with different levels of confidence. The procedure to estimate the difference between two hypothesis, say h, and h,, is as follows: 1. A parameter d can be chosen to estimate the error of two hypothesis: 2. d=error,(h,) — error,(h,) (2.46) Here, there are two hypothesis h, and h, tested on two sample sets s, and s,. Similarly, 1m, and n, are randomly drawn number of samples. 3. The estimator d can be estimated as the difference as: error, (tt,) — error, (it,) 4, The confidence intervals can be used for the estimator also as follows: error, (h,)—(1— error, (h,))__ error, (h,) — (I error, (h,)) dzz (2.47) 1, n, Sometimes, it is desirable to find interval L and U such that N% of the probability falls in this interval. Let us discuss about comparing two algorithms based on the hypothesis. This is done based on statistical tests. Some of the important statistical tests are discussed in the next section. 2.9.1 Comparing Learning Methods Some of the methods for comparing the leaming programs are given below: Z-test Z-test assumes normal distribution of data whose population variation is known. The sample size is assumed to be large. The focus is to test the population mean. The z-statistic is given as: 60 + Machine Learning (2.48) Here, X is the input data, and m is number of data elements. pt and ¢ are mean and standard deviation of X, respectively. SSS [EEREEEZG) 1 «1 12 be the population mean (1) with the population variance (o") of 2. Consider the sample X = {1, 2, 3, 4, 5}. Apply z-test and show whether the result is significant. Solution: The sample mean of X = 2 =3and the number of samples is = 5. Substituting in Eq. (2.48) gives: X= Z=—Z— = 10.06 vn By checking the critical value at significance 0.05, one can find that the null hypothesis H0 is rejected. >"> t-test and Paired t-test t-test is a hypothesis test and checks if the difference between two samples’ mean is real or by chance. Here, data is continuous and randomly selected. There will only be small number of samples and variance between groups is real. The t-test statistics follows (-distribution under null hypothesis and is used when the number of samples <30. The distribution follows t-distribution rather than Gaussian distribution. It indicates whether two groups are different or not One Sample Test In this test, the mean of one group is checked against the set average that can be either theoretical value or population mean, So, the procedure is: + Select a group + Compute average + Compare it with theoretical value and compute t-statistic: += oP (249) vn Here, t is t-statistic, m is the mean of the group, #is the theoretical value or population mean, s is the standard deviation, and n is the group size or sample size. Independent Two Sample t-test t-statistic for two groups A and B is computed as follows: mean(A) — mean(B) aaa Here, mean(A) and mean(B) are for two different samples. N, and N, are sample sizes of two groups A and B. s* is the variance of the two samples and the degree of freedom is given as N,+N,~2. Then, t-statistic is compared with the critical value. Understanding Data + 61 Let there be one group {8, 6, 7, 6, 3} and another group {7, 8, 8, 8, 2}. The number of elements in each group (1) are 5. For the first group, the mean and variance are 6 and 3.5, respectively. For the second group, the mean and variance are 6.6 and 6.8, respectively. The degree of freedom is number of samples N — 1 = 4. The t-statistic formula can be applied for the above data and its value is -0.42. The critical value is 0.343. Since the value is less that the critical value, one can say the test is insignificant. Paired t-test It is used to evaluate the hypothesis before and after intervention. The fact is that these samples are not independent. For example, consider the case of an effect of medication for a diabetic patient. The sequence is that first the sugar is tested, then the medication is done, and again the sugar test is conducted to study the effect of medication. In short, in paired t-test, the data is taken from the same subject twice. In an unpaired t-test, the samples are taken independently. In this only one group is involved. The t-statistic is computed as: may 5 vn Here, tis t-statistic, m is the mean of the group, 1 is the theoretical value or population mean, sis the standard deviation, and n is the group size or sample size. f= Chi-Square Test Chi-Square test is a non-parametric test. The goodness-of-fit test statistics follows a Chi-Square distribution under null hypothesis and measures the statistical significance between observed frequency and expressed frequency, and each observation is independent of each other and follows normal distribution. This is used to compare the observed frequency of some observation (such as, frequency of buying different brands of fans) with the expected frequency of the observation (of buying of each fan by customers) and. decide whether any statistical difference exists. This comparison is used to calculate the value of the Chi-Square statistic as: = (0,-E? a 51) Here, F is the expected frequency, O is the observed frequency and the degree of freedom is C—1, where, C is number of categories. The Chi-Square test allows us to detect the duplication of data and helps to remove the redundancy of values. OE Consider the following Table 2.4, where the machine learning course registration is done by both boys and girls. There are 50 boys and 50 girls in the class and the registration of the course is given in the table. Apply Chi-Square test and find out whether any differences exist between boys and girls for course registration. Table 2.4: Observed Data eae Boys Girls Total 62. + Machine Learning Solution: Let the null hypothesis be H, when there is no difference between boys and girls and H, be the alternate hypothesis when there is a significant difference between boys and girls. For applying the Chi-Square test based on the observations, the expectation should be obtained by multiplying the total boys X registered/Total and Total girls X not registered/Total as shown in Table 25. 50 60 50x40 « Girls 0729 100 50 Total 60 40 100 The Chi-Statistic is obtained using Eq. (2.51) as follows: z (O,-EY _ (35-30) 4 @5= 20" , 25-307 , (25-20 a OE, 30 20 30 20 freedom = number of categories —1 = 2 — 1 = 1. The p value for this statistic is 0.0412. This is less than 0.05. Therefore, the result is significant. a = 4.166 for degree of 2.10 FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION TECHNIQUES Features are attributes. Feature engineering is about determining the subset of features that form an important part of the input that improves the performance of the model, be it classification or any other model in machine learning. Feature engineering deals with two problems - Feature Transformation and Feature Selection. Feature transformation is extraction of features and creating new features that may be helpful in increasing performance. For example, the height and weight may give a new attribute called Body Mass Index (BMI), Feature subset selection is another important aspect of feature engineering that focuses on selection of features to reduce the time but not at the cost of reliability. The subset selection reduces the dataset size by removing irrelevant features and constructs a minimum set of attributes for machine learning. If the dataset has n altributes, then time complexity is extremely high as 1 dimensions need to be processed for the given dataset. For 1 attributes, there are 2" possible subsets. If the value of 1 is high, the problem becomes intractable. This is called ‘curse of dimensionality’. Since, as the number of dimensions increases, the time complexity increases. The remedy is that some of the components that do not contribute much can be deleted. This results in the reduction of dimensionality. Choosing optimal attributes becomes a graph search problem. Typically, the feature subset selection problem uses greedy approach by looking for the best choice at the time using locally optimal choice while hoping that it would lead to global optimal solutions. Understanding Data + 63 The features can be removed based on two aspects: 1. Feature relevancy - Some features contribute more for classification than other features. For example, a mole on the face can help in face detection than common features like nose. In simple words, the features should be relevant. The relevancy of the features can be determined based on information measures such as mutual information, correlation- based features like correlation coefficient and distance measures. Distance measures are discussed in Chapter 13 of this book. 2. Feature redundancy — Some features are redundant. For example, when a database table has a field called Date of birth, then age field is not relevant as age can be computed easily from date of birth. This helps in removing the column age that leads to reduction of dimension one. So, the procedure is: 1. Generate all possible subsets 2. Evaluate the subsets and model performance 3, Evaluate the results for optimal feature selection Filter-based selection uses statistical measures for assessing features. In this approach, no learning algorithm is used. Correlation and information gain measures like mutual information and entropy are all examples of this approach. Wrapper-based methods use classifiers to identify the best features. These are selected and evaluated by the learning algorithms. This procedure is computationally intensive but has superior performance. Let us discuss some of the important algorithms that fall under this category. 2.10.1 Stepwise Forward Selection This procedure starts with an empty set of attributes. Every time, an attribute is tested for statistical significance for best quality and is added to the reduced set. This process is continued till a good reduced set of attributes is obtained. 2.10.2 Stepwise Backward Elimination This procedure starts with a complete set of attributes. At every stage, the procedure removes the worst attribute from the set, leading to the reduced set. Combined Approach Both forward and reverse methods can be combined so that the procedure can add the best attribute and remove the worst attribute. 2.10.3 Principal Component Analysis The idea of the principal component analysis (PCA) or KL transform is to transform a given set of measurements to a new set of features so that the features exhibit high information packing properties. This leads to a reduced and compact set of features. Basically, this elimination is made possible because of the information redundancies. This compact representation is of areduced dimension. 64 + Machine Learning Consider a group of random vectors of the form: The mean vector of the set of random vectors is defined as: m,=Elx} The operator E refers to the expected value of the population. This is calculated theoretically using the probability density functions (PDF) of the elements x, and the joint probability density functions between the elements x, and x, From this, the covariance matrix can be calculated as: C= El(x—m,) (x—m,)"} (2.52) For M random vectors, when M is large enough, the mean vector and covariance matrix can be approximately calculated as: m,-— x, (2.53) A=— Sixxt —m mt (2.54) This covariance matrix is real and symmetric. Ife, and 2,(where, i=1, 2, ..., ) be the set of eigen vectors and corresponding eigen values of the covariance matrix, the eigen values can be arranged in a descending order so that 4, > 2,, for i= 1, 2, ..., 11. The corresponding eigen vectors are calculated. Based on this, the transform kernel is constructed. Let the transform kernel be A. Then, the matrix rows are formed from the eigen vectors of the covariance matrix. The mapping of the vectors x to y using the transformation can now be described as: y-Ae-m,) (2.55) This transform is also called as Karhunen-Loeve or Hoteling transform. The original vector x can now be reconstructed as follows: x=Aly+m, (2.56) The goal of PCA is to reduce the set of attributes to a newer, smaller set that captures the variance of the data. The variance is captured by fewer components, which would give the same result as the original, with all the attributes. Here, instead of using all the eigen vectors of the covariance matrix, only a small set that exhibits the variance can be used. By using a small set, the KL transform can achieve maximum compression of the available data. If K largest eigen values are used, the recovered information would be: x= Aly +m, (257) The advantages of PCA are immense. It reduces the attribute list by eliminating all irrelevant attributes. The PCA algorithm is as follows: 1. The target dataset x is obtained 2. The mean is subtracted from the dataset. Let the mean be m. Thus, the adjusted dataset is X—m. The objective of this process is to transform the dataset with zero mean. 3. The covariance of dataset x is obtained. Let it be C. Understanding Data + 65 4, Higen values and eigen vectors of the covariance matrix are calculated. 5. The eigen vector of the highest eigen value is the principal component of the dataset. ‘The eigen values are arranged in a descending order. The feature vector is formed with these eigen vectors in its columns. Feature vector = {eigen vector, eigen vector, ..., eigen vector, 6. Obtain the transpose of feature vector. Let it be A. 7. PCA transform is y— A x (x - m), where x is the input dataset, m is the mean, and A is the transpose of the feature vector, The original data can be retrieved using the formula given below: Original data (f)={(A)" x y) +m (2.58) =A) xy) em (2.59) The new data is a dimensionally reduced matrix that represents the original data. Therefore, PCA is effective in removing the attributes that do not contribute. If the original data is required, it can be obtained with no loss of information. Scree plot is a visualization technique to visualize the principal components or variables that play a more important role as compared to other attributes. Scree plot is a visualization technique to visualize the principal components visually. For a randomly selected dataset of 246 attributes, PCA is applied and its scree plot is shown in Figure 2.15. The scree plot indicates that only 6 out of 246 attributes are important. scree plat 25 20 F as § #10) os 00 7 2 3 @ 3 6 ‘Component number Figure 2.15: Scree Plot From Figure 2.15, one can infer the relevance of the attributes. The scree plot indicates that the first attribute is more important than all other attributes. ee SSS———SsSsSsSSSS Let the data points be ( and (} Apply PCA and find the transformed data. Again, apply the inverse and prove that PCA works. Solution: One can combine two vectors into a matrix as follows: The mean vector can be computed as Eq, (2.53) as follows: 241 _ 2 |_f15 Bl 647 |> (3 2 66 + Machine Learning As part of PCA, the mean must be subtracted from the data to get the adjusted data: 2-15)_( 05 6-65) \05 1-15) _(-0 7-65) | 05, One can find the covariance for these data vectors. The covariance can be obtained using, Fa 254): os 025 -0.25) m, (05 -05) = 05, 0.25 025, 05 0.25 -0.25 (05 05) = 05; 0.25 0.25, The final covariance matrix is obtained by adding these two matrices as: 05 -05 oe 4 cc) The eigen values and eigen vectors of matrix C can be obtained (left as an exercise) as A, = 1, 1 al eigen vector of these eigen values (after sorting it) of matrix C. For this problem, A -( 1 i}: 11 The transpose of 4, AT = is also the same matrix as it is an orthogonal matrix. The matrix Ee 11 Bo 4, = 0. The eigen vectors are () and (} The matrix A can be obtained by packing the can be normalized by diving each elements of the vector, by the norm of the vector to get: 11 aj ee oo. EE One can check that the PCA matrix A is orthogonal. A matrix is orthogonal is A*' = A and AAt=I. AA? = _fio “lo1 ‘The transformed matrix y using Eq. (2.55) is given as: ale sl al- Al- al tl al ais y=Ax(x—m) Recollect that (x-m) is the adjusted matrix. 14 yodte-nie v2 V2 & a) 1 ifes 05 2 2 Malai -| 2 2] 2 2 [sr cmnenene 05-1) fA 2 v2 22 2 2 ae =| v2 2 0 0 One can check the original matrix can be retrieved from this matrix as: (A) xy) tm -—t4) 11) xeayime| BBY BL) zazllo 0 iB v2 2. if 1) 1 2 2 4 f) _(27) 1 1}'les) “le a) 2 Therefore, one can infer the original is obtained without any loss of information. ean 2.10.4 Linear Discriminant Analysis Linear Discriminant Analysis (LDA) is also a feature reduction technique like PCA. The focus of LDA is to project higher dimension data to a line (lower dimension data). LDA is also used to classify the data. Let there be two classes, ¢, and c,. Let j1, and #, be the mean of the patterns of two classes. The mean of the class ¢, and c, can be computed as: le =—— ¥ x, andy, = eA UL, The aim of LDA is to optimize the function: vt0,V V6, where, V is the linear projection and a, and g,, are class scatter matrix and within scatter matrix, respectively. For the two-class problem, these matrices are given as: o, = N,(u, — uu, — w+ N,(u, ~ a), ~ 2) (2.61) oy, = Ze ~ Ha, w+ 2m ~ HM, — HYP (2.62) W()= (2.60) 68 + Machine Learning ‘The maximization of J(V) should satisfy the equation: 6,V = 4o,,V or o730,V = av (2.63) As 0,V is always in the direction of (u,— #,), V can be given as: Vs On, — Hy) (2.64) Let V={0,,0, ---, 0,) be the generalized eigen vectors of g, and 6,,, where, dis the largest eigen values as in PCA. The transformation of x is then given as: y=Vix (2.65) Like in PCA, the largest eigen values can be retained to have projections. 2.10.5 Singular Value Decomposition Singular Value Decomposition (SVD) is another useful decomposition technique. Let A be the matrix, then the matrix A can be decomposed as: A=usvT (2.66) Here, A is the given matrix of dimension m x n, Uis the orthogonal matrix whose dimension is m xn, Sis the diagonal matrix of dimension m x , and Vis the orthogonal matrix. The procedure for finding decomposition matrix is given as follows: 1. Fora given matrix, find AAT 2. Find eigen values of AAT 3. Sort the eigen values in a descending order. Pack the eigen vectors as a matrix U. 4, Arrange the square root of the eigen values in diagonal. This matrix is diagonal matrix, S. 5. Find eigen values and eigen vectors for ATA. Find the eigen value and pack the eigen vector asa matrix called V. Thus, A= USV!, Here, U and V are orthogonal matrices. The columns of U and Vare left and right singular values, respectively. SVD is useful in compression, as one can decide to retain only a certain component instead of the original matrix A as: 1,8,0, (267) mae Based on the choice of retention, the compression can be controlled. SSS [EEREZZEY Find svp of the matrix: a(t? 49 Solution: The first step is to compute: aat (1214) (5 2 4 9}l29) (22 97, The eigen value and eigen vector of this matrix can be calculated to get U. The eigen values of this matrix are 0.0098 and 101.9902. The eigen vectors of this matrix are: These vectors are normalized to get the vectors respectively as: uy = (02212 1 = (o,9752, = (0-972 2° \ 0.2212 The matrix U can be obtained by concatenating the above vector as: u=[ 2 0.2212 —0.9752 "2" (09752, 0.2212 17 38 The matrix V can be obtained by finding ATA. It is (s 3] The eigen values are 0.0098 and 101.9902. The eigen vectors can be found as follows: id °*") when A = 101.99 (7) when A = 0.0098 The above can be normalized as follows: (0.4082 2,- *” (0.9129, 0.9129 2, = 0.4082, ‘The matrix V can be obtained by concatenating the above vector as: (iors foe] V=lD, 2, 0.9129 0.4082 The matrix $ can be found as the diagonal matrix as: g_{YI0r9902 0) _ (10.099 0 0 ~~ Yo.0098 0 0.099 Therefore, the matrix decomposition A = U $V" is complete. 7. The main advantage of SVD is compression. A matrix, say an image, can be decomposed and selec tively only certain components can be retained by making all other elements zero. This reduces the contents of image while retaining the quality of the image. SVD is useful in data reduction too. 70 + Machine Learning 1. Datais fact. Data must he converted to information. 2. Data analysis is an operation that converts data to information. Data analytics is a general area ‘encompassing data analysis. 3. Exploratory Data analysis aims to understand the data better. 4. Descriptive statistics aims to summarize the data and data visualization aims to understand the data using charts. 5. Data types are qualitative and quantitative. The qualitative data types include nominal and ordinal. ‘The quantitative data types include interval and ratio. 6. Data also can be classified as univariate data, bivariate data, and multivariate data based on the variables used. 7. ‘The measures of the central tendency inchide mean, median, mode, and midrange. 8, The most common dispersion measures are range, 5-number summary, interquartile range and standard deviation. 9. Skewness and kurtosis are shape measures. Skewness, kurtosis, mean absolute deviation, and coefficient of variation help in assessing the shape. 10. Visualization is an important aspect of data mining which helps the user to recognize and interpret the results quickly. 11. Bivariate data involves two variables and hence the focus is on finding relationships among two variables. 12, Multivariate data has more variables. The focus is on finding the relations, cause and effects, and statistical inference. 13. Covariance is a measure of joint probability of random variables, say x and y. Itis defined as coo(X, Y). Covariance is used to measure variance between two dimensions, Correlation measures the strength of linear relationship among variables. 14, The idea of feature reduction is to transform a given set of measurements to a new set of features, so that the features exhibit high information packing properties. This leads to a reduced and compact set of features. 15. Sampling is a statistical procedure of data reduction Statistics starts by identifying the group for carrying out a study. 16. The assumption of an experiment is called hypothesis. Statistical methods are used to confirm or reject the hypothesis. 17. Zetest, tests, and Chi-Square tests are used to evaluate the hypothesis. ‘Scan for additional ‘Key Terms’, ‘Short Questions’ and ‘Numerical Problems’ Understanding Data « 71 * Data — Raw facts. * — Information — Processed data. ‘* Data Analysis — A operation that extract knowledge from facts, * Data Analytics ~ A general area of analysis that encompasses data analysis. + Big Data — A data that has characteristics like volume, velocity, variety and veracity. * Data Preprocessing — A set of operations that are used to condition the data that is needed for machine learning. * Noisy Data- An abnormal data that is different from normal data. Outlier data is same as noisy data but useful. + Data Transformations - Steps that are necessary to convert data to a processable form. * Descriptive Statistics — A branch of statistics that summarizes data. * Data Visualization — A branch of study that deals with the presentation of data. * Central Tendency — It is a summary of data in terms of its center like mean, median and mode. * Five Point Summary - A summary of data in terms of minimum, maximum, mode, and upper/ lower quartiles. + Kurtosis - A peak of data. * Bivariate Data —A dataset that has two variables. + Pairplot - A technique of visualizing multivariate data. Multivariate data is one that is characterized by multiple variables. ‘* Expectation of X - Expectation is the mean of random variable observations. + Variance - The average squared difference of each value from the expected value. * Covariance - A measure of the variance between two sets of data. * Pearson Correlation Coefficient - The normalized covariance is called correlation coefficient. + Feature Reduction - Reduction of a set of attributes to a new or simpler set that captures the variance of data. * Random Variable — A set of all possible values of an experiment. + Distribution — Parameterized mathematical function that characterizes the data. + Sampling Techniques — A set of procedures for reducing the data. + Hypothesis —The assumption of testing a data. The current assumption is called null hypothesis and the one that wishes to establish is called altemate hypothesis. * Etror ~ The incorrect rejection of a true null hypothesis (called Type I error) and incorrect rejection of a false hypothesis (called Type Il error). QO eC Ca Og 1, What is an univariate data? 2. What are the types of data? 72 + Machine Learning 3. Distinguish between ‘good’ and ‘bad’ data. 4, What are the problems of data collection? 5. Explain missing data analysis. 6. What are the measures of central tendencies? 7. Why are central tendency and dispersion measures important for data miners? 8. What are the measures of skewness and kurtosis? 9. How interquartile range is useful in eliminating outliers? 10. List out the visualization aids available for exploratory data analysis. LL. What is the use of correlation coefficient for data mining? 12. What are the advantages of LU decomposition? 13, What is the difference between correlation and covariance? 14. What is the need for PCA analysis? 15. Justify the use of scree plots in PCA. 16. Explain the role of probability distributions in machine learning. 17. Explain the basics of binomial distribution. 18. Distinguish between sample error and actual error. 19. List out the types of sampling techniques. 20. What isa hypothesis? Long Questions 1. Explain in stages the data management life cycle. 2. Explain the types of Big Data with an example. 3. Explain in detail data cleaning processes. 4, Explain in detail univariate data analysis. 5. Explain atleast five charts in detail that help data visualization. 6. Explain the procedure for SVD. 7. Explain the process of obtaining principle components and its relevance in feature reduction. 8. Explain the procedure for hypothesis testing. 9. Explain the procedure for pair-rtests and Chi-Square goodness fit test. Numerical/Problems 1. Fora given univariate dataset $= (5, 10, 15, 20, 25, 30} of marks. Find mean, median, mode, standard deviation and variance, 2. For a given univariate dataset mean. = {5, 10, 15, 20, 25, 30} of marks. Find arithmetic mean, geometric Understanding Data 73 3. Fora given univariate dataset S = {5, 10, 15, 20, 25, 30) of marks. Find five-point summary and plot the box chart. 4, For the given Tables 2.6 and 27, perform the descriptive analysis of data: Table 2.6: Sample Data 42 45 47 52 6 62 7 72 75 85 Table 2.7: Students Marks Table Tee noo 1 45 705 90 40 2 60 25 80 5 3 eo 80 90 50 4 80 80 90 80 5 85. 72 70 60 (a) Find nrin and max marks scored in each subject. (b) Find details of student who scored highest marks in Maths. (¢) Find the students with marks English >60 and Maths >70, 5. For Univariate Attribute such as Weight, English and Math marks (consider Tables 2.6 and 2.7), find the followin, (a) Mean, Median, Mode (b) Weighted Mean, Geometric Mean and Harmonic Mean (©) Variance and Standard Deviation (d) Absolute Deviation, Mean Absolute Deviation, and Median Absolute Deviation (e) Coefficient of Variation (®)Skewness and Kurtosis (g) Five-point summary, IQR, Semi-Quartile 6. For the Bivariate data such as English and Math (consider Tables 2.6 and 2.7), find: (a) Covariance and Correlation between two variables (b) Covariance between English and Hindi Marks 74 + Machine Learning 7. Use appropriate Data Visualization to plot the above Table 2.6 using the following charts: (a) Bar Plot and Pie Chart (b) Histogram, Box Plot, and QQPLOT (9) Dot Plot, Line Chart, Scatterplot (d) Stem and Leaf Plot 8. Solve the following set of equations using Gaussian elimination method. 2x, +5x,-7 6x, + 12x, = 18 9. Solve the following set of equations using LU decomposition method. 10. Apply PCA for the following matrix and prove that it works. ("3 (") Perform matrix decompositions and prove that SVD works. 11. Apply SVD for the following matrix: 12. Find covariance and correlation coefficients for the following two sets of data: X12 612 Y:8 12 18 22 re Across 2. Mean and mode are examples of of univariate data. 4, Noisy data i (Normal/ Abnormal) data. 7. Abranch of statistics that is used for summ- arizing data is called. statistics. 10. Kurtosis characterized the of data. 12, ‘The assumption of testing of data is called a 13. Raw facts are called 14. Data wrangling refers to making data suitable for processing. (Yes/No) 15. Pairplot is used to visualize univariate data. (Yes/No) Down 1. Understanding Data + 75 Grossword: . The averaged squared distance from its mean is called. : ‘The characteristics of Big data are volume, velocity and A dataset of two variables is called data. Visualization helps in presentation of data. (Nes/No) Normalized covariance is called Processed data is Incorrect rejection of true hypothesis is called error. 76 + Machine Learning Word|Searchi Find and mark the words listed below. QS VARI I KVLCYXBTYPE ONIN DH RONDA DA .VODAaEMOUn hPOOWHUZU4ZEZ0 ZzEQCABHUen abe w UmZee¢ueeZawZu aman O< Dexa mn Mad BMZya4HD eM FOSS AdHd urn mero denM OOP uan <0za OnmHnd eu BZ <00Tma HOZXENGMEDN<>H ORR ZHCOCUMKn my outa brand >a due HD bm HOMERUN YPHGWEXI D ® B wo cs > I DWWWLT RBVL s X DI AEVXYGHMVAJ RBDNKRRGBJ Yes Abnormal Central tendency Shape Variety Correlation Information Data Bivariate Typel Yes Descriptive No Hypothesis Variance Chapter 3 Basics of Learning Theory “Predicting the future isn’t magic, it’s artificial intelligence.” — Dave Waters Learning is a process by which one can acquire knowledge and construct new ideas or concepts based on the experiences. Machine learning is an intelligent way of leaming general concept from training examples without writing a program. There are many machine learning algorithms through which computers can intelligently learn from past data or experiences, identify patterns, and make predictions when new data is fed. This chapter deals with an introduction of concept learning of hypotheses space and modelling of machine learning. Learning Objectives * Understand the basics of leaning theory and the key elements of machine leaning + Introduce Concept learning to obtain an abstraction and generalization from the data * Learn about hypothesis representation language and to explore searching the hypo- thesis space using Find-S algorithm and its limitations * Study about version spaces, List-Then-Eliminate algorithm and Candidate Elimin- ation algorithm * Introduce Inductive bias, a set of prior assumptions considered by a learning algorithm beyond the training data in order to perform induction «Discuss the tradeoff between the two factors called bias and variance which exist when modelling a machine learning algorithm Understand the different challenges in model selection and model evaluation + Study popular re-sampling model selection method called Cross-Validation (K-fold, LOOCY, etc.) to tune machine learning model * Introduce the various learning frameworks and learn to evaluate hypothesis SS 78 + Machine Learning 3.1 INTRODUCTION TO LEARNING AND ITS TYPES The process of acquiring knowledge and expertise through study, experience, or being taught is called as leaming, Generally, humans leam in different ways. To make machines learn, we need to simulate the strategies of human learning in machines. But, will the computers learn? This question has been raised over many centuries by philosophers, mathematicians and logicians. First let us address the question - What sort of tasks can the computers learn? This depends on the nature of problems that the computers can solve. There are two kinds of problems — well-posed and ill-posed. Computers can solve only well-posed problems, as these have well-defined specifications and have the following components inherent to it. 1. Class of learning tasks (T) 2. A measure of performance (P) 3. A source of experience (E) The standard definition of learning proposed by Tom Mitchell is that a program can leam from E for the task T, and P improves with experience E. Let us formalize the concept of learning as follows: Let x be the input and Z be the input space, which is the set of all inputs, and Y is the output space, which is the set of all possible outputs, that is, yes/no. Let D be the input dataset with examples, (x,,¥,),(x,/¥,)/"""+(,¥,) for inputs. Let the unknown target function be f: 2 + Y, that maps the input space to output space. ‘The objective of the learning program is to pick a function, g: x Y to approximate hypothesis f. Alll the possible formulae form a hypothesis space. In short, let H be the set of all formulae from which the learning algorithm chooses. The choice is good when the hypothesis g replicates f for all samples. This is shown in Figure 3.1 ‘deal target hypothesis fx) Training S| samples Generated hypothesis g(x) ( Candidate \\ formulae Hypothests space H Figure 3.1: Learning Environment It can be observed that training samples and target function are dependent on the given problem. The learning algorithm and hypothesis set are independent of the given problem. Thus, learning model is informally the hypothesis set and learning algorithm. Thus, learning model can ‘be stated as follows: Learning Model = Hypothesis Set + Learning Algorithm Basics of Learning Theory « 79 Let us assume a problem of predicting a label for a given input data. Let D be the input dataset with both positive and negative examples. Let y be the output with class or 1. The simple learning model can be given as: 2 .x,w, > Threshold, belongs to class 1 and Lxyw, < Threshold, belongs to another class This can be put into a single equation as follows: h(x) = cinf(E xm) + ‘) where, x,,x,/--,x,are the components of the input vector, »,,1,,...,0,are the weights and +1 and -1 represent the class. This simple model is called perception model. One can simplify this by making w,=b and fixing it as 1, then the model can further be simplified as: I(x) = sigu(w" x) This is called perception learning algorithm. The formal learning models are later discussed in Section 3,7 of this chapter. Classical and Adaptive Machine Learning Systems A classical machine learning system has components such as Input, Process and Output. The input values are taken from the environment directly. These values are processed and a hypothesis is generated as output model. This model is then used for making predictions. The predicted values are consumed by the environment. In contrast to the classical systems, adaptive systems interact with the input for getting labelled data as direct inputs are not available. This process is called reinforcement learning. In reinforcement learning, a learning agent interacts with the environment and in return gets feedback. Based on the feedback, the learning agent generates input samples for learning, which are used for generating the learning model. Such learning agents are not stalic and change their behaviour according to the external signal received from the environment. The feedback is known as reward and learning here is the ability of the learning agent adapting to the environment based on the reward. These are the characteristics of an adaptive system. Learning Types There are different types of learning. Some of the different learning methods are as follows: 1. Learn by memorization or learn by repetition also called as rote learning is done by memorizing without understanding the logic or concept. Although rote learning is basically learning by repetition, in machine learning perspective, the learning occurs by simply comparing with the existing knowledge for the same input data and producing the output if present. 2. Learn by examples also called as learn by experience or previous knowledge acquired at some time, is like finding an analogy, which means performing inductive learning from observations that formulate a general concept. Here, the leaner learns by inferring a general rule from the set of observations or examples. Therefore, inductive learning is also called as discovery learning. 80 + Machine Learning 3. Learn by being taught by an expert or a teacher, generally called as passive learning. However, there is a special kind of learning called active learning where the learner can interactively query a teacher/expert to label unlabelled data instances with the desired outputs. 4. Learning by critical thinking, also called as deductive learning, deduces new facts or conclusion from related known facts and information. 5. Self learning, also called as reinforcement learning, is a self-directed learning that normally learns from mistakes punishments and rewards. 6. Learning to solve problems is a type of cognitive learning where learning happens in the mind and is possible by devising a methodology to achieve a goal. Here, the Jearner initially is not aware of the solution or the way to achieve the goal but only knows the goal. The learning happens either directly from the initial state by following the steps to achieve the goal or indirectly by inferring the behaviour. 7. Learning by generalizing explanations, also called as explanation-based learning (EBL), is another learning method that exploits domain knowledge from experts to improve the accuracy of learned concepts by supervised learning. Acquiring general concept from specific instances of the training dataset is the main challenge of machine learning. ‘Scan for the figure showing ‘Types of Learning! 3.2 INTRODUCTION TO COMPUTATION LEARNING THEORY There are many questions that have been raised by mathematicians and logicians over the time taken by computers to lear. Some of the questions are as follows: 1, How cana learning system predict an unseen instance? 2. How do the hypothesis h is close to f, when hypothesis fitself is unknown? 3. How many samples are required? 4. Can we measure the performance of a learning system? 5, Is the solution obtained local or global? These questions are the basis of a field called ‘Computational Learning Theory’ or in short (COLT). It is a specialized field of study of machine learning. COLT deals with formal methods used for learning systems. It deals with frameworks for quantifying learning tasks and learning, algorithms. It provides a fundamental basis for study of machine learning. It deals with Probably Approximate Learning (PAC) and Vapnik-Chervonenkis (VC) dimensions. The focus of PAC is the quantification of the computational difficulty of learning tasks and algorithms and the compu- tation capacity quantification is the focus of VC dimension Computational Learning Theory uses many concepts from diverse areas such as Theoretical Computer Science, Artificial Intelligence and Statistics. The core concept of COLT is the concept of Basics of Learning Theory « 81 learning framework. One such important framework is PAC. The learning framework is discussed ina detailed manner in Section3.7. COLT focuses on supervised learning tasks. Since the complexity of analyzing is difficult, normally, binary classification tasks are considered for analysis. 3.3 DESIGN OF A LEARNING SYSTEM A system that is built around a learning algorithm is called a learning system. The design of systems focuses on these steps: 1. Choosing a training experience 2. Choosing a target function 3. Representation of a target function 4. Function approximation Training Experience Let us consider designing of a chess game. In direct experience, individual board states and correct moves of the chess game are given directly. In indirect system, the move sequences and results are only given. The training experience also depends on the presence of a supervisor who can label all valid moves for a board state. In the absence of a supervisor, the game agent plays against itself and learns the good moves, if the training samples cover all scenarios, or in other words, distributed enough for performance computation. If the training samples and testing samples have the same distribution, the results would be good. Determine the Target Function The next step is the determination of a target function. In this step, the type of knowledge that needs to be learnt is determined. In direct experience, a board move is selected and is determined whether it is a good move or not against all other moves. If it is the best move, then it is chosen as: B-> M, where, B and M are legal moves. In indirect experience, all legal moves are accepted and a score is generated for each. The move with largest score is then chosen and executed. Determine the Target Function Representation The representation of knowledge may be a table, collection of rules or a neural network. The linear combination of these factors can be coined as: V=u, +w,x, +0,x, + w,x, where, x,,x, and x, represent different board features and 10, 2, 10, and 1, represent weights. Choosing an Approximation Algorithm for the Target Function The focus is to choose weights and fit the given training samples effectively. The aim is to reduce the error given as: ; E= Syl Vo] 82 + Machine Learning Here, b is the sample and V(b) is the predicted hypothesis. The approximation is carried out as: * Computing the error as the difference between trained and expected hypothesis. Let error be error(b). «Then, for every board feature x, the weights are updated as: 1, = 10, + [Ex error(b) x x, Here, j1 is the constant that moderates the size of the weight update. Thus, the learning system has the following components: * A Performance system to allow the game to play against itself. + A Critic system to generate the samples. + AGeneralizer system to generate a hypothesis based on samples. + An Experimenter system to generate a new system based on the currently learnt function. This is sent as input to the performance system. 3.4 INTRODUCTION TO CONCEPT LEARNING Concept learning is a learning strategy of acquiring abstract knowledge or inferring a general concept or deriving a category from the given training samples. It is a process of abstraction and generalization from the data. Concept learning helps to classify an object that has a set of common, relevant features. Thus, it helps a learner compare and contrast categories based on the similarity and association of positive and negative instances in the training data to classify an object. The leamer tries to simplify by observing the common features from the training samples and then apply this simplified model to the future samples. This task is also known as learning from experience. Each concept or category obtained by learning is a Boolean valued function which takes a true or false value. For example, humans can identify different kinds of animals based on common relevant features and categorize all animals based on specific sets of features. The special features that distinguish one animal from another can be called as a concept. This way of leaming categories for object and to recognize new instances of those categories is called as concept learning. It is formally defined as inferring a Boolean valued function by processing training instances. Concept learning requires three things: 1, Input - Training dataset which is a set of training instances, each labeled with the name of a concept or category to which it belongs. Use this past experience to train and build the model. 2. Output - Target concept or Target function f. It is a mapping function f(x) from input x to output y. It is to determine the specific features or common features to identify an object. In other words, it is to find the hypothesis to determine the target concept. For eg,, the specific set of features to identify an elephant from all animals. 3. Test — New instances to test the learned model. Basics of Learning Theory « 83 Formally, Concept learning is defined as-"Given a set of hypotheses, the leamer searches through the hypothesis space to identify the best hypothesis that matches the target concept”. Consider the following set of training instances shown in Table 3.1. Table 3.1: Sample Training Instances Hlephant A No Short Yes No No Black No Big. Yes 2 ‘Yes Short No No No Brown, Yes Medium | No 3. No Short Yes No No Black No Medium Yes 4 No Long No Yes Yes White No Medium | No 5. No Short Yes Yes Yes Black No Big Yes Here, in this set of training instances, the independent attributes considered are ‘Horns’, ‘Tail’, “Tusks’, ‘Paws’, ‘Fur’, ‘Color, Hooves’ and ‘Size’. The dependent attribute is ‘Elephant’. The target concept is to identify the animal to be an Elephant. Let us now take this example and understand further the concept of hypothesis. Target Concept: Predict the type of animal - For example Elephant’. 3.4.1 Representation of a Hypothesis A hypothesis ‘h’ approximates a target function ‘f’ to represent the relationship between the independent attributes and the dependent attribute of the training instances. The hypothesis is the predicted approximate model that best maps the inputs to outputs. Each hypothesis is represented asa conjunction of attribute conditions in the antecedent part. For example, (Tail = Short) a (Color = Black) The set of hypothesis in the search space is called as hypotheses. Hypotheses are the plural form of hypothesis. Generally ‘H’ is used to represent the hypotheses and ‘his used to represent a candidate hypothesis. Each attribute condition is the constraint on the attribute which is represented as attribute-value pair. In the antecedent of an attribute condition of a hypothesis, each attribute can take value as either’? orp’ or can hold a single value. + "2" denotes that the attribute can take any value [e.g,, Color =?] + “g” denotes that the attribute cannot take any value, i (eg. Homs = 9] it represents a null value * Single value denotes a specific single value from acceptable values of the attribute, ie., the attribute ‘Tail’ can take a value as ‘short’ [e.g., Tail = Short] For example, a hypothesis ‘h’ will look like, Homs Tail © Tusks Paws Fur Color. =~ Hooves Size h= Given a test instance x, we say h(x) = 1, if the test instance x satisfies this hypothesis h. 84 + Machine Learning The training dataset given above has 5 training instances with 8 independent attributes and one dependent attribute. Here, the different hypotheses that can be predicted for the target concept are, Horns Tail Tusks Paws Fur Color Hooves Size (or) h= The task is to predict the best hypothesis for the target concept (an elephant). The most general hypothesis can allow any value for each of the attribute. Itis represented as: <2,2,2,2,2,2,2, ?>. This hypothesis indicates that any animal can be an elephant. The most specific hypothesis will not allow any value for each of the attribute <9, 9 9 9 9 9 @, ©. This hypothesis indicates that no animal can be an elephant. The target concept mentioned in this example is to identify the conjunction of specific features from the training instances to correctly identify an elephant. ——___ Explain Concept Learning Task of an Elephant from the dataset given in Table 3.1. Given, Input: 5 instances each with 8 attributes Target concept/function ‘¢ Elephant > {Yes, No} Hypotheses H: Set of hypothesis each with conjunctions of literals as propositions [ie., each literal is represented as an attribute-value pair] Solution: The hypothesis ‘i’ for the concept learning task of an Elephant is given as: h= This hypothesis h is expressed in propositional logic form as below: (Homs = No) a (Tail = Short) a (Tusks = Yes) « (Paws = 2) a (Fur = 2) (Color = Black) a (Hooves = No) a Gize=?) Output: Leam the hypothesis i’ to predict an ‘Elephant’ such that for a given test instance x, Hix) = etx) This hypothesis produced is also called as concept description which is a model that can be used to classify subsequent instances. "=" Thus, concept learning can also be called as Inductive Learning that tries to induce a general function from specific training, instances. This way of learning a hypothesis that can produce an approximate target function with a sufficiently large set of training instances can also approxi- mately dassify other unobserved instances and is called as inductive learning hypothesis. We can only determine an approximate target function because it is very difficult to find an exact target function with the observed training instances. That is why a hypothesis is an approximate target function that best maps the inputs to outputs. Basics of Learning Theory + 85 3.4.2 Hypothesis Space Hypothesis space is the set of all possible hypotheses that approximates the target function f. In other words, the set of all possible approximations of the target function can be defined as hypothesis space. From this set of hypotheses in the hypothesis space, a machine learning algorithm would determine the best possible hypothesis that would best describe the target function or best fit the outputs. Generally, a hypothesis representation language represents a larger hypothesis space. Every machine learning algorithm would represent the hypothesis space in a different manner about the function that maps the input variables to output variables. For example, a regression algorithm represents the hypothesis space as a linear function whereas a decision tree algorithm represents the hypothesis space asa tree. The set of hypotheses that can be generated by a learning algorithm can be further reduced by specifying a language bias. ‘The subset of hypothesis space that is consistent with all-observed training instances is called as Version Space. Version space represents the only hypotheses that are used for the classification. For example, each of the attribute given in the Table 3.1 has the following possible set of values. Homs - Yes, No Tail - Long, Short Tusks - Yes, No Paws - Yes, No Fur - Yes, No Color - Brown, Black, White Hooves - Yes, No Size - Medium, Big Considering these values for each of the attribute, there are (2x 2x 2x2x2x3X2x2)=384 distinct instances covering all the 5 instances in the training dataset. So, we can generate (4x4x4x4x4x5x4 x4) =81,920 distinct hypotheses when including two more values [2, 9] for each of the attribute. However, any hypothesis containing one or more symbols represents the empty set of instances; that is, it classifies every instance as negative instance. Therefore, there will be (3 x3x3x3x3x4x3 x3 +1) =8,749 distinct hypotheses by including only '?' for each of the attribute and one hypothesis representing the empty set of instances. Thus, the hypothesis space is much larger and hence we need efficient learning algorithms to search for the best hypothesis from the set of hypotheses. Hypothesis ordering is also important wherein the hypotheses are ordered from the most specific one to the most general one in order to restrict searching the hypothesis space exhaustively. 3.4.3 Heuristic Space Search Heuristic search is a search strategy that finds an optimized hypothesis/solution to a problem by iteratively improving the hypothesis/solution based on a given heuristic function or a cost measure. Heuristic search methods will generate a possible hypothesis that can be a solution in 86 + Machine Learning the hypothesis space or a path from the initial state. This hypothesis will be tested with the target function or the goal state to see if itis a real solution. If the tested hypothesis is a real solution, then it will be selected. This method generally increases the efficiency because it is guaranteed to find a better hypothesis but may not be the best hypothesis. It is useful for solving tough problems which could not solved by any other method. The typical example problem solved by heuristic search is the travelling salesman problem. Several commonly used heuristic search methods are hill climbing methods, constraint satisfaction problems, best-first search, simulated-annealing, A+ algorithm, and genetic algorithms. 3.4.4 Generalization and Specialization In order to understand about how we construct this concept hierarchy, let us apply this general principle of generalization/specialization relation. By generalization of the most specific hypothesis, and by specialization of the most general hypothesis, the hypothesis space can be searched for an approximate hypothesis that matches all positive instances but does not match any negative instance. Searching the Hypothesis Space There are two ways of learning the hypothesis, consistent with all training instances from the large hypothesis space. 1. Specialization - General to Specific learning, 2. Generalization — Specific to General learning Scan for information on “Additonal Examples on Generalization and Specialization’ Generalization - Specific to General Learning Thisleaming methodology will search through the hypothesis space for an approximate hypothesis by generalizing the most specific hypothesis. SS OO Consider the training instances shown in Table 3.1 and illustrate Specific to General Learning. Solution: We will start from all false or the most specific hypothesis to determine the most restrictive specialization. Consider only the positive instances and generalize the most specific hypothesis. Ignore the negative instances. This learning is illustrated as follows: ‘The most specific hypothesis is taken now, which will not classify any instance to true. i a a ° 9 9 Read the first instance /1, to generalize the hypothesis h so that this positive instance can be dlassified by the hypothesis hl. nt No Short Yes No No Black No Big Yes (Positive instance) hl= Basics of Learning Theory + 87 ‘When reading the second instance 12, it is a negative instance, so ignore it. Yes = Short No No No Brown Yes Medium No (Negative instance) h2= Similarly, when reading the third instance J3, it is a positive instance so generalize h2 to h3 to accommodate it, The resulting 3 is generalized. B: No Short Yes No No Black No Medium Yes (Positive instance) h3= Ignore [4 since it is a negative instance. i4: No Long No Yes Yes White No Medium No (Negative instance) h4= When reading the fifth instance I5, h4 is further generalized to h5. IS: No Short Yes Yes Yes Black No_ Big Yes (Positive instance) h5= his more general to classify all instances to true. Ti: No Short Yes No No Black No Big Yes (Positive instance) Mla? 2? ? 2 ? 2 ? Pp I: Yes Short No No No Brown Yes Medium No (Negative instance) 12= 2.2 Yes 2 2 2 2 > 22 2 2 2? Black ? > 2? 702° 2°2 No > 2.2 2.2 2° 2 2 Big> 2 imposes constraints so that it will not classify a negative instance to true. 13: No Short Yes No No Black No Medium Yes (Positive instance) 13= 2 2 Yes 2 2 2 2 > 88) «| Machine Learming $$??? 2? 20 2 2 Black 2 > 2? 27 2 2 2 No 2 2? 2 222 2 Big> No Long No Yes Yes White No Medium No (Negative instance) Maa? Yes 2 2? 2? 2 > 2? 2 2 2 Black 2 } 2.2 27 2 2 2 2 Big> Remove any hypothesis inconsistent with this negative instance. Ie: No Short Yes Yes Yes Black No Big Yes (Positive instance) W5=2 2 Yes 2 2 2 2 2 2 2 2 2 Black 2 22 27 2 72 2 2 Big> Thus, h5 is the hypothesis space generated which will classify the positive instances to true and negative instances to false. SS? 3.4.5 Hypothesis Space Search by Find-S Algorithm Find-S algorithm is guaranteed to converge to the most specific hypothesis in H that is consistent with the positive instances in the training dataset. Obviously, it will also be consistent with the negative instances. Thus, this algorithm considers only the positive instances and eliminates negative instances while generating the hypothesis. It initially starts with the most specific hypothesis. eR tte Input: Positive instances in the Training dataset Output: Hypothesis ‘h’ 1. Initialize ‘h’ to the most specific hypothesis. =< 9 o 9 ed 2. Generalize the initial hypothesis for the first positive instance [Since ‘i’ is more specific]. 3. For each subsequent instances: Hits positive instance, Check for each attribute value in the instance with the hypothesis. If the attribute value is the same as the hypothesis value, then do nothing, Else if the attribute value is different than the hypothesis value, change it to’? in‘h. Else if it is a negative instance, Ignore it, SSF Consider the training dataset of 4 instances shown in Table 3.2. It contains the details of the performance of students and their likelihood of getting a job offer or not in their final semester. Apply the Find-S algorithm. Basics of Learning Theory + 89 Table 3.2: Training Dataset erences Potemese) 2 Yes Excellent Good Fast Yes Yes 2 Yes Good Good Fast Yes Yes 28 No Good Good Fast No No i! Yes Good Good Slow No Yes Solution: Step 1: Initialize ‘h’ to the most specific hypothesis. There are 6 attributes, so for each attribute, we initially fill ‘gin the initial hypothesis ‘h’. he Step 3: Scan the next instance 12, since I2 is a positive instance. Generalize ‘t’ to include positive instance 12. For each of the non-matching attribute value in ‘h’ put a‘? to include this positive instance. The third attribute value is mismatching in ‘h’ with 12, so put a‘?’ BR: 29 Yes Good Good Fast Yes Positive instance h=<2d Yes? Good Fast Yes> Now,scan 13. Since it is a negative instance, ignore it. Hence, the hypothesis remains the same without any change after scanning 13. I3: 28 No Good Good Fast No Negative instance h=<29 Yes 2 Good Fast Yes> Now scan J4, Since it is a positive instance, check for mismatch in the hypothesis ‘i’ with I4, The 5" and 6* attribute value are mismatching, so add ‘2’ to those attributes inh’. 29 Yes Good Good Slow No Positive instance 2 Yes =? Good 2 > Now, the final hypothesis generated with Find-S algorithm is: h=<29 Yes? Good 2 It includes all positive instances and obviously ignores any negative instance. ee Limitations of Find-S Algorithm 1, Find-S algorithm tries to find a hypothesis that is consistent with positive instances, ignoring all negative instances. As long as the training dataset is consistent, the hypothesis found by this algorithm may be consistent. 2. The algorithm finds only one unique hypothesis, wherein there may be many other hypotheses that are consistent with the training dataset. 90 + Machine Learning 3. Many times, the training dataset may contain some errors; hence such inconsistent data instances can mislead this algorithm in determining the consistent hypothesis since it ignores negative instances. Hence, it is necessary to find the set of hypotheses that are consistent with the training data including the negative examples. To overcome the limitations of Find-S algorithm, Candidate Elimination algorithm was proposed to output the set of all hypotheses consistent with the training dataset. 3.4.6 Version Spaces The version space contains the subset of hypotheses from the hypothesis space that is consistent with all training instances in the training dataset. ‘Scan for information on ‘Additional Examples on Version Spaces’ List-Then-Eliminate Algorithm The principle idea of this learning algorithm is to initialize the version space to containalll hypotheses and then eliminate any hypothesis that is found inconsistent with any training instances. Initially, the algorithm starts with a version space to contain all hypotheses scanning each training instance. The hypotheses that are inconsistent with the training instance are eliminated. Finally, the algorithm outputs the list of remaining hypotheses that are all consistent. Algorithm 3: Input: Version Space — a list of all hypotheses Output: Set of consistent hypotheses 1. Initialize the version space with a list of hypotheses. 2. For each training instance, «remove from version space any hypothesis that is inconsistent. This algorithm works fine if the hypothesis space is finite but practically it is difficult to deploy this algorithm. Hence, a variation of this idea is introduced in the Candidate Elimination algorithm. Version Spaces and the Candidate Elimination Algorithm Version space learning is to generate all consistent hypotheses around. This algorithm computes the version space by the combination of the two cases namely, + Specific to General learning - Generalize $ to include the positive example ‘+ General to Specific learning ~ Specialize G to exclude the negative example Basics of Learning Theory « 91 Using the Candidate Elimination algorithm, we can compute the version space containing all (and only those) hypotheses from H that are consistent with the given observed sequence of training instances. The algorithm defines two boundaries called ‘general boundary’ which is a set of all hypotheses that are the most general and ‘specific boundary’ which is a set of all hypotheses that are the most specific. Thus, the algorithm limits the version space to contain only those hypotheses that are most general and most specific. Thus, it provides a compact representation of List-then algorithm. 7A eee AEC Input: Set of instances in the Training dataset Output: Hypothesis G and $ 1. Initialize G, to the maximally general hypotheses. 2. Initialize S, to the maximally specific hypotheses. © Generalize the initial hypothesis for the first positive instance. 3. For each subsequent new training instance, * If the instance is positive, © Generalize $ to include the positive instance, > Check the attribute value of the positive instance and S, «If the attribute value of positive instance and $ are different, fill that field value with ‘?. = If the attribute value of positive instance and $ are same, then do no change. o Prune G to exclude all inconsistent hypotheses in G with the positive instance. * Ifthe instance is negative, © Specialize G to exclude the negative instance, > Add to Gall minimal specializations to exclude the negative example and be consistent with S. + Ifthe attribute value of $ and the negative instance are different, then fill that attribute value with $ value. + If the attribute value of $ and negative instance are same, no need to update’G’ and fill that attribute value with ‘7’. © Remove from $ all inconsistent hypotheses with the negative instance. Generating Positive Hypothesis ‘S’ If it is a positive example, refine $ to include the positive instance. We need to generalize $ to include the positive instance. The hypothesis is the conjunction of ’S’ and positive instance. When generalizing, for the first positive instance, add to $ all minimal generalizations such that Sis filled with attribute values of the positive instance. For the subsequent positive instances scanned, check the attribute value of the positive instance and $ obtained in the

You might also like