0% found this document useful (0 votes)
225 views84 pages

Tycs Ai Unit 2

This document discusses machine learning from examples using supervised learning techniques. It covers several key concepts: - Supervised learning involves using labeled training examples (inputs with corresponding outputs) to learn a function that maps inputs to outputs. This is done by searching for a hypothesis that best approximates the true function. - Decision trees are a simple yet powerful representation for supervised learning problems. A decision tree performs a sequence of tests on attribute values of inputs to reach an output classification. - The document outlines the process for inducing a decision tree from labeled training examples using a greedy search approach. It aims to select the attribute at each step that best splits the examples into homogeneous sets.

Uploaded by

jeasdsdasda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
225 views84 pages

Tycs Ai Unit 2

This document discusses machine learning from examples using supervised learning techniques. It covers several key concepts: - Supervised learning involves using labeled training examples (inputs with corresponding outputs) to learn a function that maps inputs to outputs. This is done by searching for a hypothesis that best approximates the true function. - Decision trees are a simple yet powerful representation for supervised learning problems. A decision tree performs a sequence of tests on attribute values of inputs to reach an output classification. - The document outlines the process for inducing a decision tree from labeled training examples using a greedy search approach. It aims to select the attribute at each step that best splits the examples into homogeneous sets.

Uploaded by

jeasdsdasda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

TYCS

USCS501
Artificifial
Intelligence
ALLPPT.com _ Free PowerPoint Templates, Diagrams and Charts
UNIT II
Learning from Examples: Forms of
Learning,
Supervised Learning,
Learning Decision Trees,
Evaluating and Choosing the Best
Hypothesis,
Theory of Learning,
Regression and Classification with
Linear Models,
Artificial Neural Networks,
Nonparametric Models,
Support Vector Machines,
Ensemble Learning,
Practical Machine Learning
Learning from
Examples
An agent is learning if it improves its performance on
future tasks after making observations about the world.

 Learning can range from the trivial, as exhibited by jotting


down a phone number, to the profound, as exhibited by
Albert Einstein, who inferred a new theory of the universe.

 Why would we want an agent to learn? If the design of the


agent can be improved, why wouldn’t the designers just
program in that improvement to begin with?

 There are three main reasons.


 First, the designers cannot anticipate all possible situations
that the agent might find itself in.
 For example, a robot designed to navigate mazes must
learn the layout of each new maze it encounters.

 Second, the designers cannot anticipate all changes over


time; a program designed to predict tomorrow’s stock
market prices must learn to adapt when conditions change
from boom to bust.

 Third, sometimes human programmers have no idea how


to program a solution themselves.
 For example, most people are good at recognizing the
faces of family members, but even the best programmers
are unable to program a computer to accomplish that task,
except by using learning algorithms.
FORMS OF LEARNING
 Any component of an agent can be improved by learning
from data.

 The improvements, and the techniques used to make them,


depend on four major factors:

• Which component is to be improved.


• What prior knowledge the agent already has.
• What representation is used for the data and the
component.
• What feedback is available to learn from.
 The components of these agents include:

1. A direct mapping from conditions on the current state to


actions.
2. A means to infer relevant properties of the world from the
percept sequence.
3. Information about the way the world evolves and about the
results of possible actions the agent can take.
4. Utility information indicating the desirability of world states.
5. Action-value information indicating the desirability of
actions.
6. Goals that describe classes of states whose achievement
maximizes the agent’s utility.
Supervised Learning
 The task of supervised learning is:
 Given a training set of N example input–output pairs
 (x1, y1), (x2, y2), . . . (xN, yN) ,
 where each yj was generated by an unknown function y
= f(x), discover a function h that approximates the true
function f.
 Here x and y can be any value; they need not be
numbers. The function h is a hypothesis.
 Learning is a search through the space of possible
hypotheses for one that will perform well, even on new
examples beyond the training set.
 To measure the accuracy of a hypothesis we give it a
test set of examples that are distinct from the training
set.
LEARNING DECISION
TREES
 Decision tree is one of the simplest and yet most
successful forms of machine learning.

 So we first learn the representation—the hypothesis


space—and then show how to learn a good hypothesis.
The Decision Tree
Representation
A decision tree represents a function that takes as input

a vector of attribute values and returns a “decision”—a
single output value.

 The input and output values can be discrete or


continuous.

 Problems where the inputs have discrete values and the


output has exactly two possible values;

 This is Boolean classification, where each example input


will be classified as true (a positive example) or false (a
negative example).
 A decision tree reaches its decision by performing a
sequence of tests.

 Each internal node in the tree corresponds to a test of the


value of one of the input attributes, Ai, and the branches
from the node are labeled with the possible values of the
attribute, Ai =vik.

 Each leaf node in the tree specifies a value to be returned


by the function.
 The decision tree representation is natural for humans;
indeed, many “How To” manuals (e.g., for car repair) are
written entirely as a single decision tree stretching over
hundreds of pages.
 As an example, we will build a decision tree to decide
whether to wait for a table at a restaurant.
 The aim here is to learn a definition for the goal predicate
“WillWait” .
 First we list the attributes that we will consider as part of
the input:
1. Alternate: whether there is a suitable alternative
restaurant nearby.
2. Area: whether the restaurant has a comfortable bar area to
wait in.
3. Fri/Sat: true on Fridays and Saturdays.
4. Hungry: whether we are hungry.
5. Patrons: how many people are in the restaurant (values
are can be None, Some, and Full ).
6. Price: the restaurant’s price range.

7. Raining: whether it is raining outside.

8. Reservation: whether we made a reservation.

9. Type: the kind of restaurant (French, Italian, Thai, or


burger).

10. WaitEstimate: the wait estimated by the host (0–10


minutes, 10–30, 30–60, or >60).
Expressiveness of decision
trees
 A Boolean decision tree is logically equivalent to
the assertion that the goal attribute is true if
and only if the input attributes satisfy one of the
paths leading to a leaf with value true.
 Writing this out in propositional logic, we have
 Goal ⇔ (Path1 ∨ Path2 ∨ ・ ・ ・ ) ,
 where each Path is a conjunction of attribute-
value tests required to follow that path.
 Thus, the whole expression is equivalent to
disjunctive normal form , which means that any
function in propositional logic can be expressed
as a decision tree.
Inducing decision
trees from examples
 An example for a Boolean decision tree consists of an (x,
y) pair, where x is a vector of values for the input
attributes, and y is a single Boolean output value.

 A training set of 12 examples is shown in Figure 18.3.

 The positive examples are the ones in which the goal


WillWait is true (x1, x3, . . .);

 the negative examples are the ones in which it is false


(x2, x5, . . .).
 Figure 18.4(a) shows that Type is a poor attribute, because it
leaves us with four possible outcomes, each of which has
the same number of positive as negative examples.

 On the other hand, in (b) we see that Patrons is a fairly


important attribute, because if the value is None or Some,
then we are left with example sets for which we can answer
definitively (No and Yes, respectively).

 If the value is Full , we are left with a mixed set of examples.

 In general, after the first attribute test splits up the


examples, each outcome is a new decision tree learning
problem in itself, with fewer examples and one less
attribute.
 There are four cases to consider for these recursive
problems:
1. If the remaining examples are all positive (or all negative),
then we are done: we can answer Yes or No. Figure 18.4(b)
shows examples of this happening in the None and Some
branches.

2. If there are some positive and some negative examples,


then choose the best attribute to split them. Figure 18.4(b)
shows Hungry being used to split the remaining examples.

3. If there are no examples left, it means that no example has


been observed for this combination of attribute values,
and we return a default value calculated from the
classification of all the examples that were used in
constructing the node’s parent.
4. If there are no attributes left, but both positive and
negative examples, it means that these examples have
exactly the same description, but different classifications.

 This can happen because there is an error or noise in the


data; because the domain is nondeterministic;

 or because we can’t observe an attribute that would


distinguish the examples.

 The best we can do is return the classification of the


remaining examples.
Choosing attribute
tests
 The greedy search used in decision tree learning is
designed to approximately minimize the depth of
the final tree.
 The idea is to pick the attribute that goes as far as
possible toward providing an exact classification of
the examples.
 A perfect attribute divides the examples into sets,
each of which are all positive or all negative and
thus will be leaves of the tree.
 The Patrons attribute is not perfect, but it is fairly
good. A really useless attribute, such as Type,
leaves the example sets with roughly the same
proportion of positive and negative examples as
the original set.
Broadening the
applicability of decision
 trees
In order to extend decision tree induction to a wider variety
of problems, a number of issues must be addressed.

1. Missing data: In many domains, not all the


attribute values will be known for every example.
 The values might have gone unrecorded, or they might be
too expensive to obtain.
 This gives rise to two problems:
 First, given a complete decision tree, how should one
classify an example that is missing one of the test
attributes?
 Second, how should one modify the information-gain
formula when some examples have unknown values for the
attribute?
2. Multivalued
attributes
When an attribute has many possible values, the information
gain measure gives an inappropriate indication of the
attribute’s usefulness.
 In the extreme case, an attribute such as ExactTime has a
different value for every example, which means each subset
of examples is a singleton with a unique classification, and
the information gain measure would have its highest value
for this attribute.
 But choosing this split first is unlikely to yield the GAIN RATIO
best tree.
 One solution is to use the gain ratio.
 Another possibility is to allow a Boolean test of the form
A=vk, that is, picking out just one of the possible values for
an attribute, leaving the remaining values to possibly be
tested later in the tree.
3. Continuous and
integer-valued input

attributes
Continuous or integer-valued attributes such as Height and
Weight , have an infinite set of possible values.

 Rather than generate infinitely many branches, decision-tree


learning algorithms typically find the split point that gives
the highest information gain.

 For example, at a given node in the tree, it might be the case


that testing on Weight > 160 gives the most information.
 Efficient methods exist for finding good split points:

 start by sorting the values of the attribute, and then


consider only split points that are between two examples
in sorted order that have different classifications, while
keeping track of the running totals of positive and
negative examples on each side of the split point.

 Splitting is the most expensive part of real-world decision


tree learning applications.
4. Continuous-valued
output attributes
 If we are trying to predict a numerical output value, such as
the price of an apartment, then we need a regression tree
rather than a classification tree.

 A regression tree has at each leaf a linear function of some


subset of numerical attributes, rather than a single value.

 For example, the branch for two bedroom apartments might


end with a linear function of square footage, number of
bathrooms, and average income for the neighborhood.

 The learning algorithm must decide when to stop splitting


and begin applying linear regression over the attributes.
Evaluating And Choosing
The Best Hypothesis
Theory of Learning
 We’ll start with the question of how many examples
are needed for learning.
 We saw from the decision tree learning on the
restaurant problem that improves with more training
data.
 Learning curves are useful, but they are specific to a
particular learning algorithm on a particular problem.
 Are there some more general principles governing the
number of examples needed in general?
 Questions like this are addressed by computational
learning theory, which lies at the intersection of AI,
statistics, and theoretical computer science.
Theory of Learning continue
 The underlying principle is that any hypothesis that is
seriously wrong will almost certainly be “found out” with
high probability after a small number of examples,
because it will make an incorrect prediction.
 Thus, any hypothesis that is consistent with a sufficiently
large set of training examples is unlikely to be seriously
wrong: that is, it must be probably approximately
correct.
 Any learning algorithm that returns hypotheses that are
probably approximately correct is called a PAC
learning algorithm;
 we can use this approach to provide bounds on the
performance of various learning algorithms.
 PAC-learning theorems, like all theorems, are logical
consequences of axioms.

 When a theorem states something about the future based on


the past, the axioms have to provide the “juice” to make that
connection.

 Note that we do not have to know what distribution that is,


just that it doesn’t change.

 In addition, to keep things simple, we will assume that the


true function f is deterministic and is a member of the
hypothesis class H that is being considered.
PAC learning example:
Learning decision lists
 We now show how to apply PAC learning to a new hypothesis
space: decision lists.
 A decision list consists of a series of tests, each of which is a
conjunction of literals.
 If a test succeeds when applied to an example description,
the decision list specifies the value to be returned.
 If the test fails, processing continues with the next test in the
list.
 Decision lists resemble decision trees, but their overall
structure is simpler: they branch only in one direction.

 In contrast, the individual tests are more complex. Figure


18.10 shows a decision list that represents the following
hypothesis:

 WillWait ⇔ (Patrons = Some) ∨ (Patrons = Full ∧ Fri/Sat) .


REGRESSION AND
CLASSIFICATION WITH LINEAR
MODELS
 So now it is time to move on from decision trees and lists to
a different hypothesis space, one that has been used for
hundred of years: the class of linear functions of
continuous-valued inputs.

 We’ll start with the simplest case: regression with a


univariate linear function, otherwise known as “fitting a
straight line.”
ARTIFICIAL NEURAL
NETWORKS
 The hypothesis that mental activity consists primarily of
electrochemical activity in networks of brain cells called
“Neurons”.
 Inspired by this hypothesis, some of the earliest AI work
aimed to create Artificial Neural Networks.

 Other names for the field connectionism, parallel


distributed processing, and neural computation.

 Figure 18.19 shows a simple mathematical model of the


neuron devised by McCulloch and Pitts (1943).

 A neural network is just a collection of units connected


together; the properties of the network are determined by its
topology and the properties of the “neurons.”
 The activation function g is typically either a hard
threshold (Figure 18.17(a)), in which case the unit is
called a perceptron, or PERCEPTRON a logistic function
(Figure 18.17(b)), in which case the term sigmoid
perceptron is sometimes used.

 Both of these nonlinear activation function ensure the


important property that the entire network of units can
represent a nonlinear function (see Exercise 18.22).
 Having decided on the mathematical model for individual
“neurons,” the next task is to connect them together to
form a network.
 There are two fundamentally distinct ways to do this.
 A feed-forward network has connections only in one
direction—that is, it forms a directed acyclic graph.
 Every node receives input from “upstream” nodes and
delivers output to “downstream” nodes; there are no loops.
 A feed-forward network represents a function of its current
input; thus, it has no internal state other than the weights
themselves.
 A recurrent network, on the other hand, feeds its outputs
back into its own inputs.
 This means that the activation levels of the network form a
dynamical system that may reach a stable state or exhibit
oscillations or even confused behavior.
 Moreover, the response of the network to a given
input depends on its initial state, which may
depend on previous inputs.

 Hence, recurrent networks (unlike feed-forward


networks) can support short-term memory.

 This makes them more interesting as models of


the brain, but also more difficult to understand.

 Feed-forward networks are usually arranged in


layers, such that each unit receives input only
from units in the immediately preceding layer.
2. Single-layer feed-forward
neural networks
(perceptrons)
 A network with all the inputs connected directly to the outputs
is called a single-layer neural network, or a perceptron
network.
 Figure 18.20 shows a simple two-input, two-output perceptron
network.
 With such a network, we might hope to learn the two-bit
adder function, for example.
 Here are all the training data we will need:
 The first thing to notice is that a perceptron network with
moutputs is really m separate networks, because each
weight affects only one of the outputs.
 Thus, there will be m separate training processes.

 Furthermore, depending on the type of activation function


used, the training processes will be either the perceptron
learning rule or gradient descent rule for the logistic
regression.
 If you try either method on the two-bit-adder data,
something interesting happens.

 Unit 3 learns the carry function easily, but unit 4 completely


fails to learn the sum function.
 We saw in Section 18.6 that linear classifiers (whether hard
or soft) can represent linear decision boundaries in the input
space.
 This works fine for the carry function, which is a logical AND
(see Figure 18.21(a)).
 The sum function, however, is an XOR (exclusive OR) of the
two inputs.
 As Figure 18.21(c) illustrates, this function is not linearly
separable so the perceptron cannot learn it.
 The linearly separable functions constitute just a small
fraction of all Boolean functions;
3. Multilayer feed-
forward neural
 (McCulloch and networks
Pitts, 1943) were well aware that a single
threshold unit would not solve all their problems.

 In fact, their paper proves that such a unit can represent


the basic Boolean functions AND, OR, and NOT and then
goes on to argue that any desired functionality can be
obtained by connecting large numbers of units into
(possibly recurrent) networks of arbitrary depth.

 The problem was that nobody knew how to train such


networks.
 This turns out to be an easy problem if we think of a
network the right way: as a function hw(x) parameterized
by the weights w.

 Consider the simple network shown in Figure 18.20(b),


which has two input units, two hidden units, and two
output unit. (In addition, each unit has a dummy input
fixed at 1.)

 Given an input vector x=(x1, x2), the activations of the


input are set to
5. Learning neural
network structures
 So far, we have considered the problem of learning weights,
given a fixed network structure; just as with Bayesian
networks, we also need to understand how to find the best
network structure.

 If we choose a network that is too big, it will be able to


memorize all the examples by forming a large lookup table, but
will not necessarily generalize well to inputs that have not been
seen before.

 10 In other words, like all statistical models, neural networks


are subject to overfitting when there are too many
parameters in the model.
 We saw this in Figure 18.1 (page 696), where the high-
parameter models in (b) and (c) fit all the data, but might
not generalize as well as the low-parameter models in (a)
and (d).

 If we stick to fully connected networks, the only choices to


be made concern the number of hidden layers and their
sizes.

 The usual approach is to try several and keep the best.

 The cross-validation techniques of Chapter 18 are needed


if we are to avoid peeking at the test set.

 That is, we choose the network architecture that gives the


highest prediction accuracy on the validation sets.
PARAMETRIC-NON
PARAMETRIC MODELS
 Linear regression and neural networks use the training data
to estimate a fixed set of parameters w.

 That defines our hypothesis hw(x), and at that point we can


throw away the training data, because they are all
summarized by w.

 A learning model that summarizes data with a set of


parameters of fixed size (independent of the number of
training examples) is called a parametric model.
NON-PARAMETRIC Model
 A nonparametric model is one that cannot be
characterized by a bounded set of parameters.

 For example, suppose that each hypothesis we generate


simply retains within itself all of the training examples and
uses all of them to predict the next example.

 Such a hypothesis family would be nonparametric because


the effective number of parameters is unbounded it grows
with the number of examples.

 This approach is called instance-based learning or


memory-based learning.
1. Nearest neighbor
models
 We can improve on table lookup with a slight
variation: given a query xq, find the k examples
that are nearest to xq.
 This is called k -nearest neighbors lookup.
 We’ll use the notation NN(k, xq) to denote the set
of k nearest neighbors.
 To do classification, first find NN(k, xq), then take
the plurality vote of the neighbors (which is the
majority vote in the case of binary classification).
 To avoid ties, k is always chosen to be an odd
number. To do regression, we can take the mean
or median of the k neighbors, or we can solve a
linear regression problem on the neighbors.
 In Figure 18.26, we show the decision boundary of k-nearest-
neighbors classification for k= 1 and 5 on the earthquake
data set from Figure 18.15.

 Nonparametric methods are still subject to underfitting and


overfitting, just like parametric methods.

 In this case 1-nearest neighbors is overfitting; it reacts too


much to the black outlier in the upper right and the white
outlier at (5.4, 3.7).

 The 5-nearest-neighbors decision boundary is good; higher k


would underfit.
 As usual, cross-validation can be used to select the best
value of k.
2. Finding nearest
neighbors with k-d trees
 A balanced binary tree over data with an arbitrary number of
dimensions is called a k-d tree, for k-dimensional tree.
 (In our notation, the number of dimensions is n, so they
would be n-d trees.
 The construction of a k-d tree is similar to the construction of
a one-dimensional balanced binary tree.
 We start with a set of examples and at the root node we split
them along the ith dimension by testing whether xi ≤ m.
 We chose the value m to be the median of the examples
along the ith dimension; thus half the examples will be in the
left branch of the tree and half in the right.
 We then recursively make a tree for the left and right sets of
examples, stopping when there are fewer than two examples
left.
 To choose a dimension to split on at each node of the tree,
one can simply select dimension i mod n at level i of the
tree.
 Another strategy is to split on the dimension that has the
widest spread of values.
SUPPORT VECTOR
MACHINES
The support vector machine or SVM framework is
currently the most popular approach for “off-the-shelf”
supervised learning:
 if you don’t have any specialized prior knowledge about a
domain, then the SVM is an excellent method to try first.
 There are three properties that make SVMs attractive:
 1. SVMs construct a maximum margin separator—a
decision boundary with the largest possible distance to
example points. This helps them generalize well.
 2. SVMs create a linear separating hyperplane, but they
have the ability to embed the data into a higher-
dimensional space, using the so-called kernel trick.
 Often, data that are not linearly separable in the original
input space are easily separable in the higherdimensional
space.
 The high-dimensional linear separator is actually nonlinear in
the original space.
 This means the hypothesis space is greatly expanded over
methods that use strictly linear representations.

 3. SVMs are a nonparametric method—they retain training


examples and potentially need to store them all.
 On the other hand, in practice they often end up retaining
only a small fraction of the number of examples—sometimes
as few as a small constant times the number of dimensions.
 Thus SVMs combine the advantages of nonparametric and
parametric models: they have the flexibility to represent
complex functions, but they are resistant to overfitting.
ENSEMBLE LEARNING
 So far we have looked at learning methods in which a
single hypothesis, chosen from a hypothesis space, is used
to make predictions.

 The idea of ensemble learning methods is to select a


collection, or ensemble, of hypotheses from the
hypothesis space and combine their predictions.

 For example, during cross-validation we might generate


twenty different decision trees, and have them vote on the
best classification for a new example.
 The motivation for ensemble learning is simple.

 Consider an ensemble of K =5 hypotheses and suppose that


we combine their predictions using simple majority voting.

 For the ensemble to misclassify a new example, at least


three of the five hypotheses have to misclassify it.

 The hope is that this is much less likely than a


misclassification by a single hypothesis.
 Suppose we assume that each hypothesis hk in the
ensemble has an error of p—that is, the probability that a
randomly chosen example is misclassified by hk is p.
 Furthermore, suppose we assume that the errors made by
each hypothesis are independent.
 In that case, if p is small, then the probability of a large
number of misclassifications occurring is minuscule.

 For example, a simple calculation (Exercise 18.18) shows


that using an ensemble of five hypotheses reduces an error
rate of 1 in 10 down to an error rate of less than 1 in 100.

 Now, obviously the assumption of independence is


unreasonable, because hypotheses are likely to be misled in
the same way by any misleading aspects of the training
data.

 But if the hypotheses are at least a little bit different,


thereby reducing the correlation between their errors, then
ensemble learning can be very useful.
Online Learning
 On the one hand, that is a sensible assumption: if the
future bears no resemblance to the past, then how can
we predict anything?

 On the other hand, it is too strong an assumption: it is


rare that our inputs have captured all the information
that would make the future truly independent of the
past.
 Let us consider the situation where our input
consists of predictions from a panel of experts.

 For example, each day a set of K pundits predicts


whether the stock market will go up or down, and
our task is to pool those predictions and make our
own.

 One way to do this is to keep track of how well each


expert performs, and choose to believe them in
proportion to their past performance.

 This is called the randomized weighted majority


algorithm.

 We can described it more formally:


PRACTICAL MACHINE
LEARNING
 we consider two aspects of practical machine learning.

 The first involves finding algorithms capable of learning


to recognize handwritten digits and squeezing every last
drop of predictive performance out of them.

 The second involves anything but— pointing out that


obtaining, cleaning, and representing the data can be at
least as important as algorithm engineering.
1. Case study:
Handwritten digit

recognition
Recognizing handwritten digits is an important problem with
many applications, including automated sorting of mail by
postal code, automated reading of checks and tax returns,
and data entry for hand-held computers.
 It is an area where rapid progress has been made, in part
because of better learning algorithms and in part because of
the availability of better training sets.
 The United States National Institute of Science
and Technology (NIST) has archived a database of
60,000 labeled digits, each 20×20=400 pixels
with 8-bit grayscale values.

 It has become one of the standard benchmark


problems for comparing new learning algorithms.

 Some example digits are shown in Figure 18.36.


 Many different learning approaches have been tried. One of
the first, and probably the simplest, is the 3-nearest-
neighbor classifier, which also has the advantage of
requiring no training time.
 As a memory-based algorithm, however, it must store all
60,000 images, and its run time performance is slow. It
achieved a test error rate of 2.4%.

 A single-hidden-layer neural network was designed for


this problem with 400 input units (one per pixel) and 10
output units (one per class).
 Using cross-validation, it was found that roughly 300 hidden
units gave the best performance.
 With full interconnections between layers, there were a total
of 123,300 weights. This network achieved a 1.6% error rate.
2. Case study: Word
senses and house prices
 In practical applications of machine learning, the data set
is usually large, multidimensional, and messy.

 The data are not handed to the analyst in a prepackaged


set of (x, y) values; rather the analyst needs to go out and
acquire the right data.

 There is a task to be accomplished, and most of the


engineering problem is deciding what data are necessary
to accomplish the task; a smaller part is choosing and
implementing an appropriate machine learning method to
process the data.
…Thank You… 
********

You might also like