AI and Expert System
Department of Information
Technology
Ambo University
About the course
Course Title: AI & Expert Systems
Course Code: MIT6124
Credit Hours: 3
Contact Hour: 2 lecture + 1 lab
Course Objective
This course focuses on to develop
an appreciation of the ideas
underpinning different theories of
knowledge of AI, Expert systems,
robotics fundamentals, Neural
Network, Pattern recognition,
Image processing, Genetic
algorithm & machine learning
Contents
Unit 1: Recap of basic Artificial
Intelligence concepts
Unit 2: Expert Systems
Unit 3: Introduction to Neural Network
Unit 4: Fuzzy system
Unit 5: Introduction to Machine
learning &
Genetic Algorithm
Unit 6: Introduction to Pattern
Recognition &
Image processing
What is AI?
There is no officially agreed
definition for AI.
It could be better to defined based
on two terminologies:
Autonomy - The ability to perform
tasks in complex environments
without constant guidance by a user.
Adaptivity - The ability to improve
performance by learning from
experience.
Example
Self-driving cars:
search and planning to find the most
convenient route from A to B
computer vision to identify obstacles, and
decision making under uncertainty to cope
with the complex and dynamic environment.
The same technologies are also used in other
autonomous systems such as:
delivery robots, flying drones, and autonomous
ships.
Exercises
Which of the following are AI and
which are not?
1. Spreadsheet that calculates sums and
other pre-defined functions on given
data
The outcome is determined by the user-
specified formula, no AI needed;
2. Predicting the stock market by fitting
a curve to past data about stock prices
Fitting a simple curve is not really AI, but
there are so many different curves to choose
from, even if there's a lot of data to constrain
Related Fields
In addition to AI, there are several other closely
related topics.
These include
machine learning,
data science, and
deep learning
Machine learning
Systems that improve their performance in a
given task with more and more experience or
data
Related Fields (contd)
Deep learning
is a subfield of machine learning, which itself is a
subfield of AI, which itself is a subfield of computer
science.
The depth of deep learning refers to
the complexity of a mathematical model, and
that the increased computing power of modern
computers has allowed researchers to increase this
complexity
to reach levels that appear not only quantitatively
but also qualitatively different from before
Related Fields (contd)
Data Science
is a recent umbrella term (term that
covers several subdisciplines) that
includes
machine learning and statistics
certain aspects of computer science
including
algorithms,
data storage, and
web application development
Related Fields (contd)
Robotics means
building and programming robots so
that
they can operate in complex, real-world
scenarios.
In a way, robotics is the ultimate
challenge of AI
since it requires a combination of
virtually all areas of AI.
For example:
Computer vision and speech
Related Fields (contd)
Natural language processing,
information retrieval, and reasoning
under uncertainty
for processing instructions and predicting
consequences of potential actions
Cognitive modeling and affective
computing
(systems that respond to expressions of
human feelings or that mimic feelings)
for interacting and working together with
humans
Philosophy of AI
The very nature of the term
artificial intelligence brings up
philosophical questions
whether intelligent behavior implies
or requires the existence of a mind,
and
to what extent is consciousness
replicable as computation.
The Turing Test
Alan Turing (1912-1954) was an
English mathematician and
logician.
He is rightfully considered to be the
father of computer science.
Turing was fascinated by
intelligence and thinking, and
the possibility of simulating them by
machines
Turings most prominent contribution
to AI is
his imitation game, which later became
The Turing Test (contd)
In the test, a human interrogator
interacts with two players, A and B, by
exchanging written messages (in a
chat).
If the interrogator cannot determine
which player, A or B, is a computer and
which is a human,
the computer is said to pass the test.
The argument is that if a computer is
indistinguishable from a human in a
general natural language conversation,
then it must have reached human-level
Probability
One of the reasons why modern AI
methods actually work in real-
world problems
is their ability to deal with uncertainty
Probability has turned out to be
the best approach for reasoning
under uncertainty, and
almost all current AI applications are
based, to at least some degree, on
probabilities.
Why probability matters
Probability can be used
to quantify and compare risks in
everyday life:
what are the chances of crashing your car
if you exceed the speed limit,
what are the chances that the interest
rates on your mortgage will go up by five
percentage points within the next five
years, or
what are the chances that AI will
automate particular tasks
such as detecting fractured bones in X-ray
images or waiting tables in a restaurant.
The key lesson about probability
It is the ability to think of
uncertainty as
a thing that can be quantified at least
in principle.
This means that we can talk about
uncertainty as if it were a number:
numbers can be compared (is this
thing more probable than that thing),
and
they can often be measured.
The key lesson about probability
(contd)
Granted, measuring probabilities is
hard:
we usually need many observations
about a phenomenon to draw
conclusions.
However, by systematically collecting
data,
we can critically evaluate probabilistic
statements, and
our numbers can sometimes be found to
be right or wrong.
The key lesson about probability
(contd)
In other words, the key lesson is
that
uncertainty is not beyond the scope
of rational thinking and discussion,
and
probability provides a systematic way
of doing just that.
Why quantifying uncertainty matters
If we think of uncertainty as
something that can't be quantified
or measured,
the uncertainty aspect may become
an obstacle for rational discussion.
We may for example argue that
since we don't know exactly whether
a vaccine may cause a harmful side-
effect, it is too dangerous to use.
Why quantifying uncertainty matters
However, this may lead us
to ignore a life-threatening disease
that the vaccine will eradicate.
In most cases, the benefits and
risks are known
to sufficient precision to clearly see
that one is more significant than the
other.
Odds
Probably the easiest way
to represent uncertainty is through
odds.
They make it particularly easy
to update beliefs when more
information becomes available
By odds, we mean
for example 3:1 (three to one), which
means that
we expect that for every three cases of
an outcome, there is one case of the
Odds (contd)
The other way to express the same
would be to say that
the chances of winning are 3/4 (three
in four).
These are called natural frequencies
since they involve only whole numbers.
With whole numbers, it is easy to
imagine,
for example, four people out of whom,
three have brown eyes.
Or four days out of which it rains on three
(if you're in Waliso).
Why we use odds and not
percentages
Three out of four is of course the
same as 75%
(mathematicians prefer to use
fractions like 0.75 instead of
percentages).
It has been found that people get
confused and make mistakes more
easily
when dealing with fractions and
percentages than with natural
frequencies or odds.
Example
The odds 1:5 mean that
you'd have to play the game six times
to get one win on the average.
The probability 20% means that
you'd have to play five times to get
one win on the average.
Example (contd)
For odds that are greater than one,
such as 5:1,
it is easy to remember that
we are not dealing with probabilities
because no probability can be greater
than 1 (or greater than 100%),
but for odds that are less than one
such as 1:5, the danger of confusion
lurks around the corner.
The Bayes rule
The Bayes rule can be expressed in
many forms.
The simplest one is in terms of
odds.
The idea is
to take the odds for something
happening (against it not happening),
which we´ll write as prior odds.
The word prior refers to
our assessment of the odds before
obtaining some new information that
The Bayes rule (contd)
The purpose of the formula is
to update the prior odds
when new information becomes
available,
to obtain the posterior odds, or
the odds after obtaining the information
How odds change
In order to weigh the new
information, and decide how the
odds change when it becomes
available,
we need to consider how likely we
would be to encounter this
information in alternative situations.
Let's take as an example,
the odds that it will rain later today.
Imagine getting up in the morning in
Woliso.
How odds change (contd)
The chances of rain are 206 in 365
The number of days without rain is
therefore 159.
This converts to prior odds of
206:159 for rain,
so the cards are stacked against you
already before you open your eyes.
How odds change (contd)
However, after opening your eyes and
taking a look outside,
you notice it's cloudy.
Suppose the chances of having a cloudy
morning on a rainy day are 9 out of 10
that means that only one out of 10 rainy
days start out with blue skies.
But sometimes there are also clouds
without rain:
the chances of having clouds on a rainless
day are 1 in 10.
How odds change (contd)
Now how much higher are the
chances of clouds on a rainy day
compared to a rainless day?
Think about this carefully as it will
be important
to be able to comprehend the
question and obtain the answer in
what follows.
The answer is that
the chances of clouds are nine times
higher on a rainy day than on a
Likelihood ratio
The above ratio
(nine times higher chance of clouds
on a rainy day than on a rainless day)
is called
the likelihood ratio.
More generally, the likelihood ratio
is
the probability of the observation in
case the event of interest (in the
above, rain), divided by the
probability of the observation in case
Likelihood ratio (contd)
So we concluded that on a cloudy
morning, we have:
likelihood ratio = (9/10) / (1/10) =
9
The mighty Bayes rule
for converting prior odds into
posterior odds is as follows:
posterior odds = likelihood ratio ×
prior odds
The Bayes rule in practice: breast
cancer screening
This example also illustrates
a common bias in dealing with
uncertain information called the base-
rate fallacy.
Consider mammographic screening
for breast cancer.
Using made up percentages for the
sake of simplifying the numbers,
let's assume that 5 in 100 women
have breast cancer.
...The Bayes rule in practice
Suppose that if a person has
breast cancer,
then the mammograph test will find it
80 times out of 100.
When the test comes out
suggesting that breast cancer is
present,
we say that the result is positive
...The Bayes rule in practice
The test may also fail in the other
direction,
namely to indicate breast cancer when none
exists.
This is called a false positive finding.
Suppose that if the person being tested
actually doesn't have breast cancer,
the chances that the test nevertheless
comes out positive are 10 in 100.
Based on the above probabilities,
Assignment - 1
1. Calculate prior probability
2. Calculate the likelihood ratio.
Naive Bayes classification
One of the most useful applications
of the Bayes rule is
the so-called naive Bayes classifier.
The Bayes classifier is a machine
learning technique that can be
used
to classify objects such as text
documents into two or more classes.
The classifier is trained by
analyzing a set of training data, for
which the correct classes are given.
...Naive Bayes classification
The naive Bayes classifier can be
used
to determine
the probabilities of the classes given
a number of different observations.
The assumption in the model is
that
the feature variables are conditionally
independent given the class
Real world application: spam filters
We will use a spam email filter as a
running example
for illustrating the idea of the naive
Bayes classifier.
Thus, the class variable indicates
whether a message is spam (or junk
email) or
whether it is a legitimate message
(also called ham).
...Real world application: spam
filters
The words in the message
correspond to the feature
variables,
so that the number of feature
variables in the model is
determined by
the length of the message.
Why we call it naive
Using spam filters as an example,
the idea is to think of the words as being
produced by choosing one word after the
other
so that the choice of the word depends only on
whether the message is spam or ham.
This is a crude simplification of the
process because
it means that there is no dependency
between adjacent words, and
the order of the words has no significance.
This is in fact why the method is called naive.
Estimating parameters
To get started,
we need to specify the prior odds for spam
(against ham).
For simplicity assume this to be 1:1
which means that on the average half of the
incoming messages are spam
(in reality, the amount of spam is probably much
higher).
To get our likelihood ratios,
we need two different probabilities for any
word occurring:
one in spam messages and another one in
ham messages.
...Estimating parameters
The word distributions for the two
classes are
best estimated from actual training
data
that contains some spam messages as
well as legitimate messages.
The simplest way is
to count how many times each word,
appears in the data and divide the
number by the total word count.
...Estimating parameters
To illustrate the idea,
lets assume that we have at our
disposal some spam and some
ham.
You can easily obtain such data by
saving a batch of your emails in
two files.
Assume that we have calculated
the number of occurrences of the
following words (along with all other
words) in the two classes of
...Estimating parameters
word spam ham
million 156 98
dollars 29 119
adclick 51 0
conferences 0 12
total 95791 306438
...Estimating parameters
word spam ham
million 156 98
dollars 29 119
adclick 51 0
conferences 0 12
total 95791 306438
...Estimating parameters
We can now estimate that
the probability that a word in a spam
message is million,
for example, is about 156 out of 95791,
which is roughly the same as 1 in 614.
Likewise,
we get the estimate that 98 out of
306438 words,
which is about the same as 1 in 3127, in
a ham message are million.
...Estimating parameters
Both of these probability estimates
are small, less than 1 in 500,
but more importantly,
the former is higher than the latter: 1
in 614 is higher than 1 in 3127.
This means that the likelihood
ratio:
the ratio is (1/614) / (1/3127)
= 3127/614 = 5.1 (rounded
to one decimal digit).
Zero means trouble
One problem with estimating the
probabilities directly from the counts is
that
zero counts lead to zero estimates.
This can be quite harmful for the
performance of the classifier
it easily leads to situations where the
posterior odds are 0/0, which is nonsense.
The simplest solution is to use a small
lower bound for all probability
estimates.
The value 1/100000, for instance, does the
... means trouble
Using the above logic,
we can determine the likelihood
ratio for all possible words without
having to use zero, giving us the
following likelihood ratios:
word likelihood ratio
million 5.1
dollars 0.8
adclick 53.2
conferences 0.3
... means trouble
We are now ready to apply the
method to classify new messages.
Assignment -2
1. One word spam filter
Let's start with a message that only has
one word in it: million
a) Calculate the posterior odds for
spam given this word using the table
above.
b) Is that spam or ham?
Assignment -2 (contd)
2. Full spam filter
a) Now use the naive Bayes method to
calculate the posterior odds for
spam given the message: million
dollars adclick conferences.
b) Is spam or ham?