An Introduction to Probabilistic Neural Networks
Vincent Cheung Kevin Cannons
Signal & Data Compression Laboratory Electrical & Computer Engineering University of Manitoba Winnipeg, Manitoba, Canada Advisor: Dr. W. Kinsner
June 10, 2002
Outline
 Introduction  Classifier Example  Theory and Architecture  Training  Program Implementations  Conclusion
Page 1 of 13
Intro
Example
Theory
Training
Programs
What is a PNN?
 A probabilistic neural network (PNN) is predominantly a classifier
 Map any input pattern to a number of classifications  Can be forced into a more general function approximator
 A PNN is an implementation of a statistical algorithm called kernel discriminant analysis in which the operations are organized into a multilayered feedforward network with four layers:
 Input layer  Pattern layer  Summation layer  Output layer
Page 2 of 13
Intro
Example
Theory
Training
Programs
Advantages and Disadvantages
 Advantages
 Fast training process
 Orders of magnitude faster than backpropagation
 An inherently parallel structure  Guaranteed to converge to an optimal classifier as the size of the representative training set increases
 No local minima issues
 Training samples can be added or removed without extensive retraining
 Disadvantages
 Not as general as backpropagation  Large memory requirements  Slow execution of the network  Requires a representative training set
 Even more so than other types of NNs
Page 3 of 13
Intro
Example
Theory
Training
Programs
Classification Theory
 If the probability density function (pdf) of each of the populations is known, then an unknown, X, belongs to class i if: fi(X) > fj(X), all j  i
 Prior probability (h)
 Probability of an unknown sample being drawn from a particular population fk is the pdf for class k
 Other parameters may be included
 Misclassification cost (c)
 Cost of incorrectly classifying an unknown
 Classification decision becomes:
hicifi(X) > hjcjfj(X), all j  i (Bayes optimal decision rule)
Page 4 of 13
Intro
Example
Theory
Training
Programs
PDF Estimation
 Estimate the pdf by using the samples of the populations (the training set)  PDF for a single sample (in a population):
1  x  xk  W     
x = unknown (input) xk = kth sample W = weighting function  = smoothing parameter
 PDF for a single population:
1 n  x  xk  W    n k =1  
(average of the pdfs for the n samples in the population)
(Parzens pdf estimator)
 The estimated pdf approaches the true pdf as the training set size increases, as long as the true pdf is smooth
Page 5 of 13
Intro
Example
Theory
Training
Programs
Weighting Function
 Provides a sphere-of-influence
 Large values for small distances between the unknown and the training samples  Rapidly decreases to zero as the distance increases
 Commonly use Gaussian function
 Behaves well and easily computed  Not related to any assumption about a normal distribution
 The estimated pdf becomes:
g ( x) = 1 e n k =1
n  ( x  xk ) 2
Page 6 of 13
Intro
Example
Theory
Training
Programs
Multivariate Inputs
 Input to the network is a vector  PDF for a single sample (in a population):
 1 e p/2 p (2 )  X Xk 2
2 2
X = unknown (input) Xk = kth sample  = smoothing parameter p = length of vector
X  X ik 2
2 2
 PDF for a single population:
gi ( X ) = 1 (2 ) p / 2  p ni
e
k =1
ni
(average of the pdfs for the ni samples in the ithpopulation)
 Classification criteria: gi(X) > gj(X), all j  i 2 X  X ik ni  1 2  gi ( X ) =  e ni k =1
Page 7 of 13
(eliminate common factors and absorb the 2 into )
Intro
Example
Theory
Training
Programs
Training Set
 The training set must be thoroughly representative of the actual population for effective classification
 More demanding than most NNs  Sparse set sufficient  Erroneous samples and outliers tolerable
 Adding and removing training samples simply involves adding or removing neurons in the pattern layer
 Minimal retraining required, if at all
 As the training set increases in size, the PNN asymptotically converges to the Bayes optimal classifier
 The estimated pdf approaches the true pdf, assuming the true pdf is smooth
Page 8 of 13
Intro
Example
Theory
Training
Programs
Training
 The training process of a PNN is essentially the act of determining the value of the smoothing parameter, sigma
 Training is fast
 Orders of magnitude faster than backpropagation
Correct Classifications Optimum
Matched Filter
Nearest Neighbour Sigma ()
 Determining Sigma
 Educated guess based on knowledge of the data  Estimate a value using a heuristic technique
Page 9 of 13
Intro
Example
Theory
Training
Programs
Estimating Sigma Using Jackknifing
 Systematic testing of values for sigma over some range
 Bounding the optimal value to some interval  Shrinking the interval
 Jackknifing is used to grade the performance of each test sigma
 Exclude a single sample from the training set  Test if the PNN correctly classifies the excluded sample  Iterate the exclusion and testing process for each sample in the training set
 The number of correct classifications over the entire process is a measure of the performance for that value of sigma
 Not unbiased measure of performance
 Training and testing sets not independent  Gives a ball park estimate of quality of sigma
Page 10 of 13
Intro
Example
Theory
Training
Programs
Implementations
 Current Work
 Basic PNN coded in Java
 Simple examples
Boy/Girl classifier (same as perceptron) Classification of points in R2 or R3 into the quadrants
 Multithreaded PNN
 For parallel processing (on supercomputers)  One thread per class
 Future Work
 Artificially create a time series of a chaotic system and use a PNN to classify its features in order to reconstruct the strange attractor
 Further test the classification abilities of PNN  Test the PNNs tolerance to noisy inputs
Page 11 of 13
Conclusion
 PNNs should be used if
 A near optimal classifier with a short training time is desired  Slow execution speed and large memory requirements can be tolerated
 No extensive testing on our implementation of PNNs have been done
 Once chaotic time series have been obtained, we will have more challenging data to work with
Page 12 of 13
References
[Mast93] T. Masters, Practial Neural Network Recipes in C++, Toronto, ON: Academic Press, Inc., 1993. [Specht88] D.F. Specht, Probabilistic Neural Networks for Classification, Mapping, or Associative Memory, IEEE International Conference on Neural Networks, vol. I, pp. 525-532, July 1998. [Specht92] D.F. Specht, Enhancements to Probabilistic Neural Networks, International Joint Conference on Neural Networks, vol. I, pp. 761-768, June 1992. [Wass93] P. D. Wasserman, Advanced Methods in Neural Computing, New York, NY: Van Nostrand Reinhold, 1993. [Zak98] Anthony Zaknich, Artificial Neural Networks: An Introductory Course. [Online]. http://www.maths.uwa.edu.au/~rkealley/ann_all/ann_all.html (as of June 6, 2002).
Page 13 of 13
Simple Classifier Example
 Idea behind classification using a PNN  Three classes or populations
 X, O, and
 The ? is an unknown sample and must be classified into one of the populations  Nearest neighbour algorithm would classify the ? as a because a sample is the closest sample to the ?
 In other words, with nearest neighbour, the unknown belongs to the same population as the closest sample
Page 14 of 13
Simple Classifier Example
 A more effective classifier would also take the other samples into consideration in making its decision  However, not all samples should contribute to the classification of a particular unknown the same amount
 Samples close to the unknown should have a large contribution (increase the probability of classifying the unknown as that population)  Samples far from the unknown should have a small contribution (decrease the probability of classifying the unknown as that population)  A sphere-of-influence
Page 15 of 13
Simple Classifier Example
 What the more effective classifier would then do is, for each population, calculate the average of all the contributions made by the samples in that population  The unknown sample is then classified as being a member of the population which has the largest average
Page 16 of 13
Architecture
Input Layer
Pattern Layer
Summation Layer
Output Layer
Architecture
1 gi ( X ) = ni
e
k =1
ni
X  X ik
X11 X12
y11 =
X  X 11
1 y12 =
X  X 12
y11 + y12 g1(X) = 2
X21 X22
y21 2 y22
g2(X)
g3(X) X31 X32 X33 Input Layer y31 y32 y33 Summation Layer Output Layer 3 Output = Class of Max(g1, g2, g3)
Pattern Layer (Training Set)