0% found this document useful (0 votes)
35 views148 pages

Print This

Uploaded by

Dimitris Anestis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views148 pages

Print This

Uploaded by

Dimitris Anestis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

700 Chapter 10 Image Segmentation

10.1 FUNDAMENTALS

10
10.1

Let R represent the entire spatial region occupied by an image. We may view image
segmentation as a process that partitions R into n subregions, R1 , R2 , …, Rn , such
that
n
∪ Ri = R.
Image Segmentation (a)
i =1
(b) Ri is a connected set, for i = 0, 1, 2, … , n.
(c) Ri ! Rj = ∅ for all i and j, i ≠ j.
(d) Q ( Ri ) = TRUE for i = 0, 1, 2, … , n.
The whole is equal to the sum of its parts.
Euclid
( )
(e) Q Ri " Rj = FALSE for any adjacent regions Ri and Rj .

The whole is greater than the sum of its parts. where Q ( Rk ) is a logical predicate defined over the points in set Rk , and ∅ is the
Max Wertheimer
null set. The symbols ´ and ¨ represent set union and intersection, respectively, as
defined in Section 2.6. Two regions Ri and Rj are said to be adjacent if their union
forms a connected set, as defined in Section 2.5. If the set formed by the union of two
regions is not connected, the regions are said to disjoint.
Condition (a) indicates that the segmentation must be complete, in the sense that
every pixel must be in a region. Condition (b) requires that points in a region be con-
Preview nected in some predefined sense (e.g., the points must be 8-connected). Condition
The material in the previous chapter began a transition from image processing methods whose inputs (c) says that the regions must be disjoint. Condition (d) deals with the properties that
and outputs are images, to methods in which the inputs are images but the outputs are attributes extract- must be satisfied by the pixels in a segmented region—for example, Q ( Ri ) = TRUE
ed from those images. Most of the segmentation algorithms in this chapter are based on one of two basic if all pixels in Ri have the same intensity. Finally, condition (e) indicates that two
properties of image intensity values: discontinuity and similarity. In the first category, the approach is adjacent regions Ri and Rj must be different in the sense of predicate Q.†
to partition an image into regions based on abrupt changes in intensity, such as edges. Approaches in Thus, we see that the fundamental problem in segmentation is to partition an
the second category are based on partitioning an image into regions that are similar according to a set image into regions that satisfy the preceding conditions. Segmentation algorithms
of predefined criteria. Thresholding, region growing, and region splitting and merging are examples of for monochrome images generally are based on one of two basic categories dealing
methods in this category. We show that improvements in segmentation performance can be achieved with properties of intensity values: discontinuity and similarity. In the first category,
by combining methods from distinct categories, such as techniques in which edge detection is combined we assume that boundaries of regions are sufficiently different from each other, and
with thresholding. We discuss also image segmentation using clustering and superpixels, and give an from the background, to allow boundary detection based on local discontinuities in
introduction to graph cuts, an approach ideally suited for extracting the principal regions of an image. intensity. Edge-based segmentation is the principal approach used in this category.
This is followed by a discussion of image segmentation based on morphology, an approach that com- Region-based segmentation approaches in the second category are based on parti-
bines several of the attributes of segmentation based on the techniques presented in the first part of the tioning an image into regions that are similar according to a set of predefined criteria.
chapter. We conclude the chapter with a brief discussion on the use of motion cues for segmentation. Figure 10.1 illustrates the preceding concepts. Figure 10.1(a) shows an image of a
region of constant intensity superimposed on a darker background, also of constant
Upon completion of this chapter, readers should: intensity. These two regions comprise the overall image. Figure 10.1(b) shows the
result of computing the boundary of the inner region based on intensity discontinui-
Understand the characteristics of various types Know how to combine thresholding and spa- ties. Points on the inside and outside of the boundary are black (zero) because there
of edges found in practice. tial filtering to improve segmentation. are no discontinuities in intensity in those regions. To segment the image, we assign
Understand how to use spatial filtering for Be familiar with region-based segmentation, one level (say, white) to the pixels on or inside the boundary, and another level (e.g.,
edge detection. including clustering and superpixels. black) to all points exterior to the boundary. Figure 10.1(c) shows the result of such
Be familiar with other types of edge detection Understand how graph cuts and morphologi- a procedure. We see that conditions (a) through (c) stated at the beginning of this
methods that go beyond spatial filtering. cal watersheds are used for segmentation.

In general, Q can be a compound expression such as, “Q ( Ri ) = TRUE if the average intensity of the pixels in
Understand image thresholding using several Be familiar with basic techniques for utilizing region Ri is less than mi AND if the standard deviation of their intensity is greater than si ,” where mi and si
different approaches. motion in image segmentation. are specified constants.

699

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 699 6/16/2017 2:12:34 PM DIP4E_GLOBAL_Print_Ready.indb 700 6/16/2017 2:12:37 PM


10.2 Point, Line, and Edge Detection 701 702 Chapter 10 Image Segmentation

a b c we are interested are isolated points, lines, and edges. Edge pixels are pixels at which
d e f the intensity of an image changes abruptly, and edges (or edge segments) are sets of
When we refer to lines,
FIGURE 10.1 we are referring to thin connected edge pixels (see Section 2.5 regarding connectivity). Edge detectors are
(a) Image of a structures, typically just local image processing tools designed to detect edge pixels. A line may be viewed as
constant intensity a few pixels thick. Such
lines may correspond, for a (typically) thin edge segment in which the intensity of the background on either
region.
(b) Boundary
example, to elements of
a digitized architectural
side of the line is either much higher or much lower than the intensity of the line
based on intensity drawing, or roads in a pixels. In fact, as we will discuss later, lines give rise to so-called “roof edges.” Finally,
discontinuities. satellite image. an isolated point may be viewed as a foreground (background) pixel surrounded by
(c) Result of background (foreground) pixels.
segmentation.
(d) Image of a
texture region. BACKGROUND
(e) Result of As we saw in Section 3.5, local averaging smoothes an image. Given that averaging
intensity discon- is analogous to integration, it is intuitive that abrupt, local changes in intensity can
tinuity computa-
tions (note the be detected using derivatives. For reasons that will become evident shortly, first- and
large number of second-order derivatives are particularly well suited for this purpose.
small edges). Derivatives of a digital function are defined in terms of finite differences. There
(f) Result of are various ways to compute these differences but, as explained in Section 3.6, we
segmentation require that any approximation used for first derivatives (1) must be zero in areas
based on region
properties. of constant intensity; (2) must be nonzero at the onset of an intensity step or ramp;
and (3) must be nonzero at points along an intensity ramp. Similarly, we require that
section are satisfied by this result. The predicate of condition (d) is: If a pixel is on, an approximation used for second derivatives (1) must be zero in areas of constant
or inside the boundary, label it white; otherwise, label it black. We see that this predi- intensity; (2) must be nonzero at the onset and end of an intensity step or ramp; and
cate is TRUE for the points labeled black or white in Fig. 10.1(c). Similarly, the two (3) must be zero along intensity ramps. Because we are dealing with digital quanti-
segmented regions (object and background) satisfy condition (e). ties whose values are finite, the maximum possible intensity change is also finite, and
The next three images illustrate region-based segmentation. Figure 10.1(d) is the shortest distance over which a change can occur is between adjacent pixels.
similar to Fig. 10.1(a), but the intensities of the inner region form a textured pattern. We obtain an approximation to the first-order derivative at an arbitrary point x of
Figure 10.1(e) shows the result of computing intensity discontinuities in this image. a one-dimensional function f ( x) by expanding the function f ( x + !x) into a Taylor
The numerous spurious changes in intensity make it difficult to identify a unique series about x
boundary for the original image because many of the nonzero intensity changes are
connected to the boundary, so edge-based segmentation is not a suitable approach.
f ( x + !x) = f ( x) + !x
∂f ( x )
+
( !x ) ∂ 2 f ( x) + ( !x ) ∂3 f ( x) + ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
2 3

However, we note that the outer region is constant, so all we need to solve this seg- Remember, the notation ∂x 2! ∂x 2 3! ∂x 3
mentation problem is a predicate that differentiates between textured and constant n! means “n factorial”: (10-1)
n! = 1#2#· · ·# n. "
( !x )n ∂n f ( x)
regions. The standard deviation of pixel values is a measure that accomplishes this
because it is nonzero in areas of the texture region, and zero otherwise. Figure 10.1(f)
= ∑
n=0 n! ∂x n

shows the result of dividing the original image into subregions of size 8 × 8. Each
subregion was then labeled white if the standard deviation of its pixels was posi- where !x is the separation between samples of f. For our purposes, this separation
tive (i.e., if the predicate was TRUE), and zero otherwise. The result has a “blocky” is measured in pixel units. Thus, following the convention in the book, !x = 1 for
appearance around the edge of the region because groups of 8 × 8 squares were the sample preceding x and !x = −1 for the sample following x. When !x = 1, Eq.
labeled with the same intensity (smaller squares would have given a smoother (10-1) becomes
region boundary). Finally, note that these results also satisfy the five segmentation
conditions stated at the beginning of this section. Although this is an
∂f ( x ) 1 ∂ 2 f ( x) 1 ∂ 3 f ( x)
expression of only one
f ( x + 1) = f ( x) + + + + ⋅⋅⋅⋅⋅⋅
variable, we used partial
∂x 2 ! ∂x 2 3 ! ∂x 3
10.2 POINT, LINE, AND EDGE DETECTION derivatives notation for (10-2)
"
1 ∂ n f ( x)
10.2
consistency when we
The focus of this section is on segmentation methods that are based on detecting discuss functions of two
variables later in this
= ∑
n = 0 n! ∂x n
sharp, local changes in intensity. The three types of image characteristics in which section.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 701 6/16/2017 2:12:37 PM DIP4E_GLOBAL_Print_Ready.indb 702 6/16/2017 2:12:38 PM


10.2 Point, Line, and Edge Detection 703 704 Chapter 10 Image Segmentation

Similarly, when !x = −1, ∂ 3 f ( x) f ( x + 2) − 2 f ( x + 1) + 0 f ( x) + 2 f ( x − 1) − f ( x − 2)


= f $$$( x) = (10-8)
∂x 3 2
∂f ( x ) 1 ∂ 2 f ( x) 1 ∂ 3 f ( x)
f ( x − 1) = f ( x) − + − + ⋅⋅⋅⋅⋅⋅
∂x 2 ! ∂x 2 3 ! ∂x 3 Similarly [see Problem 10.2(b)], the fourth finite difference (the highest we use in
(10-3) the book) after ignoring all higher order terms is given by
"
( −1 ) n ∂ n f ( x )
= ∑ n! ∂x n
n=0 ∂ 4 f ( x)
= f $$$$( x) = f ( x + 2) − 4 f ( x + 1) + 6 f ( x) − 4 f ( x − 1) + f ( x − 2) (10-9)
∂x 4
In what follows, we compute intensity differences using just a few terms of the Taylor
series. For first-order derivatives we use only the linear terms, and we can form dif- Table 10.1 summarizes the first four central derivatives just discussed. Note the
ferences in one of three ways. symmetry of the coefficients about the center point. This symmetry is at the root
The forward difference is obtained from Eq. (10-2): of why central differences have a lower approximation error for the same number
of points than the other two differences. For two variables, we apply the results in
∂f ( x ) Table 10.1 to each variable independently. For example,
= f $( x) = f ( x + 1) − f ( x) (10-4)
∂x
∂ 2 f ( x, y )
where, as you can see, we kept only the linear terms. The backward difference is simi- = f ( x + 1, y ) − 2 f ( x, y ) + f ( x − 1, y ) (10-10)
larly obtained by keeping only the linear terms in Eq. (10-3): ∂x 2

∂f ( x ) and
= f $( x) = f ( x) − f ( x − 1) (10-5)
∂x
∂ 2 f ( x, y )
= f ( x, y + 1) − 2 f ( x, y ) + f ( x, y − 1) (10-11)
and the central difference is obtained by subtracting Eq. (10-3) from Eq. (10-2): ∂y 2

∂f ( x ) f ( x + 1) − f ( x − 1) It is easily verified that the first and second-order derivatives in Eqs. (10-4)
= f $( x) = (10-6) through (10-7) satisfy the conditions stated at the beginning of this section regarding
∂x 2
derivatives of the first and second order. To illustrate this, consider Fig. 10.2. Part (a)
The higher terms of the series that we did not use represent the error between an shows an image of various objects, a line, and an isolated point. Figure 10.2(b) shows
exact and an approximate derivative expansion. In general, the more terms we use a horizontal intensity profile (scan line) through the center of the image, including
from the Taylor series to represent a derivative, the more accurate the approxima- the isolated point. Transitions in intensity between the solid objects and the back-
tion will be. To include more terms implies that more points are used in the approxi- ground along the scan line show two types of edges: ramp edges (on the left) and
mation, yielding a lower error. However, it turns out that central differences have step edges (on the right). As we will discuss later, intensity transitions involving thin
a lower error for the same number of points (see Problem 10.1). For this reason, objects such as lines often are referred to as roof edges.
derivatives are usually expressed as central differences. Figure 10.2(c) shows a simplified profile, with just enough points to make it possi-
The second order derivative based on a central difference, ∂ 2 f ( x) ∂x 2 , is obtained ble for us to analyze manually how the first- and second-order derivatives behave as
by adding Eqs. (10-2) and (10-3): they encounter a point, a line, and the edges of objects. In this diagram the transition

∂ 2 f ( x)
= f $$( x) = f ( x + 1) − 2 f ( x) + f ( x − 1) (10-7) TABLE 10.1
∂x 2 f ( x + 2) f ( x + 1) f ( x) f ( x − 1) f ( x − 2)
First four central
digital derivatives 2 f $( x) 1 0 −1
To obtain the third order, central derivative we need one more point on either side (finite differenc-
of x. That is, we need the Taylor expansions for f ( x + 2) and f ( x − 2), which we es) for samples f $$( x) 1 −2 1
obtain from Eqs. (10-2) and (10-3) with !x = 2 and !x = −2, respectively. The strat- taken uniformly,
!x = 1 units apart. 2 f $$$( x) 1 −2 0 2 −1
egy is to combine the two Taylor expansions to eliminate all derivatives lower than
the third. The result after ignoring all higher-order terms [see Problem 10.2(a)] is f $$$$( x) 1 −4 6 −4 1

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 703 6/16/2017 2:12:39 PM DIP4E_GLOBAL_Print_Ready.indb 704 6/16/2017 2:12:41 PM


10.2 Point, Line, and Edge Detection 705 706 Chapter 10 Image Segmentation

a b FIGURE 10.3
c A general 3 × 3
w1 w2 w3
spatial filter
FIGURE 10.2 kernel. The w’s
(a) Image. are the kernel
(b) Horizontal coefficients w4 w5 w6
intensity profile (weights).
that includes the
isolated point
indicated by the w7 w8 w9
arrow.
(c) Subsampled
profile; the dashes
were added
for clarity. The second derivative has opposite signs (negative to positive or positive to negative)
numbers in the
boxes are the
as it transitions into and out of an edge. This “double-edge” effect is an important
intensity values characteristic that can be used to locate edges, as we will show later in this section.
of the dots shown As we move into the edge, the sign of the second derivative is used also to determine
in the profile. The whether an edge is a transition from light to dark (negative second derivative), or
derivatives were 7 Isolated point from dark to light (positive second derivative)
obtained using 6
In summary, we arrive at the following conclusions: (1) First-order derivatives gen-
Intensity

Eqs. (10-4) for the 5


Ramp Step
first derivative 4 Line erally produce thicker edges. (2) Second-order derivatives have a stronger response to
and Eq. (10-7) for 3 Flat segment fine detail, such as thin lines, isolated points, and noise. (3) Second-order derivatives
the second. 2
1
produce a double-edge response at ramp and step transitions in intensity. (4) The sign
0 of the second derivative can be used to determine whether a transition into an edge is
Intensity values 5 5 4 3 2 1 0 0 0 6 0 0 0 0 1 3 1 0 0 0 0 7 7 7 7 from light to dark or dark to light.
The approach of choice for computing first and second derivatives at every pix-
First derivative %1%1%1%1%1 0 0 6 %6 0 0 0 1 2 %2%1 0 0 0 7 0 0 0 el location in an image is to use spatial convolution. For the 3 × 3 filter kernel in
Fig. 10.3, the procedure is to compute the sum of products of the kernel coefficients
Second derivative %1 0 0 0 0 1 0 6 %12 6 0 0 1 1 %4 1 1 0 0 7 %7 0 0 with the intensity values in the region encompassed by the kernel, as we explained
in Section 3.4. That is, the response of the filter at the center point of the kernel is
This equation is an
in the ramp spans four pixels, the noise point is a single pixel, the line is three pixels expansion of Eq. (3-35) Z = w1z1 + w2 z2 + … + w9 z9
for a 3#3 kernel, valid
thick, and the transition of the step edge takes place between adjacent pixels. The at one point, and using
9 (10-12)
number of intensity levels was limited to eight for simplicity. simplified subscript
notation for the kernel
= ∑ wk zk
k =1
Consider the properties of the first and second derivatives as we traverse the coefficients.
profile from left to right. Initially, the first-order derivative is nonzero at the onset where zk is the intensity of the pixel whose spatial location corresponds to the loca-
and along the entire intensity ramp, while the second-order derivative is nonzero tion of the kth kernel coefficient.
only at the onset and end of the ramp. Because the edges of digital images resemble
this type of transition, we conclude that first-order derivatives produce “thick” edges,
DETECTION OF ISOLATED POINTS
and second-order derivatives much thinner ones. Next we encounter the isolated
noise point. Here, the magnitude of the response at the point is much stronger for Based on the conclusions reached in the preceding section, we know that point
the second- than for the first-order derivative. This is not unexpected, because a detection should be based on the second derivative which, from the discussion in
second-order derivative is much more aggressive than a first-order derivative in Section 3.6, means using the Laplacian:
enhancing sharp changes. Thus, we can expect second-order derivatives to enhance
fine detail (including noise) much more than first-order derivatives. The line in this ∂2 f ∂2 f
∇ 2 f ( x, y) = + (10-13)
example is rather thin, so it too is fine detail, and we see again that the second deriva- ∂x 2 ∂y 2
tive has a larger magnitude. Finally, note in both the ramp and step edges that the

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 705 6/16/2017 2:12:41 PM DIP4E_GLOBAL_Print_Ready.indb 706 6/16/2017 2:12:42 PM


10.2 Point, Line, and Edge Detection 707 708 Chapter 10 Image Segmentation

where the partial derivatives are computed using the second-order finite differences a
in Eqs. (10-10) and (10-11). The Laplacian is then b c d 1 1 1
FIGURE 10.4
∇ f ( x, y ) = f ( x + 1, y) + f ( x − 1, y) + f ( x, y + 1) + f ( x, y − 1) − 4 f ( x, y)
2
(10-14) (a) Laplacian ker-
nel used for point 1 %8 1
detection.
As explained in Section 3.6, this expression can be implemented using the Lapla- (b) X-ray image
cian kernel in Fig. 10.4(a) in Example 10.1. We then we say that a point has been of a turbine blade
detected at a location ( x, y) on which the kernel is centered if the absolute value of with a porosity 1 1 1
manifested by a
the response of the filter at that point exceeds a specified threshold. Such points are single black pixel.
labeled 1 and all others are labeled 0 in the output image, thus producing a binary (c) Result of con-
image. In other words, we use the expression: volving the kernel
with the image.
1 if Z( x, y) > T (d) Result of
g( x, y) =  (10-15) using Eq. (10-15)
0 otherwise was a single point
(shown enlarged
where g( x, y) is the output image, T is a nonnegative threshold, and Z is given by at the tip of the
Eq. (10-12). This formulation simply measures the weighted differences between a arrow). (Original
pixel and its 8-neighbors. Intuitively, the idea is that the intensity of an isolated point image courtesy of
X-TEK Systems,
will be quite different from its surroundings, and thus will be easily detectable by Ltd.)
this type of kernel. Differences in intensity that are considered of interest are those
large enough (as determined by T ) to be considered isolated points. Note that, as
usual for a derivative kernel, the coefficients sum to zero, indicating that the filter
response will be zero in areas of constant intensity.
EXAMPLE 10.2 : Using the Laplacian for line detection.
Figure 10.5(a) shows a 486 × 486 (binary) portion of a wire-bond mask for an electronic circuit, and
EXAMPLE 10.1 : Detection of isolated points in an image.
Fig. 10.5(b) shows its Laplacian image. Because the Laplacian image contains negative values (see the
Figure 10.4(b) is an X-ray image of a turbine blade from a jet engine. The blade has a porosity mani- discussion after Example 3.18), scaling is necessary for display. As the magnified section shows, mid gray
fested by a single black pixel in the upper-right quadrant of the image. Figure 10.4(c) is the result of fil- represents zero, darker shades of gray represent negative values, and lighter shades are positive. The
tering the image with the Laplacian kernel, and Fig. 10.4(d) shows the result of Eq. (10-15) with T equal double-line effect is clearly visible in the magnified region.
to 90% of the highest absolute pixel value of the image in Fig. 10.4(c). The single pixel is clearly visible At first, it might appear that the negative values can be handled simply by taking the absolute value
in this image at the tip of the arrow (the pixel was enlarged to enhance its visibility). This type of detec- of the Laplacian image. However, as Fig. 10.5(c) shows, this approach doubles the thickness of the lines.
tion process is specialized because it is based on abrupt intensity changes at single-pixel locations that A more suitable approach is to use only the positive values of the Laplacian (in noisy situations we use
are surrounded by a homogeneous background in the area of the detector kernel. When this condition the values that exceed a positive threshold to eliminate random variations about zero caused by the
is not satisfied, other methods discussed in this chapter are more suitable for detecting intensity changes. noise). As Fig. 10.5(d) shows, this approach results in thinner lines that generally are more useful. Note
in Figs. 10.5(b) through (d) that when the lines are wide with respect to the size of the Laplacian kernel,
the lines are separated by a zero “valley.” This is not unexpected. For example, when the 3 × 3 kernel is
LINE DETECTION centered on a line of constant intensity 5 pixels wide, the response will be zero, thus producing the effect
The next level of complexity is line detection. Based on the discussion earlier in this just mentioned. When we talk about line detection, the assumption is that lines are thin with respect to
section, we know that for line detection we can expect second derivatives to result the size of the detector. Lines that do not satisfy this assumption are best treated as regions and handled
in a stronger filter response, and to produce thinner lines than first derivatives. Thus, by the edge detection methods discussed in the following section.
we can use the Laplacian kernel in Fig. 10.4(a) for line detection also, keeping in
mind that the double-line effect of the second derivative must be handled properly. The Laplacian detector kernel in Fig. 10.4(a) is isotropic, so its response is inde-
The following example illustrates the procedure. pendent of direction (with respect to the four directions of the 3 × 3 kernel: verti-
cal, horizontal, and two diagonals). Often, interest lies in detecting lines in specified

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 707 6/16/2017 2:12:42 PM DIP4E_GLOBAL_Print_Ready.indb 708 6/16/2017 2:12:43 PM


10.2 Point, Line, and Edge Detection 709 710 Chapter 10 Image Segmentation

a b
c d %1 %1 %1 2 %1 %1 %1 2 %1 %1 %1 2
FIGURE 10.5
(a) Original
image. 2 2 2 %1 2 %1 %1 2 %1 %1 2 %1
(b) Laplacian
image; the
magnified %1 %1 %1 %1 %1 2 %1 2 %1 2 %1 %1
section shows the
positive/negative
Horizontal &45' Vertical %45'
double-line effect
characteristic of a b c d
the Laplacian. FIGURE 10.6 Line detection kernels. Detection angles are with respect to the axis system in Fig. 2.19, with positive
(c) Absolute value angles measured counterclockwise with respect to the (vertical) x-axis.
of the Laplacian.
(d) Positive values
of the Laplacian. point is said to be more likely associated with a horizontal line. If we are interested
in detecting all the lines in an image in the direction defined by a given kernel, we
simply run the kernel through the image and threshold the absolute value of the
result, as in Eq. (10-15). The nonzero points remaining after thresholding are the
strongest responses which, for lines one pixel thick, correspond closest to the direc-
tion defined by the kernel. The following example illustrates this procedure.

EXAMPLE 10.3 : Detecting lines in specified directions.


Figure 10.7(a) shows the image used in the previous example. Suppose that we are interested in find-
ing all the lines that are one pixel thick and oriented at + 45°. For this purpose, we use the kernel in
Fig. 10.6(b). Figure 10.7(b) is the result of filtering the image with that kernel. As before, the shades
darker than the gray background in Fig. 10.7(b) correspond to negative values. There are two principal
segments in the image oriented in the + 45° direction, one in the top left and one at the bottom right. Fig-
ures 10.7(c) and (d) show zoomed sections of Fig. 10.7(b) corresponding to these two areas. The straight
line segment in Fig. 10.7(d) is brighter than the segment in Fig. 10.7(c) because the line segment in the
directions. Consider the kernels in Fig. 10.6. Suppose that an image with a constant bottom right of Fig. 10.7(a) is one pixel thick, while the one at the top left is not. The kernel is “tuned”
background and containing various lines (oriented at 0°, ± 45°, and 90°) is filtered to detect one-pixel-thick lines in the + 45° direction, so we expect its response to be stronger when such
with the first kernel. The maximum responses would occur at image locations in lines are detected. Figure 10.7(e) shows the positive values of Fig. 10.7(b). Because we are interested in
which a horizontal line passes through the middle row of the kernel. This is easily the strongest response, we let T equal 254 (the maximum value in Fig. 10.7(e) minus one). Figure 10.7(f)
verified by sketching a simple array of 1’s with a line of a different intensity (say, 5s) shows in white the points whose values satisfied the condition g > T , where g is the image in Fig. 10.7(e).
running horizontally through the array. A similar experiment would reveal that the The isolated points in the figure are points that also had similarly strong responses to the kernel. In the
second kernel in Fig. 10.6 responds best to lines oriented at + 45°; the third kernel original image, these points and their immediate neighbors are oriented in such a way that the kernel
to vertical lines; and the fourth kernel to lines in the − 45° direction. The preferred produced a maximum response at those locations. These isolated points can be detected using the kernel
direction of each kernel is weighted with a larger coefficient (i.e., 2) than other possi- in Fig. 10.4(a) and then deleted, or they can be deleted using morphological operators, as discussed in the
ble directions. The coefficients in each kernel sum to zero, indicating a zero response last chapter.
in areas of constant intensity.
Let Z1 , Z2 , Z3 , and Z4 denote the responses of the kernels in Fig. 10.6, from left EDGE MODELS
to right, where the Zs are given by Eq. (10-12). Suppose that an image is filtered
with these four kernels, one at a time. If, at a given point in the image, Zk > Z j , Edge detection is an approach used frequently for segmenting images based on
for all j ≠ k, that point is said to be more likely associated with a line in the direc- abrupt (local) changes in intensity. We begin by introducing several ways to model
tion of kernel k. For example, if at a point in the image, Z1 > Z j for j = 2, 3, 4, that edges and then discuss a number of approaches for edge detection.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 709 6/16/2017 2:12:44 PM DIP4E_GLOBAL_Print_Ready.indb 710 6/16/2017 2:12:44 PM


10.2 Point, Line, and Edge Detection 711 712 Chapter 10 Image Segmentation

a b c
FIGURE 10.8
From left to right,
models (ideal
representations) of
a step, a ramp, and
a roof edge, and
their corresponding
intensity profiles.

closely modeled as having an intensity ramp profile, such as the edge in Fig. 10.8(b).
The slope of the ramp is inversely proportional to the degree to which the edge is
blurred. In this model, we no longer have a single “edge point” along the profile.
Instead, an edge point now is any point contained in the ramp, and an edge segment
would then be a set of such points that are connected.
A third type of edge is the so-called roof edge, having the characteristics illus-
trated in Fig. 10.8(c). Roof edges are models of lines through a region, with the
base (width) of the edge being determined by the thickness and sharpness of the
line. In the limit, when its base is one pixel wide, a roof edge is nothing more than
a one-pixel-thick line running through a region in an image. Roof edges arise, for
example, in range imaging, when thin objects (such as pipes) are closer to the sensor
than the background (such as walls). The pipes appear brighter and thus create an
image similar to the model in Fig. 10.8(c). Other areas in which roof edges appear
routinely are in the digitization of line drawings and also in satellite images, where
thin features, such as roads, can be modeled by this type of edge.
a b c It is not unusual to find images that contain all three types of edges. Although
d e f blurring and noise result in deviations from the ideal shapes, edges in images that
FIGURE 10.7 (a) Image of a wire-bond template. (b) Result of processing with the + 45° line detector kernel in Fig. are reasonably sharp and have a moderate amount of noise do resemble the charac-
10.6. (c) Zoomed view of the top left region of (b). (d) Zoomed view of the bottom right region of (b). (e) The image teristics of the edge models in Fig. 10.8, as the profiles in Fig. 10.9 illustrate. What the
in (b) with all negative values set to zero. (f) All points (in white) whose values satisfied the condition g > T , where
models in Fig. 10.8 allow us to do is write mathematical expressions for edges in the
g is the image in (e) and T = 254 (the maximum pixel value in the image minus 1). (The points in (f) were enlarged
to make them easier to see.) development of image processing algorithms. The performance of these algorithms
will depend on the differences between actual edges and the models used in devel-
oping the algorithms.
Edge models are classified according to their intensity profiles. A step edge is Figure 10.10(a) shows the image from which the segment in Fig. 10.8(b) was extract-
characterized by a transition between two intensity levels occurring ideally over the ed. Figure 10.10(b) shows a horizontal intensity profile. This figure shows also the first
distance of one pixel. Figure 10.8(a) shows a section of a vertical step edge and and second derivatives of the intensity profile. Moving from left to right along the
a horizontal intensity profile through the edge. Step edges occur, for example, in intensity profile, we note that the first derivative is positive at the onset of the ramp
images generated by a computer for use in areas such as solid modeling and ani- and at points on the ramp, and it is zero in areas of constant intensity. The second
mation. These clean, ideal edges can occur over the distance of one pixel, provided derivative is positive at the beginning of the ramp, negative at the end of the ramp,
that no additional processing (such as smoothing) is used to make them look “real.” zero at points on the ramp, and zero at points of constant intensity. The signs of the
Digital step edges are used frequently as edge models in algorithm development. derivatives just discussed would be reversed for an edge that transitions from light to
For example, the Canny edge detection algorithm discussed later in this section was dark. The intersection between the zero intensity axis and a line extending between
derived originally using a step-edge model. the extrema of the second derivative marks a point called the zero crossing of the
In practice, digital images have edges that are blurred and noisy, with the degree second derivative.
of blurring determined principally by limitations in the focusing mechanism (e.g., We conclude from these observations that the magnitude of the first derivative
lenses in the case of optical images), and the noise level determined principally by can be used to detect the presence of an edge at a point in an image. Similarly, the
the electronic components of the imaging system. In such situations, edges are more sign of the second derivative can be used to determine whether an edge pixel lies on

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 711 6/16/2017 2:12:45 PM DIP4E_GLOBAL_Print_Ready.indb 712 6/16/2017 2:12:45 PM


10.2 Point, Line, and Edge Detection 713 714 Chapter 10 Image Segmentation

the ramp (see Problem 10.9). However, the conclusions reached using those models
are the same as with an ideal ramp, and working with the latter simplifies theoretical
formulations. Finally, although attention thus far has been limited to a 1-D horizon-
tal profile, a similar argument applies to an edge of any orientation in an image. We
simply define a profile perpendicular to the edge direction at any desired point, and
interpret the results in the same manner as for the vertical edge just discussed.

EXAMPLE 10.4 : Behavior of the first and second derivatives in the region of a noisy edge.
The edge models in Fig. 10.8 are free of noise. The image segments in the first column in Fig. 10.11 show
close-ups of four ramp edges that transition from a black region on the left to a white region on the right
(keep in mind that the entire transition from black to white is a single edge). The image segment at the
top left is free of noise. The other three images in the first column are corrupted by additive Gaussian
noise with zero mean and standard deviation of 0.1, 1.0, and 10.0 intensity levels, respectively. The graph
below each image is a horizontal intensity profile passing through the center of the image. All images
have 8 bits of intensity resolution, with 0 and 255 representing black and white, respectively.
Consider the image at the top of the center column. As discussed in connection with Fig. 10.10(b), the
FIGURE 10.9 A 1508 × 1970 image showing (zoomed) actual ramp (bottom, left), step (top,
right), and roof edge profiles. The profiles are from dark to light, in the areas enclosed by the derivative of the scan line on the left is zero in the constant areas. These are the two black bands shown
small circles. The ramp and step profiles span 9 pixels and 2 pixels, respectively. The base of the in the derivative image. The derivatives at points on the ramp are constant and equal to the slope of the
roof edge is 3 pixels. (Original image courtesy of Dr. David R. Pickens, Vanderbilt University.) ramp. These constant values in the derivative image are shown in gray. As we move down the center col-
umn, the derivatives become increasingly different from the noiseless case. In fact, it would be difficult
to associate the last profile in the center column with the first derivative of a ramp edge. What makes
the dark or light side of an edge. Two additional properties of the second derivative these results interesting is that the noise is almost visually undetectable in the images on the left column.
around an edge are: (1) it produces two values for every edge in an image; and (2) These examples are good illustrations of the sensitivity of derivatives to noise.
its zero crossings can be used for locating the centers of thick edges, as we will show As expected, the second derivative is even more sensitive to noise. The second derivative of the noise-
later in this section. Some edge models utilize a smooth transition into and out of less image is shown at the top of the right column. The thin white and black vertical lines are the positive
and negative components of the second derivative, as explained in Fig. 10.10. The gray in these images
represents zero (as discussed earlier, scaling causes zero to show as gray). The only noisy second deriva-
a b
tive image that barely resembles the noiseless case corresponds to noise with a standard deviation of 0.1.
FIGURE 10.10 The remaining second-derivative images and profiles clearly illustrate that it would be difficult indeed to
(a) Two regions of
constant Horizontal intensity
detect their positive and negative components, which are the truly useful features of the second deriva-
intensity profile tive in terms of edge detection.
separated by an The fact that such little visual noise can have such a significant impact on the two key derivatives
ideal ramp edge. used for detecting edges is an important issue to keep in mind. In particular, image smoothing should be
(b) Detail near a serious consideration prior to the use of derivatives in applications where noise with levels similar to
the edge, showing
a horizontal those we have just discussed is likely to be present.
First
intensity profile, derivative
and its first and In summary, the three steps performed typically for edge detection are:
second
derivatives. 1. Image smoothing for noise reduction. The need for this step is illustrated by the
results in the second and third columns of Fig. 10.11.
Second 2. Detection of edge points. As mentioned earlier, this is a local operation that
derivative extracts from an image all points that are potential edge-point candidates.
3. Edge localization. The objective of this step is to select from the candidate
Zero crossing points only the points that are members of the set of points comprising an edge.
The remainder of this section deals with techniques for achieving these objectives.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 713 6/16/2017 2:12:45 PM DIP4E_GLOBAL_Print_Ready.indb 714 6/16/2017 2:12:45 PM


10.2 Point, Line, and Edge Detection 715 716 Chapter 10 Image Segmentation

BASIC EDGE DETECTION


As illustrated in the preceding discussion, detecting changes in intensity for the pur-
pose of finding edges can be accomplished using first- or second-order derivatives.
We begin with first-order derivatives, and work with second-order derivatives in the
following subsection.

The Image Gradient and Its Properties


The tool of choice for finding edge strength and direction at an arbitrary location
( x, y) of an image, f, is the gradient, denoted by (f and defined as the vector

For convenience, we
 ∂f ( x, y) 
repeat here some of the  g x ( x, y)  ∂x 
gradient concepts and ∇f ( x, y) ≡ grad [ f ( x, y)] ≡   =   (10-16)
equations introduced in  g y ( x, y)  ∂f ( x, y) 
Chapter 3.  ∂y 

This vector has the well-known property that it points in the direction of maximum
rate of change of f at ( x, y) (see Problem 10.10). Equation (10-16) is valid at an
arbitrary (but single) point ( x, y). When evaluated for all applicable values of x
and y, (f ( x, y) becomes a vector image, each element of which is a vector given by
Eq. (10-16). The magnitude, M( x, y), of this gradient vector at a point ( x, y) is given
by its Euclidean vector norm:

M( x, y) = ∇f ( x, y) = g x2 ( x, y) + g y2 ( x, y) (10-17)

This is the value of the rate of change in the direction of the gradient vector at point
( x, y). Note that M( x, y), (f ( x, y) , g x ( x, y), and g y ( x, y) are arrays of the same
size as f, created when x and y are allowed to vary over all pixel locations in f. It is
common practice to refer to M( x, y) and (f ( x, y) as the gradient image, or simply
as the gradient when the meaning is clear. The summation, square, and square root
operations are elementwise operations, as defined in Section 2.6.
The direction of the gradient vector at a point ( x, y) is given by
 g y ( x, y) 
a( x, y ) = tan −1   (10-18)
 g x ( x, y) 
Angles are measured in the counterclockwise direction with respect to the x-axis
(see Fig. 2.19). This is also an image of the same size as f, created by the elementwise
division of g x and g y over all applicable values of x and y. The following example
illustrates, the direction of an edge at a point ( x, y) is orthogonal to the direction,
a( x, y), of the gradient vector at the point.

EXAMPLE 10.5 : Computing the gradient.


FIGURE 10.11 First column: 8-bit images with values in the range [0, 255], and intensity profiles Figure 10.12(a) shows a zoomed section of an image containing a straight edge segment. Each square
of a ramp edge corrupted by Gaussian noise of zero mean and standard deviations of 0.0, 0.1,
1.0, and 10.0 intensity levels, respectively. Second column: First-derivative images and inten-
corresponds to a pixel, and we are interested in obtaining the strength and direction of the edge at the
sity profiles. Third column: Second-derivative images and intensity profiles. point highlighted with a box. The shaded pixels in this figure are assumed to have value 0, and the white

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 715 6/16/2017 2:12:46 PM DIP4E_GLOBAL_Print_Ready.indb 716 6/16/2017 2:12:48 PM


10.2 Point, Line, and Edge Detection 717 718 Chapter 10 Image Segmentation
y
Origin a b
Gradient vector Gradient vector %1 %1 1
FIGURE 10.13
1-D kernels used to
implement Eqs. 1
a a (10-19) and (10-20).
a % 90'
Edge direction
x and
∂f ( x, y)
a b c g y ( x, y) = = f ( x, y + 1) − f ( x, y) (10-20)
∂y
FIGURE 10.12 Using the gradient to determine edge strength and direction at a point. Note that the edge direction
is perpendicular to the direction of the gradient vector at the point where the gradient is computed. Each square These two equations can be implemented for all values of x and y by filtering f ( x, y)
represents one pixel. (Recall from Fig. 2.19 that the origin of our coordinate system is at the top, left.) with the 1-D kernels in Fig. 10.13.
When diagonal edge direction is of interest, we need 2-D kernels. The Roberts
Filter kernels used to
cross-gradient operators (Roberts [1965]) are one of the earliest attempts to use 2-D
pixels have value 1. We discuss after this example an approach for computing the derivatives in the x- compute the derivatives kernels with a diagonal preference. Consider the 3 × 3 region in Fig. 10.14(a). The
and y-directions using a 3 × 3 neighborhood centered at a point. The method consists of subtracting the needed for the gradient Roberts operators are based on implementing the diagonal differences
are often called gradient
pixels in the top row of the neighborhood from the pixels in the bottom row to obtain the partial deriva- operators, difference
tive in the x-direction. Similarly, we subtract the pixels in the left column from the pixels in the right col- ∂f
operators, edge operators, gx = = (z9 − z5 ) (10-21)
umn of the neighborhood to obtain the partial derivative in the y-direction. It then follows, using these
or edge detectors. ∂x
differences as our estimates of the partials, that ∂f ∂x = −2 and ∂f ∂y = 2 at the point in question. Then, and
∂f
gy = = (z8 − z6 ) (10-22)
 ∂f  ∂y
 g x   ∂x   −2  These derivatives can be implemented by filtering an image with the kernels shown
∇f =   =   =  
 g y   ∂f   2  in Figs. 10.14(b) and (c).
 ∂y  Kernels of size 2 × 2 are simple conceptually, but they are not as useful for com-
puting edge direction as kernels that are symmetric about their centers, the smallest
from which we obtain (f = 2 2 at that point. Similarly, the direction of the gradient vector at the of which are of size 3 × 3. These kernels take into account the nature of the data on
( )
same point follows from Eq. (10-18): a = tan −1 g y g x = − 45°, which is the same as 135° measured in opposite sides of the center point, and thus carry more information regarding the
the positive (counterclockwise) direction with respect to the x-axis in our image coordinate system (see direction of an edge. The simplest digital approximations to the partial derivatives
Fig. 2.19). Figure 10.12(b) shows the gradient vector and its direction angle. using kernels of size 3 × 3 are given by
As mentioned earlier, the direction of an edge at a point is orthogonal to the gradient vector at that Observe that these two
point. So the direction angle of the edge in this example is a − 90° = 135° − 90° = 45°, as Fig. 10.12(c) equations are first-order ∂f
central differences as gx = = (z7 + z8 + z9 ) − (z1 + z2 + z3 )
shows. All edge points in Fig. 10.12(a) have the same gradient, so the entire edge segment is in the same given in Eq. (10-6), but ∂x
direction. The gradient vector sometimes is called the edge normal. When the vector is normalized to unit multiplied by 2. and (10-23)
length by dividing it by its magnitude, the resulting vector is referred to as the edge unit normal. ∂f
gy = = (z3 + z6 + z9 ) − (z1 + z4 + z7 )
∂y
In this formulation, the difference between the third and first rows of the 3 × 3 region
Gradient Operators approximates the derivative in the x-direction, and the difference between the third
Obtaining the gradient of an image requires computing the partial derivatives ∂f ∂x and first columns approximate the derivative in the y-direction. Intuitively, we would
and ∂f ∂y at every pixel location in the image. For the gradient, we typically use a expect these approximations to be more accurate than the approximations obtained
forward or centered finite difference (see Table 10.1). Using forward differences we using the Roberts operators. Equations (10-22) and (10-23) can be implemented over
obtain an entire image by filtering it with the two kernels in Figs. 10.14(d) and (e). These
kernels are called the Prewitt operators (Prewitt [1970]).
∂f ( x, y)
g x ( x, y) = = f ( x + 1, y) − f ( x, y) (10-19) A slight variation of the preceding two equations uses a weight of 2 in the center
∂x coefficient:

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 717 6/16/2017 2:12:49 PM DIP4E_GLOBAL_Print_Ready.indb 718 6/16/2017 2:12:50 PM


10.2 Point, Line, and Edge Detection 719 720 Chapter 10 Image Segmentation

a Recall the important coefficients of all the kernels in Fig. 10.14 sum to zero, thus giving a response of zero
z1 z2 z3 result in Problem 3.32
b c that using a kernel in areas of constant intensity, as expected of derivative operators.
d e whose coefficients sum Any of the pairs of kernels from Fig. 10.14 are convolved with an image to obtain
f g to zero produces a
z4 z5 z6
filtered image whose the gradient components g x and g y at every pixel location. These two partial deriva-
FIGURE 10.14 pixels also sum to zero. tive arrays are then used to estimate edge strength and direction. Obtaining the
A 3 × 3 region This implies in general
of an image (the z7 z8 z9 that some pixels will be magnitude of the gradient requires the computations in Eq. (10-17). This imple-
z’s are intensity negative. Similarly, if the mentation is not always desirable because of the computational burden required
kernel coefficients sum
values), and to 1, the sum of pixels in by squares and square roots, and an approach used frequently is to approximate the
various kernels %1 0 0 %1 the original and filtered magnitude of the gradient by absolute values:
used to compute images will be the same
(see Problem 3.31).
the gradient at the
point labeled z5 .
0 1 1 0 M( x, y) ≈ g x + g y (10-26)
Roberts
This equation is more attractive computationally, and it still preserves relative
%1 %1 %1 %1 0 1
changes in intensity levels. The price paid for this advantage is that the resulting
filters will not be isotropic (invariant to rotation) in general. However, this is not an
issue when kernels such as the Prewitt and Sobel kernels are used to compute g x
0 0 0 %1 0 1
and g y because these kernels give isotropic results only for vertical and horizontal
edges. This means that results would be isotropic only for edges in those two direc-
1 1 1 %1 0 1 tions anyway, regardless of which of the two equations is used. That is, Eqs. (10-17)
and (10-26) give identical results for vertical and horizontal edges when either the
Prewitt
Sobel or Prewitt kernels are used (see Problem 10.11).
%1 %2 %1 %1 0 1 The 3 × 3 kernels in Fig. 10.14 exhibit their strongest response predominantly for
vertical and horizontal edges. The Kirsch compass kernels (Kirsch [1971]) in Fig. 10.15,
0 0 0 %2 0 2
are designed to detect edge magnitude and direction (angle) in all eight compass
directions. Instead of computing the magnitude using Eq. (10-17) and angle using
Eq. (10-18), Kirsch’s approach was to determine the edge magnitude by convolv-
1 2 1 %1 0 1
ing an image with all eight kernels and assign the edge magnitude at a point as the
Sobel response of the kernel that gave strongest convolution value at that point. The edge
angle at that point is then the direction associated with that kernel. For example, if
the strongest value at a point in the image resulted from using the north (N) kernel,
the edge magnitude at that point would be assigned the response of that kernel, and
∂f the direction would be 0° (because compass kernel pairs differ by a rotation of 180°;
gx = = (z7 + 2z8 + z9 ) − (z1 + 2z2 + z3 ) (10-24)
∂x choosing the maximum response will always result in a positive number). Although
and when working with, say, the Sobel kernels, we think of a north or south edge as
being vertical, the N and S compass kernels differentiate between the two, the differ-
∂f ence being the direction of the intensity transitions defining the edge. For example,
gy = = (z3 + 2z6 + z9 ) − (z1 + 2z4 + z7 ) (10-25)
∂y assuming that intensity values are in the range [0, 1], the binary edge in Fig. 10.8(a)
is defined by black (0) on the left and white (1) on the right. When all Kirsch kernels
It can be demonstrated (see Problem 10.12) that using a 2 in the center location pro-
are applied to this edge, the N kernel will yield the highest value, thus indicating an
vides image smoothing. Figures 10.14(f) and (g) show the kernels used to implement
edge oriented in the north direction (at the point of the computation).
Eqs. (10-24) and (10-25). These kernels are called the Sobel operators (Sobel [1970]).
The Prewitt kernels are simpler to implement than the Sobel kernels, but the
slight computational difference between them typically is not an issue. The fact EXAMPLE 10.6 : Illustration of the 2-D gradient magnitude and angle.
that the Sobel kernels have better noise-suppression (smoothing) characteristics Figure 10.16 illustrates the Sobel absolute value response of the two components of the gradient, g x
makes them preferable because, as mentioned earlier in the discussion of Fig. 10.11, and g y , as well as the gradient image formed from the sum of these two components. The directionality
noise suppression is an important issue when dealing with derivatives. Note that the of the horizontal and vertical components of the gradient is evident in Figs. 10.16(b) and (c). Note, for

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 719 6/16/2017 2:12:51 PM DIP4E_GLOBAL_Print_Ready.indb 720 6/16/2017 2:12:52 PM


10.2 Point, Line, and Edge Detection 721 722 Chapter 10 Image Segmentation

a b c d FIGURE 10.17
e f g h %3 %3 5 %3 5 5 5 5 5 5 5 %3 Gradient angle
image computed
FIGURE 10.15 using Eq. (10-18).
Kirsch compass %3 0 5 %3 0 5 %3 0 %3 5 0 %3
Areas of constant
kernels. The edge intensity in this
direction of image indicate
%3 %3 5 %3 %3 %3 %3 %3 %3 %3 %3 %3
strongest response that the direction
of each kernel is N NW W SW of the gradient
labeled below it. vector is the same
5 %3 %3 %3 %3 %3 %3 %3 %3 %3 %3 %3 at all the pixel
locations in those
regions.
5 0 %3 5 0 %3 %3 0 %3 %3 0 5

5 %3 %3 5 5 %3 5 5 5 %3 5 5
Figure 10.17 shows the gradient angle image computed using Eq. (10-18). In general, angle images are
not as useful as gradient magnitude images for edge detection, but they do complement the information
S SE E NE extracted from an image using the magnitude of the gradient. For instance, the constant intensity areas
in Fig. 10.16(a), such as the front edge of the sloping roof and top horizontal bands of the front wall,
example, how strong the roof tile, horizontal brick joints, and horizontal segments of the windows are in are constant in Fig. 10.17, indicating that the gradient vector direction at all the pixel locations in those
Fig. 10.16(b) compared to other edges. In contrast, Fig. 10.16(c) favors features such as the vertical com- regions is the same. As we will show later in this section, angle information plays a key supporting role
ponents of the façade and windows. It is common terminology to use the term edge map when referring in the implementation of the Canny edge detection algorithm, a widely used edge detection scheme.
to an image whose principal features are edges, such as gradient magnitude images. The intensities of the
image in Fig. 10.16(a) were scaled to the range [0, 1]. We use values in this range to simplify parameter The original image in Fig. 10.16(a) is of reasonably high resolution, and at the
selection in the various methods for edge detection discussed in this section. distance the image was acquired, the contribution made to image detail by the wall
bricks is significant. This level of fine detail often is undesirable in edge detection
because it tends to act as noise, which is enhanced by derivative computations and
a b
thus complicates detection of the principal edges. One way to reduce fine detail is
c d
to smooth the image prior to computing the edges. Figure 10.18 shows the same
FIGURE 10.16
sequence of images as in Fig. 10.16, but with the original image smoothed first using
(a) Image of size
834 × 1114 pixels, a 5 × 5 averaging filter (see Section 3.5 regarding smoothing filters). The response
with intensity of each kernel now shows almost no contribution due to the bricks, with the results
values scaled to being dominated mostly by the principal edges in the image.
the range [0, 1]. Figures 10.16 and 10.18 show that the horizontal and vertical Sobel kernels do
(b) g x , the
not differentiate between edges in the ± 45° directions. If it is important to empha-
component of
the gradient in %1 %1 1 size edges oriented in particular diagonal directions, then one of the Kirsch kernels
the x-direction, in Fig. 10.15 should be used. Figures 10.19(a) and (b) show the responses of the 45°
obtained using the 1 (NW) and −45° (SW) Kirsch kernels, respectively. The stronger diagonal selectivity
Sobel kernel in of these kernels is evident in these figures. Both kernels have similar responses to
Fig. 10.14(f) to
horizontal and vertical edges, but the response in these directions is weaker.
filter the image.
(c) g y , obtained
using the kernel The threshold used to
generate Fig. 10.20(a)
Combining the Gradient with Thresholding
in Fig. 10.14(g). was selected so that most The results in Fig. 10.18 show that edge detection can be made more selective by
(d) The gradient of the small edges caused
image, g x + g y . by the bricks were smoothing the image prior to computing the gradient. Another approach aimed
eliminated. This was the
same objective as when
at achieving the same objective is to threshold the gradient image. For example,
the image in Fig. 10.16(a) Fig. 10.20(a) shows the gradient image from Fig. 10.16(d), thresholded so that pix-
was smoothed prior to
computing the gradient.
els with values greater than or equal to 33% of the maximum value of the gradi-
ent image are shown in white, while pixels below the threshold value are shown in

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 721 6/16/2017 2:12:53 PM DIP4E_GLOBAL_Print_Ready.indb 722 6/16/2017 2:12:53 PM


10.2 Point, Line, and Edge Detection 723 724 Chapter 10 Image Segmentation

a b a b
c d FIGURE 10.20
FIGURE 10.18 (a) Result of
Same sequence as thresholding
in Fig. 10.16, but Fig. 10.16(d), the
with the original gradient of the
image smoothed original image.
using a 5 × 5 aver- (b) Result of
aging kernel prior thresholding
to edge detection. Fig. 10.18(d), the
gradient of the
smoothed image.

MORE ADVANCED TECHNIQUES FOR EDGE DETECTION


The edge-detection methods discussed in the previous subsections are based on fil-
tering an image with one or more kernels, with no provisions made for edge char-
acteristics and noise content. In this section, we discuss more advanced techniques
that attempt to improve on simple edge-detection methods by taking into account
factors such as image noise and the nature of edges themselves.

The Marr-Hildreth Edge Detector


One of the earliest successful attempts at incorporating more sophisticated analy-
black. Comparing this image with Fig. 10.16(d), we see that there are fewer edges
sis into the edge-finding process is attributed to Marr and Hildreth [1980]. Edge-
in the thresholded image, and that the edges in this image are much sharper (see,
detection methods in use at the time were based on small operators, such as the
for example, the edges in the roof tile). On the other hand, numerous edges, such
Sobel kernels discussed earlier. Marr and Hildreth argued (1) that intensity chang-
as the sloping line defining the far edge of the roof (see arrow), are broken in the
es are not independent of image scale, implying that their detection requires using
thresholded image.
operators of different sizes; and (2) that a sudden intensity change will give rise to a
When interest lies both in highlighting the principal edges and on maintaining
peak or trough in the first derivative or, equivalently, to a zero crossing in the second
as much connectivity as possible, it is common practice to use both smoothing and
derivative (as we saw in Fig. 10.10).
thresholding. Figure 10.20(b) shows the result of thresholding Fig. 10.18(d), which is
These ideas suggest that an operator used for edge detection should have two
the gradient of the smoothed image. This result shows a reduced number of broken
salient features. First and foremost, it should be a differential operator capable of
edges; for instance, compare the corresponding edges identified by the arrows in
Figs. 10.20(a) and (b). computing a digital approximation of the first or second derivative at every point in
the image. Second, it should be capable of being “tuned” to act at any desired scale,
so that large operators can be used to detect blurry edges and small operators to
a b detect sharply focused fine detail.
FIGURE 10.19 Marr and Hildreth suggested that the most satisfactory operator fulfilling these
Diagonal edge conditions is the filter ( 2G where, as defined in Section 3.6, ( 2 is the Laplacian, and
detection. Equation (10-27) differs G is the 2-D Gaussian function
(a) Result of using from the definition of a
the Kirsch kernel in Gaussian function by a x2 + y2
multiplicative constant −
Fig. 10.15(c). 2s 2
[see Eq. (3-45)]. Here, G( x, y) = e (10-27)
(b) Result of using we are interested only in
the kernel in Fig. the general shape of the
10.15(d). The input Gaussian function.
with standard deviation s (sometimes s is called the space constant in this context).
image in both cases
was Fig. 10.18(a). We find an expression for ( 2G by applying the Laplacian to Eq. (10-27):

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 723 6/16/2017 2:12:54 PM DIP4E_GLOBAL_Print_Ready.indb 724 6/16/2017 2:12:55 PM


10.2 Point, Line, and Edge Detection 725 726 Chapter 10 Image Segmentation

∂ 2G( x, y) ∂ 2G( x, y) a b ( 2G
∇ 2G( x, y) = + c d
∂x 2 ∂y 2
FIGURE 10.21
x2 + y2 x2 + y2 (a) 3-D plot of
∂ −x − 2s2
∂ −y − 2s2
= a e b + a e b (10-28) the negative of the
∂x s 2 ∂y s 2 LoG.
x2 + y2 x2 + y2 (b) Negative of
x2 1 − 2 y2 1 − 2 the LoG
=a − 2b e 2s + a − 2b e 2s
s4 s s4 s displayed as an
image.
Collecting terms, we obtain (c) Cross section
of (a) showing y
x
x2 + y2 zero crossings.
x 2 + y 2 − 2s 2 − (d) 5 × 5 kernel
∇ 2G( x, y) = a be
2s2 (10-29) (2G
s4 approximation to
0 0 %1 0 0
the shape in (a).
The negative
This expression is called the Laplacian of a Gaussian (LoG). of this kernel 0 %1 %2 %1 0
Figures 10.21(a) through (c) show a 3-D plot, image, and cross-section of the would be used in
negative of the LoG function (note that the zero crossings of the LoG occur at practice. %1 %2 16 %2 %1
x + y 2 = 2s 2 , which defines a circle of radius 2s centered on the peak of the
2

Gaussian function). Because of the shape illustrated in Fig. 10.21(a), the LoG func- 0 %1 %2 %1 0
tion sometimes is called the Mexican hat operator. Figure 10.21(d) shows a 5 × 5 Zero crossing Zero crossing
kernel that approximates the shape in Fig. 10.21(a) (normally, we would use the neg- 0 0 %1 0 0
ative of this kernel). This approximation is not unique. Its purpose is to capture the 2 2s
essential shape of the LoG function; in terms of Fig. 10.21(a), this means a positive,
central term surrounded by an adjacent, negative region whose values decrease as a
direction, thus avoiding having to use multiple kernels to calculate the strongest
function of distance from the origin, and a zero outer region. The coefficients must
sum to zero so that the response of the kernel is zero in areas of constant intensity. response at any point in the image.
The Marr-Hildreth algorithm consists of convolving the LoG kernel with an input
Filter kernels of arbitrary size (but fixed s) can be generated by sampling Eq. (10-29),
and scaling the coefficients so that they sum to zero. A more effective approach for image,
generating a LoG kernel is sampling Eq. (10-27) to the desired size, then convolving
the resulting array with a Laplacian kernel, such as the kernel in Fig. 10.4(a). Because g( x, y) = ( 2G( x, y) ! f ( x, y) (10-30)
convolving an image with a kernel whose coefficients sum to zero yields an image
whose elements also sum to zero (see Problems 3.32 and 10.16), this approach auto- and then finding the zero crossings of g( x, y) to determine the locations of edges in
matically satisfies the requirement that the sum of the LoG kernel coefficients be This expression is
f ( x, y). Because the Laplacian and convolution are linear processes, we can write
implemented in the Eq. (10-30) as
zero. We will discuss size selection for LoG filter later in this section. spatial domain using
There are two fundamental ideas behind the selection of the operator ∇ 2G. First, Eq. (3-35). It can be
implemented also in the g( x, y) = ∇ 2 [G( x, y) ! f ( x, y)] (10-31)
the Gaussian part of the operator blurs the image, thus reducing the intensity of frequency domain using
structures (including noise) at scales much smaller than s. Unlike the averaging Eq. (4-104).
indicating that we can smooth the image first with a Gaussian filter and then com-
filter used in Fig. 10.18, the Gaussian function is smooth in both the spatial and
frequency domains (see Section 4.8), and is thus less likely to introduce artifacts pute the Laplacian of the result. These two equations give identical results.
(e.g., ringing) not present in the original image. The other idea concerns the second- The Marr-Hildreth edge-detection algorithm may be summarized as follows:
derivative properties of the Laplacian operator, ∇ 2 . Although first derivatives can 1. Filter the input image with an n × n Gaussian lowpass kernel obtained by sam-
be used for detecting abrupt changes in intensity, they are directional operators. The pling Eq. (10-27).
Laplacian, on the other hand, has the important advantage of being isotropic (invari- 2. Compute the Laplacian of the image resulting from Step 1 using, for example,
ant to rotation), which not only corresponds to characteristics of the human visual the 3 × 3 kernel in Fig. 10.4(a). [Steps 1 and 2 implement Eq. (10-31).]
system (Marr [1982]) but also responds equally to changes in intensity in any kernel
3. Find the zero crossings of the image from Step 2.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 725 6/16/2017 2:12:55 PM DIP4E_GLOBAL_Print_Ready.indb 726 6/16/2017 2:12:57 PM


10.2 Point, Line, and Edge Detection 727 728 Chapter 10 Image Segmentation

To specify the size of the Gaussian kernel, recall from our discussion of Fig. 3.35 that a b
the values of a Gaussian function at a distance larger than 3s from the mean are c d
small enough so that they can be ignored. As discussed in Section 3.5, this implies FIGURE 10.22
As explained in Section
3.5, <⋅= and :⋅; denote the
using a Gaussian kernel of size L 6sM × L 6sM , where L 6sM denotes the ceiling of 6s; that (a) Image of size
ceiling and floor func- is, smallest integer not less than 6s. Because we work with kernels of odd dimen- 834 × 1114 pixels,
tions. That is, the ceiling sions, we would use the smallest odd integer satisfying this condition. Using a kernel with intensity
and floor functions map values scaled to the
a real number to the smaller than this will “truncate” the LoG function, with the degree of truncation
range [0, 1].
smallest following, or the
largest previous, integer,
being inversely proportional to the size of the kernel. Using a larger kernel would (b) Result of
respectively. make little difference in the result. Steps 1 and 2 of
One approach for finding the zero crossings at any pixel, p, of the filtered image, the Marr-Hildreth
g( x, y), is to use a 3 × 3 neighborhood centered at p. A zero crossing at p implies algorithm using
s = 4 and n = 25.
that the signs of at least two of its opposing neighboring pixels must differ. There are
Attempts to find zero (c) Zero cross-
crossings by finding the four cases to test: left/right, up/down, and the two diagonals. If the values of g( x, y) ings of (b) using
coordinates (x, y) where
g(x, y) = 0 are impractical
are being compared against a threshold (a common approach), then not only must a threshold of 0
because of noise and the signs of opposing neighbors be different, but the absolute value of their numeri- (note the closed-
other computational cal difference must also exceed the threshold before we can call p a zero-crossing loop edges).
inaccuracies. (d) Zero cross-
pixel. We illustrate this approach in Example 10.7.
ings found using a
Computing zero crossings is the key feature of the Marr-Hildreth edge-detection threshold equal to
method. The approach discussed in the previous paragraph is attractive because of 4% of the maxi-
its simplicity of implementation and because it generally gives good results. If the mum value of the
accuracy of the zero-crossing locations found using this method is inadequate in a image in (b). Note
the thin edges.
particular application, then a technique proposed by Huertas and Medioni [1986]
for finding zero crossings with subpixel accuracy can be employed.
with s1 > s 2 . Experimental results suggest that certain “channels” in the human
vision system are selective with respect to orientation and frequency, and can be
EXAMPLE 10.7 : Illustration of the Marr-Hildreth edge-detection method. modeled using Eq. (10-32) with a ratio of standard deviations of 1.75:1. Using the
Figure 10.22(a) shows the building image used earlier and Fig. 10.22(b) is the result of Steps 1 and 2 of ratio 1.6:1 preserves the basic characteristics of these observations and also pro-
the Marr-Hildreth algorithm, using s = 4 (approximately 0.5% of the short dimension of the image) vides a closer “engineering” approximation to the LoG function (Marr and Hil-
and n = 25 to satisfy the size condition stated above. As in Fig. 10.5, the gray tones in this image are due dreth [1980]). In order for the LoG and DoG to have the same zero crossings, the
to scaling. Figure 10.22(c) shows the zero crossings obtained using the 3 × 3 neighborhood approach just value of s for the LoG must be selected based on the following equation (see
discussed, with a threshold of zero. Note that all the edges form closed loops. This so-called “spaghetti Problem 10.19):
effect” is a serious drawback of this method when a threshold value of zero is used (see Problem 10.17).
We avoid closed-loop edges by using a positive threshold. s12 s22  s2 
s2 = ln  1  (10-33)
Figure 10.22(d) shows the result of using a threshold approximately equal to 4% of the maximum s12 − s22  s22 
value of the LoG image. The majority of the principal edges were readily detected, and “irrelevant” fea-
tures, such as the edges due to the bricks and the tile roof, were filtered out. This type of performance Although the zero crossings of the LoG and DoG will be the same when this value
is virtually impossible to obtain using the gradient-based edge-detection techniques discussed earlier. of s is used, their amplitude scales will be different. We can make them compatible
Another important consequence of using zero crossings for edge detection is that the resulting edges are by scaling both functions so that they have the same value at the origin.
1 pixel thick. This property simplifies subsequent stages of processing, such as edge linking. The profiles in Figs. 10.23(a) and (b) were generated with standard devia-
tion ratios of 1:1.75 and 1:1.6, respectively (by convention, the curves shown are
inverted, as in Fig. 10.21). The LoG profiles are the solid lines, and the DoG profiles
It is possible to approximate the LoG function in Eq. (10-29) by a difference of are dotted. The curves shown are intensity profiles through the center of the LoG
Gaussians (DoG): and DoG arrays, generated by sampling Eqs. (10-29) and (10-32), respectively. The
2 2 2 2
amplitude of all curves at the origin were normalized to 1. As Fig. 10.23(b) shows,
1 x +y 1 x +y
DG ( x, y) =

e 2 s12 − e − 2 s22 (10-32) the ratio 1:1.6 yielded a slightly closer approximation of the LoG and DoG func-
2ps12 2ps 22 tions (for example, compare the bottom lobes of the two figures).

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 727 6/16/2017 2:12:59 PM DIP4E_GLOBAL_Print_Ready.indb 728 6/16/2017 2:12:59 PM


10.2 Point, Line, and Edge Detection 729 730 Chapter 10 Image Segmentation

a b where the approximation was only about 20% worse that using the optimized
FIGURE 10.23 numerical solution (a difference of this magnitude generally is visually impercep-
(a) Negatives of tible in most applications).
the LoG (solid) Generalizing the preceding result to 2-D involves recognizing that the 1-D
and DoG approach still applies in the direction of the edge normal (see Fig. 10.12). Because
(dotted) profiles
using a s ratio of the direction of the normal is unknown beforehand, this would require applying the
1.75:1. (b) Profiles 1-D edge detector in all possible directions. This task can be approximated by first
obtained using a smoothing the image with a circular 2-D Gaussian function, computing the gradient
ratio of 1.6:1. of the result, and then using the gradient magnitude and direction to estimate edge
strength and direction at every point.
Gaussian kernels are separable (see Section 3.4). Therefore, both the LoG and Let f ( x, y) denote the input image and G( x, y) denote the Gaussian function:
the DoG filtering operations can be implemented with 1-D convolutions instead of
using 2-D convolutions directly (see Problem 10.19). For an image of size M × N −
x2 + y2
G( x, y) = e 2s 2 (10-35)
and a kernel of size n × n, doing so reduces the number of multiplications and addi-
tions for each convolution from being proportional to n 2 MN for 2-D convolutions
to being proportional to nMN for 1-D convolutions. This implementation difference We form a smoothed image, fs ( x, y), by convolving f and G:
is significant. For example, if n = 25, a 1-D implementation will require on the order
of 12 times fewer multiplication and addition operations than using 2-D convolution. fs ( x, y) = G( x, y) ! f ( x, y) (10-36)

The Canny Edge Detector This operation is followed by computing the gradient magnitude and direction
(angle), as discussed earlier:
Although the algorithm is more complex, the performance of the Canny edge detec-
tor (Canny [1986]) discussed in this section is superior in general to the edge detec-
Ms ( x, y) = (fs ( x, y) = g x2 ( x, y) + g y2 ( x, y) (10-37)
tors discussed thus far. Canny’s approach is based on three basic objectives:
and
1. Low error rate. All edges should be found, and there should be no spurious
 g y ( x, y) 
responses. a( x, y) = tan −1   (10-38)
2. Edge points should be well localized. The edges located must be as close as pos-  g x ( x, y) 
sible to the true edges. That is, the distance between a point marked as an edge
with g x ( x, y) = ∂fs ( x, y) ∂x and g y ( x, y) = ∂fs ( x, y) ∂y. Any of the derivative fil-
by the detector and the center of the true edge should be minimum.
ter kernel pairs in Fig. 10.14 can be used to obtain g x ( x, y) and g y ( x, y). Equation
3. Single edge point response. The detector should return only one point for each (10-36) is implemented using an n × n Gaussian kernel whose size is discussed below.
true edge point. That is, the number of local maxima around the true edge should Keep in mind that (fs ( x, y) and a( x, y) are arrays of the same size as the image
be minimum. This means that the detector should not identify multiple edge pix-
from which they are computed.
els where only a single edge point exists.
Gradient image (fs ( x, y) typically contains wide ridges around local maxima.
The essence of Canny’s work was in expressing the preceding three criteria math- The next step is to thin those ridges. One approach is to use nonmaxima suppres-
ematically, and then attempting to find optimal solutions to these formulations. In sion. The essence of this approach is to specify a number of discrete orientations of
general, it is difficult (or impossible) to find a closed-form solution that satisfies the edge normal (gradient vector). For example, in a 3 × 3 region we can define four
all the preceding objectives. However, using numerical optimization with 1-D step orientations† for an edge passing through the center point of the region: horizontal,
edges corrupted by additive white Gaussian noise† led to the conclusion that a good vertical, + 45°, and − 45°. Figure 10.24(a) shows the situation for the two possible
approximation to the optimal step edge detector is the first derivative of a Gaussian, orientations of a horizontal edge. Because we have to quantize all possible edge
directions into four ranges, we have to define a range of directions over which we
2
−x − x
2
d − x2 consider an edge to be horizontal. We determine edge direction from the direction
e 2s = 2 e 2s2 (10-34)
dx s of the edge normal, which we obtain directly from the image data using Eq. (10-38).

As Fig. 10.24(b) shows, if the edge normal is in the range of directions from −22.5° to
Recall that white noise is noise having a frequency spectrum that is continuous and uniform over a specified
frequency band. White Gaussian noise is white noise in which the distribution of amplitude values is Gaussian.

Gaussian white noise is a good approximation of many real-world situations and generates mathematically Every edge has two possible orientations. For example, an edge whose normal is oriented at 0° and an edge
tractable models. It has the useful property that its values are statistically independent. whose normal is oriented at 180° are the same horizontal edge.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 729 6/16/2017 2:13:00 PM DIP4E_GLOBAL_Print_Ready.indb 730 6/16/2017 2:13:02 PM


10.2 Point, Line, and Edge Detection 731 732 Chapter 10 Image Segmentation

a b %157.5' &157.5'
improve on this situation by using hysteresis thresholding which, as we will discuss
c Edge normal in Section 10.3, uses two thresholds: a low threshold, TL and a high threshold, TH .
FIGURE 10.24 Experimental evidence (Canny [1986]) suggests that the ratio of the high to low
(a) Two possible p1 p2 p3 p1 p2 p3 threshold should be in the range of 2:1 to 3:1.
orientations of a p5 y
We can visualize the thresholding operation as creating two additional images:
horizontal edge p4 p6 p4 p p6
5
(shaded) in a 3 × 3 Edge Edge normal
p7 p8 p9 p7 p8 p9 gNH ( x, y) = gN ( x, y) ≥ TH (10-39)
neighborhood. (gradient vector)
(b) Range of values a and
(shaded) of a, the Edge normal %22.5' &22.5' gNL ( x, y) = gN ( x, y) ≥ TL (10-40)
direction angle of
the edge normal x
for a horizontal %157.5' &157.5' Initially, gNH ( x, y) and gNL ( x, y) are set to 0. After thresholding, gNH ( x, y) will usu-
&45'edge
edge. (c) The angle ally have fewer nonzero pixels than gNL ( x, y), but all the nonzero pixels in gNH ( x, y)
ranges of the edge will be contained in gNL ( x, y) because the latter image is formed with a lower thresh-
normals for the %112.5' &112.5'
four types of edge old. We eliminate from gNL ( x, y) all the nonzero pixels from gNH ( x, y) by letting
directions in a 3 × 3 gNL ( x, y) = gNL ( x, y) − gNH ( x, y) (10-41)
Vertical edge
neighborhood.
Each edge direc- The nonzero pixels in gNH ( x, y) and gNL ( x, y) may be viewed as being “strong”
tion has two ranges, %67.5' &67.5'
and “weak” edge pixels, respectively. After the thresholding operations, all strong
shown in corre-
sponding shades. pixels in gNH ( x, y) are assumed to be valid edge pixels, and are so marked imme-
%45'edge diately. Depending on the value of TH , the edges in gNH ( x, y) typically have gaps.
%22.5' &22.5'
0' Longer edges are formed using the following procedure:
Horizontal edge
(a) Locate the next unvisited edge pixel, p, in gNH ( x, y).
22.5° or from −157.5° to 157.5°, we call the edge a horizontal edge. Figure 10.24(c) (b) Mark as valid edge pixels all the weak pixels in gNL ( x, y) that are connected to
shows the angle ranges corresponding to the four directions under consideration. p using, say, 8-connectivity.
Let d1 , d2 , d3 ,and d4 denote the four basic edge directions just discussed for (c) If all nonzero pixels in gNH ( x, y) have been visited go to Step (d). Else, return
a 3 × 3 region: horizontal, −45°, vertical, and +45°, respectively. We can formulate to Step ( a).
the following nonmaxima suppression scheme for a 3 × 3 region centered at an (d) Set to zero all pixels in gNL ( x, y) that were not marked as valid edge pixels.
arbitrary point ( x, y) in a:
At the end of this procedure, the final image output by the Canny algorithm is
1. Find the direction dk that is closest to a( x, y).
formed by appending to gNH ( x, y) all the nonzero pixels from gNL ( x, y).
2. Let K denote the value of (fs at ( x, y). If K is less than the value of (fs at one We used two additional images, gNH ( x, y) and gNL ( x, y) to simplify the discussion.
or both of the neighbors of point ( x, y) along dk , let gN ( x, y) = 0 (suppression); In practice, hysteresis thresholding can be implemented directly during nonmaxima
otherwise, let gN ( x, y) = K . suppression, and thresholding can be implemented directly on gN ( x, y) by forming a
When repeated for all values of x and y, this procedure yields a nonmaxima sup- list of strong pixels and the weak pixels connected to them.
pressed image gN ( x, y) that is of the same size as fs ( x, y). For example, with reference Summarizing, the Canny edge detection algorithm consists of the following steps:
to Fig. 10.24(a), letting ( x, y) be at p5 , and assuming a horizontal edge through p5 ,
1. Smooth the input image with a Gaussian filter.
the pixels of interest in Step 2 would be p2 and p8 . Image gN ( x, y) contains only the
thinned edges; it is equal to image (fs ( x, y) with the nonmaxima edge points sup- 2. Compute the gradient magnitude and angle images.
pressed. 3. Apply nonmaxima suppression to the gradient magnitude image.
The final operation is to threshold gN ( x, y) to reduce false edge points. In the 4. Use double thresholding and connectivity analysis to detect and link edges.
Marr-Hildreth algorithm we did this using a single threshold, in which all values
below the threshold were set to 0. If we set the threshold too low, there will still Although the edges after nonmaxima suppression are thinner than raw gradient edg-
be some false edges (called false positives). If the threshold is set too high, then es, the former can still be thicker than one pixel. To obtain edges one pixel thick, it is
valid edge points will be eliminated (false negatives). Canny’s algorithm attempts to typical to follow Step 4 with one pass of an edge-thinning algorithm (see Section 9.5).

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 731 6/16/2017 2:13:05 PM DIP4E_GLOBAL_Print_Ready.indb 732 6/16/2017 2:13:07 PM


10.2 Point, Line, and Edge Detection 733 734 Chapter 10 Image Segmentation

As mentioned earlier, smoothing is accomplished by convolving the input image to achieve the objectives stated in the previous paragraph for the gradient and Marr-Hildreth images.
with a Gaussian kernel whose size, n × n, must be chosen. Once a value of s has Comparing the Canny image with the other two images, we see in the Canny result significant improve-
Usually, selecting a
been specified, we can use the approach discussed in connection with the Marr-Hil- ments in detail of the principal edges and, at the same time, more rejection of irrelevant features. For
suitable value of s dreth algorithm to determine an odd value of n that provides the “full” smoothing example, note that both edges of the concrete band lining the bricks in the upper section of the image
for the first time in an capability of the Gaussian filter for the specified value of s. were detected by the Canny algorithm, whereas the thresholded gradient lost both of these edges, and
application requires
experimentation. Some final comments on implementation: As noted earlier in the discussion of the Marr-Hildreth method detected only the upper one. In terms of filtering out irrelevant detail, the
the Marr-Hildreth edge detector, the 2-D Gaussian function in Eq. (10-35) is sepa- Canny image does not contain a single edge due to the roof tiles; this is not true in the other two images.
rable into a product of two 1-D Gaussians. Thus, Step 1 of the Canny algorithm can The quality of the lines with regard to continuity, thinness, and straightness is also superior in the Canny
be formulated as 1-D convolutions that operate on the rows (columns) of the image image. Results such as these have made the Canny algorithm a tool of choice for edge detection.
one at a time, and then work on the columns (rows) of the result. Furthermore, if
we use the approximations in Eqs. (10-19) and (10-20), we can also implement the
gradient computations required for Step 2 as 1-D convolutions (see Problem 10.22). EXAMPLE 10.9 : Another illustration of the three principal edge-detection methods discussed in this section.
As another comparison of the three principal edge-detection methods discussed in this section, consider
EXAMPLE 10.8 : Illustration and comparison of the Canny edge-detection method. Fig. 10.26(a), which shows a 512 × 512 head CT image. Our objective is to extract the edges of the outer
Figure 10.25(a) shows the familiar building image. For comparison, Figs. 10.25(b) and (c) show, respec- contour of the brain (the gray region in the image), the contour of the spinal region (shown directly
tively, the result in Fig. 10.20(b) obtained using the thresholded gradient, and Fig. 10.22(d) using the behind the nose, toward the front of the brain), and the outer contour of the head. We wish to generate
Marr-Hildreth detector. Recall that the parameters used in generating those two images were selected the thinnest, continuous contours possible, while eliminating edge details related to the gray content in
to detect the principal edges, while attempting to reduce “irrelevant” features, such as the edges of the the eyes and brain areas.
bricks and the roof tiles. Figure 10.26(b) shows a thresholded gradient image that was first smoothed using a 5 × 5 averaging
Figure 10.25(d) shows the result obtained with the Canny algorithm using the parameters TL = 0.04, kernel. The threshold required to achieve the result shown was 15% of the maximum value of the gradi-
TH = 0.10 (2.5 times the value of the low threshold), s = 4, and a kernel of size 25 × 25, which cor- ent image. Figure 10.26(c) shows the result obtained with the Marr-Hildreth edge-detection algorithm
responds to the smallest odd integer not less than 6s. These parameters were chosen experimentally with a threshold of 0.002, s = 3, and a kernel of size 19 × 19 . Figure 10.26(d) was obtained using the
Canny algorithm with TL = 0.05,TH = 0.15 (3 times the value of the low threshold), s = 2, and a kernel
of size 13 × 13.
a b a b
c d c d
FIGURE 10.25 FIGURE 10.26
(a) Original image (a) Head CT image
of size 834 × 1114 of size 512 × 512
pixels, with pixels, with
intensity values intensity values
scaled to the range scaled to the range
[0, 1]. [0, 1].
(b) Thresholded (b) Thresholded
gradient of the gradient of the
smoothed image. smoothed image.
(c) Image obtained (c) Image obtained
using the using the Marr-Hil-
Marr-Hildreth dreth algorithm.
algorithm. (d) Image obtained
(d) Image obtained using the Canny
using the Canny algorithm.
algorithm. Note the (Original image
significant courtesy of Dr.
improvement of David R. Pickens,
the Canny image Vanderbilt
compared to the University.)
other two.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 733 6/16/2017 2:13:08 PM DIP4E_GLOBAL_Print_Ready.indb 734 6/16/2017 2:13:09 PM


10.2 Point, Line, and Edge Detection 735 736 Chapter 10 Image Segmentation

In terms of edge quality and the ability to eliminate irrelevant detail, the results in Fig. 10.26 correspond The direction angle of the gradient vector is given by Eq. (10-18). An edge pixel
closely to the results and conclusions in the previous example. Note also that the Canny algorithm was with coordinates ( s, t ) in Sxy has an angle similar to the pixel at ( x, y) if
the only procedure capable of yielding a totally unbroken edge for the posterior boundary of the brain,
and the closest boundary of the spinal cord. It was also the only procedure capable of finding the cleanest a( s, t ) − a( x, y) ≤ A (10-43)
contours, while eliminating all the edges associated with the gray brain matter in the original image.
where A is a positive angle threshold. As noted earlier, the direction of the edge at
( x, y) is perpendicular to the direction of the gradient vector at that point.
The price paid for the improved performance of the Canny algorithm is a sig-
A pixel with coordinates ( s, t ) in Sxy is considered to be linked to the pixel at ( x, y)
nificantly more complex implementation than the two approaches discussed earlier.
if both magnitude and direction criteria are satisfied. This process is repeated for
In some applications, such as real-time industrial image processing, cost and speed
every edge pixel. As the center of the neighborhood is moved from pixel to pixel, a
requirements usually dictate the use of simpler techniques, principally the thresh-
record of linked points is kept. A simple bookkeeping procedure is to assign a dif-
olded gradient approach. When edge quality is the driving force, the Marr-Hildreth
ferent intensity value to each set of linked edge pixels.
and Canny algorithms, especially the latter, offer superior alternatives.
The preceding formulation is computationally expensive because all neighbors of
every point have to be examined. A simplification particularly well suited for real
LINKING EDGE POINTS time applications consists of the following steps:
Ideally, edge detection should yield sets of pixels lying only on edges. In practice,
1. Compute the gradient magnitude and angle arrays, M( x, y) and a( x, y), of the
these pixels seldom characterize edges completely because of noise, breaks in the
input image, f ( x, y).
edges caused by nonuniform illumination, and other effects that introduce disconti-
nuities in intensity values. Therefore, edge detection typically is followed by linking 2. Form a binary image, g( x, y), whose value at any point ( x, y) is given by:
algorithms designed to assemble edge pixels into meaningful edges and/or region
boundaries. In this section, we discuss two fundamental approaches to edge linking 1 if M( x, y) > TM AND a( x, y) = A ± TA
g( x, y) = 
that are representative of techniques used in practice. The first requires knowledge 0 otherwise
about edge points in a local region (e.g., a 3 × 3 neighborhood), and the second
is a global approach that works with an entire edge map. As it turns out, linking where TM is a threshold, A is a specified angle direction, and ±TA defines a
points along the boundary of a region is also an important aspect of some of the “band” of acceptable directions about A.
segmentation methods discussed in the next chapter, and in extracting features from 3. Scan the rows of g and fill (set to 1) all gaps (sets of 0’s) in each row that do not
a segmented image, as we will do in Chapter 11. Thus, you will encounter additional exceed a specified length, L. Note that, by definition, a gap is bounded at both
edge-point linking methods in the next two chapters. ends by one or more 1’s. The rows are processed individually, with no “memory”
kept between them.
Local Processing 4. To detect gaps in any other direction, u, rotate g by this angle and apply the
A simple approach for linking edge points is to analyze the characteristics of pixels horizontal scanning procedure in Step 3. Rotate the result back by −u.
in a small neighborhood about every point ( x, y) that has been declared an edge When interest lies in horizontal and vertical edge linking, Step 4 becomes a simple
point by one of the techniques discussed in the preceding sections. All points that procedure in which g is rotated ninety degrees, the rows are scanned, and the result
are similar according to predefined criteria are linked, forming an edge of pixels that is rotated back. This is the application found most frequently in practice and, as the
share common properties according to the specified criteria. following example shows, this approach can yield good results. In general, image
The two principal properties used for establishing similarity of edge pixels in this rotation is an expensive computational process so, when linking in numerous angle
kind of local analysis are (1) the strength (magnitude) and (2) the direction of the directions is required, it is more practical to combine Steps 3 and 4 into a single,
gradient vector. The first property is based on Eq. (10-17). Let Sxy denote the set of radial scanning procedure.
coordinates of a neighborhood centered at point ( x, y) in an image. An edge pixel
with coordinates ( s, t ) in Sxy is similar in magnitude to the pixel at ( x, y) if
EXAMPLE 10.10 : Edge linking using local processing.
M( s, t ) − M( x, y) ≤ E (10-42) Figure 10.27(a) shows a 534 × 566 image of the rear of a vehicle. The objective of this example is to
illustrate the use of the preceding algorithm for finding rectangles whose sizes makes them suitable
where E is a positive threshold. candidates for license plates. The formation of these rectangles can be accomplished by detecting

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 735 6/16/2017 2:13:10 PM DIP4E_GLOBAL_Print_Ready.indb 736 6/16/2017 2:13:12 PM


10.2 Point, Line, and Edge Detection 737 738 Chapter 10 Image Segmentation

a b c comparisons of every point to all lines. This is a computationally prohibitive task in


d e f most applications.
FIGURE 10.27
The original formulation
Hough [1962] proposed an alternative approach, commonly referred to as the
(a) Image of the rear of the Hough transform Hough transform. Let ( xi , yi ) denote a point in the xy-plane and consider the general
of a vehicle. presented here works equation of a straight line in slope-intercept form: yi = axi + b. Infinitely many lines
(b) Gradient magni- with straight lines. For a
tude image. generalization to pass through ( xi , yi ), but they all satisfy the equation yi = axi + b for varying val-
(c) Horizontally
arbitrary shapes, see
Ballard [1981].
ues of a and b. However, writing this equation as b = − xi a + yi and considering the
connected edge ab-plane (also called parameter space) yields the equation of a single line for a fixed
pixels. point ( xi , yi ). Furthermore, a second point ( x j , y j ) also has a single line in parameter
(d) Vertically con- space associated with it, which intersects the line associated with ( xi , yi ) at some
nected edge pixels.
(e) The logical OR point (a$, b$) in parameter space, where a$ is the slope and b$ the intercept of the line
of (c) and (d). containing both ( xi , yi ) and ( x j , y j ) in the xy-plane (we are assuming, of course, that
(f) Final result, the lines are not parallel). In fact, all points on this line have lines in parameter space
using morphological that intersect at (a$, b$). Figure 10.28 illustrates these concepts.
thinning. (Original In principle, the parameter space lines corresponding to all points ( xk , yk ) in the
image courtesy of
Perceptics xy-plane could be plotted, and the principal lines in that plane could be found by
Corporation.) identifying points in parameter space where large numbers of parameter-space lines
intersect. However, a difficulty with this approach is that a, (the slope of a line)
approaches infinity as the line approaches the vertical direction. One way around
this difficulty is to use the normal representation of a line:
strong horizontal and vertical edges. Figure 10.27(b) shows the gradient magnitude image, M( x, y), and x cos u + y sin u = r (10-44)
Figs. 10.27(c) and (d) show the result of Steps 3 and 4 of the algorithm, obtained by letting TM equal
to 30% of the maximum gradient value, A = 90°, TA = 45°, and filling all gaps of 25 or fewer pixels Figure 10.29(a) illustrates the geometrical interpretation of the parameters r and u.
(approximately 5% of the image width). A large range of allowable angle directions was required to A horizontal line has u = 0°, with r being equal to the positive x-intercept. Simi-
detect the rounded corners of the license plate enclosure, as well as the rear windows of the vehicle. larly, a vertical line has u = 90°, with r being equal to the positive y-intercept, or
Figure 10.27(e) is the result of forming the logical OR of the two preceding images, and Fig. 10.27(f) u = −90°, with r being equal to the negative y-intercept (we limit the angle to the
was obtained by thinning 10.27(e) with the thinning procedure discussed in Section 9.5. As Fig. 10.27(f) range −90° ≤ u ≤ 90°). Each sinusoidal curve in Figure 10.29(b) represents the fam-
shows, the rectangle corresponding to the license plate was clearly detected in the image. It would be ily of lines that pass through a particular point ( xk , yk ) in the xy-plane. The intersec-
a simple matter to isolate the license plate from all the rectangles in the image, using the fact that the tion point (r$, u$) in Fig. 10.29(b) corresponds to the line that passes through both
width-to-height ratio of license plates have distinctive proportions (e.g., a 2:1 ratio in U.S. plates). ( xi , yi ) and ( x j , y j ) in Fig. 10.29(a).
The computational attractiveness of the Hough transform arises from subdividing
the ru parameter space into so-called accumulator cells, as Fig. 10.29(c) illustrates,
Global Processing Using the Hough Transform
where (rmin , rmax ) and ( umin , umax ) are the expected ranges of the parameter values:
The method discussed in the previous section is applicable in situations in which
knowledge about pixels belonging to individual objects is available. Often, we have
to work in unstructured environments in which all we have is an edge map and no a b b$
y b
knowledge about where objects of interest might be. In such situations, all pixels FIGURE 10.28
are candidates for linking, and thus have to be accepted or eliminated based on pre- (a) xy-plane. b ) %xi a & yi
defined global properties. In this section, we develop an approach based on whether (b) Parameter (xi, yi)
space.
sets of pixels lie on curves of a specified shape. Once detected, these curves form the
edges or region boundaries of interest. a$
Given n points in an image, suppose that we want to find subsets of these points
that lie on straight lines. One possible solution is to find all lines determined by every (xj, yj)
b ) %xj a & yj
pair of points, then find all subsets of points that are close to particular lines. This
approach involves finding n ( n − 1) 2 ∼ n2 lines, then performing ( n ) ( n ( n − 1)) 2 ∼ n3 x a

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 737 6/16/2017 2:13:13 PM DIP4E_GLOBAL_Print_Ready.indb 738 6/16/2017 2:13:16 PM


10.2 Point, Line, and Edge Detection 739 740 Chapter 10 Image Segmentation

u$ umin 0 umax a
y u
r min
u b
xj cosu & yj sinu ) r
FIGURE 10.30
u (a) Image of size
r
101 × 101 pixels,
0 containing five
white points (four
(xj, yj)
in the corners and
one in the center).
(xi, yi) r$ r max (b) Corresponding
xi cosu & yi sinu ) r parameter space.
x r r

a b c
FIGURE 10.29 (a) (r, u) parameterization of a line in the xy-plane. (b) Sinusoidal curves in the ru-plane;the point of
intersection (r$, u$) corresponds to the line passing through points ( xi , yi ) and ( x j , y j ) in the xy-plane. (c) Division Q
of the ru-plane into accumulator cells. %100
2

−90° ≤ u ≤ 90° and − D ≤ r ≤ D, where D is the maximum distance between opposite %50
R
corners in an image. The cell at coordinates (i, j ) with accumulator value A(i, j ) cor-
responds to the square associated with parameter-space coordinates (ri , u j ). Ini-
tially, these cells are set to zero. Then, for every non-background point ( xk , yk ) in S A 1 S
0
the xy-plane, we let u equal each of the allowed subdivision values on the u-axis

r
and solve for the corresponding r using the equation r = xk cos u + yk sin u. The
resulting r values are then rounded off to the nearest allowed cell value along the 3
R
r axis. If a choice of uq results in the solution rp , then we let A( p, q) = A( p, q) + 1. 50 4
B
At the end of the procedure, a value of K in a cell A(i, j ) means that K points in the
xy-plane lie on the line x cos u j + y sin u j = ri . The number of subdivisions in the
Q
ru-plane determines the accuracy of the colinearity of these points. It can be shown 100
(see Problem 10.27) that the number of computations in the method just discussed is
linear with respect to n, the number of non-background points in the xy-plane. 5

%80 %60 %40 %20 0 20 40 60 80


EXAMPLE 10.11 : Some basic properties of the Hough transform.
u
Figure 10.30 illustrates the Hough transform based on Eq. (10-44). Figure 10.30(a) shows an image
of size M × M (M = 101) with five labeled white points, and Fig. 10.30(b) shows each of these points value). Finally, the points labeled Q, R, and S in Fig. 10.30(b) illustrate the fact that the Hough transform
mapped onto the ru-plane using subdivisions of one unit for the r and u axes. The range of u values is exhibits a reflective adjacency relationship at the right and left edges of the parameter space. This prop-
±90°, and the range of r values is ± 2M. As Fig. 10.30(b) shows, each curve has a different sinusoidal erty is the result of the manner in which r and u change sign at the ± 90° boundaries.
shape. The horizontal line resulting from the mapping of point 1 is a sinusoid of zero amplitude.
The points labeled A (not to be confused with accumulator values) and B in Fig. 10.30(b) illustrate Although the focus thus far has been on straight lines, the Hough transform is
the colinearity detection property of the Hough transform. For example, point B, marks the intersection applicable to any function of the form g ( v, c ) = 0, where v is a vector of coordinates
of the curves corresponding to points 2, 3, and 4 in the xy image plane. The location of point A indicates and c is a vector of coefficients. For example, points lying on the circle
that these three points lie on a straight line passing through the origin (r = 0) and oriented at −45° [see
Fig. 10.29(a)]. Similarly, the curves intersecting at point B in parameter space indicate that points 2, 3, (x − c1 ) + ( y − c2 ) = c32
2 2
(10-45)
and 4 lie on a straight line oriented at 45°, and whose distance from the origin is r = 71 (one-half the
can be detected by using the basic approach just discussed. The difference is the
diagonal distance from the origin of the image to the opposite corner, rounded to the nearest integer presence of three parameters c1 , c2 , and c3 that result in a 3-D parameter space with

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 739 6/16/2017 2:13:19 PM DIP4E_GLOBAL_Print_Ready.indb 740 6/16/2017 2:13:22 PM


10.2 Point, Line, and Edge Detection 741 742 Chapter 10 Image Segmentation

cube-like cells, and accumulators of the form A(i, j, k ). The procedure is to incre-
ment c1 and c2 , solve for the value of c3 that satisfies Eq. (10-45), and update the
accumulator cell associated with the triplet (c1 , c2 , c3 ). Clearly, the complexity of the
Hough transform depends on the number of coordinates and coefficients in a given
functional representation. As noted earlier, generalizations of the Hough transform
to detect curves with no simple analytic representations are possible, as is the appli-
cation of the transform to grayscale images.
Returning to the edge-linking problem, an approach based on the Hough trans-
form is as follows:
1. Obtain a binary edge map using any of the methods discussed earlier in this section.
2. Specify subdivisions in the ru-plane.
3. Examine the counts of the accumulator cells for high pixel concentrations.
4. Examine the relationship (principally for continuity) between pixels in a chosen
cell.
Continuity in this case usually is based on computing the distance between discon-
nected pixels corresponding to a given accumulator cell. A gap in a line associated
with a given cell is bridged if the length of the gap is less than a specified threshold.
Being able to group lines based on direction is a global concept applicable over the
entire image, requiring only that we examine pixels associated with specific accumu-
lator cells. The following example illustrates these concepts. a b
c d e
FIGURE 10.31 (a) A 502 × 564 aerial image of an airport. (b) Edge map obtained using Canny’s algorithm. (c) Hough
EXAMPLE 10.12 : Using the Hough transform for edge linking. parameter space (the boxes highlight the points associated with long vertical lines). (d) Lines in the image plane
Figure 10.31(a) shows an aerial image of an airport. The objective of this example is to use the Hough corresponding to the points highlighted by the boxes. (e) Lines superimposed on the original image.
transform to extract the two edges defining the principal runway. A solution to such a problem might be
of interest, for instance, in applications involving autonomous air navigation.
The first step is to obtain an edge map. Figure 10.31(b) shows the edge map obtained using Canny’s orientations of runways throughout the world are available in flight charts, and the direction of travel
algorithm with the same parameters and procedure used in Example 10.9. For the purpose of computing is easily obtainable using GPS (Global Positioning System) information. This information also could be
the Hough transform, similar results can be obtained using any of the other edge-detection techniques used to compute the distance between the vehicle and the runway, thus allowing estimates of param-
discussed earlier. Figure 10.31(c) shows the Hough parameter space obtained using 1° increments for u, eters such as expected length of lines relative to image size, as we did in this example.
and one-pixel increments for r.
The runway of interest is oriented approximately 1° off the north direction, so we select the cells cor-
responding to ± 90° and containing the highest count because the runways are the longest lines oriented 10.3 THRESHOLDING
10.3

in these directions. The small boxes on the edges of Fig. 10.31(c) highlight these cells. As mentioned ear- Because of its intuitive properties, simplicity of implementation, and computational
lier in connection with Fig. 10.30(b), the Hough transform exhibits adjacency at the edges. Another way speed, image thresholding enjoys a central position in applications of image segmen-
of interpreting this property is that a line oriented at +90° and a line oriented at −90° are equivalent (i.e., tation. Thresholding was introduced in Section 3.1, and we have used it in various
they are both vertical). Figure 10.31(d) shows the lines corresponding to the two accumulator cells just discussions since then. In this section, we discuss thresholding in a more formal way,
discussed, and Fig. 10.31(e) shows the lines superimposed on the original image. The lines were obtained and develop techniques that are considerably more general than what has been pre-
by joining all gaps not exceeding 20% (approximately 100 pixels) of the image height. These lines clearly sented thus far.
correspond to the edges of the runway of interest.
Note that the only information needed to solve this problem was the orientation of the runway and FOUNDATION
the observer’s position relative to it. In other words, a vehicle navigating autonomously would know
In the previous section, regions were identified by first finding edge segments,
that if the runway of interest faces north, and the vehicle’s direction of travel also is north, the runway
then attempting to link the segments into boundaries. In this section, we discuss
should appear vertically in the image. Other relative orientations are handled in a similar manner. The

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 741 6/16/2017 2:13:23 PM DIP4E_GLOBAL_Print_Ready.indb 742 6/16/2017 2:13:24 PM


10.3 Thresholding 743 744 Chapter 10 Image Segmentation

techniques for partitioning images directly into regions based on intensity values where a, b, and c are any three distinct intensity values. We will discuss dual threshold-
and/or properties of these values. ing later in this section. Segmentation problems requiring more than two thresholds
are difficult (or often impossible) to solve, and better results usually are obtained using
The Basics of Intensity Thresholding other methods, such as variable thresholding, as will be discussed later in this section,
Suppose that the intensity histogram in Fig. 10.32(a) corresponds to an image, f ( x, y), or region growing, as we will discuss in Section 10.4.
composed of light objects on a dark background, in such a way that object and back- Based on the preceding discussion, we may infer intuitively that the success of
ground pixels have intensity values grouped into two dominant modes. One obvious intensity thresholding is related directly to the width and depth of the valley(s) sepa-
Remember, f(x, y)
way to extract the objects from the background is to select a threshold, T, that sepa- rating the histogram modes. In turn, the key factors affecting the properties of the
denotes the intensity of f
at coordinates (x, y). rates these modes. Then, any point ( x, y) in the image at which f ( x, y) > T is called valley(s) are: (1) the separation between peaks (the further apart the peaks are, the
an object point. Otherwise, the point is called a background point. In other words, better the chances of separating the modes); (2) the noise content in the image (the
the segmented image, denoted by g( x, y), is given by modes broaden as noise increases); (3) the relative sizes of objects and background;
Although we follow
convention in using 0 (4) the uniformity of the illumination source; and (5) the uniformity of the reflectance
intensity for the back- properties of the image.
ground and 1 for object 1 if f ( x, y) > T
pixels, any two distinct g( x, y) =  (10-46)
values can be used in 0 if f ( x, y) ≤ T The Role of Noise in Image Thresholding
Eq. (10-46).
The simple synthetic image in Fig. 10.33(a) is free of noise, so its histogram con-
When T is a constant applicable over an entire image, the process given in this equa-
sists of two “spike” modes, as Fig. 10.33(d) shows. Segmenting this image into two
tion is referred to as global thresholding. When the value of T changes over an image,
regions is a trivial task: we just select a threshold anywhere between the two modes.
we use the term variable thresholding. The terms local or regional thresholding are
Figure 10.33(b) shows the original image corrupted by Gaussian noise of zero
used sometimes to denote variable thresholding in which the value of T at any point
mean and a standard deviation of 10 intensity levels. The modes are broader now
( x, y) in an image depends on properties of a neighborhood of ( x, y) (for example,
the average intensity of the pixels in the neighborhood). If T depends on the spa-
tial coordinates ( x, y) themselves, then variable thresholding is often referred to as
dynamic or adaptive thresholding. Use of these terms is not universal.
Figure 10.32(b) shows a more difficult thresholding problem involving a histo-
gram with three dominant modes corresponding, for example, to two types of light
objects on a dark background. Here, multiple thresholding classifies a point ( x, y) as
belonging to the background if f ( x, y) ≤ T1 , to one object class if T1 < f ( x, y) ≤ T2 ,
and to the other object class if f ( x, y) > T2 . That is, the segmented image is given by

a if f ( x, y) > T2

g ( x, y ) = b if T1 < f ( x, y) ≤ T2 (10-47)
c if f ( x, y) ≤ T1

a b
FIGURE 10.32
Intensity
histograms that 0 63 127 191 255 0 63 127 191 255 0 63 127 191 255

can be partitioned
(a) by a single a b c
threshold, and d e f
(b) by dual FIGURE 10.33 (a) Noiseless 8-bit image. (b) Image with additive Gaussian noise of mean 0 and standard deviation of
thresholds. 10 intensity levels. (c) Image with additive Gaussian noise of mean 0 and standard deviation of 50 intensity levels.
T T1 T2 (d) through (f) Corresponding histograms.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 743 6/16/2017 2:13:25 PM DIP4E_GLOBAL_Print_Ready.indb 744 6/16/2017 2:13:25 PM


10.3 Thresholding 745 746 Chapter 10 Image Segmentation

[see Fig. 10.33(e)], but their separation is enough so that the depth of the valley perfectly uniform, but the reflectance of the image was not, as a results, for example,
between them is sufficient to make the modes easy to separate. A threshold placed of natural reflectivity variations in the surface of objects and/or background.
midway between the two peaks would do the job. Figure 10.33(c) shows the result The important point is that illumination and reflectance play a central role in the
of corrupting the image with Gaussian noise of zero mean and a standard deviation success of image segmentation using thresholding or other segmentation techniques.
of 50 intensity levels. As the histogram in Fig. 10.33(f) shows, the situation is much Therefore, controlling these factors when possible should be the first step consid-
more serious now, as there is no way to differentiate between the two modes. With- ered in the solution of a segmentation problem. There are three basic approaches
out additional processing (such as the methods discussed later in this section) we to the problem when control over these factors is not possible. The first is to correct
have little hope of finding a suitable threshold for segmenting this image. the shading pattern directly. For example, nonuniform (but fixed) illumination can
be corrected by multiplying the image by the inverse of the pattern, which can be
The Role of Illumination and Reflectance in Image Thresholding obtained by imaging a flat surface of constant intensity. The second is to attempt
Figure 10.34 illustrates the effect that illumination can have on the histogram of to correct the global shading pattern via processing using, for example, the top-hat
an image. Figure 10.34(a) is the noisy image from Fig. 10.33(b), and Fig. 10.34(d) transformation introduced in Section 9.8. The third approach is to “work around”
shows its histogram. As before, this image is easily segmentable with a single thresh- nonuniformities using variable thresholding, as discussed later in this section.
old. With reference to the image formation model discussed in Section 2.3, suppose
that we multiply the image in Fig. 10.34(a) by a nonuniform intensity function, such BASIC GLOBAL THRESHOLDING
In theory, the histogram as the intensity ramp in Fig. 10.37(b), whose histogram is shown in Fig. 10.34(e).
of a ramp image is
Figure 10.34(c) shows the product of these two images, and Fig. 10.34(f) is the result- When the intensity distributions of objects and background pixels are sufficiently
uniform. In practice, the
degree of uniformity ing histogram. The deep valley between peaks was corrupted to the point where sep- distinct, it is possible to use a single (global) threshold applicable over the entire
depends on the size of
aration of the modes without additional processing (to be discussed later in this sec- image. In most applications, there is usually enough variability between images that,
the image and number of
intensity levels. tion) is no longer possible. Similar results would be obtained if the illumination was even if global thresholding is a suitable approach, an algorithm capable of estimat-
ing the threshold value for each image is required. The following iterative algorithm
can be used for this purpose:
1. Select an initial estimate for the global threshold, T.
2. Segment the image using T in Eq. (10-46). This will produce two groups of
pixels: G1 , consisting of pixels with intensity values > T; and G2 , consisting of
pixels with values ≤ T.
3. Compute the average (mean) intensity values m1 and m2 for the pixels in G1
and G2 , respectively.
4. Compute a new threshold value midway between m1 and m2 :
1
T= ( m1 + m2 )
2
5. Repeat Steps 2 through 4 until the difference between values of T in successive
iterations is smaller than a predefined value, !T.
The algorithm is stated here in terms of successively thresholding the input image
and calculating the means at each step, because it is more intuitive to introduce
it in this manner. However, it is possible to develop an equivalent (and more effi-
cient) procedure by expressing all computations in the terms of the image histogram,
0 63 127 191 255 0 0.2 0.4 0.6 0.8 1 0 63 127 191 255
which has to be computed only once (see Problem 10.29).
The preceding algorithm works well in situations where there is a reasonably
a b c clear valley between the modes of the histogram related to objects and background.
d e f
Parameter !T is used to stop iterating when the changes in threshold values is small.
FIGURE 10.34 (a) Noisy image. (b) Intensity ramp in the range [0.2, 0.6]. (c) Product of (a) and (b). (d) through (f)
The initial threshold must be chosen greater than the minimum and less than the
Corresponding histograms.
maximum intensity level in the image (the average intensity of the image is a good

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 745 6/16/2017 2:13:25 PM DIP4E_GLOBAL_Print_Ready.indb 746 6/16/2017 2:13:27 PM


10.3 Thresholding 747 748 Chapter 10 Image Segmentation

between-class variance, a well-known measure used in statistical discriminant analy-


sis. The basic idea is that properly thresholded classes should be distinct with respect
to the intensity values of their pixels and, conversely, that a threshold giving the
best separation between classes in terms of their intensity values would be the best
(optimum) threshold. In addition to its optimality, Otsu’s method has the important
property that it is based entirely on computations performed on the histogram of an
image, an easily obtainable 1-D array (see Section 3.3).
Let {0, 1, 2, … , L − 1} denote the set of L distinct integer intensity levels in a digi-
tal image of size M × N pixels, and let ni denote the number of pixels with intensity i.
The total number, MN, of pixels in the image is MN = n0 + n1 + n2 + " + nL−1 . The
normalized histogram (see Section 3.3) has components pi = ni MN , from which it
follows that L −1

0 63 127 191 255


∑ pi = 1
i=0
pi ≥ 0 (10-48)

a b c Now, suppose that we select a threshold T (k ) = k, 0 < k < L − 1, and use it to thresh-
FIGURE 10.35 (a) Noisy fingerprint. (b) Histogram. (c) Segmented result using a global threshold (thin image border old the input image into two classes, c1 and c2 , where c1 consists of all the pixels in
added for clarity). (Original image courtesy of the National Institute of Standards and Technology.). the image with intensity values in the range [0, k ] and c2 consists of the pixels with
values in the range [k + 1, L − 1]. Using this threshold, the probability, P1 (k ), that a
pixel is assigned to (i.e., thresholded into) class c1 is given by the cumulative sum
initial choice for T ). If this condition is met, the algorithm converges in a finite num-
k
ber of steps, whether or not the modes are separable (see Problem 10.30). P1 (k ) = ∑ pi (10-49)
i=0

EXAMPLE 10.13 : Global thresholding. Viewed another way, this is the probability of class c1 occurring. For example, if we
Figure 10.35 shows an example of segmentation using the preceding iterative algorithm. Figure 10.35(a) set k = 0, the probability of class c1 having any pixels assigned to it is zero. Similarly,
is the original image and Fig. 10.35(b) is the image histogram, showing a distinct valley. Application the probability of class c2 occurring is
of the basic global algorithm resulted in the threshold T = 125.4 after three iterations, starting with T L −1

equal to the average intensity of the image, and using !T = 0. Figure 10.35(c) shows the result obtained P2 (k ) = ∑
i = k +1
pi = 1 − P1 (k ) (10-50)
using T = 125 to segment the original image. As expected from the clear separation of modes in the
histogram, the segmentation between object and background was perfect. From Eq. (3-25), the mean intensity value of the pixels in c1 is
k k
m1 (k ) = ∑ iP ( i c1 ) = ∑ iP ( c1 i ) P ( i ) P ( c1 )
OPTIMUM GLOBAL THRESHOLDING USING OTSU’S METHOD i=0 i=0
(10-51)
Thresholding may be viewed as a statistical-decision theory problem whose objec- 1 k
tive is to minimize the average error incurred in assigning pixels to two or more = ∑ i pi
P1 ( k ) i = 0
groups (also called classes). This problem is known to have an elegant closed-form
solution known as the Bayes decision function (see Section 12.4). The solution is where P1 (k ) is given by Eq. (10-49). The term P ( i c1 ) in Eq. (10-51) is the probability
based on only two parameters: the probability density function (PDF) of the inten-
of intensity value i, given that i comes from class c1 . The rightmost term in the first
sity levels of each class, and the probability that each class occurs in a given applica- line of the equation follows from Bayes’ formula:
tion. Unfortunately, estimating PDFs is not a trivial matter, so the problem usually
is simplified by making workable assumptions about the form of the PDFs, such as P ( A B ) = P ( B A) P ( A) P ( B )
assuming that they are Gaussian functions. Even with simplifications, the process
of implementing solutions using these assumptions can be complex and not always The second line follows from the fact that P ( c1 i ) , the probability of c1 given i, is 1
well-suited for real-time applications. because we are dealing only with values of i from class c1 . Also, P(i) is the probabil-
The approach in the following discussion, called Otsu’s method (Otsu [1979]), is ity of the ith value, which is the ith component of the histogram, pi . Finally, P(c1 ) is
an attractive alternative. The method is optimum in the sense that it maximizes the the probability of class c1 which, from Eq. (10-49), is equal to P1 (k ).

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 747 6/16/2017 2:13:27 PM DIP4E_GLOBAL_Print_Ready.indb 748 6/16/2017 2:13:30 PM


10.3 Thresholding 749 750 Chapter 10 Image Segmentation

Similarly, the mean intensity value of the pixels assigned to class c2 is The first line of this equation follows from Eqs. (10-55), (10-56), and (10-59). The
second line follows from Eqs. (10-50) through (10-54). This form is slightly more
L −1 efficient computationally because the global mean, mG , is computed only once, so
m2 (k ) = ∑ iP ( i c2 ) only two parameters, m1 and P1 , need to be computed for any value of k.
i = k +1
(10-52) The first line in Eq. (10-60) indicates that the farther the two means m1 and m2 are
1 L −1 from each other, the larger sB2 will be, implying that the between-class variance is a
= ∑ i pi
P2 (k ) i = k +1 measure of separability between classes. Because sG2 is a constant, it follows that h
also is a measure of separability, and maximizing this metric is equivalent to maximiz-
The cumulative mean (average intensity) up to level k is given by ing s B2 . The objective, then, is to determine the threshold value, k, that maximizes the
k between-class variance, as stated earlier. Note that Eq. (10-57) assumes implicitly that
m(k ) = ∑ i pi (10-53) sG2 > 0. This variance can be zero only when all the intensity levels in the image are
i=0 the same, which implies the existence of only one class of pixels. This in turn means
and the average intensity of the entire image (i.e., the global mean) is given by that h = 0 for a constant image because the separability of a single class from itself
L −1 is zero.
mG = ∑
i=0
i pi (10-54) Reintroducing k, we have the final results:
s B2 (k )
The validity of the following two equations can be verified by direct substitution of h (k ) = (10-61)
the preceding results: sG2
and
P1 m1 + P2 m2 = mG (10-55) sB2 ( k ) =
[ mG P1 (k) − m(k)]2 (10-62)
P1 (k )[1 − P1 (k )]
and
Then, the optimum threshold is the value, k* , that maximizes s B2 (k ) :
P1 + P2 = 1 (10-56)

where we have omitted the ks temporarily in favor of notational clarity. ( )


sB2 k* = max sB2 (k )
0 ≤ k ≤ L −1
(10-63)
In order to evaluate the effectiveness of the threshold at level k, we use the nor-
malized, dimensionless measure To find k* we simply evaluate this equation for all integer values of k (subject to the
condition 0 < P1 (k ) < 1) and select the value of k that yielded the maximum s B2 (k ).
sB2
h= (10-57) If the maximum exists for more than one value of k, it is customary to average the
sG2 various values of k for which sB2 (k ) is maximum. It can be shown (see Problem
where sG2 is the global variance [i.e., the intensity variance of all the pixels in the 10.36) that a maximum always exists, subject to the condition 0 < P1 (k ) < 1. Evaluat-
image, as given in Eq. (3-26)], ing Eqs. (10-62) and (10-63) for all values of k is a relatively inexpensive computa-
L −1 tional procedure, because the maximum number of integer values that k can have
sG2 = ∑ (i − mG ) 2
pi (10-58) is L, which is only 256 for 8-bit images.
i=0 Once k* has been obtained, input image f ( x, y) is segmented as before:
and s B2 is the between-class variance, defined as
1 if f ( x, y) > k*
g( x, y) =  (10-64)
s B2 = P1 ( m1 − mG ) + P2 ( m2 − mG )
2 2
(10-59) 0 if f ( x, y) ≤ k*

for x = 0, 1, 2, … , M − 1 and y = 0, 1, 2, … , N − 1. Note that all the quantities needed


The second step in this This expression can also be written as
equation makes sense to evaluate Eq. (10-62) are obtained using only the histogram of f ( x, y). In addition
only if P1 is greater than to the optimum threshold, other information regarding the segmented image can be
sB2 = P1 P2 ( m1 − m2 )
0 and less than 1, which, 2
in view of Eq. (10-56), extracted from the histogram. For example, P1 (k* ) and P2 (k* ), the class probabilities
implies that P2 must
satisfy the same
=
( mG P1 − m) 2 (10-60) evaluated at the optimum threshold, indicate the portions of the areas occupied by
the classes (groups of pixels) in the thresholded image. Similarly, the means m1 (k* )
P1 (1 − P1 )
condition.
and m2 (k* ) are estimates of the average intensity of the classes in the original image.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 749 6/16/2017 2:13:32 PM DIP4E_GLOBAL_Print_Ready.indb 750 6/16/2017 2:13:35 PM


10.3 Thresholding 751 752 Chapter 10 Image Segmentation
In general, the measure in Eq.(10-61) has values in the range a b
c d
0 ≤ h(k ) ≤ 1 (10-65) FIGURE 10.36
(a) Original
for values of k in the range [ 0, L − 1]. When evaluated at the optimum threshold image.
(b) Histogram
k*, this measure is a quantitative estimate of the separability of classes, which in
(high peaks
turn gives us an idea of the accuracy of thresholding a given image with k*. The were clipped to
lower bound in Eq. (10-65) is attainable only by images with a single, constant inten- highlight details in
sity level. The upper bound is attainable only by two-valued images with intensities the lower values).
equal to 0 and L − 1 (see Problem 10.37). (c) Segmenta-
tion result using
Otsu’s algorithm may be summarized as follows:
the basic global
algorithm from
1. Compute the normalized histogram of the input image. Denote the components Section 10.3. 0 63 127 191 255
of the histogram by pi , i = 0, 1, 2, … , L − 1. (d) Result using
2. Compute the cumulative sums, P1 (k ), for k = 0, 1, 2, … , L − 1, using Eq. (10-49). Otsu’s method.
(Original image
3. Compute the cumulative means, m(k ), for k = 0, 1, 2, … , L − 1, using Eq. (10-53). courtesy of
4. Compute the global mean, mG , using Eq. (10-54). Professor Daniel
A. Hammer, the
5. Compute the between-class variance term, sB2 (k ), for k = 0, 1, 2, … , L − 1, using University of
Eq. (10-62). Pennsylvania.)
6. Obtain the Otsu threshold, k* , as the value of k for which s B2 (k ) is maximum. If
the maximum is not unique, obtain k* by averaging the values of k correspond-
ing to the various maxima detected.
7. Compute the global variance, sG2 , using Eq. (10-58), and then obtain the separa-
bility measure, h* , by evaluating Eq. (10-61) with k = k* .
The following example illustrates the use of this algorithm.

EXAMPLE 10.14 : Optimum global thresholding using Otsu’s method. USING IMAGE SMOOTHING TO IMPROVE GLOBAL THRESHOLDING
Figure 10.36(a) shows an optical microscope image of polymersome cells. These are cells artificially engi- As illustrated in Fig. 10.33, noise can turn a simple thresholding problem into an
neered using polymers. They are invisible to the human immune system and can be used, for example, unsolvable one. When noise cannot be reduced at the source, and thresholding is the
to deliver medication to targeted regions of the body. Figure 10.36(b) shows the image histogram. The preferred segmentation method, a technique that often enhances performance is to
objective of this example is to segment the molecules from the background. Figure 10.36(c) is the result smooth the image prior to thresholding. We illustrate this approach with an example.
of using the basic global thresholding algorithm discussed earlier. Because the histogram has no distinct Figure 10.37(a) is the image from Fig. 10.33(c), Fig. 10.37(b) shows its histogram,
valleys and the intensity difference between the background and objects is small, the algorithm failed to and Fig. 10.37(c) is the image thresholded using Otsu’s method. Every black point
achieve the desired segmentation. Figure 10.36(d) shows the result obtained using Otsu’s method. This in the white region and every white point in the black region is a thresholding error,
result obviously is superior to Fig. 10.36(c). The threshold value computed by the basic algorithm was so the segmentation was highly unsuccessful. Figure 10.37(d) shows the result of
169, while the threshold computed by Otsu’s method was 182, which is closer to the lighter areas in the smoothing the noisy image with an averaging kernel of size 5 × 5 (the image is of size
image defining the cells. The separability measure h* was 0.467. 651 × 814 pixels), and Fig. 10.37(e) is its histogram. The improvement in the shape
As a point of interest, applying Otsu’s method to the fingerprint image in Example 10.13 yielded a of the histogram as a result of smoothing is evident, and we would expect threshold-
threshold of 125 and a separability measure of 0.944. The threshold is identical to the value (rounded to ing of the smoothed image to be nearly perfect. Figure 10.37(f) shows this to be the
the nearest integer) obtained with the basic algorithm. This is not unexpected, given the nature of the case. The slight distortion of the boundary between object and background in the
segmented, smoothed image was caused by the blurring of the boundary. In fact, the
histogram. In fact, the separability measure is high because of the relatively large separation between
more aggressively we smooth an image, the more boundary errors we should antici-
modes and the deep valley between them.
pate in the segmented result.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 751 6/16/2017 2:13:37 PM DIP4E_GLOBAL_Print_Ready.indb 752 6/16/2017 2:13:37 PM


10.3 Thresholding 753 754 Chapter 10 Image Segmentation

0 63 127 191 255 0 63 127 191 255

0 63 127 191 255 0 63 127 191 255

a b c a b c
d e f d e f
FIGURE 10.37 (a) Noisy image from Fig. 10.33(c) and (b) its histogram. (c) Result obtained using Otsu’s method. FIGURE 10.38 (a) Noisy image and (b) its histogram. (c) Result obtained using Otsu’s method. (d) Noisy image
(d) Noisy image smoothed using a 5 × 5 averaging kernel and (e) its histogram. (f) Result of thresholding using smoothed using a 5 × 5 averaging kernel and (e) its histogram. (f) Result of thresholding using Otsu’s method.
Otsu’s method. Thresholding failed in both cases to extract the object of interest. (See Fig. 10.39 for a better solution.)

Next, we investigate the effect of severely reducing the size of the foreground objects and the background. An immediate and obvious improvement is that his-
region with respect to the background. Figure 10.38(a) shows the result. The noise in tograms should be less dependent on the relative sizes of objects and background.
this image is additive Gaussian noise with zero mean and a standard deviation of 10 For instance, the histogram of an image composed of a small object on a large back-
intensity levels (as opposed to 50 in the previous example). As Fig. 10.38(b) shows, ground area (or vice versa) would be dominated by a large peak because of the high
the histogram has no clear valley, so we would expect segmentation to fail, a fact that concentration of one type of pixels. We saw in Fig. 10.38 that this can lead to failure
is confirmed by the result in Fig. 10.38(c). Figure 10.38(d) shows the image smoothed in thresholding.
with an averaging kernel of size 5 × 5, and Fig. 10.38(e) is the corresponding histo- If only the pixels on or near the edges between objects and background were
gram. As expected, the net effect was to reduce the spread of the histogram, but the used, the resulting histogram would have peaks of approximately the same height. In
distribution still is unimodal. As Fig. 10.38(f) shows, segmentation failed again. The addition, the probability that any of those pixels lies on an object would be approxi-
reason for the failure can be traced to the fact that the region is so small that its con- mately equal to the probability that it lies on the background, thus improving the
tribution to the histogram is insignificant compared to the intensity spread caused symmetry of the histogram modes. Finally, as indicated in the following paragraph,
by noise. In situations such as this, the approach discussed in the following section is
using pixels that satisfy some simple measures based on gradient and Laplacian
more likely to succeed.
operators has a tendency to deepen the valley between histogram peaks.
The approach just discussed assumes that the edges between objects and back-
USING EDGES TO IMPROVE GLOBAL THRESHOLDING ground are known. This information clearly is not available during segmentation,
Based on the discussion thus far, we conclude that the chances of finding a “good” as finding a division between objects and background is precisely what segmenta-
threshold are enhanced considerably if the histogram peaks are tall, narrow, sym- tion aims to do. However, an indication of whether a pixel is on an edge may be
metric, and separated by deep valleys. One approach for improving the shape of obtained by computing its gradient or Laplacian. For example, the average value
histograms is to consider only those pixels that lie on or near the edges between of the Laplacian is 0 at the transition of an edge (see Fig. 10.10), so the valleys of

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 753 6/16/2017 2:13:38 PM DIP4E_GLOBAL_Print_Ready.indb 754 6/16/2017 2:13:38 PM


10.3 Thresholding 755 756 Chapter 10 Image Segmentation

histograms formed from the pixels selected by a Laplacian criterion can be expected
to be sparsely populated. This property tends to produce the desirable deep valleys
discussed above. In practice, comparable results typically are obtained using either
the gradient or Laplacian images, with the latter being favored because it is compu-
tationally more attractive and is also created using an isotropic edge detector.
The preceding discussion is summarized in the following algorithm, where f ( x, y)
It is possible to modify is the input image:
this algorithm so that
both the magnitude of
the gradient and the
1. Compute an edge image as either the magnitude of the gradient, or absolute
0 63 127 191 255
absolute value of the value of the Laplacian, of f ( x, y) using any of the methods in Section 10.2.
Laplacian images are
used. In this case, we 2. Specify a threshold value, T.
would specify a threshold
for each image and form 3. Threshold the image from Step 1 using T from Step 2 to produce a binary image,
the logical OR of the
two results to obtain
gT ( x, y). This image is used as a mask image in the following step to select pixels
the marker image. This from f ( x, y) corresponding to “strong” edge pixels in the mask.
approach is useful when
more control is desired 4. Compute a histogram using only the pixels in f ( x, y) that correspond to the
over the points deemed
to be valid edge points.
locations of the 1-valued pixels in gT ( x, y).
5. Use the histogram from Step 4 to segment f ( x, y) globally using, for example,
0 63 127 191 255
Otsu’s method.
a b c
The nth percentile is
If T is set to any value less than the minimum value of the edge image then, accord- d e f
the smallest number ing to Eq. (10-46), gT ( x, y) will consist of all 1’s, implying that all pixels of f ( x, y) FIGURE 10.39 (a) Noisy image from Fig. 10.38(a) and (b) its histogram. (c) Mask image formed as the gradient mag-
that is greater than n%
of the numbers in a will be used to compute the image histogram. In this case, the preceding algorithm nitude image thresholded at the 99.7 percentile. (d) Image formed as the product of (a) and (c). (e) Histogram of
given set. For example, becomes global thresholding using the histogram of the original image. It is custom- the nonzero pixels in the image in (d). (f) Result of segmenting image (a) with the Otsu threshold based on the
if you received a 95 in a
ary to specify the value of T to correspond to a percentile, which typically is set histogram in (e). The threshold was 134, which is approximately midway between the peaks in this histogram.
test and this score was
greater than 85% of all high (e.g., in the high 90’s) so that few pixels in the gradient/Laplacian image will
the students taking the
test, then you would be be used in the computation. The following examples illustrate the concepts just dis-
in the 85th percentile EXAMPLE 10.16 : Using edge information based on the Laplacian to improve global thresholding.
with respect to the test
cussed. The first example uses the gradient, and the second uses the Laplacian. Simi-
scores. lar results can be obtained in both examples using either approach. The important In this example, we consider a more complex thresholding problem. Figure 10.40(a) shows an 8-bit
issue is to generate a suitable derivative image. image of yeast cells for which we want to use global thresholding to obtain the regions corresponding
to the bright spots. As a starting point, Fig. 10.40(b) shows the image histogram, and Fig. 10.40(c) is
the result obtained using Otsu’s method directly on the image, based on the histogram shown. We see
EXAMPLE 10.15 : Using edge information based on the gradient to improve global thresholding. that Otsu’s method failed to achieve the original objective of detecting the bright spots. Although the
method was able to isolate some of the cell regions themselves, several of the segmented regions on the
Figures 10.39(a) and (b) show the image and histogram from Fig. 10.38. You saw that this image could right were actually joined. The threshold computed by the Otsu method was 42, and the separability
not be segmented by smoothing followed by thresholding. The objective of this example is to solve the measure was 0.636.
problem using edge information. Figure 10.39(c) is the mask image, gT ( x, y), formed as gradient mag- Figure 10.40(d) shows the mask image gT ( x, y) obtained by computing the absolute value of the
nitude image thresholded at the 99.7 percentile. Figure 10.39(d) is the image formed by multiplying the Laplacian image, then thresholding it with T set to 115 on an intensity scale in the range [0, 255]. This
mask by the input image. Figure 10.39(e) is the histogram of the nonzero elements in Fig. 10.39(d). Note value of T corresponds approximately to the 99.5 percentile of the values in the absolute Laplacian
that this histogram has the important features discussed earlier; that is, it has reasonably symmetrical image, so thresholding at this level results in a sparse set of pixels, as Fig. 10.40(d) shows. Note in this
modes separated by a deep valley. Thus, while the histogram of the original noisy image offered no hope image how the points cluster near the edges of the bright spots, as expected from the preceding dis-
for successful thresholding, the histogram in Fig. 10.39(e) indicates that thresholding of the small object cussion. Figure 10.40(e) is the histogram of the nonzero pixels in the product of (a) and (d). Finally,
from the background is indeed possible. The result in Fig. 10.39(f) shows that this is the case. This image Fig. 10.40(f) shows the result of globally segmenting the original image using Otsu’s method based on
was generated using Otsu’s method [to obtain a threshold based on the histogram in Fig. 10.42(e)], and the histogram in Fig. 10.40(e). This result agrees with the locations of the bright spots in the image. The
then applying the Otsu threshold globally to the noisy image in Fig. 10.39(a). The result is nearly perfect. threshold computed by the Otsu method was 115, and the separability measure was 0.762, both of which
are higher than the values obtained by using the original histogram.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 755 6/16/2017 2:13:39 PM DIP4E_GLOBAL_Print_Ready.indb 756 6/16/2017 2:13:40 PM


10.3 Thresholding 757 758 Chapter 10 Image Segmentation

FIGURE 10.41
Image in Fig.
10.40(a) segmented
using the same
procedure as
explained in Figs.
10.40(d) through
(f), but using a
lower value to
threshold the
absolute Laplacian
image.

0 63 127 191 255

because the separability measure on which it is based also extends to an arbitrary


In applications involving
number of classes (Fukunaga [1972]). In the case of K classes, c1 , c2 ,…, cK , the
more than one variable between-class variance generalizes to the expression
(for example the RGB
components of a color K
∑ Pk ( mk − mG )
2
image), thresholding can sB2 = (10-66)
be implemented using a
k =1
distance measure, such
as the Euclidean distance,
or Mahalanobis distance where
discussed in Section 6.7
(see Eqs. (6-48), (6-49),
and Example 6.15).
Pk = ∑ pi
i ∈ck
(10-67)

0 63 127 191 255


and

1
a b c
d e f
mk =
Pk
∑ ipi
i ∈ck
(10-68)

FIGURE 10.40 (a) Image of yeast cells. (b) Histogram of (a). (c) Segmentation of (a) with Otsu’s method using the
histogram in (b). (d) Mask image formed by thresholding the absolute Laplacian image. (e) Histogram of the non- As before, mG is the global mean given in Eq. (10-54). The K classes are separated
zero pixels in the product of (a) and (d). (f) Original image thresholded using Otsu’s method based on the histogram by K − 1 thresholds whose values, k1∗ , k2∗ ,…, kK∗ −1 , are the values that maximize Eq.
in (e). (Original image courtesy of Professor Susan L. Forsburg, University of Southern California.) (10-66):

(
s B2 k1∗ , k2∗ ,…, kK∗ −1 ) = max
0 < k1 < k2 <…kK < L −1
s B2 ( k1 , k2 ,… kK −1 ) (10-69)
By varying the percentile at which the threshold is set, we can even improve the segmentation of the
complete cell regions. For example, Fig. 10.41 shows the result obtained using the same procedure as in Although this result is applicable to an arbitrary number of classes, it begins to lose
the previous paragraph, but with the threshold set at 55, which is approximately 5% of the maximum meaning as the number of classes increases because we are dealing with only one
value of the absolute Laplacian image. This value is at the 53.9 percentile of the values in that image. variable (intensity). In fact, the between-class variance usually is cast in terms of
This result clearly is superior to the result in Fig. 10.40(c) obtained using Otsu’s method with the histo- multiple variables expressed as vectors (Fukunaga [1972]). In practice, using mul-
gram of the original image. tiple global thresholding is considered a viable approach when there is reason to
believe that the problem can be solved effectively with two thresholds. Applications
MULTIPLE THRESHOLDS that require more than two thresholds generally are solved using more than just
intensity values. Instead, the approach is to use additional descriptors (e.g., color)
Thus far, we have focused attention on image segmentation using a single global and the application is cast as a pattern recognition problem, as you will learn shortly
threshold. Otsu’s method can be extended to an arbitrary number of thresholds in the discussion on multivariable thresholding.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 757 6/16/2017 2:13:40 PM DIP4E_GLOBAL_Print_Ready.indb 758 6/16/2017 2:13:41 PM


10.3 Thresholding 759 760 Chapter 10 Image Segmentation
Recall from the For three classes consisting of three intensity intervals (which are separated by If there are several maxima, the corresponding values of k1 and k2 are averaged to
discussion of the
Canny edge detec- two thresholds), the between-class variance is given by: obtain the final thresholds. The thresholded image is then given by
tor that thresholding
with two thresholds is
s B2 = P1 ( m1 − mG ) + P2 ( m2 − mG ) + P3 ( m3 − mG )
2 2 2
referred to as hysteresis (10-70) a if f ( x, y) ≤ k1*
thresholding.

where k1 g( x, y) = b if k1* < f ( x, y) ≤ k2* (10-76)
P1 = ∑ pi 
if f ( x, y) > k2*
i=0 c
k2
P2 = ∑
i = k1 + 1
pi (10-71) where a, b, and c are any three distinct intensity values.
Finally, the separability measure defined earlier for one threshold extends direct-
L −1
ly to multiple thresholds:
P3 = ∑ pi
(
s B2 k1∗ , k2∗ )
( )
i = k2 + 1
and h k1∗ , k2∗ = (10-77)
sG2
k1
1
m1 = ∑ ipi
P1 i = 0
where sG2 is the total image variance from Eq. (10-58).

k2
1
m2 =
P2
∑ ipi
i = k +1
(10-72) EXAMPLE 10.17 : Multiple global thresholding.
1

L −1 Figure 10.42(a) shows an image of an iceberg. The objective of this example is to segment the image into
1
m3 = ∑ ipi
P3 i = k2 +1
three regions: the dark background, the illuminated area of the iceberg, and the area in shadows. It is
evident from the image histogram in Fig. 10.42(b) that two thresholds are required to solve this problem.
As in Eqs. (10-55) and (10-56), the following relationships hold: The procedure discussed above resulted in the thresholds k1∗ = 80 and k2∗ = 177, which we note from
Fig. 10.45(b) are near the centers of the two histogram valleys. Figure 10.42(c) is the segmentation that
P1 m1 + P2 m2 + P3 m3 = mG (10-73) resulted using these two thresholds in Eq. (10-76). The separability measure was 0.954. The principal
reason this example worked out so well can be traced to the histogram having three distinct modes
and separated by reasonably wide, deep valleys. But we can do even better using superpixels, as you will see
P1 + P2 + P3 = 1 (10-74) in Section 10.5.

We see from Eqs. (10-71) and (10-72) that P and m, and therefore are functions sB2 ,
of k1 and k2 . The two optimum threshold values, k1* and k2* , are the values that maxi-
2
mize s B (k1 , k2 ). That is, as indicated in Eq. (10-69), we find the optimum thresholds
by finding

( )
sB2 k1∗ , k2∗ = max
0 < k1 < k2 < L − 1
sB2 ( k1 , k2 ) (10-75)

The procedure starts by selecting the first value of k1 (that value is 1 because look-
ing for a threshold at 0 intensity makes no sense; also, keep in mind that the incre-
ment values are integers because we are dealing with integer intensity values).
Next, k2 is incremented through all its values greater than k1 and less than L − 1
(i.e., k2 = k1 + 1,…, L − 2). Then, k1 is incremented to its next value and k2 is incre- 0 63 127 191 255
mented again through all its values greater than k1 . This procedure is repeated
until k1 = L − 3. The result of this procedure is a 2-D array, s B2 ( k1 , k2 ) , and the last a b c
step is to look for the maximum value in this array. The values of k1 and k2 cor- FIGURE 10.42 (a) Image of an iceberg. (b) Histogram. (c) Image segmented into three regions using dual Otsu thresholds.
responding to that maximum in the array are the optimum thresholds, k1* and k2* . (Original image courtesy of NOAA.)

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 759 6/16/2017 2:13:43 PM DIP4E_GLOBAL_Print_Ready.indb 760 6/16/2017 2:13:44 PM


10.3 Thresholding 761 762 Chapter 10 Image Segmentation

VARIABLE THRESHOLDING where Q is a predicate based on parameters computed using the pixels in neighbor-
As discussed earlier in this section, factors such as noise and nonuniform illumina- ( )
hood Sxy . For example, consider the following predicate, Q s xy , mxy , based on the
tion play a major role in the performance of a thresholding algorithm. We showed local mean and standard deviation:
that image smoothing and the use of edge information can help significantly. How-
TRUE if f ( x, y) > asxy AND f ( x, y) > bmxy
ever, sometimes this type of preprocessing is either impractical or ineffective in ( )
Q sxy , mxy =  (10-82)
improving the situation, to the point where the problem cannot be solved by any  FALSE otherwisee
of the thresholding methods discussed thus far. In such situations, the next level of
thresholding complexity involves variable thresholding, as we will illustrate in the Note that Eq. (10-80) is a special case of Eq. (10-81), obtained by letting Q be TRUE
following discussion. if f ( x, y) > Txy and FALSE otherwise. In this case, the predicate is based simply on
the intensity at a point.
Variable Thresholding Based on Local Image Properties
A basic approach to variable thresholding is to compute a threshold at every point, EXAMPLE 10.18 : Variable thresholding based on local image properties.
( x, y), in the image based on one or more specified properties in a neighborhood Figure 10.43(a) shows the yeast image from Example 10.16. This image has three predominant inten-
of ( x, y). Although this may seem like a laborious process, modern algorithms and sity levels, so it is reasonable to assume that perhaps dual thresholding could be a good segmentation
hardware allow for fast neighborhood processing, especially for common functions approach. Figure 10.43(b) is the result of using the dual thresholding method summarized in Eq. (10-76).
such as logical and arithmetic operations. As the figure shows, it was possible to isolate the bright areas from the background, but the mid-gray
We illustrate the approach using the mean and standard deviation of the pixel regions on the right side of the image were not segmented (i.e., separated) properly. To illustrate the use
values in a neighborhood of every point in an image. These two quantities are use-
ful for determining local thresholds because, as you know from Chapter 3, they are a b
descriptors of average intensity and contrast. Let mxy and sxy denote the mean and c d
We simplified the nota-
standard deviation of the set of pixel values in a neighborhood, Sxy , centered at FIGURE 10.43
tion slightly from the coordinates ( x, y) in an image (see Section 3.3 regarding computation of the local (a) Image from
form we used in
Eqs. (3-27) and (3-28) by
mean and standard deviation). The following are common forms of variable thresh- Fig. 10.40.
letting xy imply a olds based on the local image properties: (b) Image
neighborhood S, centered segmented using
at coordinates (x, y). Txy = as xy + bmxy (10-78) the dual
thresholding
where a and b are nonnegative constants, and approach given
by Eq. (10-76).
Txy = as xy + bmG (10-79) (c) Image of local
Note that Txy is a standard
threshold array of the deviations.
same size as the image where mG is the global image mean. The segmented image is computed as (d) Result
from which it was
obtained. The threshold obtained using
at a location (x, y) in the
1 if f ( x, y) > Txy local thresholding.
array is used to segment
g( x, y) =  (10-80)
the value of an image at
that location. 0 if f ( x, y) ≤ Txy

where f ( x, y) is the input image. This equation is evaluated for all pixel locations
in the image, and a different threshold is computed at each location ( x, y) using the
pixels in the neighborhood Sxy .
Significant power (with a modest increase in computation) can be added to vari-
able thresholding by using predicates based on the parameters computed in the neigh-
borhood of a point ( x, y) :

1 if Q(local parameters) is TRUE


g( x, y) =  (10-81)
0 if Q(local parameterrs) is FALSE

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 761 6/16/2017 2:13:46 PM DIP4E_GLOBAL_Print_Ready.indb 762 6/16/2017 2:13:46 PM


10.3 Thresholding 763 764 Chapter 10 Image Segmentation

of local thresholding, we computed the local standard deviation sxy for all ( x, y) in the input image using
a neighborhood of size 3 × 3. Figure 10.43(c) shows the result. Note how the faint outer lines correctly
delineate the boundaries of the cells. Next, we formed a predicate of the form shown in Eq. (10-82), but
using the global mean instead of mxy . Choosing the global mean generally gives better results when the
background is nearly constant and all the object intensities are above or below the background intensity.
The values a = 30 and b = 1.5 were used to complete the specification of the predicate (these values
were determined experimentally, as is usually the case in applications such as this). The image was then
segmented using Eq. (10-82). As Fig. 10.43(d) shows, the segmentation was quite successful. Note in par-
ticular that all the outer regions were segmented properly, and that most of the inner, brighter regions
were isolated correctly.
a b c
Variable Thresholding Based on Moving Averages FIGURE 10.44 (a) Text image corrupted by spot shading. (b) Result of global thresholding using Otsu’s method.
(c) Result of local thresholding using moving averages.
A special case of the variable thresholding method discussed in the previous sec-
tion is based on computing a moving average along scan lines of an image. This
implementation is useful in applications such as document processing, where speed As another illustration of the effectiveness of this segmentation approach, we used the same param-
is a fundamental requirement. The scanning typically is carried out line by line in a eters as in the previous paragraph to segment the image in Fig. 10.45(a), which is corrupted by a sinu-
zigzag pattern to reduce illumination bias. Let zk+1 denote the intensity of the point soidal intensity variation typical of the variations that may occur when the power supply in a document
encountered in the scanning sequence at step k + 1. The moving average (mean scanner is not properly grounded. As Figs. 10.45(b) and (c) show, the segmentation results are compa-
intensity) at this new point is given by rable to those in Fig. 10.44.
Note that successful segmentation results were obtained in both cases using the same values for n
1 k +1 and c, which shows the relative ruggedness of the approach. In general, thresholding based on moving
m(k + 1) = ∑ zi
n i = k+2−n
for k ≥ n − 1
averages works well when the objects of interest are small (or thin) with respect to the image size, a
(10-83) condition satisfied by images of typed or handwritten text.
= m(k ) +
1
n
(
zk +1 − zk − n ) for k ≥ n + 1
10.4 SEGMENTATION BY REGION GROWING AND BY REGION
where n is the number of points used in computing the average, and m(1) = z1 . The 10.4
SPLITTING AND MERGING
conditions imposed on k are so that all subscripts on zk are positive. All this means
is that n points must be available for computing the average. When k is less than the You should review the
As we discussed in Section 10.1, the objective of segmentation is to partition an
limits shown (this happens near the image borders) the averages are formed with terminology introduced image into regions. In Section 10.2, we approached this problem by attempting to
the available image points. Because a moving average is computed for every point
in Section 10.1 before
proceeding.
find boundaries between regions based on discontinuities in intensity levels, where-
in the image, segmentation is implemented using Eq. (10-80) with Txy = cmxy , where as in Section 10.3, segmentation was accomplished via thresholds based on the dis-
c is positive scalar, and mxy is the moving average from Eq. (10-83) at point ( x, y) in tribution of pixel properties, such as intensity values or color. In this section and in
the input image. Sections 10.5 and 10.6, we discuss segmentation techniques that find the regions
directly. In Section 10.7, we will discuss a method that finds the regions and their
boundaries simultaneously.
EXAMPLE 10.19 : Document thresholding using moving averages.
Figure 10.44(a) shows an image of handwritten text shaded by a spot intensity pattern. This form of REGION GROWING
intensity shading is typical of images obtained using spot illumination (such as a photographic flash). As its name implies, region growing is a procedure that groups pixels or subregions
Figure 10.44(b) is the result of segmentation using the Otsu global thresholding method. It is not unex- into larger regions based on predefined criteria for growth. The basic approach is to
pected that global thresholding could not overcome the intensity variation because the method gener- start with a set of “seed” points, and from these grow regions by appending to each
ally performs poorly when the areas of interest are embedded in a nonuniform illumination field. Figure seed those neighboring pixels that have predefined properties similar to the seed
10.44(c) shows successful segmentation with local thresholding using moving averages. For images of (such as ranges of intensity or color).
written material, a rule of thumb is to let n equal five times the average stroke width. In this case, the Selecting a set of one or more starting points can often be based on the nature of
average width was 4 pixels, so we let n = 20 in Eq. (10-83) and used c = 0.5. the problem, as we show later in Example 10.20. When a priori information is not

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 763 6/16/2017 2:13:48 PM DIP4E_GLOBAL_Print_Ready.indb 764 6/16/2017 2:13:48 PM


10.4 Segmentation by Region Growing and by Region Splitting and Merging 765 766 Chapter 10 Image Segmentation

See Sections 2.5 and 9.5


1. Find all connected components in S( x, y) and reduce each connected component
regarding connected to one pixel; label all such pixels found as 1. All other pixels in S are labeled 0.
components, and
Section 9.2 regarding 2. Form an image fQ such that, at each point ( x, y), fQ ( x, y) = 1 if the input image
erosion. satisfies a given predicate, Q, at those coordinates, and fQ ( x, y) = 0 otherwise.
3. Let g be an image formed by appending to each seed point in S all the 1-valued
points in fQ that are 8-connected to that seed point.
4. Label each connected component in g with a different region label (e.g.,integers
or letters). This is the segmented image obtained by region growing.
The following example illustrates the mechanics of this algorithm.
a b c
FIGURE 10.45 (a) Text image corrupted by sinusoidal shading. (b) Result of global thresholding using Otsu’s method. EXAMPLE 10.20 : Segmentation by region growing.
(c) Result of local thresholding using moving averages..
Figure 10.46(a) shows an 8-bit X-ray image of a weld (the horizontal dark region) containing several
cracks and porosities (the bright regions running horizontally through the center of the image). We illus-
trate the use of region growing by segmenting the defective weld regions. These regions could be used
available, the procedure is to compute at every pixel the same set of properties that
in applications such as weld inspection, for inclusion in a database of historical studies, or for controlling
ultimately will be used to assign pixels to regions during the growing process. If the
an automated welding system.
result of these computations shows clusters of values, the pixels whose properties
The first thing we do is determine the seed points. From the physics of the problem, we know that
place them near the centroid of these clusters can be used as seeds.
cracks and porosities will attenuate X-rays considerably less than solid welds, so we expect the regions
The selection of similarity criteria depends not only on the problem under con-
containing these types of defects to be significantly brighter than other parts of the X-ray image. We
sideration, but also on the type of image data available. For example, the analysis of
can extract the seed points by thresholding the original image, using a threshold set at a high percen-
land-use satellite imagery depends heavily on the use of color. This problem would
tile. Figure 10.46(b) shows the histogram of the image, and Fig. 10.46(c) shows the thresholded result
be significantly more difficult, or even impossible, to solve without the inherent infor-
obtained with a threshold equal to the 99.9 percentile of intensity values in the image, which in this case
mation available in color images. When the images are monochrome, region analysis
was 254 (see Section 10.3 regarding percentiles). Figure 10.46(d) shows the result of morphologically
must be carried out with a set of descriptors based on intensity levels and spatial
eroding each connected component in Fig. 10.46(c) to a single point.
properties (such as moments or texture). We will discuss descriptors useful for region
Next, we have to specify a predicate. In this example, we are interested in appending to each seed
characterization in Chapter 11.
all the pixels that (a) are 8-connected to that seed, and (b) are “similar” to it. Using absolute intensity
Descriptors alone can yield misleading results if connectivity properties are not
differences as a measure of similarity, our predicate applied at each location ( x, y) is
used in the region-growing process. For example, visualize a random arrangement of
pixels that have three distinct intensity values. Grouping pixels with the same inten-  TRUE if the absolute difference of intensities
sity value to form a “region,” without paying attention to connectivity, would yield a  between the seed and the pixel at ( x, y) is ≤ T
Q=
segmentation result that is meaningless in the context of this discussion.  FALSE
 otherwise
Another problem in region growing is the formulation of a stopping rule. Region
growth should stop when no more pixels satisfy the criteria for inclusion in that where T is a specified threshold. Although this predicate is based on intensity differences and uses a
region. Criteria such as intensity values, texture, and color are local in nature and single threshold, we could specify more complex schemes in which a different threshold is applied to
do not take into account the “history” of region growth. Additional criteria that can each pixel, and properties other than differences are used. In this case, the preceding predicate is suf-
increase the power of a region-growing algorithm utilize the concept of size, like- ficient to solve the problem, as the rest of this example shows.
ness between a candidate pixel and the pixels grown so far (such as a comparison of From the previous paragraph, we know that all seed values are 255 because the image was thresh-
the intensity of a candidate and the average intensity of the grown region), and the olded with a threshold of 254. Figure 10.46(e) shows the difference between the seed value (255) and
shape of the region being grown. The use of these types of descriptors is based on Fig. 10.46(a). The image in Fig. 10.46(e) contains all the differences needed to compute the predicate at
the assumption that a model of expected results is at least partially available. each location ( x, y). Figure 10.46(f) shows the corresponding histogram. We need a threshold to use in
Let: f ( x, y) denote an input image; S( x, y) denote a seed array containing 1’s the predicate to establish similarity. The histogram has three principal modes, so we can start by apply-
at the locations of seed points and 0’s elsewhere; and Q denote a predicate to be ing to the difference image the dual thresholding technique discussed in Section 10.3. The resulting two
applied at each location ( x, y). Arrays f and S are assumed to be of the same size. thresholds in this case were T1 = 68 and T2 = 126, which we see correspond closely to the valleys of
A basic region-growing algorithm based on 8-connectivity may be stated as follows. the histogram. (As a brief digression, we segmented the image using these two thresholds. The result in

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 765 6/16/2017 2:13:48 PM DIP4E_GLOBAL_Print_Ready.indb 766 6/16/2017 2:13:50 PM


10.4 Segmentation by Region Growing and by Region Splitting and Merging 767 768 Chapter 10 Image Segmentation

candidates. However, Step 3 will reject the outer points because they are not 8-connected to the seeds.
In fact, as Fig. 10.46(i) shows, this step resulted in the correct segmentation, indicating that the use of
connectivity was a fundamental requirement in this case. Finally, note that in Step 4 we used the same
value for all the regions found by the algorithm. In this case, it was visually preferable to do so because
all those regions have the same physical meaning in this application—they all represent porosities.

REGION SPLITTING AND MERGING


The procedure just discussed grows regions from seed points. An alternative is to sub-
0 63 127 191 255 divide an image initially into a set of disjoint regions and then merge and/or split the
regions in an attempt to satisfy the conditions of segmentation stated in Section 10.1.
The basics of region splitting and merging are discussed next.
Let R represent the entire image region and select a predicate Q. One approach
for segmenting R is to subdivide it successively into smaller and smaller quadrant
regions so that, for any region Ri , Q(Ri ) = TRUE. We start with the entire region, R.
If Q(R) = FALSE, we divide the image into quadrants. If Q is FALSE for any
quadrant, we subdivide that quadrant into sub-quadrants, and so on. This splitting
technique has a convenient representation in the form of so-called quadtrees; that
is, trees in which each node has exactly four descendants, as Fig. 10.47 shows (the
0 63 127 191 255 images corresponding to the nodes of a quadtree sometimes are called quadregions
or quadimages). Note that the root of the tree corresponds to the entire image, and
that each node corresponds to the subdivision of a node into four descendant nodes.
In this case, only R4 was subdivided further.
If only splitting is used, the final partition normally contains adjacent regions with
identical properties. This drawback can be remedied by allowing merging as well as
splitting. Satisfying the constraints of segmentation outlined in Section 10.1 requires
merging only adjacent regions whose combined pixels satisfy the predicate Q. That
See Section 2.5
regarding region (
is, two adjacent regions Rj and Rk are merged only if Q Rj " Rk = TRUE. )
adjacency. The preceding discussion can be summarized by the following procedure in which,
at any step, we
a b c
d e f 1. Split into four disjoint quadrants any region Ri for which Q(Ri ) = FALSE.
g h i
2. When no further splitting is possible, merge any adjacent regions Rj and Rk for
Figure 10.46 (a) X-ray image of a defective weld. (b) Histogram. (c) Initial seed image. (d) Final seed image (the
points were enlarged for clarity). (e) Absolute value of the difference between the seed value (255) and (a).
(
which Q Rj " Rk = TRUE. )
(f) Histogram of (e). (g) Difference image thresholded using dual thresholds. (h) Difference image thresholded with
the smallest of the dual thresholds. (i) Segmentation result obtained by region growing. (Original image courtesy
of X-TEK Systems, Ltd.) a b R
R
FIGURE 10.47
(a) Partitioned
image. R1 R2
Fig. 10.46(g) shows that segmenting the defects cannot be accomplished using dual thresholds, despite (b) Corresponding
quadtree. R1 R2 R3 R4
the fact that the thresholds are in the deep valleys of the histogram.)
R represents R41 R42
Figure 10.46(h) shows the result of thresholding the difference image with only T1 . The black points the entire image R3
are the pixels for which the predicate was TRUE; the others failed the predicate. The important result region. R43 R44
here is that the points in the good regions of the weld failed the predicate, so they will not be included R41 R42 R43 R44
in the final result. The points in the outer region will be considered by the region-growing algorithm as

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 767 6/16/2017 2:13:50 PM DIP4E_GLOBAL_Print_Ready.indb 768 6/16/2017 2:13:52 PM


10.4 Segmentation by Region Growing and by Region Splitting and Merging 769 770 Chapter 10 Image Segmentation

3. Stop when no further merging is possible. TRUE if sR > a AND 0 < mR < b
Q(R) = 
Numerous variations of this basic theme are possible. For example, a significant  FALSE otherwise
simplification results if in Step 2 we allow merging of any two adjacent regions Rj
and Rk if each one satisfies the predicate individually. This results in a much sim- where sR and mR are the standard deviation and mean of the region being processed, and a and b are
pler (and faster) algorithm, because testing of the predicate is limited to individual nonnegative constants.
quadregions. As the following example shows, this simplification is still capable of Analysis of several regions in the outer area of interest revealed that the mean intensity of pixels
yielding good segmentation results. in those regions did not exceed 125, and the standard deviation was always greater than 10. Figures
10.48(b) through (d) show the results obtained using these values for a and b, and varying the minimum
size allowed for the quadregions from 32 to 8. The pixels in a quadregion that satisfied the predicate
EXAMPLE 10.21 : Segmentation by region splitting and merging. were set to white; all others in that region were set to black. The best result in terms of capturing the
Figure 10.48(a) shows a 566 × 566 X-ray image of the Cygnus Loop supernova. The objective of this shape of the outer region was obtained using quadregions of size 16 × 16. The small black squares in
example is to segment (extract from the image) the “ring” of less dense matter surrounding the dense Fig. 10.48(d) are quadregions of size 8 × 8 whose pixels did not satisfy the predicate. Using smaller
inner region. The region of interest has some obvious characteristics that should help in its segmenta- quadregions would result in increasing numbers of such black regions. Using regions larger than the one
tion. First, we note that the data in this region has a random nature, indicating that its standard devia- illustrated here would result in a more “block-like” segmentation. Note that in all cases the segmented
tion should be greater than the standard deviation of the background (which is near 0) and of the large region (white pixels) was a connected region that completely separates the inner, smoother region from
central region, which is smooth. Similarly, the mean value (average intensity) of a region containing the background. Thus, the segmentation effectively partitioned the image into three distinct areas that
data from the outer ring should be greater than the mean of the darker background and less than the correspond to the three principal features in the image: background, a dense region, and a sparse region.
mean of the lighter central region. Thus, we should be able to segment the region of interest using the Using any of the white regions in Fig. 10.48 as a mask would make it a relatively simple task to extract
following predicate: these regions from the original image (see Problem 10.43). As in Example 10.20, these results could not
have been obtained using edge- or threshold-based segmentation.

a b As used in the preceding example, properties based on the mean and standard
c d deviation of pixel intensities in a region attempt to quantify the texture of the region
FIGURE 10.48 (see Section 11.3 for a discussion on texture). The concept of texture segmentation
(a) Image of the is based on using measures of texture in the predicates. In other words, we can per-
Cygnus Loop form texture segmentation by any of the methods discussed in this section simply by
supernova, taken
specifying predicates based on texture content.
in the X-ray band
by NASA’s
Hubble Telescope. 10.5 REGION SEGMENTATION USING CLUSTERING AND
(b) through (d) 10.5
SUPERPIXELS
Results of limit-
ing the smallest In this section, we discuss two related approaches to region segmentation. The first
allowed is a classical approach based on seeking clusters in data, related to such variables as
quadregion to be intensity and color. The second approach is significantly more modern, and is based
of sizes of 32 × 32,
16 × 16, and 8 × 8 on using clustering to extract “superpixels” from an image.
A more general form of
pixels, clustering is
respectively. unsupervised clustering, REGION SEGMENTATION USING K-MEANS CLUSTERING
(Original image in which a clustering
courtesy of
algorithm attempts to The basic idea behind the clustering approach used in this chapter is to partition a
find a meaningful set of
NASA.) clusters in a given set
set, Q, of observations into a specified number, k, of clusters. In k-means clustering,
of samples. We do not each observation is assigned to the cluster with the nearest mean (hence the name
address this topic, as
our focus in this brief
of the method), and each mean is called the prototype of its cluster. A k-means algo-
introduction is only to rithm is an iterative procedure that successively refines the means until convergence
illustrate how supervised
clustering is used for
is achieved.
image segmentation. Let {z1 , z 2 , … , zQ } be set of vector observations (samples). These vectors have
the form

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 769 6/16/2017 2:13:52 PM DIP4E_GLOBAL_Print_Ready.indb 770 6/16/2017 2:13:53 PM


10.5 Region Segmentation Using Clustering and Superpixels 771 772 Chapter 10 Image Segmentation

 z1  a b
z  FIGURE 10.49
z= 
2
(10-84) (a) Image of size
#
  688 × 688 pixels.
 zn  (b) Image
segmented using
In image segmentation, each component of a vector z represents a numerical pixel the k-means
attribute. For example, if segmentation is based on just grayscale intensity, then z = z algorithm with
is a scalar representing the intensity of a pixel. If we are segmenting RGB color k = 3.
images, z typically is a 3-D vector, each component of which is the intensity of a pixel
in one of the three primary color images, as we discussed in Chapter 6. The objec-
tive of k-means clustering is to partition the set Q of observations into k (k ≤ Q)
disjoint cluster sets C = {C1 , C2 , … , Ck }, so that the following criterion of optimality
is satisfied:† k
arg min a ∑ ∑ z − m i b
2
(10-85)
C i = 1 z ∈Ci
When T = 0, this algorithm is known to converge in a finite number of iterations
where m i is the mean vector (or centroid) of the samples in set Ci and arg is the vec- to a local minimum. It is not guaranteed to yield the global minimum required to
tor norm of the argument. Typically, the Euclidean norm is used, so the term z − m i minimize Eq. (10-85). The result at convergence does depend on the initial values
is the familiar Euclidean distance from a sample in Ci to mean m i . In words, this chosen for m i . An approach used frequently in data analysis is to specify the initial
equation says that we are interested in finding the sets C = {C1 , C2 , … , Ck } such that means as k randomly chosen samples from the given sample set, and to run the
the sum of the distances from each point in a set to the mean of that set is minimum. algorithm several times, with a new random set of initial samples each time. This is
Unfortunately, finding this minimum is an NP-hard problem for which no practi- to test the “stability” of the solution. In image segmentation, the important issue is
cal solution is known. As a result, a number of heuristic methods that attempt to find the value selected for k because this determines the number of segmented regions;
approximations to the minimum have been proposed over the years. In this section, thus, multiple passes are rarely used.
we discuss what is generally considered to be the “standard” k-means algorithm,
which is based on the Euclidean distance (see Section 2.6). Given a set {z1 , z 2 , … , zQ }
of vector observation and a specified value of k, the algorithm is as follows: EXAMPLE 10.22 : Using k-means clustering for segmentation.

These initial means are 1. Initialize the algorithm: Specify an initial set of means, m i (1), i = 1, 2, … , k. Figure 10.49(a) shows an image of size 688 × 688 pixels, and Fig. 10.49(b) is the segmentation obtained
the initial cluster centers. using the k-means algorithm with k = 3. As you can see, the algorithm was able to extract all the mean-
They are also called seeds. 2. Assign samples to clusters: Assign each sample to the cluster set whose mean
ingful regions of this image with high accuracy. For example, compare the quality of the characters in
is the closest (ties are resolved arbitrarily, but samples are assigned to only one
cluster): both images. It is important to realize that the entire segmentation was done by clustering of a single
variable (intensity). Because k-means works with vector observations in general, its power to discrimi-
zq → Ci if !zq − m i !2 < !zq − m j !2 j = 1, 2, … , k ( j ≠ i); q = 1, 2, … , Q
nate between regions increases as the number of components of vector z in Eq. (10-84) increases.
3. Update the cluster centers (means):
1 REGION SEGMENTATION USING SUPERPIXELS
mi =
Ci
∑z
z ∈Ci
i = 1, 2, … , k
The idea behind superpixels is to replace the standard pixel grid by grouping pixels
into primitive regions that are more perceptually meaningful than individual pixels.
where Ci is the number of samples in cluster set Ci .
The objectives are to lessen computational load, and to improve the performance of
4. Test for completion: Compute the Euclidean norms of the differences between segmentation algorithms by reducing irrelevant detail. A simple example will help
the mean vectors in the current and previous steps. Compute the residual error, explain the basic approach of superpixel representations.
E, as the sum of the k norms. Stop if E ≤ T , where T a specified, nonnegative Figure 10.50(a) shows an image of size 600 × 800 (480,000) pixels containing
threshold. Else, go back to Step 2. various levels of detail that could be described verbally as: “This is an image of two
large carved figures in the foreground, and at least three, much smaller, carved fig-

Remember, min x
( h( x)) is the minimum of h with respected to x, whereas arg xmin ( h( x)) is the value (or values) ures resting on a fence behind the large figures. The figures are on a beach, with
of x at which h is minimum.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 771 6/16/2017 2:13:55 PM DIP4E_GLOBAL_Print_Ready.indb 772 6/16/2017 2:13:56 PM


10.5 Region Segmentation Using Clustering and Superpixels 773 774 Chapter 10 Image Segmentation

a b c
FIGURE 10.50 (a) Image of size 600 × 480 (480,000) pixels. (b) Image composed of 4,000 superpixels (the boundaries
between superpixels (in white) are superimposed on the superpixel image for reference—the boundaries are not
part of the data). (c) Superpixel image. (Original image courtesy of the U.S. National Park Services.).

Figures 10.50(b) and (c) the ocean and sky in the background.” Figure 10.50(b) shows the same image rep-
were obtained using a
method to be discussed
resented by 4,000 superpixels and their boundaries (the boundaries are shown for
later in this section. reference—they are not part of the data), and Fig. 10.50(c) shows the superpixel
image. One could argue that the level of detail in the superpixel image would lead FIGURE 10.51 Top row: Results of using 1,000, 500, and 250 superpixels in the representation of Fig. 10.50(a). As before,
to the same description as the original, but the former contains only 4,000 primitive the boundaries between superpixels are superimposed on the images for reference. Bottom row: Superpixel images.
units, as opposed to 480,000 in the original. Whether the superpixel representation
is “adequate” depends on the application. If the objective is to describe the image
at the level of detail mentioned above, then the answer is yes. On the other hand, if
the objective is to detect imperfections at pixel-level resolutions, then the answer SLIC Superpixel Algorithm
obviously is no. And there are application, such as computerized medical diagnosis, In this section we discuss an algorithm for generating superpixels, called simple lin-
in which approximate representations of any kind are not acceptable. Nevertheless, ear iterative clustering (SLIC). This algorithm, developed by Achanta et al. [2012],
numerous application areas, such as image-database queries, autonomous naviga- is conceptually simple, and has computational and other performance advantages
tion, and certain branches of robotics, in which economy of implementation and over other superpixels techniques. SLIC is a modification of the k-means algorithm
potential improvements in segmentation performance far outweigh any appreciable discussed in the previous section. SLIC observations typically use (but are not lim-
loss of image detail. ited to) 5-dimensional vectors containing three color components and two spatial
One important requirement of any superpixel representation is adherence to bound- coordinates. For example, if we are using the RGB color system, the 5-dimensional
aries. This means that boundaries between regions of interest must be preserved vector associated with an image pixel has the form
in a superpixel image. We can see that this indeed is the case with the image in
Fig. 10.50(c). Note, for example, how clear the boundaries between the figures and r 
the background are. The same is true of the boundaries between the beach and the As you will learn in g
Chapter 11, vectors  
ocean, and between the ocean and the sky. Other important characteristics are the containing image
z = b (10-86)
attributes are called
preservations of topological properties and, of course, computational efficiency. The  
 x
feature vectors.
superpixel algorithm discussed in this section meets these requirements.
As another illustration, we show the results of severely decreasing the number of  y 
superpixels to 1,000, 500, and 250. The results in Fig. 10.51, show a significant loss of
detail compared to Fig. 10.50(a), but the first two images contain most of the detail where (r, g, b) are the three color components of a pixel, and ( x, y) are its two spatial
relevant to the image description discussed earlier. A notable difference is that two coordinates. Let nsp denote the desired number of superpixels and let ntp denote the
total number of pixels in the image.The initial superpixel centers, m i = [ ri gi bi xi yi ] ,
T
of the three small carvings on the fence in the back were eliminated. The 250-ele-
ment superpixel image even lost the third. However, the boundaries between the i = 1, 2, … , nsp , are obtained by sampling the image on a regular grid spaced s units
principal regions, as well as the basic topology of the images, were preserved. apart. To generate superpixels approximately equal in size (i.e., area), the grid spac-

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 773 6/16/2017 2:13:56 PM DIP4E_GLOBAL_Print_Ready.indb 774 6/16/2017 2:13:57 PM


10.5 Region Segmentation Using Clustering and Superpixels 775 776 Chapter 10 Image Segmentation
12
ing interval is selected as s = [ ntp nsp ] . To prevent centering a superpixel on the Specifying the Distance Measure
edge of the image, and to reduce the chances of starting at a noisy point, the initial
SLIC superpixels correspond to clusters in a space whose coordinates are colors
cluster centers are moved to the lowest gradient position in the 3 × 3 neighborhood
and spatial variables. It would be senseless to use a single Euclidean distance in this
about each center.
case, because the scales in the axes of this coordinate system are different and unre-
The SLIC superpixel algorithm consists of the following steps. Keep in mind that
lated. In other words, spatial and color distances must be treated separately. This is
superpixels are vectors in general. When we refer to a “pixel” in the algorithm, we
accomplished by normalizing the distance of the various components, then combin-
are referring to the ( x, y) location of the superpixel relative to the image.
ing them into a single measure. Let dc and ds denote the color and spatial Euclidean
1. Initialize the algorithm: Compute the initial superpixel cluster centers, distances between two points in a cluster, respectively:
12
m i = [ ri gi bi xi yi ] , i = 1, 2, … , nsp
T dc = (rj − ri )2 + ( g j − gi )2 + (bj − bi )2  (10-87)

and
by sampling the image at regular grid steps, s. Move the cluster centers to the
lowest gradient position in a 3 × 3 neighborhood. For each pixel location, p, in 12
the image, set a label L( p) = −1 and a distance d( p) = ". ds = ( x j − xi )2 + ( y j − yi )2  (10-88)
2. Assign samples to cluster centers: For each cluster center m i , i = 1, 2, … , nsp ,
compute the distance, Di ( p) between m i and each pixel p in a 2 s × 2 s neighbor- We then define D as the composite distance
hood about m i . Then, for each p and i = 1, 2, … , nsp , if Di < d( p), let d( p) = Di 12
and L( p) = i.  d 2 d 2
D = a c b + a s b  (10-89)
3. Update the cluster centers: Let Ci denote the set of pixels in the image with  dcm dsm 
label L( p) = i. Update m i :
where dcm and dsm are the maximum expected values of dc and ds . The maximum spa-
tial distance should correspond to the sampling interval; that is, dsm = s = [ ntp nsp ]1 2 .
1
mi =
Ci
∑z i = 1, 2, … , nsp Determining the maximum color distance is not as straightforward, because these
z ∈Ci distances can vary significantly from cluster to cluster, and from image to image. A
solution is to set dcm to a constant c so that Eq. (10-89) becomes
where Ci is the number of pixels in set Ci , and the z’s are given by Eq. (10-86). 12
4. Test for convergence: Compute the Euclidean norms of the differences between  d 2 d 2
D = a c b + a s b  (10-90)
the mean vectors in the current and previous steps. Compute the residual error,  c s 
E, as the sum of the nsp norms. If E < T , where T a specified nonnegative thresh-
old, go to Step 5. Else, go back to Step 2. We can write this equation as
5. Post-process the superpixel regions: Replace all the superpixels in each region, 12
 d 2 
Ci , by their average value, m i . D = dc2 + a s b c 2  (10-91)
 s 
Note in Step 5 that superpixels end up as contiguous regions of constant value. The
average value is not the only way to compute this constant, but it is the most widely This is the distance measure used for each cluster in the algorithm. Constant c can be
used. For graylevel images, the average is just the average intensity of all the pixels used to weigh the relative importance between color similarity and spatial proximity.
in the region spanned by the superpixel. This algorithm is similar to the k-means When c is large, spatial proximity is more important, and the resulting superpixels
algorithm in the previous section, with the exceptions that the distances, Di , are not are more compact. When c is small, the resulting superpixels adhere more tightly to
specified as Euclidean distances (see below), and that these distances are computed image boundaries, but have less regular size and shape.
for regions of size 2 s × 2 s, rather than for all the pixels in the image, thus reduc- For grayscale images, as in Example 10.23 below, we use
ing computation time significantly. In practice, SLIC convergence with respect to 12
E can be achieved with fairly large values of T. For example, all results reported by dc = (l j − li )2  (10-92)
Achanta et al. [2012] were obtained using T = 10.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 775 6/16/2017 2:14:00 PM DIP4E_GLOBAL_Print_Ready.indb 776 6/16/2017 2:14:01 PM


10.6 Region Segmentation Using Graph Cuts 777 778 Chapter 10 Image Segmentation

in Eq. (10-91), where the l’s are intensity levels of the points for which the distance
is being computed.
In 3-D, superpixels become supervoxels, which are handled by defining

12
ds = ( x j − xi )2 + ( y j − yi )2 + (zj − zi )2  (10-93)

where the z’s are the coordinates of the third spatial dimension. We must also add
the third spatial variable, z, to the vector in Eq. (10-86).
Because no provision is made in the algorithm to enforce connectivity, it is pos-
sible for isolated pixels to remain after convergence. These are assigned the label
of the nearest cluster using a connected components algorithm (see Section 9.6).
Although we explained the algorithm in the context of RGB color components, the
method is equally applicable to other colors systems. In fact, other components of
vector z in Eq. (10-86) (with the exception of the spatial variables) could be other
real-valued feature values, provided that a meaningful distance measure can be
defined for them.

EXAMPLE 10.23 : Using superpixels for image segmentation.


Figure 10.52(a) shows an image of an iceberg, and Fig. 10.52(b) shows the result of segmenting this
image using the k-means algorithm developed in the last section, with k = 3. Although the main regions
of the image were segmented, there are numerous segmentation errors in both regions of the iceberg, a b
and also on the boundary separating it from the background. Errors are visible as isolated pixels (and c d e
also as small groups of pixels) with the wrong shade (e.g., black pixels within a white region). Figure FIGURE 10.52 (a) Image of size 533 × 566 (301,678) pixels. (b) Image segmented using the k-means algorithm.
10.52(c) shows a 100-superpixel representation of the image with the superpixel boundaries superim- (c) 100-element superpixel image showing boundaries for reference. (d) Same image without boundaries. (e) Super-
posed for reference, and Fig. 10.52(d) shows the same image without the boundaries. Figure 10.52(e) is pixel image (d) segmented using the k-means algorithm. (Original image courtesy of NOAA.)
the segmentation of (d) using the k-means algorithm with k = 3 as before. Note the significant improve-
ment over the result in (b), indicating that the original image has considerably more (irrelevant) detail
than is needed for a proper segmentation. In terms of computational advantage, consider that generat- IMAGES AS GRAPHS
ing Fig. 10.52(b) required individual processing of over 300K pixels, while (e) required processing of 100 A graph, G, is a mathematical structure consisting of a set V of nodes and a set E of
Nodes and edges are also
pixels with considerably fewer shades of gray. referred to as vertices edges connecting those vertices:
and links, respectively.
G = (V, E ) (10-94)
where V is a set and
10.6 REGION SEGMENTATION USING GRAPH CUTS
10.6
See Section 2.5 for an
E 8 V×V (10-95)
In this section, we discuss an approach for partitioning an image into regions by explanation of the
expressing the pixels of the image as nodes of a graph, and then finding an optimum Cartesian product V × V
and for a review of the is a set of ordered pairs of elements from V. If (u, v) ∈ E implies that (v, u) ∈ E, and
partition (cut) of the graph into groups of nodes. Optimality is based on criteria whose set symbols used in this vice versa, the graph is said to be undirected; otherwise the graph is directed. For
values are high for members within a group (i.e., a region) and low across members of section.
example, we may consider a street map as a graph in which the nodes are street
different groups. As you will see later in this section, graph-cut segmentation is capa-
intersections, and the edges are the streets connecting those intersections. If all
ble in some cases of results that can be superior to the results achievable by any of the
streets are bidirectional, the graph is undirected (meaning that we can travel both
segmentation methods studied thus far. The price of this potential benefit is added
ways from any two intersections). Otherwise, if at least one street is a one-way street,
complexity in implementation, which generally translates into slower execution.
the graph is directed.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 777 6/16/2017 2:14:01 PM DIP4E_GLOBAL_Print_Ready.indb 778 6/16/2017 2:14:02 PM


10.6 Region Segmentation Using Graph Cuts 779 780 Chapter 10 Image Segmentation

The types of graphs in which we are interested are undirected graphs whose a b
edges are further characterized by a matrix, W, whose element w(i, j ) is a weight c d
associated with the edge that connects nodes i and j. Because the graph is undirected, FIGURE 10.53
w(i, j ) = w( j, i), which means that W is a symmetric matrix. The weights are selected (a) A 3 × 3 image.
to be proportional to one or more similarity measures between all pairs of nodes. A (c) A corresponding
graph.
graph whose edges are associated with weights is called a weighted graph. (d) Graph cut.
The essence of the material in this section is to represent an image to be seg- (c) Segmented
Superpixels are also well
mented as a weighted, undirected graph, where the nodes of the graph are the pixels image.
suited for use as graph in the image, and an edge is formed between every pair of nodes. The weight, w(i, j ),
nodes. Thus, when we
refer in this section to
of each edge is a function of the similarity between nodes i and j. We then seek to Image ⇓ Segmentation
partition the nodes of the graph into disjoint subsets V1 , V2 ,…, VK where, by some
“pixels” in an image, we
are, by implication,
also referring to super-
measure, the similarity among the nodes within a subset is high, and the similarity ⇓ Node
pixels. across the nodes of different subsets is low. The nodes of the partitioned subsets Edge
correspond to the regions in the segmented image.


Set V is partitioned into subsets by cutting the graph. A cut of a graph is a parti-
tion of V into two subsets A and B such that
Graph Cut
A ´ B = V and A ¨ B = ∅ (10-96)
image graphs. Figure 10.54 shows the same graph as the one we just discussed, but
where the cut is implemented by removing the edges connecting subgraphs A and B. here you see two additional nodes called the source and sink terminal nodes, respec-
There are two key aspects of using graph cuts for image segmentation: (1) how to tively, each connected to all nodes in the graph via unidirectional links called t-links.
associate a graph with an image; and (2) how to cut the graph in a way that makes The terminal nodes are not part of the image; their role, for example, is to associate
sense in terms of partitioning the image into background and foreground (object) with each pixel a probability that it is a background or foreground (object) pixel.
pixels. We address these two questions next. The probabilities are the weights of the t-links. In Figs. 10.54(c) and (d), the thickness
Figure 10.53 shows a simplified approach for generating a graph from an image. of each t-link is proportional to the value of the probability that the graph node to
The nodes of the graph correspond to the pixels in the image and, to keep the expla- which it is connected is a foreground or background pixel (the thicknesses shown
nation simple, we allow edges only between adjacent pixels using 4-connectivity, are so that the segmentation result would be the same as in Fig. 10.53). Which of the
which means that there are no diagonal edges linking the pixels. But, keep in mind two nodes we call background or foreground is arbitrary.
that, in general, edges are specified between every pair of pixels. The weights for the
edges typically are formed from spatial relationships (for example, distance from the
vertex pixel) and intensity measures (for example, texture and color), consistent with MINIMUM GRAPH CUTS
exhibiting similarity between pixels. In this simple example, we define the degree Once an image has been expressed as a graph, the next step is to cut the graph into
of similarity between two pixels as the inverse of the difference in their intensities. two or more subgraphs. The nodes (pixels) in each resulting subgraph correspond
That is, for two nodes (pixels) ni and n j , the weight of the edge between them is to a region in the segmented image. Approaches based on Fig. 10.54 rely on inter-
w(i, j ) = 1"A # I (ni ) − I (n j ) # + cB, where I (ni ) and I (n j ), are the intensities of the two preting the graph as a flow network (of pipes, for example) and obtaining what is
nodes (pixels) and c is a constant included to prevent division by 0. Thus, the closer commonly referred to as a minimum graph cut. This formulation is based on the
the values of intensity between adjacent pixels is, the larger the value of w will be. so-called Max-Flow, Min-Cut Theorem. This theorem states that, in a flow network,
For illustrative purposes, the thickness of each edge in Fig. 10.53 is shown propor- the maximum amount of flow passing from the source to the sink is equal to the
tional to the degree of similarity between the pixels that it connects (see Problem minimum cut. This minimum cut is defined as the smallest total weight of the edges
10.44). As you can see in the figure, the edges between the dark pixels are stronger that, if removed, would disconnect the sink from the source:
than the edges between dark and light pixels, and vice versa. Conceptually, segmen-
tation is achieved by cutting the graph along its weak edges, as illustrated by the
dashed line in Fig. 10.53(d). Figure 10.53(c) shows the segmented image. cut( A, B) = ∑
u ∈A,v∈B
w(u, v) (10-97)
Although the basic structure in Fig. 10.53 is the focus of the discussion in this
section, we mention for completeness another common approach for constructing

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 779 6/16/2017 2:14:03 PM DIP4E_GLOBAL_Print_Ready.indb 780 6/16/2017 2:14:03 PM


10.6 Region Segmentation Using Graph Cuts 781 782 Chapter 10 Image Segmentation

a b FIGURE 10.55 A min cut A more meaningful cut


c d An example
FIGURE 10.54 showing how a
(a) Same image min cut can lead
as in Fig. 10.53(a). to a meaningless
(c) Corresponding segmentation. In
graph and terminal this example, the
nodes. (d) Graph similarity between
cut. (b) Segmented pixels is defined
image. as their spatial
proximity, which
Image ⇓ Segmentation results in two

⇓ distinct regions.

Source Terminal Source Terminal


(Background) (Background)

Graph
their proximity, such as the partition shown in Fig. 10.55. The approach presented in
this section, proposed by Shi and Malik [2000] (see also Hochbaum [2010]), is aimed
at avoiding this type of behavior by redefining the concept of a cut.
Instead of looking at the total weight value of the edges that connect two parti-
⇓ tions, the idea is to work with a measure of “disassociation” that computes the cost
as a fraction of the total edge connections to all nodes in the graph. This measure,
called the normalized cut (Ncut), is defined as
Cut
cut( A, B) cut( A, B)
Ncut ( A, B) = + (10-98)
assoc( A, V ) assoc(B, V )

Sink Terminal Sink Terminal where cut( A, B) is given by Eq. (10-97) and
(Foreground) (Foreground)
assoc( A, V ) = ∑
u ∈A, z ∈V
w(u, z) (10-99)
where A and B satisfy Eq. (10-96). The optimum partition of a graph is the one that
is the sum of the weights of all the edges from the nodes of subgraph A to the nodes
minimizes this cut value. There is an exponential number of such partitions, which
of the entire graph. Similarly,
would present us with an intractable computational problem. However, efficient
algorithms that run in polynomial time have been developed for solving max-flow
problems. Therefore, based on the Max-Flow, Min-Cut Theorem, we can apply these
assoc(B, V ) = ∑
v ∈B, z ∈V
w(v, z) (10-100)
algorithms to image segmentation, provided that we cast segmentation as a flow
is the sum of the weights of the edges from all the edges in B to the entire graph. As
problem and select the weights for the edges and t-links such that minimum graph
you can see, assoc( A, V ) is simply the cut of A from the rest of the graph, and simi-
cuts will result in meaningful segmentations.
larly for assoc(B, V ).
Although the min-cut approach offers an elegant solution, it can result in group-
By using Ncut ( A, B) instead of cut( A, B), the cut that partitions isolated points
ings that favor cutting small sets of isolated nodes in a graph, leading to improper
will no longer have small values. You can see this, for example, by noting in Fig. 10.55
segmentations. Figure 10.55 shows an example, in which the two regions of interest
that if A is the single node shown, cut( A, B) and assoc( A, V ) will have the same val-
are characterized by the tightness of the pixel groupings. Meaningful edge weights
ue. Thus, independently of how small cut( A, B) is, Ncut ( A, B) will always be greater
that reflect this property would be inversely proportional to the distance between
than or equal to 1, thus providing normalization for “pathological” cases such as this.
pairs of points. But this would lead to weights that would be smaller for isolated
Based on similar concepts, we can define a measure for total normalized associa-
points, resulting in min cuts such as the example in Fig. 10.55. In fact, any cut that
tion within graph partitions as
partitions out individual points on the left of the figure will have a smaller cut value
in Eq. (10-4) than a cut that properly partitions the points into two groups based on

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 781 6/16/2017 2:14:03 PM DIP4E_GLOBAL_Print_Ready.indb 782 6/16/2017 2:14:05 PM


10.6 Region Segmentation Using Graph Cuts 783 784 Chapter 10 Image Segmentation

assoc( A, A) assoc(B, B) Eq. (10-105) gives K eigenvalues and K eigenvectors, each corresponding to one
Nassoc( A, B) = + (10-101) eigenvalue. The solution to our problem is the eigenvector corresponding the second
assoc( A, V ) assoc(B, V )
smallest eigenvalue.
where assoc( A, A) and assoc(B, B) are the total weights connecting the nodes within We can convert the preceding generalized eigenvalue formulation into a standard
A and within B, respectively. It is not difficult to show (see Problem 10.46) that eigenvalue problem by writing Eq. (10-105) as (see Problem 10.45):

Ncut ( A, B) = 2 − Nassoc( A, B) (10-102)


Az = l z (10-106)
which implies that minimizing Ncut ( A, B) simultaneously maximizes Nassoc( A, B).
Based on the preceding discussion, image segmentation using graph cuts is now where
based on finding a partition that minimizes Ncut ( A, B). Unfortunately, minimizing −1 − 12
this quantity exactly is an NP-complete computational task, and we can no longer A = D 2 (D − W)D (10-107)
rely on the solutions available for max flow because the approach being followed
now is based on the concepts explained in connection with Fig. 10.53. However, Shi and
and Malik [2000] (see also Hochbaum [2010]) were able to find an approximate dis- 1
z = D2 y (10-108)
crete solution to minimizing Ncut ( A, B) by formulating minimization as a general-
ized eigenvalue problem, for which numerous implementations exist. from which it follows that
−1
COMPUTING MINIMAL GRAPH CUTS y = D 2z (10-109)
As above, let V denote the nodes of a graph G, and let A and B be two subsets
If the nodes of graph of V satisfying Eq. (10-96). Let K denote the number of nodes in V and define a Thus, we can find the (continuous-valued) eigenvector corresponding to the second
G are the pixels in an K-dimensional indicator vector, x, whose element xi has the property xi = 1 if node smallest eigenvalue using either a generalized or a standard eigenvalue solver. The
image, then K = M × N,
where M and N are the ni of V is in A and xi = −1 if it is in B. Let desired (discrete) vector x can be generated from the resulting, continuous valued
number of rows and
solution vector by finding a splitting point that divides the values of the continuous
di = ∑ w(i, j )
columns in the image.
(10-103) eigenvector elements into two parts. We do this by finding the splitting point that
j
yields the smallest value of Ncut ( A, B), since this is the quantity we are trying to
be the sum of the weights from node ni to all other nodes in V. Using these defini- minimize. To simplify the search, we divide the range of values in the continuous
tions, we can write Eq. (10-98) as vector into Q evenly spaced values, evaluate Eq. (10-104) for each value, and choose
the splitting point that yields the smallest value of Ncut ( A, B). Then, all values of the
cut( A, B) cut( A, B) eigenvector with values above the split point are assigned the value 1; all others are
Ncut ( A, B) = +
cut( A, V ) cut(B, V ) assigned the value −1. The result is the desired vector x. Then, partition A is the set
∑ − w(i, j )xi x j ∑ −w(i, j )xi x j (10-104) nodes in V corresponding to 1’s in x; the remaining nodes correspond to partition B.
xi > 0, x j < 0 xi < 0 , x j > 0 This partitioning is carried out only if the stability criterion discussed in the follow-
= +
ing paragraph is met.
∑ di
xi > 0
∑ di
xi < 0 Searching for a splitting point implies computing a total of Q values of Ncut ( A, B)
and selecting the smallest one. A region that is not clearly segmentable into two
The objective is to find a vector, x, that minimizes Ncut ( A, B). A closed-form solu- subregions using the specified weights will usually result in many splitting points
tion that minimizes Eq. (10-104) can be found, but only if the elements of x are with similar values of Ncut ( A, B). Trying to segment such a region is likely to result
allowed to be real, continuous numbers instead of being constrained to be ±1. The in a meaningless partition. To avoid this behavior, a region (i.e., subgraph) is split
solution derived by Shi and Malik [2000] is given by solving the generalized eigen- only if it satisfies a stability criterion, obtained by first computing the histogram of
system expression the eigenvector values, then forming the ratio of the minimum to the maximum bin
(D − W)y = lDy (10-105) counts. In an “uncertain” eigenvector, the values in the histogram will stay relatively
the same, and the ratio will be relatively high. Shi and Malik [2000] found experi-
where D is a K × K diagonal matrix with main-diagonal elements di , i = 1, 2, … , K , mentally that thresholding the ratio at 0.06 was a effective criterion for not splitting
and W is a K × K weight matrix with elements w(i, j ), as defined earlier. Solving the region in question.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 783 6/16/2017 2:14:07 PM DIP4E_GLOBAL_Print_Ready.indb 784 6/16/2017 2:14:07 PM


10.6 Region Segmentation Using Graph Cuts 785 786 Chapter 10 Image Segmentation

GRAPH CUT SEGMENTATION ALGORITHM  − [ I ( ni )− I ( nj )]2 − dist ( ni , nj )


In the preceding discussion, we illustrated two ways in which edge weights can be  s I2 sd2
if dist(ni , n j ) < r
w(i, j ) =  e e
generated from an image. In Figs. 10.53 and 10.54, we looked at weights generated 0
 otherwise
using image intensity values, and in Fig. 10.55 we considered weights based on the
distance between pixels. But these are just two examples of the many ways that
we can generate a graph and corresponding weights from an image. For example, where I (ni ) is the intensity of node ni , s I2 and sd2 are constants determining the spread of the two
we could use color, texture, statistical moments about a region, and other types of Gaussian-like functions, dist(ni , n j ) is the distance (e.g., the Euclidean distance) between the two nodes,
features to be discussed in Chapter 11. In general, then, graphs can be constructed and r is a radial constant that establishes how far away we are willing to consider similarity. The expo-
from image features, of which pixel intensities are a special case. With this concept nential terms decrease as a function of dissimilarity in intensity and as function of distance between the
as background, we can summarize the discussion thus far in this section as the fol- nodes, as required of our measure of similarity in this case.
lowing algorithm:

1. Given a set of features, specify a weighted graph, G = (V , E ) in which V contains


the points in the feature space, and E contains the edges of the graph. Compute EXAMPLE 10.25 : Segmentation using graph cuts.
the edge weights and use them to construct matrices W and D. Let K denote the Graph cuts are ideally suited for obtaining a rough segmentation of the principal regions in an image.
desired number of partitions of the graph. Figure 10.56 shows a typical result. Figure 10.56(a) is the familiar building image. Consistent with the
2. Solve the eigenvalue system (D − W)y = lDy to find the eigenvector with the idea of extracting the principal regions of an image, Fig. 10.56(b) shows the image smoothed with a
second smallest eigenvalue. simple 25 × 25 box kernel. Observe how the fine detail is smoothed out, leaving only major regional
3. Use the eigenvector from Step 2 to bipartition the graph by finding the splitting features such as the facade and sky. Figure 10.56(c) is the result of segmentation using the graph cut
point such that Ncut ( A, B) is minimized. algorithm just developed, with weights of the form discussed in the previous example, and allowing only
4. If the number of cuts has not reached K, decide if the current partition should two partitions. Note how well the region corresponding to the building was extracted, with none of the
be subdivided by checking the stability of the cut. details characteristic of the methods discussed earlier in this chapter. In fact, it would have been nearly
impossible to obtain comparable results using any of the methods we have discussed thus far without
5. Recursively repartition the segmented parts if necessary.
significant additional processing. This type of result is ideal for tasks such as providing broad cues for
Note that the algorithm works by recursively generating two-way cuts. The number of autonomous navigation, for searching image databases, and for low-level image analysis.
groups (e.g., regions) in the segmented image is controlled by K. Other criteria, such
as the maximum size allowed for each cut, can further refine the final segmentation.
For example, when using pixels and their intensities as the basis for constructing the
10.7 SEGMENTATION USING MORPHOLOGICAL WATERSHEDS
graph, we can specify the maximum and/or minimum size allowed for each region. 10.7

Thus far, we have discussed segmentation based on three principal concepts: edge
detection, thresholding, and region extraction. Each of these approaches was found
EXAMPLE 10.24 : Specifying weights for graph cut segmentation. to have advantages (for example, speed in the case of global thresholding) and dis-
In Fig. 10.53, we illustrated how to generate graph weights using intensity values, and in Fig. 10.55 we advantages (for example, the need for post-processing, such as edge linking, in edge-
discussed briefly how to generate weights based on the distance between pixels. In this example, we give based segmentation). In this section, we discuss an approach based on the concept of
a more practical approach for generating weights that include both intensity and distance from a pixel, so-called morphological watersheds. Segmentation by watersheds embodies many of
thus introducing the concept of a neighborhood in graph segmentation. the concepts of the other three approaches and, as such, often produces more stable
Let ni and n j denote two nodes (image pixels). As mentioned earlier in this section, weights are sup- segmentation results, including connected segmentation boundaries. This approach
posed to reflect the similarity between nodes in a graph. When considering segmentation, one of the also provides a simple framework for incorporating knowledge-based constraints
principal ways to establish how likely two pixels in an image are to be a part of the same region or object (see Fig. 1.23) in the segmentation process, as we discuss at the end of this section.
is to determine the difference in their intensity values, and how close the pixels are to each other. The
weight value of the edge between two pixels should be large when the pixels are very close in intensity BACKGROUND
and proximity (i.e., when the pixels are “similar), and should decrease as their intensity difference and The concept of a watershed is based on visualizing an image in three dimensions,
distance from each other increases. That is, the weight value should be a function of how similar the two spatial coordinates versus intensity, as in Fig. 2.18(a). In such a “topographic”
pixels are in intensity and distance. These two concepts can be embedded into a single weight function interpretation, we consider three types of points: (1) points belonging to a regional
using the following expression: minimum; (2) points at which a drop of water, if placed at the location of any of those

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 785 6/16/2017 2:14:08 PM DIP4E_GLOBAL_Print_Ready.indb 786 6/16/2017 2:14:09 PM


10.7 Segmentation Using Morphological Watersheds 787 788 Chapter 10 Image Segmentation

a c
b d
FIGURE 10.57
(a) Original
image.
(b) Topographic
view. Only the
background is
black. The basin
on the left is
slightly lighter
than black.
(c) and (d) Two
stages of flooding.
a b c All constant dark
values of gray are
FIGURE 10.56 (a) Image of size 600 × 600 pixels. (b) Image smoothed with a 25 × 25 box kernel. (c) Graph cut segmen- intensities in the
tation obtained by specifying two regions. original image. Water Water
Only constant
light gray repre-
sents “water.”
(Courtesy of Dr.
points, would fall with certainty to a single minimum; and (3) points at which water S. Beucher, CMM/
would be equally likely to fall to more than one such minimum. For a particular Ecole des Mines Water
regional minimum, the set of points satisfying condition (2) is called the catchment de Paris.)
(Continued on
basin or watershed of that minimum. The points satisfying condition (3) form crest next page.)
lines on the topographic surface, and are referred to as divide lines or watershed lines.
The principal objective of segmentation algorithms based on these concepts is to
find the watershed lines. The method for doing this can be explained with the aid of
Fig. 10.57. Figure 10.57(a) shows a gray-scale image and Fig. 10.57(b) is a topograph-
ic view, in which the height of the “mountains” is proportional to intensity values in
the input image. For ease of interpretation, the backsides of structures are shaded.
This is not to be confused with intensity values; only the general topography of the
three-dimensional representation is of interest. In order to prevent the rising water effect is more pronounced as water continues to rise, as shown in Fig. 10.57(g). This
from spilling out through the edges of the image, we imagine the perimeter of the figure shows a longer dam between the two catchment basins and another dam in
entire topography (image) being enclosed by dams that are higher than the highest the top part of the right basin. The latter dam was built to prevent merging of water
possible mountain, whose value is determined by the highest possible intensity value from that basin with water from areas corresponding to the background. This pro-
in the input image. cess is continued until the maximum level of flooding (corresponding to the highest
Suppose that a hole is punched in each regional minimum [shown as dark areas in intensity value in the image) is reached. The final dams correspond to the watershed
Fig. 10.57(b)] and that the entire topography is flooded from below by letting water lines, which are the desired segmentation boundaries. The result for this example is
Because of neighboring rise through the holes at a uniform rate. Figure 10.57(c) shows the first stage of flood- shown in Fig. 10.57(h) as dark, one-pixel-thick paths superimposed on the original
contrast, the leftmost ing, where the “water,” shown in light gray, has covered only areas that correspond image. Note the important property that the watershed lines form connected paths,
basin in Fig. 10.57(c)
appears black, but it is a to the black background in the image. In Figs. 10.57(d) and (e) we see that the water thus giving continuous boundaries between regions.
few shades lighter than now has risen into the first and second catchment basins, respectively. As the water One of the principal applications of watershed segmentation is in the extraction
the black background.
The mid-gray in the continues to rise, it will eventually overflow from one catchment basin into another. of nearly uniform (blob-like) objects from the background. Regions characterized
second basin is a natural The first indication of this is shown in 10.57(f). Here, water from the lower part of by small variations in intensity have small gradient values. Thus, in practice, we often
gray from the image
in (a). the left basin overflowed into the basin on the right, and a short “dam” (consisting of see watershed segmentation applied to the gradient of an image, rather than to the
single pixels) was built to prevent water from merging at that level of flooding (the image itself. In this formulation, the regional minima of catchment basins correlate
mathematical details of dam building are discussed in the following section). The nicely with the small value of the gradient corresponding to the objects of interest.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 787 6/16/2017 2:14:09 PM DIP4E_GLOBAL_Print_Ready.indb 788 6/16/2017 2:14:09 PM


10.7 Segmentation Using Morphological Watersheds 789 790 Chapter 10 Image Segmentation

e f
g h
FIGURE 10.57
(Continued)
(e) Result of
further flooding.
(f) Beginning of
merging of water
from two
catchment basins
(a short dam was
built between
them).
(g) Longer dams.
(h) Final water-
shed (segmenta-
tion) lines super-
imposed on the
original image.
(Courtesy of Dr.
S. Beucher, CMM/
Ecole des Mines
de Paris.)

Origin
1 1 1
1 1 1
1 1 1

DAM CONSTRUCTION
Dam construction is based on binary images, which are members of 2-D integer
space Z 2 (see Sections 2.4 and 2.6). The simplest way to construct dams separating
sets of binary points is to use morphological dilation (see Section 9.2).
Figure 10.58 illustrates the basics of dam construction using dilation. Part (a)
shows portions of two catchment basins at flooding step n − 1, and Fig. 10.58(b)
shows the result at the next flooding step, n. The water has spilled from one basin
to the another and, therefore, a dam must be built to keep this from happening. In First dilation
order to be consistent with notation to be introduced shortly, let M1 and M2 denote Second dilation
the sets of coordinates of points in two regional minima. Then let the set of coordi- a Dam points
nates of points in the catchment basin associated with these two minima at stage n − 1 b c
of flooding be denoted by Cn−1 (M1 ) and Cn−1 (M2 ), respectively. These are the two d
gray regions in Fig. 10.58(a). FIGURE 10.58 (a) Two partially flooded catchment basins at stage n − 1 of flooding. (b) Flooding at stage n, showing
See Sections 2.5 and 9.5
that water has spilled between basins. (c) Structuring element used for dilation. (d) Result of dilation and dam
regarding connected Let C [ n − 1] denote the union of these two sets. There are two connected com-
components. construction.
ponents in Fig. 10.58(a), and only one component in Fig. 10.58(b). This connected

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 789 6/16/2017 2:14:10 PM DIP4E_GLOBAL_Print_Ready.indb 790 6/16/2017 2:14:11 PM


10.7 Segmentation Using Morphological Watersheds 791 792 Chapter 10 Image Segmentation

component encompasses the earlier two components, which are shown dashed. Geometrically, T [ n] is the set of coordinates of points in g( x, y) lying below the
Two connected components having become a single component indicates that plane g( x, y) = n.
water between the two catchment basins has merged at flooding step n. Let this The topography will be flooded in integer flood increments, from n = min + 1 to
connected component be denoted by q. Note that the two components from step n = max + 1. At any step n of the flooding process, the algorithm needs to know
n − 1 can be extracted from q by performing a logical AND operation, q !C [ n − 1]. the number of points below the flood depth. Conceptually, suppose that the coordi-
Observe also that all points belonging to an individual catchment basin form a nates in T [ n] that are below the plane g( x, y) = n are “marked” black, and all other
single connected component. coordinates are marked white. Then when we look “down” on the xy-plane at any
Suppose that each of the connected components in Fig. 10.58(a) is dilated by increment n of flooding, we will see a binary image in which black points correspond
the structuring element in Fig. 10.58(c), subject to two conditions: (1) The dilation to points in the function that are below the plane g( x, y) = n. This interpretation is
has to be constrained to q (this means that the center of the structuring element quite useful, and will make it easier to understand the following discussion.
can be located only at points in q during dilation); and (2) the dilation cannot be Let Cn ( Mi ) denote the set of coordinates of points in the catchment basin associ-
performed on points that would cause the sets being dilated to merge (i.e., become ated with minimum Mi that are flooded at stage n. With reference to the discussion
a single connected component). Figure 10.58(d) shows that a first dilation pass (in in the previous paragraph, we may view Cn ( Mi ) as a binary image given by
light gray) expanded the boundary of each original connected component. Note that
condition (1) was satisfied by every point during dilation, and that condition (2) did Cn ( Mi ) = C ( Mi ) ! T [ n ] (10-111)
not apply to any point during the dilation process; thus, the boundary of each region
was expanded uniformly. In other words, Cn ( Mi ) = 1 at location ( x, y) if ( x, y) ∈ C ( Mi ) AND ( x, y) ∈T [ n ];
In the second dilation, shown in black in 10.58(d), several points failed condition otherwise Cn ( Mi ) = 0. The geometrical interpretation of this result is straightfor-
(1) while meeting condition (2), resulting in the broken perimeter shown in the figure. ward. We are simply using the AND operator to isolate at stage n of flooding the
It is evident that the only points in q that satisfy the two conditions under consid- portion of the binary image in T [ n] that is associated with regional minimum Mi .
eration describe the one-pixel-thick connected path shown crossed-hatched in Fig. Next, let B denote the number of number of flooded catchment basins at stage n,
10.58(d). This path is the desired separating dam at stage n of flooding. Construction and let C[ n] denote the union of these basins at stage n :
of the dam at this level of flooding is completed by setting all the points in the path B
just determined to a value greater than the maximum possible intensity value of the C[ n] = ∪ Cn ( Mi ) (10-112)
image (e.g., greater than 255 for an 8-bit image). This will prevent water from cross- i =1

ing over the part of the completed dam as the level of flooding is increased. As noted Then C[max + 1] is the union of all catchment basins:
earlier, dams built by this procedure, which are the desired segmentation boundaries,
B
are connected components. In other words, this method eliminates the problems of
C [ max + 1] = ∪ C ( Mi ) (10-113)
broken segmentation lines. i =1
Although the procedure just described is based on a simple example, the method
used for more complex situations is exactly the same, including the use of the 3 × 3 It can be shown (see Problem 10.47) that the elements in both Cn ( Mi ) and T [ n] are
symmetric structuring element in Fig. 10.58(c). never replaced during execution of the algorithm, and that the number of elements
in these two sets either increases or remains the same as n increases. Thus, it fol-
WATERSHED SEGMENTATION ALGORITHM lows that C[ n − 1] is a subset of C[ n]. According to Eqs. (10-112) and (10-113), C[ n]
is a subset of T [ n], so it follows that C[ n − 1] is also a subset of T [ n]. From this we
Let M1 , M2 ,…, MR be sets denoting the coordinates of the points in the regional
have the important result that each connected component of C[ n − 1] is contained
minima of an image, g( x, y). As mentioned earlier, this typically will be a gradient
in exactly one connected component of T [ n].
image. Let C ( Mi ) be a set denoting the coordinates of the points in the catchment
The algorithm for finding the watershed lines is initialized by letting C[min + 1] =
basin associated with regional minimum Mi (recall that the points in any catchment
T[min + 1]. The procedure then proceeds recursively, successively computing C[ n]
basin form a connected component). The notation min and max will be used to
from C[ n − 1], using the following approach. Let Q denote the set of connected com-
denote the minimum and maximum values of g( x, y). Finally, let T [ n] represent the
ponents in T [ n]. Then, for each connected component q ∈ Q[ n], there are three pos-
set of coordinates ( s, t ) for which g( s, t ) < n. That is,
sibilities:
T [ n] = {( s, t ) g ( s, t ) < n} (10-110) 1. q ! C[ n − 1] is empty.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 791 6/16/2017 2:14:12 PM DIP4E_GLOBAL_Print_Ready.indb 792 6/16/2017 2:14:16 PM


10.7 Segmentation Using Morphological Watersheds 793 794 Chapter 10 Image Segmentation

2. q ! C[ n − 1] contains one connected component of C[ n − 1]. a b


3. q ! C[ n − 1] contains more than one connected component of C[ n − 1]. c d
FIGURE 10.59
The construction of C[ n] from C[ n − 1] depends on which of these three conditions (a) Image of blobs.
holds. Condition 1 occurs when a new minimum is encountered, in which case con- (b) Image gradient.
(c) Watershed lines,
nected component q is incorporated into C[ n − 1] to form C[ n]. Condition 2 occurs superimposed on
when q lies within the catchment basin of some regional minimum, in which case the gradient image.
q is incorporated into C[ n − 1] to form C[ n]. Condition 3 occurs when all (or part) (d) Watershed lines
of a ridge separating two or more catchment basins is encountered. Further flood- superimposed on
the original image.
ing would cause the water level in these catchment basins to merge. Thus, a dam (or (Courtesy of Dr.
dams if more than two catchment basins are involved) must be built within q to pre- S. Beucher, CMM/
vent overflow between the catchment basins. As explained earlier, a one-pixel-thick Ecole des Mines de
dam can be constructed when needed by dilating q ! C[ n − 1] with a 3 × 3 structur- Paris.)
ing element of 1’s, and constraining the dilation to q.
Algorithm efficiency is improved by using only values of n that correspond to
existing intensity values in g( x, y). We can determine these values, as well as the
values of min and max, from the histogram of g( x, y).

EXAMPLE 10.26 : Illustration of the watershed segmentation algorithm.


Consider the image and its gradient in Figs. 10.59(a) and (b), respectively. Application of the watershed
algorithm just described yielded the watershed lines (white paths) shown superimposed on the gradient
image in Fig. 10.59(c). These segmentation boundaries are shown superimposed on the original image in
Fig. 10.59(d). As noted at the beginning of this section, the segmentation boundaries have the important
property of being connected paths.

THE USE OF MARKERS


Direct application of the watershed segmentation algorithm in the form discussed
in the previous section generally leads to over-segmentation, caused by noise and
other local irregularities of the gradient. As Fig. 10.60 illustrates, over-segmentation
can be serious enough to render the result of the algorithm virtually useless. In this
a b
case, this means a large number of segmented regions. A practical solution to this
problem is to limit the number of allowable regions by incorporating a preprocess- FIGURE 10.60
(a) Electrophoresis
ing stage designed to bring additional knowledge into the segmentation procedure. image.
An approach used to control over-segmentation is based on the concept of mark- (b) Result of apply-
ers. A marker is a connected component belonging to an image. We have internal ing the watershed
markers, associated with objects of interest, and external markers, associated with segmentation algo-
the background. A procedure for marker selection typically will consist of two prin- rithm to the gradient
image.
cipal steps: (1) preprocessing; and (2) definition of a set of criteria that markers Over-segmentation
must satisfy. To illustrate, consider Fig. 10.60(a) again. Part of the problem that led is evident.
to the over-segmented result in Fig. 10.60(b) is the large number of potential min- (Courtesy of Dr.
ima. Because of their size, many of these minima are irrelevant detail. As has been S. Beucher, CMM/
pointed out several times in earlier discussions, an effective method for minimizing Ecole des Mines de
Paris.)
the effect of small spatial detail is to filter the image with a smoothing filter. This is
an appropriate preprocessing scheme in this case also.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 793 6/16/2017 2:14:20 PM DIP4E_GLOBAL_Print_Ready.indb 794 6/16/2017 2:14:21 PM


10.7 Segmentation Using Morphological Watersheds 795 796 Chapter 10 Image Segmentation

a b 10.8 THE USE OF MOTION IN SEGMENTATION


10.8

FIGURE 10.61 Motion is a powerful cue used by humans and many animals to extract objects or
(a) Image showing regions of interest from a background of irrelevant detail. In imaging applications,
internal markers
(light gray regions)
motion arises from a relative displacement between the sensing system and the
and external scene being viewed, such as in robotic applications, autonomous navigation, and
markers (watershed dynamic scene analysis. In the following discussion we consider the use of motion in
lines). segmentation both spatially and in the frequency domain.
(b) Result of
segmentation. Note
the improvement
SPATIAL TECHNIQUES
over Fig. 10.60(b). In what follows, we will consider two approaches for detecting motion, working direct-
(Courtesy of Dr. ly in the spatial domain. The key objective is to give you an idea how to measure
S. Beucher, CMM/
Ecole des Mines de
changes in digital images using some straightforward techniques.
Paris.)
A Basic Approach
One of the simplest approaches for detecting changes between two image frames
f ( x, y, ti ) and f ( x, y, t j ) taken at times ti and t j , respectively, is to compare the two
Suppose that we define an internal marker as (1) a region that is surrounded by images pixel by pixel. One procedure for doing this is to form a difference image.
points of higher “altitude”; (2) such that the points in the region form a connected Suppose that we have a reference image containing only stationary components.
component; and (3) in which all the points in the connected component have the Comparing this image against a subsequent image of the same scene, but including
same intensity value. After the image was smoothed, the internal markers resulting one or more moving objects, results in the difference of the two images canceling the
from this definition are shown as light gray, blob-like regions in Fig. 10.61(a). Next, stationary elements, leaving only nonzero entries that correspond to the nonstation-
the watershed algorithm was applied to the smoothed image, under the restriction ary image components.
that these internal markers be the only allowed regional minima. Figure 10.61(a) A difference image of two images (of the same size) taken at times ti and t j may
shows the resulting watershed lines. These watershed lines are defined as the exter- be defined as
nal markers. Note that the points along the watershed line pass along the highest
points between neighboring markers.
The external markers in Fig. 10.61(a) effectively partition the image into regions, 1 if f ( x, y, ti ) − f ( x, y, t j ) > T
dij ( x, y) =  (10-114)
with each region containing a single internal marker and part of the background. 0 otherwise
The problem is thus reduced to partitioning each of these regions into two: a single
object, and its background. We can bring to bear on this simplified problem many of where T is a nonnegative threshold. Note that dij ( x, y) has a value of 1 at spatial coor-
the segmentation techniques discussed earlier in this chapter. Another approach is dinates ( x, y) only if the intensity difference between the two images is appreciably
simply to apply the watershed segmentation algorithm to each individual region. In different at those coordinates, as determined by T. Note also that coordinates ( x, y)
other words, we simply take the gradient of the smoothed image [as in Fig. 10.59(b)] in Eq. (10-114) span the dimensions of the two images, so the difference image is of
and restrict the algorithm to operate on a single watershed that contains the marker the same size as the images in the sequence.
in that particular region. Figure 10.61(b) shows the result obtained using this In the discussion that follows, all pixels in dij ( x, y) that have value 1 are consid-
approach. The improvement over the image in 10.60(b) is evident. ered the result of object motion. This approach is applicable only if the two imag-
Marker selection can range from simple procedures based on intensity values es are registered spatially, and if the illumination is relatively constant within the
and connectivity, as we just illustrated, to more complex descriptions involving size,
bounds established by T. In practice, 1-valued entries in dij ( x, y) may arise as a result
shape, location, relative distances, texture content, and so on (see Chapter 11 regard-
of noise also. Typically, these entries are isolated points in the difference image, and
ing feature descriptors). The point is that using markers brings a priori knowledge
a simple approach to their removal is to form 4- or 8-connected regions of 1’s in
to bear on the segmentation problem. Keep in mind that humans often aid segmen-
image dij ( x, y), then ignore any region that has less than a predetermined number of
tation and higher-level tasks in everyday vision by using a priori knowledge, one
elements. Although it may result in ignoring small and/or slow-moving objects, this
of the most familiar being the use of context. Thus, the fact that segmentation by
approach improves the chances that the remaining entries in the difference image
watersheds offers a framework that can make effective use of this type of knowledge
actually are the result of motion, and not noise.
is a significant advantage of this method.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 795 6/16/2017 2:14:21 PM DIP4E_GLOBAL_Print_Ready.indb 796 6/16/2017 2:14:22 PM


10.8 The Use of Motion in Segmentation 797 798 Chapter 10 Image Segmentation

Although the method just described is simple, it is used frequently as the basis of
imaging systems designed to detect changes in controlled environments, such as in
surveillance of parking facilities, buildings, and similar fixed locales.

Accumulative Differences
Consider a sequence of image frames denoted by f ( x, y, t1 ), f ( x, y, t 2 ),…, f ( x, y, t n ),
and let f ( x, y, t1 ) be the reference image. An accumulative difference image (ADI)
is formed by comparing this reference image with every subsequent image in the
sequence. A counter for each pixel location in the accumulative image is increment-
ed every time a difference occurs at that pixel location between the reference and an
image in the sequence. Thus, when the kth frame is being compared with the refer-
ence, the entry in a given pixel of the accumulative image gives the number of times a b c
the intensity at that position was different [as determined by T in Eq. (10-114)] from FIGURE 10.62 ADIs of a rectangular object moving in a southeasterly direction. (a) Absolute ADI. (b) Positive ADI.
the corresponding pixel value in the reference image. (c) Negative ADI.
Assuming that the intensity values of the moving objects are greater than the
background, we consider three types of ADIs. Let R( x, y) denote the reference
image and, to simplify the notation, let k denote tk so that f ( x, y, k ) = f ( x, y, tk ). We
assume that R( x, y) = f ( x, y, 1). Then, for any k > 1, and keeping in mind that the
values of the ADIs are counts, we define the following accumulative differences for are of size 256 × 256 pixels. We note the following: (1) The nonzero area of the positive ADI is equal
all relevant values of ( x, y) : to the size of the moving object; (2) the location of the positive ADI corresponds to the location of the
moving object in the reference frame; (3) the number of counts in the positive ADI stops increasing
 Ak −1 ( x, y) + 1 if R( x, y) − f ( x, y, k ) > T when the moving object is displaced completely with respect to the same object in the reference frame;
Ak ( x, y) =  (10-115) (4) the absolute ADI contains the regions of the positive and negative ADI; and (5) the direction and
 Ak −1 ( x, y) otherwise
speed of the moving object can be determined from the entries in the absolute and negative ADIs.

 Pk −1 ( x, y) + 1 if R( x, y) − f ( x, y, k ) > T Establishing a Reference Image


Pk ( x, y) =  (10-116)
 Pk −1 ( x, y) otherwise A key to the success of the techniques just discussed is having a reference image
against which subsequent comparisons can be made. The difference between two
and
images in a dynamic imaging problem has the tendency to cancel all stationary com-
 N k −1 ( x, y) + 1 if R( x, y) − f ( x, y, k ) < − T ponents, leaving only image elements that correspond to noise and to the moving
N k ( x, y) =  (10-117) objects.
 N k −1 ( x, y) otherwisee
Obtaining a reference image with only stationary elements is not always pos-
sible, and building a reference from a set of images containing one or more moving
where Ak ( x, y), Pk ( x, y), and N k ( x, y) are the absolute, positive, and negative ADIs,
objects becomes necessary. This applies particularly to situations describing busy
respectively, computed using the kth image in the sequence. All three ADIs start
scenes or in cases where frequent updating is required. One procedure for generat-
out with zero counts and are of the same size as the images in the sequence. The
ing a reference image is as follows. Consider the first image in a sequence to be the
order of the inequalities and signs of the thresholds in Eqs. (10-116) and (10-117) are
reference image. When a nonstationary component has moved completely out of
reversed if the intensity values of the background pixels are greater than the values
its position in the reference frame, the corresponding background in the present
of the moving objects.
frame can be duplicated in the location originally occupied by the object in the ref-
erence frame. When all moving objects have moved completely out of their original
EXAMPLE 10.27 : Computation of the absolute, positive, and negative accumulative difference images. positions, a reference image containing only stationary components will have been
created. Object displacement can be established by monitoring the changes in the
Figure 10.62 shows the three ADIs displayed as intensity images for a rectangular object of dimension
positive ADI, as indicated earlier. The following example illustrates how to build a
75 × 50 pixels that is moving in a southeasterly direction at a speed of 5 2 pixels per frame. The images
reference frame using the approach just described.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 797 6/16/2017 2:14:24 PM DIP4E_GLOBAL_Print_Ready.indb 798 6/16/2017 2:14:27 PM


10.8 The Use of Motion in Segmentation 799 800 Chapter 10 Image Segmentation

EXAMPLE 10.28 : Building a reference image. Suppose that in frame two (t = 1), the object has moved to coordinates ( x$ + 1, y$);
that is, it has moved 1 pixel parallel to the x-axis. Then, repeating the projection pro-
Figures 10.63(a) and (b) show two image frames of a traffic intersection. The first image is considered
the reference, and the second depicts the same scene some time later. The objective is to remove the cedure discussed in the previous paragraph yields the sum exp  j 2pa1 ( x′ + 1) ∆t  . If
principal moving objects in the reference image in order to create a static image. Although there are the object continues to move 1 pixel location per frame then, at any integer instant
other smaller moving objects, the principal moving feature is the automobile at the intersection mov- of time, t, the result will be exp  j 2pa1 ( x′ + t ) ∆t  , which, using Euler’s formula, may
ing from left to right. For illustrative purposes we focus on this object. By monitoring the changes in be expressed as
the positive ADI, it is possible to determine the initial position of a moving object, as explained above.
e j 2 pa1 ( x ′ + t )∆t = cos  2pa1 ( x′ + t ) ∆t  + j sin  2pa1 ( x′ + t ) ∆t  (10-118)
Once the area occupied by this object is identified, the object can be removed from the image by sub-
traction. By looking at the frame in the sequence at which the positive ADI stopped changing, we can
for t = 0, 1, 2, … , K − 1. In other words, this procedure yields a complex sinusoid
copy from this image the area previously occupied by the moving object in the initial frame. This area
then is pasted onto the image from which the object was cut out, thus restoring the background of that with frequency a1 . If the object were moving V1 pixels (in the x-direction) between
area. If this is done for all moving objects, the result is a reference image with only static components frames, the sinusoid would have frequency V1a1 . Because t varies between 0 and
against which we can compare subsequent frames for motion detection. The reference image resulting K − 1 in integer increments, restricting a1 to have integer values causes the discrete
from removing the east-bound moving vehicle and restoring the background is shown in Fig. 10.63(c). Fourier transform of the complex sinusoid to have two peaks—one located at fre-
quency V1a1 and the other at K − V1a1 . This latter peak is the result of symmetry in
the discrete Fourier transform, as discussed in Section 4.6, and may be ignored. Thus
FREQUENCY DOMAIN TECHNIQUES a peak search in the Fourier spectrum would yield one peak with value V1a1 . Divid-
In this section, we consider the problem of determining motion via a Fourier trans- ing this quantity by a1 yields V1 , which is the velocity component in the x-direction,
form formulation. Consider a sequence f ( x, y, t ), t = 0, 1, 2, … , K − 1, of K digital as the frame rate is assumed to be known. A similar analysis would yield V2 , the
image frames of size M × N pixels, generated by a stationary camera. We begin the component of velocity in the y-direction.
development by assuming that all frames have a homogeneous background of zero A sequence of frames in which no motion takes place produces identical exponen-
intensity. The exception is a single, 1-pixel object of unit intensity that is moving tial terms, whose Fourier transform would consist of a single peak at a frequency of 0
with constant velocity. Suppose that for frame one (t = 0), the object is at location (a single dc term). Therefore, because the operations discussed so far are linear, the
( x$, y$) and the image plane is projected onto the x-axis; that is, the pixel intensities general case involving one or more moving objects in an arbitrary static background
are summed (for each row) across the columns in the image. This operation yields would have a Fourier transform with a peak at dc corresponding to static image
a 1-D array with M entries that are zero, except at x$, which is the x-coordinate of components, and peaks at locations proportional to the velocities of the objects.
These concepts may be summarized as follows. For a sequence of k digital images
the single-point object. If we now multiply all the components of the 1-D array by
of size M × N pixels, the sum of the weighted projections onto the x-axis at any inte-
the quantity exp [ j 2pa1 x∆t ] for x = 0, 1, 2, … , M − 1 and add the results, we obtain
ger instant of time is
the single term exp [ j 2pa1 x′∆t ] because there is only one nonzero point in the array.
In this notation, a1 is a positive integer, and !t is the time interval between frames. M −1 N −1
g x (t , a1 ) = ∑ ∑ f ( x, y, t ) e j 2pa x∆t
x=0 y=0
1
t = 0, 1,…, K − 1 (10-119)

Similarly, the sum of the projections onto the y-axis is


N −1 M −1
g y (t , a2 ) = ∑ ∑ f ( x, y, t ) e j 2pa y∆t
y=0 x=0
2
t = 0, 1,…, K − 1 (10-120)

where, as noted earlier, a1 and a2 are positive integers.


The 1D Fourier transforms of Eqs. (10-119) and (10-120), respectively, are
K −1

a b c
Gx (u1 , a1 ) = ∑
t =0
g x (t , a1 ) e − j 2 pu t K
1
u1 = 0, 1,…, K − 1 (10-121)

FIGURE 10.63 Building a static reference image. (a) and (b) Two frames in a sequence. (c) Eastbound automobile sub-
tracted from (a), and the background restored from the corresponding area in (b). (Jain and Jain.) and

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 799 6/16/2017 2:14:28 PM DIP4E_GLOBAL_Print_Ready.indb 800 6/16/2017 2:14:31 PM


10.8 The Use of Motion in Segmentation 801 802 Chapter 10 Image Segmentation

K −1 FIGURE 10.64
Gy (u2 , a2 ) = ∑
t =0
g y (t , a2 ) e − j 2 pu t K
2
u2 = 0, 1,…, K − 1 (10-122) LANDSAT
frame. (Cowart,
Snyder, and
These transforms are computed using an FFT algorithm, as discussed in Section 4.11. Ruedger.)
The frequency-velocity relationship is

u1 = a1V1 (10-123)

and

u2 = a2V2 (10-124)

In the preceding formulation, the unit of velocity is in pixels per total frame time.
For example, V1 = 10 indicates motion of 10 pixels in K frames. For frames that
are taken uniformly, the actual physical speed depends on the frame rate and the
distance between pixels. Thus, if V1 = 10, and K = 30, the frame rate is two images
per second, and the distance between pixels is 0.5 m, then the actual physical speed
in the x-direction is
the results of computing Eqs. (10-121) and (10-122) with a1 = 6 and a2 = 4, respectively. The peak at
V1 = (10 pixels )( 0.5 m pixel )( 2 frames s )( 30 frames ) u1 = 3 in Fig. 10.66(a) yields V1 = 0.5 from Eq. (10-123). Similarly, the peak at u2 = 4 in Fig. 10.66(b)
yields V2 = 1.0 from Eq. (10-124).

The sign of the x-component of the velocity is obtained by computing


Guidelines for selecting a1 and a2 can be explained with the aid of Fig. 10.66. For
d 2 Re  g x ( t , a1 ) instance, suppose that we had used a2 = 15 instead of a2 = 4. In that case, the peaks in
S1 x = (10-125) Fig. 10.66(b) would now be at u2 = 15 and 17 because V2 = 1.0. This would be a seri-
dt 2 ously aliased result. As discussed in Section 4.5, aliasing is caused by under-sampling
t=n
(too few frames in the present discussion, as the range of u is determined by K).
and Because u = aV , one possibility is to select a as the integer closest to a = umax Vmax ,

d 2 Im  g x ( t , a1 )
S2 x = (10-126)
dt 2
t=n
FIGURE 10.65
Because g x is sinusoidal, it can be shown (see Problem 10.53) that S1 x and S2 x will Intensity plot of
the image in
have the same sign at an arbitrary point in time, n, if the velocity component V1 Fig. 10.64, with
is positive. Conversely, opposite signs in S1 x and S2 x indicate a negative velocity the target circled.
component. If either S1 x or S2 x is zero, we consider the next closest point in time, (Rajala, Riddle,
t = n ± !t. Similar comments apply to computing the sign of V2 . and Snyder.)

EXAMPLE 10.29 : Detection of a small moving object via frequency-domain analysis.


Figures 10.64 through 10.66 illustrate the effectiveness of the approach just developed. Figure 10.64
shows one of a 32-frame sequence of LANDSAT images generated by adding white noise to a reference
image. The sequence contains a superimposed target moving at 0.5 pixel per frame in the x-direction y
and 1 pixel per frame in the y-direction. The target, shown circled in Fig. 10.65, has a Gaussian intensity x
distribution spread over a small (9-pixel) area, and is not easily discernible by eye. Figure 10.66 shows

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 801 6/16/2017 2:14:32 PM DIP4E_GLOBAL_Print_Ready.indb 802 6/16/2017 2:14:34 PM


10.8 The Use of Motion in Segmentation 803 804 Chapter 10 Image Segmentation
640 Gonzalez [1974]). The superpixel algorithm presented in Section 10.5 is from Achanta et al. [2012]. See their paper
for a listing and comparison of other superpixel approaches. The material on graph cuts is based on the paper by
560
Shi and Malik [2000]. See Hochbaum [2010] for an example of faster implementations.
480 Segmentation by watersheds was shown in Section 10.7 to be a powerful concept. Early references dealing with
segmentation by watersheds are Serra [1988], and Beucher and Meyer [1992]. As indicated in our discussion in Sec-
400 100

Magnitude ( # 10 2)
tion 10.7, one of the key issues with watersheds is the problem of over-segmentation. The papers by Bleau and Leon
Magnitude ( # 10)

320 80 [2000] and by Gaetano et al. [2015] are illustrative of approaches for dealing with this problem.
The material in Section 10.8 dealing with accumulative differences is from Jain, R. [1981]. See also Jain, Kasturi,
240 60 and Schunck [1995]. The material dealing with motion via Fourier techniques is from Rajala, Riddle, and Snyder
40
[1983]. The books by Snyder and Qi [2004], and by Chakrabarti et al. [2015], provide additional reading on
160
motion estimation. For details on the software aspects of many of the examples in this chapter, see Gonzalez, Woods,
80 20 and Eddins [2009].

0 0
0 4 8 12 16 20 24 28 32 36 40 0 4 8 12 16 20 24 28 32 36 40

a b
Frequency Frequency Problems
FIGURE 10.66 (a) Spectrum of Eq. (10-121) showing a peak at u1 = 3. (b) Spectrum of Eq. (10-122) showing a peak at Solutions to the problems marked with an asterisk (*) are in the DIP4E Student Support Package (consult the book
u2 = 4. (Rajala, Riddle, and Snyder.) website: www.ImageProcessingPlace.com)..

10.1 * In a Taylor series approximation, the remainder 10.5 * With reference to Fig. 10.6, what are the angles
(also called the truncation error) consists of all (measured with respect to the x-axis of the book
where umax is the aliasing frequency limitation established by K, and Vmax is the the terms not used in the approximation. The axis convention in Fig. 2.19) of the horizontal and
maximum expected object velocity. first term in the remainder of a finite difference vertical lines to which the kernels in Figs. 10.6(a)
approximation is indicative of the error in the and (c) are most responsive?
approximation. The higher the derivative order 10.6 Refer to Fig. 10.7 in answering the following ques-
of that term is, the lower the error will be in the tions.
Summary, References, and Further Reading approximation. All three approximations to the (a) * Some of the lines joining the pads and center
Because of its central role in autonomous image processing, segmentation is a topic covered in most books dealing first derivative given in Eqs. (10-4)-(10-6) are element in Fig. 10.7(e) are single lines, while
with image processing, image analysis, and computer vision. The following books provide complementary and/or computed using the same number of sample others are double lines. Explain why.
supplementary reading for our coverage of this topic: Umbaugh [2010]; Prince [2012]; Nixon and Aguado, A [2012]; points. However, the error of the central differ- (b) Propose a method for eliminating the com-
Pratt [2014]; and Petrou and Petrou [2010]. ence approximation is less than the other two. ponents in Fig. 10.7(f) that are not part of
Work dealing with the use of kernels to detect intensity discontinuities (see Section 10.2) has a long history. Show that this is true. the line oriented at −45°.
Numerous kernels have been proposed over the years: Roberts [1965]; Prewitt [1970]; and Kirsh [1971]. The Sobel
operators are from [Sobel]; see also Danielsson and Seger [1990]. Our presentation of the zero-crossing properties of 10.2 Do the following: (c)
the Laplacian is based on Marr [1982]. The Canny edge detector discussed in Section 10.2 is due to Canny [1986]. The (a) * Show how Eq. (10-8) was obtained. 10.7 With reference to the edge models in Fig. 10.8,
basic reference for the Hough transform is Hough [1962]. See Ballard [1981], for a generalization to arbitrary shapes. (b) Show how Eq. (10-9) was obtained. answer the following without generating the gra-
Other approaches used to deal with the effects of illumination and reflectance on thresholding are illustrated by dient and angle images. Simply provide sketches
10.3 A binary image contains straight lines oriented
the work of Perez and Gonzalez [1987], Drew et al. [1999], and Toro and Funt [2007]. The optimum thresholding horizontally, vertically, at 45°, and at −45°. Give of the profiles that show what you would expect
approach due to Otsu [1979] has gained considerable acceptance because it combines excellent performance with a set of 3 × 3 kernels that can be used to detect the profiles of the magnitude and angle images
simplicity of implementation, requiring only estimation of image histograms. The basic idea of using preprocessing one-pixel breaks in these lines. Assume that the to look like.
to improve thresholding dates back to an early paper by White and Rohrer [1983]), which combined thresholding, intensities of the lines and background are 1 and (a) * Suppose that we compute the gradient mag-
the gradient, and the Laplacian in the solution of a difficult segmentation problem. 0, respectively. nitude of each of these models using the
See Fu and Mui [1981] for an early survey on the topic of region-oriented segmentation. The work of Haddon Prewitt kernels in Fig. 10.14. Sketch what a
10.4 Propose a technique for detecting gaps of length
and Boyce [1990] and of Pavlidis and Liow [1990] are among the earliest efforts to integrate region and boundary ranging between 1 and K pixels in line segments horizontal profile through the center of each
information for the purpose of segmentation. Region growing is still an active area of research in image processing, of a binary image. Assume that the lines are one gradient image would look like.
as exemplified by Liangjia et al. [2013]. The basic reference on the k-means algorithm presented in Section 10.5 pixel thick. Base your technique on 8-neighbor (b) Sketch a horizontal profile for each corre-
goes way back several decades to an obscure 1957 Bell Labs report by Lloyd, who subsequenty published in Lloyd connectivity analysis, rather than attempting to sponding angle image.
[1982]. This algorithm was already being in used in areas such as pattern recognition in the 1960s and ’70s (Tou and construct kernels for detecting the gaps. 10.8 Consider a horizontal intensity profile through

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 803 6/16/2017 2:14:36 PM DIP4E_GLOBAL_Print_Ready.indb 804 6/16/2017 2:14:37 PM


Problems 805 806 Chapter 10 Image Segmentation

the middle of a binary image that contains a ver- in Fig. 10.14, and in (a) above, and give iso- sketch the histogram of edge directions. Be 2
2s 2
G(r ) = e − r
tical step edge through the center of the image. tropic results only for horizontal and verti- precise in labeling the height of each compo-
Draw what the profile would look like after the cal edges, and for edges oriented at ± 45°, nent of the histogram.
where r 2 = x 2 + y 2 . The LoG is then derived by
image has been blurred by an averaging kernel respectively. (c) What would the Laplacian of this image look taking the second partial derivative with respect
of size n × n with coefficients equal to 1 n2 . For like based on using Eq. (10-14)? Show all to r: ( 2G(r ) = ∂ 2G(r ) ∂r 2 . Finally, x 2 + y 2 is sub-
10.12 The results obtained by a single pass through an
simplicity, assume that the image was scaled so relevant different pixel values in the Lapla- stituted for r 2 to get the final (incorrect) result:
image of some 2-D kernels can be achieved also
that its intensity levels are 0 on the left of the cian image.
by two passes using 1-D kernels. For example,
edge and 1 on its right. Also, assume that the size
of the kernel is much smaller than the image, so
the same result of using a 3 × 3 smoothing kernel 10.15 Suppose that an image f ( x, y) is convolved with (
( 2G ( x, y ) =  x 2 + y 2 − s 2
 ) s4 

with coefficients 1 9 can be obtained by a pass
that image border effects are not a concern near
the center of the image.
of the kernel [1 1 1] through an image, followed
a kernel of size n × n (with coefficients 1 n2 ) to
produce a smoothed image f ( x, y).  (
exp  − x 2 + y 2 ) 2s 2 

by a pass of the result with the kernel [1 1 1]T .
10.9 * Suppose that we had used the edge models in the The final result is then scaled by 1 9. Show that (a) * Derive an expression for edge strength Derive this result and explain the reason for the
following image, instead of the ramp in Fig. 10.10. the response of Sobel kernels (Fig. 10.14) can (edge magnitude) as a function of n. Assume difference between this expression and Eq. (10-29).
Sketch the gradient and Laplacian of each profile. be implemented similarly by one pass of the that n is odd and that the partial derivatives
10.19 Do the following:
differencing kernel [ −1 0 1] (or its vertical coun- are computed using Eqs. (10-19) and (10-20).
(a) * Derive Eq. (10-33).
terpart) followed by the smoothing kernel [1 2 1] (b) Show that the ratio of the maximum edge
(or its vertical counterpart). strength of the smoothed image to the maxi- (b) Let k = s1 s 2 denote the standard deviation
Image mum edge strength of the original image is ratio discussed in connection with the DoG
10.13 A popular variation of the compass kernels function, and express Eq. (10-33) in terms of
1 n. In other words, edge strength is inversely
shown in Fig. 10.15 is based on using coefficients k and s 2 .
proportional to the size of the smoothing
with values 0, 1, and −1.
kernel, as one would expect. 10.20 In the following, assume that G and f are discrete
(a) * Give the form of the eight compass kernels arrays of size n × n and M × N , respectively.
10.16 With reference to Eq. (10-29),
using these coefficients. As in Fig. 10.15, let N,
Profile of a
NW, . . . denote the direction of the edge that (a) * Show that the average value of the LoG (a) Show that the 2-D convolution of the Gauss-
horizontal line ian function G( x, y) in Eq. (10-27) with an
gives the strongest response. operator, ( 2G( x, y), is zero.
image f ( x, y) can be expressed as a 1-D con-
(b) Specify the gradient vector direction of the (b) Show that the average value of any image volution along the rows (columns) of f ( x, y),
edges detected by each kernel in (a). convolved with this operator also is zero.
10.10 Do the following: followed by a 1-D convolution along the col-
(Hint: Consider solving this problem in the
(a) * Show that the direction of steepest (maxi- 10.14 The rectangle in the following binary image is of umns (rows) of the result. (Hints: See Sec-
frequency domain, using the convolution
mum) ascent of a function f at point ( x, y) size m × n pixels. tion 3.4 regarding discrete convolution and
theorem and the fact that the average value
is given by the vector (f ( x, y) in Eq. (10-16), separability).
of a function is proportional to its Fourier
and that the rate of that descent is (f ( x, y) , transform evaluated at the origin.) (b) * Derive an expression for the computa-
defined in Eq. (10-17). tional advantage using the 1-D convolution
(c) Suppose that we: (1) used the kernel in Fig.
approach in (a) as opposed to implementing
(b) Show that the direction of steepest descent is 10.4(a) to approximate the Laplacian of a
the 2-D convolution directly. Assume that
given by the vector −(f ( x, y), and that the Gaussian, and (2) convolved this result with
G( x, y) is sampled to produce an array of size
rate of the steepest descent is (f ( x, y) . any image. What would be true in general of
n × n and that f ( x, y) is of size M × N . The
(c) Give the description of an image whose gra- the values of the resulting image? Explain.
computational advantage is the ratio of the
dient magnitude image would be the same, (Hint: Take a look at Problem 3.32.)
number of multiplications required for 2-D
whether we computed it using Eq. (10-17) or 10.17 Refer to Fig. 10.22(c). convolution to the number required for 1-D
(10-26). A constant image is not acceptable (a) Explain why the edges form closed contours. convolution. (Hint: Review the subsection
answer. on separable kernels in Section 3.4.)
(a) * What would the magnitude of the gradient (b) * Does the zero-crossing method for finding
10.11 Do the following. edge location always result in closed con- 10.21 Do the following.
of this image look like based on using the
(a) How would you modify the Sobel and approximation in Eq. (10-26)? Assume that tours? Explain. (a) Show that Steps 1 and 2 of the Marr-Hildreth
Prewitt kernels in Fig. 10.14 so that they give g x and g y are obtained using the Sobel ker- 10.18 One often finds in the literature a derivation of algorithm can be implemented using four
their strongest gradient response for edges nels. Show all relevant different pixel values the Laplacian of a Gaussian (LoG) that starts 1-D convolutions. (Hints: Refer to Problem
oriented at ± 45° ? in the gradient image. with the expression 10.20(a) and express the Laplacian operator
(b) * Show that the Sobel and Prewitt kernels (b) With reference to Eq. (10-18) and Fig. 10.12, as the sum of two partial derivatives, given

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 805 6/16/2017 2:14:39 PM DIP4E_GLOBAL_Print_Ready.indb 806 6/16/2017 2:14:41 PM


Problems 807 808 Chapter 10 Image Segmentation

by Eqs. (10-10) and (10-11), and implement (e) Sketch the horizontal profiles of the angle assume that the images have been preprocessed greater than m2 , and that the initial T is between
each derivative using a 1-D kernel, as in images resulting from using the Canny edge so that they are binary and that all tracks are 1 the max and min image intensities. Give conditions
Problem 10.12.) detector. thick, except at the point of collision from which (in terms of the parameters of these curves) for the
they emanate. Your procedure should be able to following to be true when the algorithm converges:
(b) Derive an expression for the computational 10.24 In Example 10.9, we used a smoothing kernel of
differentiate between tracks that have the same
advantage of using the 1-D convolution size 19 × 19 to generate Fig. 10.26(c) and a kernel (a) * The threshold is equal to (m1 + m2 ) 2.
direction but different origins. (Hint: Base your
approach in (a) as opposed to implementing of size 13 × 13 to generate Fig. 10.26(d). What was
solution on the Hough transform.) (b) * The threshold is to the left of m2 .
the 2-D convolution directly. Assume that the rationale that led to choosing these values?
G( x, y) is sampled to produce an array of (Hint: Observe that both are Gaussian kernels, 10.29 * Restate the basic global thresholding algorithm (c) The threshold is in the interval given by the
size n × n and that f ( x, y) is of size M × N . and refer to the discussion of lowpass Gaussian in Section 10.3 so that it uses the histogram of an equation (m1 + m2 2) < T < m1 .
The computational advantage is the ratio of kernels in Section 3.5.) image instead of the image itself.
10.35 Do the following:
the number of multiplications required for 10.25 Refer to the Hough transform in Section 10.2. 10.30 * Prove that the basic global thresholding algo-
2-D convolution to the number required for (a) * Show how the first line in Eq. (10-60) fol-
rithm in Section 10.3 converges in a finite number
1-D convolution (see Problem 10.20). (a) Propose a general procedure for obtaining lows from Eqs. (10-55), (10-56), and (10-59).
of steps. (Hint: Use the histogram formulation
the normal representation of a line from its (b) Show how the second line in Eq. (10-60)
10.22 Do the following. from Problem 10.29.)
slope-intercept form, y = ax + b. follows from the first.
(a) * Formulate Step 1 and the gradient mag- 10.31 Give an explanation why the initial threshold in
(b) * Find the normal representation of the line 10.36 Show that a maximum value for Eq. (10-63)
nitude image computation in Step 2 of the the basic global thresholding algorithm in Sec-
y = −2 x + 1. always exists for k in the range 0 ≤ k ≤ L − 1.
Canny algorithm using 1-D instead of 2-D tion 10.3 must be between the minimum and
10.26 Refer to the Hough transform in Section 10.2. maximum values in the image. (Hint: Construct 10.37 * With reference to Eq. (10-65), advance an
convolutions.
(a) * Explain why the Hough mapping of the point an example that shows the algorithm failing for a argument that establishes that 0 ≤ h(k ) ≤ 1 for k
(b) What is the computational advantage of labeled 1 in Fig. 10.30(a) is a straight line in threshold value selected outside this range.) in the range 0 ≤ k ≤ L − 1, where the minimum
using the 1-D convolution approach as Fig. 10.30(b). 10.32 *Assume that the initial threshold in the basic is achievable only by images with constant inten-
opposed to implementing a 2-D convolu-
(b) * Is this the only point that would produce that global thresholding algorithm in Section 10.3 is sity, and the maximum occurs only for 2-valued
tion. Assume that the 2-D Gaussian filter in
result? Explain. selected as a value between the minimum and images with values 0 and (L − 1).
Step 1 is sampled into an array of size n × n
maximum intensity values in an image. Do you 10.38 Do the following:
and that the input image is of size M × N . (c) Explain the reflective adjacency relationship
think the final value of the threshold at conver-
Express the computational advantage as illustrated by, for example, the curve labeled (a) * Suppose that the intensities of a digital
gence depends on the specific initial value used?
the ratio of the number of multiplications Q in Fig. 10.30(b). image f ( x, y) are in the range [0, 1] and that
Explain. (You can use a simple image example to
required by each method. a threshold, T, successfully segmented the
10.27 Show that the number of operations required to support your conclusion.)
10.23 With reference to the three vertical edge models implement the accumulator-cell approach dis- image into objects and background. Show
cussed in Section 10.2 is linear in n, the number 10.33 You may assume in both of the following cases that the threshold T ′ = 1 − T will success-
and corresponding profiles in Fig. 10.8 provide
of non-background points in the image plane (i.e., that the initial threshold is in the open interval fully segment the negative of f ( x, y) into the
sketches of the profiles that would result from
the xy-plane). (0, L − 1). same regions. The term negative is used here
each of the following methods. You may sketch
the profiles manually. 10.28 An important application of image segmentation (a) * Show that if the histogram of an image is in the sense defined in Section 3.2.
is in processing images resulting from so-called uniform over all possible intensity levels, (b) The intensity transformation function in
(a) * Suppose that we compute the gradient
bubble chamber events. These images arise from the basic global thresholding algorithm con- (a) that maps an image into its negative is
magnitude of each of the three edge model
experiments in high-energy physics in which a verges to the average intensity of the image. a linear function with negative slope. State
images using the Sobel kernels. Sketch the
horizontal intensity profiles of the three beam of particles of known properties is directed (b) Show that if the histogram of an image is the conditions that an arbitrary intensity
resulting gradient images. onto a target of known nuclei. A typical event con- bimodal, with identical modes that are sym- transformation function must satisfy for the
sists of incoming tracks, any one of which, upon metric about their means, then the basic segmentability of the original image with
(b) Sketch the horizontal intensity profiles that
a collision, branches out into secondary tracks of global thresholding algorithm will converge respect to a threshold, T, to be preserved.
would result from using the 3 × 3 Laplacian
particles emanating from the point of collision. to the point halfway between the means of What would be the value of the threshold
kernel in Fig. 10.10.4(a).
Propose a segmentation approach for detecting the modes. after the intensity transformation?
(c) * Repeat (b) using only the first two steps of all tracks angled at any of the following six direc-
tions off the horizontal: ± 25°, ± 50°, and ± 75°. 10.34 Refer to the basic global thresholding algorithm in 10.39 The objects and background in the image below
the Marr-Hildreth edge detector.
The estimation error allowed in any of these six Section 10.3. Assume that in a given problem, the have a mean intensity of 170 and 60, respectively,
(d) Repeat (b) using the first two steps of the histogram is bimodal with modes that are Gauss- on a [0, 255] scale. The image is corrupted by
directions is ±5°. For a track to be valid it must
Canny edge detector. You may ignore the ian curves of the form A1 exp[ −(z − m1 )2 2s12 ] Gaussian noise with 0 mean and a standard devia-
be at least 100 pixels long and have no more than
angle images. and A2 exp[ −(z − m2 )2 2s 22 ]. Assume that m1 is tion of 10 intensity levels. Propose a thresholding
three gaps, each not exceeding 10 pixels. You may

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 807 6/16/2017 2:14:42 PM DIP4E_GLOBAL_Print_Ready.indb 808 6/16/2017 2:14:43 PM


Problems 809 810 Chapter 10 Image Segmentation

method capable of a correct segmentation rate of 10.43 Consider the region of 1’s resulting from the 10.50 What would the negative ADI image shown the manufacturer. All that is known is that, dur-
90% or higher. (Recall that 99.7% of the area of segmentation of the sparse regions in the image in Fig. 10.62(c) look like if we tested against T ing the life of the lamps, A(t ) is always greater
a Gaussian curve lies in a ±3s interval about the of the Cygnus Loop in Example 10.21. Propose (instead of testing against −T) in Eq. (10-117)? than the negative component in the preceding
mean, where s is the standard deviation.) a technique for using this region as a mask to 10.51 Are the following statements true or false? Ex- equation because illumination cannot be nega-
isolate the three main components of the image: plain the reason for your answer in each. tive. It has been observed that Otsu’s algorithm
(1) background; (2) dense inner region; and (3) works well when the lamps are new, and their
sparse outer region. (a) * The nonzero entries in the absolute ADI pattern of illumination is nearly constant over the
continue to grow in dimension, provided entire image. However, segmentation perfor-
10.44 Let the pixels in the first row of a 3 × 3 image, like that the object is moving. mance deteriorates with time. Being experimental,
the one in Fig. 10.53(a), be labeled as 1, 2, 3, and
(b) The nonzero entries in the positive ADI the lamps are exceptionally expensive, so you are
the pixels in the second and third rows be labeled
always occupy the same area, regardless of employed as a consultant to help solve the prob-
as 4, 5, 6 and 7, 8, 9, respectively. Let the inten-
the motion undergone by the object. lem using digital image processing techniques to
sity of these pixels be [90, 80, 30; 70, 5, 20; 80 20
compensate for the changes in illumination, and
30] where, for example, the intensity of pixel 2 is (c) The nonzero entries in the negative ADI
thus extend the useful life of the lamps. You are
80 and of pixel 4 it is 70. Compute the weights continue to grow in dimension, provided
given flexibility to install any special markers or
for the edges for the graph in Fig. 10.53(c), using that the object is moving.
other visual cues in the viewing area of the imag-
the formula w(i, j ) = 30[1"A # I (ni ) − I (n j ) # + c B ] 10.52 Suppose that in Example 10.29 motion along the ing cameras. Propose a solution in sufficient detail
explained in the text in connection with that x-axis is set to zero. The object now moves only
10.40 Refer to the intensity ramp image in Fig. 10.34(b) that the engineering plant manager can under-
figure (we scaled the formula by 30 to make the along the y-axis at 1 pixel per frame for 32 frames
and the moving-average algorithm discussed in stand your approach. (Hint: Review the image
numerical results easier to interpret). Let c = 0 and then (instantaneously) reverses direction
Section 10.3. Assume that the image is of size model discussed in Section 2.3 and consider using
in this case. and moves in exactly the opposite direction for
500 × 700 pixels and that its minimum and maxi- one or more targets of known reflectivity.)
mum values are 0 and 1, where 0’s are contained 10.45 * Show how Eqs. (10-106) through (10-108) follow another 32 frames. What would Figs. 10.66(a)
10.55 The speed of a bullet in flight is to be estimated by
only in the first column. from Eq. (10-105). and (b) look like under these conditions?
using high-speed imaging techniques. The method
(a) * What would be the result of segmenting this 10.53 *Advance an argument that demonstrates that of choice involves the use of a CCD camera and
10.46 Demonstrate the validity of Eq. (10-102).
image with the moving-average algorithm when the signs of S1 x and S2 x in Eqs. (10-125) flash that exposes the scene for K seconds. The bul-
using b = 0 and an arbitrary value for n. 10.47 Refer to the discussion in Section 10.7. and (10-126) are the same, velocity component let is 2.5 cm long, 1 cm wide, and its range of speed
V1 is positive.
Explain what the segmented image would (a) * Show that the elements of Cn ( Mi ) and T [ n ] is 750 ± 250 m s. The camera optics produce an
look like. are never replaced during execution of the 10.54 An automated pharmaceutical plant uses image image in which the bullet occupies 10% of the
(b) Now reverse the direction of the ramp so watershed segmentation algorithm. processing to measure the shapes of medication horizontal resolution of a 256 × 256 digital image.
that its leftmost value is 1 and the rightmost tablets for the purpose of quality control. The (a) * Determine the maximum value of K that
(b) Show that the number of elements in sets
value is 0 and repeat (a). segmentation stage of the system is based on will guarantee that the blur from motion
Cn (Mi ) and T [ n] either increases or remains
Otsu’s method. The speed of the inspection lines does not exceed 1 pixel.
(c) Repeat (a) but with b = 1 and n = 2. the same as n increases.
is so high that a very high rate flash illumina-
(d) Repeat (a) but with b = 1 and n = 100. 10.48 You saw in Section 10.7 that the boundaries (b) Determine the minimum number of frames
tion is required to “stop” motion. When new, the
obtained using the watershed segmentation algo- per second that would have to be acquired
illumination lamps project a uniform pattern of
10.41 Propose a region-growing algorithm to segment rithm form closed loops (for example, see Figs. in order to guarantee that at least two com-
light. However, as the lamps age, the illumination
the image in Problem 10.39. 10.59 and 10.61). Advance an argument that estab- plete images of the bullet are obtained dur-
pattern deteriorates as a function of time and
10.42 * Segment the image shown by using the split and lishes whether or not closed boundaries always ing its path through the field of view of the
spatial coordinates according to the equation
merge procedure discussed in Section 10.4. Let result from application of this algorithm. camera.
2 )2 + ( y − N 2 )2 ]
Q ( Ri ) = TRUE if all pixels in Ri have the same 10.49 * Give a step-by-step implementation of the dam-
i( x, y) = A(t ) − t 2 e − [( x − M (c) * Propose a segmentation procedure for
intensity. Show the quadtree corresponding to building procedure for the one-dimensional inten- automatically extracting the bullet from a
your segmentation. sity cross section shown below. Show a drawing where ( M 2, N 2 ) is the center of the viewing sequence of frames.
of the cross section at each step, showing “water” area and t is time measured in increments of
N (d) Propose a method for automatically deter-
levels and dams constructed. months. The lamps are still experimental and
mining the speed of the bullet.
the behavior of A(t ) is not fully understood by
7
6
5
N 4
3
2
1
0 x
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 809 6/16/2017 2:14:45 PM DIP4E_GLOBAL_Print_Ready.indb 810 6/16/2017 2:14:47 PM


812 Chapter 11 Feature Extraction

11.1 BACKGROUND

11
11.1

Although there is no universally accepted, formal definition of what constitutes an


image feature, there is little argument that, intuitively, we generally think of a fea-
ture as a distinctive attribute or description of “something” we want to label or
differentiate. For our purposes, the key words here are label and differentiate. The
Feature Extraction “something” of interest in this chapter refers either to individual image objects, or
even to entire images or sets of images. Thus, we think of features as attributes that
are going to help us assign unique labels to objects in an image or, more gener-
ally, are going to be of value in differentiating between entire images or families of
images.
Well, but reflect; have we not several times There are two principal aspects of image feature extraction: feature detection, and
acknowledged that names rightly given are the feature description. That is, when we refer to feature extraction, we are referring
likenesses and images of the things which they name? to both detecting the features and then describing them. To be useful, the extrac-
Socrates tion process must encompass both. The terminology you are likely to encounter in
image processing and analysis to describe feature detection and description varies,
but a simple example will help clarify our use of these term. Suppose that we use
object corners as features for some image processing task. In this chapter, detection
refers to finding the corners in a region or image. Description, on the other hand,
refers to assigning quantitative (or sometimes qualitative) attributes to the detected
Preview features, such as corner orientation, and location with respect to other corners. In
After an image has been segmented into regions or their boundaries using methods such as those in other words, knowing that there are corners in an image has limited use without
Chapters 10 and 11, the resulting sets of segmented pixels usually have to be converted into a form suit- additional information that can help us differentiate between objects in an image,
able for further computer processing. Typically, the step after segmentation is feature extraction, which or between images, based on corners and their attributes.
consists of feature detection and feature description. Feature detection refers to finding the features Given that we want to use features for purposes of differentiation, the next ques-
in an image, region, or boundary. Feature description assigns quantitative attributes to the detected tion is: What are the important characteristics that these features must possess in
features. For example, we might detect corners in a region boundary, and describe those corners by the realm of digital image processing? You are already familiar with some of these
their orientation and location, both of which are quantitative attributes. Feature processing methods characteristics. In general, features should be independent of location, rotation, and
discussed in this chapter are subdivided into three principal categories, depending on whether they are scale. Other factors, such as independence of illumination levels and changes caused
applicable to boundaries, regions, or whole images. Some features are applicable to more than one cat- by the viewpoint between the imaging sensor(s) and the scene, also are impor-
egory. Feature descriptors should be as insensitive as possible to variations in parameters such as scale, tant. Whenever possible, preprocessing should be used to normalize input images
translation, rotation, illumination, and viewpoint. The descriptors discussed in this chapter are either before feature extraction. For example, in situations where changes in illumination
insensitive to, or can be normalized to compensate for, variations in one or more of these parameters. are severe enough to cause difficulties in feature detection, it would make sense to
preprocess an image to compensate for those changes. Histogram equalization or
specification come to mind as automatic techniques that we know are helpful in
Upon completion of this chapter, readers should: this regard. The idea is to use as much a priori information as possible to preprocess
Understand the meaning and applicability of Be familiar with the limitations of the various images in order to improve the chances of accurate feature extraction.
a broad class of features suitable for image feature extraction methods discussed. When used in the context of a feature, the word “independent” usually has one of
processing. two meanings: invariant or covariant. A feature descriptor is invariant with respect
Understand the principal steps used in the
to a set of transformations if its value remains unchanged after the application (to
Understand the concepts of feature vectors solution of feature extraction problems.
the entity being described) of any transformation from the family. A feature descrip-
and feature space, and how to relate them Be able to formulate feature extraction algo- tor is covariant with respect to a set of transformations if applying to the entity any
to the various descriptors developed in this rithms. transformation from the set produces the same result in the descriptor. For example,
chapter. Have a “feel” for the types of features that consider this set of affine transformations: {translation, reflection, rotation}, and sup-
See Table 2.3 regarding
Be skilled in the mathematical tools used in have a good chance of success in a given affine transformations. pose that we have an elliptical region to which we assign the feature descriptor area.
feature extraction algorithms. application. Clearly, applying any of these transformations to the region does not change its area.

811

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 811 6/16/2017 2:14:47 PM DIP4E_GLOBAL_Print_Ready.indb 812 6/16/2017 2:14:47 PM


11.1 Background 813 814 Chapter 11 Feature Extraction

Therefore, area is an invariant feature descriptor with respect to the given family of recognition for automated inspection, searching for patterns (e.g., individual faces
transformations. However, if we add the affine transformation scaling to the fam- and/or fingerprints) in image databases, and autonomous applications, such as robot
ily, descriptor area ceases to be invariant with respect to the extended family. The and vehicle navigation. For these applications, numerical features usually are “pack-
descriptor is now covariant with respect to the family, because scaling the area of the aged” in the form of a feature vector, (i.e., a 1 × n or n × 1 matrix) whose elements are
region by any factor scales the value of the descriptor by the same factor. Similarly, the descriptors. An RGB image is one of the simplest examples. As you know from
the descriptor direction (of the principal axis of the region) is covariant because Chapter 6, each pixel of an RGB image can be expressed as 3-D vector,
rotating the region by any angle has the same effect on the value of the descriptor.
Most of the feature descriptors we use in this chapter are covariant in general, in  x1 
the sense that they may be invariant to some transformations of interest, but not to x =  x2 
others that may be equally as important. As you will see shortly, it is good practice to  x3 
normalize as many relevant invariances as possible out of covariances. For instance,
we can compensate for changes in direction of a region by computing its actual in which x1 is the intensity value of the red image at a point, and the other com-
direction and rotating the region so that its principal axis points in a predefined ponents are the intensity values of the green and blue images at the same point. If
direction. If we do this for every region detected in an image, rotation will cease to color is used as a feature, then a region in an RGB image would be represented as
be covariant. a set of feature vectors (points) in 3-D space. When n descriptors are used, feature
Another major classification of features is local vs. global. You are likely to see vectors become n-dimensional, and the space containing them is referred to as an
many different attempts to classify features as belonging to one of these two catego- n-dimensional feature space. You may “visualize” a set of n-dimensional feature vec-
ries. What makes this difficult is that a feature may belong to both, depending on the tors as a “hypercloud” of points in n-dimensional Euclidean space.
application. For example, consider the descriptor area again, and suppose that we In this chapter, we group features into three principal categories: boundary,
are applying it to the task of inspecting the degree to which bottles moving past an region, and whole image features. This subsidivision is not based on the applicabil-
imaging sensor on a production line are full of liquid. The sensor and its accompany- ity of the methods we are about to discuss; rather, it is based on the fact that some
ing software are capable of generating images of ten bottles at once, in which liquid categories make more sense than others when considered in the context of what is
in each bottle appears as a bright region, and the rest of the image appears as dark being described. For example, it is implied that when we refer to the “length of a
background. The area of a region in this fixed geometry is directly proportional to boundary” we are referring to the “length of the boundary of a region,” but it makes
the amount of liquid in a bottle and, if detected and measured reliably, area is the no sense to refer to the “length” of an image. It will become clear that many of the
only feature we need to solve the inspection problem. Each image has ten regions, so features we will be discussing are applicable to boundaries and regions, and some
we consider area to be a local feature, in the sense that it is applicable to individual apply to whole images as well.
elements (regions) of an image. If the problem were to detect the total amount (area)
of liquid in an image, we would now consider area to be a global descriptor. But the 11.2 BOUNDARY PREPROCESSING
11.2
story does not end there. Suppose that the liquid inspection task is redefined so that The segmentation techniques discussed in the previous two chapters yield raw data
it calculates the entire amount of liquid per day passing by the imaging station. We in the form of pixels along a boundary or pixels contained in a region. It is standard
no longer care about the area of individual regions per se. Our units now are images. practice to use schemes that compact the segmented data into representations that
If we know the total area in an image, and we know the number of images, calculat- facilitate the computation of descriptors. In this section, we discuss various bound-
ing the total amount of liquid in a day is trivial. Now the area of an entire image is a ary preprocessing approaches suitable for this purpose.
local feature, and the area of the total at the end of the day is global. Obviously, we
could redefine the task so that the area at the end of a day becomes a local feature BOUNDARY FOLLOWING (TRACING)
descriptor, and the area for all assembly lines becomes a global measure. And so on,
endlessly. In this chapter, we call a feature local if it is applies to a member of a set,
You will find it helpful to Several of the algorithms discussed in this chapter require that the points in the
review the discussion in
and global if it applies to the entire set, where “member” and “set” are determined Sections 2.5 on neighbor- boundary of a region be ordered in a clockwise or counterclockwise direction. Con-
by the application.
hoods, adjacency and sequently, we begin our discussion by introducing a boundary-following algorithm
connectivity, and the
Features by themselves are seldom generated for human consumption, except in discussion in Section 9.6 whose output is an ordered sequence of points. We assume (1) that we are work-
applications such as interactive image processing, topics that are not in the main- dealing with connected ing with binary images in which object and background points are labeled 1 and 0,
components.
stream of this book. In fact, as you will see later, some feature extraction meth- respectively; and (2) that images are padded with a border of 0’s to eliminate the
ods generate tens, hundreds, or even thousands of descriptor values that would possibility of an object merging with the image border. For clarity, we limit the dis-
appear meaningless if examined visually. Instead, feature description typically is cussion to single regions. The approach is extended to multiple, disjoint regions by
used as a preprocessing step for higher-level tasks, such as image registration, object processing the regions individually.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 813 6/16/2017 2:14:47 PM DIP4E_GLOBAL_Print_Ready.indb 814 6/16/2017 2:14:48 PM


11.2 Boundary Preprocessing 815 816 Chapter 11 Feature Extraction

c c c 1 1
1 1 1 1 c0 b0 1 1 1 b 1 1 b 1 b 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

a b c d e f a b c
FIGURE 11.1 Illustration of the first few steps in the boundary-following algorithm. The point to be processed next is FIGURE 11.2 Examples of boundaries that can be processed by the boundary-following algo-
labeled in bold, black; the points yet to be processed are gray; and the points found by the algorithm are shaded. rithm. (a) Closed boundary with a branch. (b) Self-intersecting boundary. (c) Multiple bound-
Squares without labels are considered background (0) values. aries (processed one at a time).

The following algorithm traces the boundary of a 1-valued region, R, in a binary a straightforward approach is to extract the holes (see Section 9.6) and treat them
image. as 1-valued regions on a background of 0’s. Applying the boundary-following algo-
rithm to these regions will yield the inner boundaries of the original region.
1. Let the starting point, b0 , be the uppermost-leftmost point† in the image that is We could have stated the algorithm just as easily based on following a boundary
labeled 1. Denote by c0 the west neighbor of b0 [see Fig. 11.1(b)]. Clearly, c0 is in the counterclockwise direction but you will find it easier to have just one algo-
always a background point. Examine the 8-neighbors of b0 , starting at c0 and rithm and then reverse the order of the result to obtain a sequence in the opposite
See Section 2.5 for the
definition of 4-neigh- proceeding in a clockwise direction. Let b1 denote the first neighbor encountered direction. We use both directions interchangeably (but consistently) in the following
bors, 8-neighbors, and whose value is 1, and let c1 be the (background) point immediately preceding b1 sections to help you become familiar with both approaches.
m-neighbors of a point,
in the sequence. Store the locations of b0 for use in Step 5.
2. Let b = b0 and c = c0 . CHAIN CODES
3. Let the 8-neighbors of b, starting at c and proceeding in a clockwise direction, Chain codes are used to represent a boundary by a connected sequence of straight-
be denoted by n1 , n2 , … , n8 . Find the first neighbor labeled 1 and denote it by nk . line segments of specified length and direction. We assume in this section that all
4. Let b = nk and c = nk –1 . curves are closed, simple curves (i.e., curves that are closed and not self intersecting).
5. Repeat Steps 3 and 4 until b = b0 . The sequence of b points found when the
algorithm stops is the set of ordered boundary points. Freeman Chain Codes
Typically, a chain code representation is based on 4- or 8-connectivity of the seg-
Note that c in Step 4 is always a background point because nk is the first 1-valued ments. The direction of each segment is coded by using a numbering scheme, as in Fig.
point found in the clockwise scan. This algorithm is referred to as the Moore bound- 11.3. A boundary code formed as a sequence of such directional numbers is referred
ary tracing algorithm after Edward F. Moore, a pioneer in cellular automata theory. to as a Freeman chain code.
Figure 11.1 illustrates the first few steps of the algorithm. It is easily verified (see Digital images usually are acquired and processed in a grid format with equal
Problem 11.1) that continuing with this procedure will yield the correct boundary, spacing in the x- and y-directions, so a chain code could be generated by following a
shown in Fig. 11.1(f), whose points are ordered in a clockwise sequence. The algo- boundary in, say, a clockwise direction and assigning a direction to the segments con-
rithm works equally well with more complex boundaries, such as the boundary with necting every pair of pixels. This level of detail generally is not used for two principal
an attached branch in Fig. 11.2(a) or the self-intersecting boundary in Fig. 11.2(b). reasons: (1) The resulting chain would be quite long and (2) any small disturbances
Multiple boundaries [Fig. 11.2(c)] are handled by processing one boundary at a time. along the boundary due to noise or imperfect segmentation would cause changes
If we start with a binary region instead of a boundary, the algorithm extracts the in the code that may not be related to the principal shape features of the boundary.
outer boundary of the region. Typically, the resulting boundary will be one pixel An approach used to address these problems is to resample the boundary by
thick, but not always [see Problem 11.1(b)]. If the objective is to find the boundaries selecting a larger grid spacing, as in Fig. 11.4(a). Then, as the boundary is traversed, a
of holes in a region (these are called the inner or interior boundaries of the region), boundary point is assigned to a node of the coarser grid, depending on the proximity

of the original boundary point to that node, as in Fig. 11.4(b). The resampled bound-
As you will see later in this chapter and in Problem 11.8, the uppermost-leftmost point in a 1-valued boundary
has the important property that a polygonal approximation to the boundary has a convex vertex at that location.
ary obtained in this way can be represented by a 4- or 8-code. Figure 11.4(c) shows
Also, the left and north neighbors of the point are guaranteed to be background points. These properties make the coarser boundary points represented by an 8-directional chain code. It is a simple
it a good “standard” point at which to start boundary-following algorithms. matter to convert from an 8-code to a 4-code and vice versa (see Problems 2.15, 9.27,

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 815 6/16/2017 2:14:50 PM DIP4E_GLOBAL_Print_Ready.indb 816 6/16/2017 2:14:50 PM


11.2 Boundary Preprocessing 817 818 Chapter 11 Feature Extraction

a b 1 2 For instance, the first difference of the 4-directional chain code 10103322 is 3133030.
3 1 Size normalization can be achieved by altering the spacing of the resampling grid.
FIGURE 11.3
Direction The normalizations just discussed are exact only if the boundaries themselves
numbers for are invariant to rotation (again, in angles that are integer multiples of the directions
2 0 4 0
(a) 4-directional in Fig. 11.3) and scale change, which seldom is the case in practice. For instance,
chain code, and the same object digitized in two different orientations will have different bound-
(b) 8-directional 5 7 ary shapes in general, with the degree of dissimilarity being proportional to image
chain code.
3 6 resolution. This effect can be reduced by selecting chain elements that are long in
proportion to the distance between pixels in the digitized image, and/or by orienting
the resampling grid along the principal axes of the object to be coded, as discussed
and 9.29). For the same reason mentioned when discussing boundary tracing earlier in Section 11.3, or along its eigen axes, as discussed in Section 11.5.
in this section, we chose the starting point in Fig. 11.4(c) as the uppermost-leftmost
point of the boundary, which gives the chain code 0766…1212. As you might suspect, EXAMPLE 11.1 : Freeman chain code and some of its variations.
the spacing of the resampling grid is determined by the application in which the Figure 11.5(a) shows a 570 × 570-pixel, 8-bit gray-scale image of a circular stroke embedded in small,
chain code is used. randomly distributed specular fragments. The objective of this example is to obtain a Freeman chain
If the sampling grid used to obtain a connected digital curve is a uniform quad- code, the corresponding integer of minimum magnitude, and the first difference of the outer boundary
rilateral (see Fig. 2.19) all points of a Freeman code based on Fig. 11.3 are guaran- of the stroke. Because the object of interest is embedded in small fragments, extracting its boundary
teed to coincide with the points of the curve. The same is true if a digital curve is would result in a noisy curve that would not be descriptive of the general shape of the object. As you
subsampled using the same type of sampling grid, as in Fig. 11.4(b). This is because know, smoothing is a routine process when working with noisy boundaries. Figure 11.5(b) shows the
the samples of curves produced using such grids have the same arrangement as in original image smoothed using a box kernel of size 9 × 9 pixels (see Section 3.5 for a discussion of spa-
Fig. 11.3, so all points are reachable as we traverse a curve from one point to the next tial smoothing), and Fig. 11.5(c) is the result of thresholding this image with a global threshold obtained
to generate the code. using Otsu’s method. Note that the number of regions has been reduced to two (one of which is a dot),
The numerical value of a chain code depends on the starting point. However, the significantly simplifying the problem.
code can be normalized with respect to the starting point by a straightforward pro- Figure 11.5(d) is the outer boundary of the region in Fig. 11.5(c). Obtaining the chain code of this
cedure: We simply treat the chain code as a circular sequence of direction numbers boundary directly would result in a long sequence with small variations that are not representative
and redefine the starting point so that the resulting sequence of numbers forms an of the global shape of the boundary, so we resample it before obtaining its chain code. This reduces
integer of minimum magnitude. We can normalize also for rotation (in angles that insignificant variability. Figure 11.5(e) is the result of using a resampling grid with nodes 50 pixels apart
are integer multiples of the directions in Fig. 11.3) by using the first difference of the (approximately 10% of the image width) and Fig. 11.5(f) is the result of joining the sample points by
chain code instead of the code itself. This difference is obtained by counting the num- straight lines. This simpler approximation retained the principal features of the original boundary.
ber of direction changes (in a counterclockwise direction in Fig. 11.3) that separate The 8-directional Freeman chain code of the simplified boundary is
two adjacent elements of the code. If we treat the code as a circular sequence to nor-
malize it with respect to the starting point, then the first element of the difference is 00006066666666444444242222202202
computed by using the transition between the last and first components of the chain.
The starting point of the boundary is at coordinates (2, 5) in the subsampled grid (remember from
Fig. 2.19 that the origin of an image is at its top, left). This is the uppermost-leftmost point in Fig. 11.5(f).
The integer of minimum magnitude of the code happens in this case to be the same as the chain code:
0
a b c 00006066666666444444242222202202
2 7
FIGURE 11.4
1 6 The first difference of the code is
(a) Digital
boundary with 2 6
resampling grid 00062600000006000006260000620626
1 6
superimposed.
(b) Result of 2 6 Using this code to represent the boundary results in a significant reduction in the amount of data
resampling. needed to store the boundary. In addition, working with code numbers offers a unified way to analyze
(c) 8-directional 6
3 the shape of a boundary, as we discuss in Section 11.3. Finally, keep in mind that the subsampled bound-
chain-coded 5 4
boundary.
3 ary can be recovered from any of the preceding codes.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 817 6/16/2017 2:14:51 PM DIP4E_GLOBAL_Print_Ready.indb 818 6/16/2017 2:14:51 PM


11.2 Boundary Preprocessing 819 820 Chapter 11 Feature Extraction

Figure 11.6 illustrates how an SCC is generated. The first step is to select the
length of the line segment to use in generating the code [see Fig. 11.6(b)]. Next, a
starting point (the origin) is specified (for an open curve, the logical starting point is
one of its end points). As Fig. 11.6(c) shows, once the origin has been selected, one
end of a line segment is placed at the origin and the other end of the segment is set
to coincide with the curve. This point becomes the starting point of the next line seg-
ment, and we repeat this procedure until the starting point (or end point in the case
of an open curve) is reached. As the figure illustrates, you can think of this process as
a sequence of identical circles (with radius equal to the length of the line segment)
traversing the curve. The intersections of the circles and the curve determine the
nodes of the straight-line approximation to the curve.
Once the intersections of the circles are known, we determine the slope changes
between contiguous line segments. Positive and zero slope changes are normalized
to the open half interval [0, 1), while negative slope changes are normalized to the
open interval (−1, 0). Not allowing slope changes of ±1 eliminates the implementa-
tion issues that result from having to deal with the fact that such changes result in
the same line segment with opposite directions.
The sequence of slope changes is the chain that defines the SCC approximation
to the original curve. For example, the code for the curve in Fig. 11.6(e) is 0.12, 0.20,
0.21, 0.11, −0.11, −0.12, −0.21, −0.22, −0.24, −0.28, −0.28, −0.31, −0.30. The accu-
racy of the slope changes defined in Fig. 11.6(d) is 10 −2 , resulting in an “alphabet”
of 199 possible symbols (slope changes). The accuracy can be changed, of course. For
instance, and accuracy of 10 −1 produces an alphabet of 19 symbols (see Problem 11.6).
a b c Unlike a Freeman code, there is no guarantee that the last point of the coded curve
d e f will coincide with the last point of the curve itself. However, shortening the line
FIGURE 11.5 (a) Noisy image of size 570 × 570 pixels. (b) Image smoothed with a 9 × 9 box kernel. (c) Smoothed
image, thresholded using Otsu’s method. (d) Longest outer boundary of (c). (e) Subsampled boundary (the points
are shown enlarged for clarity). (f) Connected points from (e).

Slope Chain Codes


Using Freeman chain codes generally requires resampling a boundary to smooth
small variations, a process that implies defining a grid and subsequently assigning
all boundary points to their closest neighbors in the grid. An alternative to this
approach is to use slope chain codes (SCCs) (Bribiesca [1992, 2013]). The SCC of a Line segment
2-D curve is obtained by placing straight-line segments of equal length around the
curve, with the end points of the segments touching the curve.
Obtaining an SSC requires calculating the slope changes between contiguous line
segments, and normalizing the changes to the continuous (open) interval (−1, 1).
This approach requires defining the length of the line segments, as opposed to Free-
man codes, which require defining a grid and assigning curve points to it—a much a b c d e
more elaborate procedure. Like Freeman codes, SCCs are independent of rotation,
FIGURE 11.6 (a) An open curve. (b) A straight-line segment. (c) Traversing the curve using circumferences to deter-
but a larger range of possible slope changes provides a more accurate representa- mine slope changes; the dot is the origin (starting point). (d) Range of slope changes in the open interval (−1, 1)
tion under rotation than the rotational independence of the Freeman codes, which is (the arrow in the center of the chart indicates direction of travel). There can be ten subintervals between the slope
limited to the eight directions in Fig. 11.3(b). As with Freeman codes, SCCs are inde- numbers shown.(e) Resulting coded curve showing its corresponding numerical sequence of slope changes. (Cour-
pendent of translation, and can be normalized for scale changes (see Problem 11.8). tesy of Professor Ernesto Bribiesca, IIMAS-UNAM, Mexico.)

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 819 6/16/2017 2:14:52 PM DIP4E_GLOBAL_Print_Ready.indb 820 6/16/2017 2:14:53 PM


11.2 Boundary Preprocessing 821 822 Chapter 11 Feature Extraction

length and/or increasing angle resolution often resolves the problem, because the
results of computations are rounded to the nearest integer (remember we work with
integer coordinates).
The inverse of an SCC is another chain of the same length, obtained by reversing
the order of the symbols and their signs. The mirror image of a chain is obtained by
starting at the origin and reversing the signs of the symbols. Finally, we point out
that the preceding discussion is directly applicable to closed curves. Curve following
would start at an arbitrary point (for example, the uppermost-leftmost point of the
curve) and proceed in a clockwise or counterclockwise direction, stopping when the
starting point is reached. We will illustrate an use of SSCs in Example 11.6.

BOUNDARY APPROXIMATIONS USING MINIMUM-PERIMETER


POLYGONS
A digital boundary can be approximated with arbitrary accuracy by a polygon. For a
For an open curve, the closed curve, the approximation becomes exact when the number of segments of the
number of segments a b c
of an exact polygonal
polygon is equal to the number of points in the boundary, so each pair of adjacent
FIGURE 11.7 (a) An object boundary. (b) Boundary enclosed by cells (shaded). (c) Minimum-perimeter polygon
approximation is equal points defines a segment of the polygon. The goal of a polygonal approximation obtained by allowing the boundary to shrink. The vertices of the polygon are created by the corners of the inner
to the number of points
minus 1.
is to capture the essence of the shape in a given boundary using the fewest pos- and outer walls of the gray region.
sible number of segments. Generally, this problem is not trivial, and can turn into
a time-consuming iterative search. However, approximation techniques of modest
complexity are well suited for image-processing tasks. Among these, one of the most Figure 11.8(a) shows this shape in dark gray. Suppose that we traverse the bound-
powerful is representing a boundary by a minimum-perimeter polygon (MPP), as ary of the dark gray region in a counterclockwise direction. Every turn encountered
defined in the following discussion. A convex vertex is the
in the traversal will be either a convex or a concave vertex (the angle of a vertex is
center point of a triplet defined as an interior angle of the boundary at that vertex). Convex and concave
Foundation of points that define an
angle in the range
vertices are shown, respectively, as white and blue dots in Fig. 11.8(b). Note that
0° < u < 180°. Similarly, these vertices are the vertices of the inner wall of the light-gray bounding region in
An intuitive approach for computing MPPs is to enclose a boundary [see Fig. 11.7(a)] angles of a concave Fig. 11.8(b), and that every concave (blue) vertex in the dark gray region has a corre-
by a set of concatenated cells, as in Fig. 11.7(b). Think of the boundary as a rubber vertex are in the range
180° < u < 360°. An sponding concave “mirror” vertex in the light gray wall, located diagonally opposite
band contained in the gray cells in Fig. 11.7(b). As it is allowed to shrink, the rubber angle of 180° defines a the vertex. Figure 11.8(c) shows the mirrors of all the concave vertices, with the MPP
band will be constrained by the vertices of the inner and outer walls of the region degenerate vertex (i.e.,
segment of a straight from Fig. 11.7(c) superimposed for reference. We see that the vertices of the MPP
of the gray cells. Ultimately, this shrinking produces the shape of a polygon of mini- line), which cannot be an coincide either with convex vertices in the inner wall (white dots) or with the mir-
mum perimeter (with respect to this geometrical arrangement) that circumscribes MPP-vertex.
rors of the concave vertices (blue dots) in the outer wall. Only convex vertices of the
the region enclosed by the cell strip, as in Fig. 11.7(c). Note in this figure that all the
inner wall and concave vertices of the outer wall can be vertices of the MPP. Thus,
vertices of the MPP coincide with corners of either the inner or the outer wall.
our algorithm needs to focus attention only on those vertices.
The size of the cells determines the accuracy of the polygonal approximation.
In the limit, if the size of each (square) cell corresponds to a pixel in the boundary,
MPP Algorithm
the maximum error in each cell between the boundary and the MPP approxima-
tion would be 2 d, where d is the minimum possible distance between pixels (i.e., The set of cells enclosing a digital boundary [e.g., the gray cells in Fig. 11.7(b)] is
the distance between pixels established by the resolution of the original sampled called a cellular complex. We assume the cellular complexes to be simply connected,
boundary). This error can be reduced in half by forcing each cell in the polygonal in the sense the boundaries they enclose are not self-intersecting. Based on this
approximation to be centered on its corresponding pixel in the original boundary. assumption, and letting white (W) denote convex vertices, and blue (B) denote mir-
The objective is to use the largest possible cell size acceptable in a given application, rored concave vertices, we state the following observations:
thus producing MPPs with the fewest number of vertices. Our objective in this sec-
1. The MPP bounded by a simply connected cellular complex is not self-intersecting.
tion is to formulate a procedure for finding these MPP vertices.
The cellular approach just described reduces the shape of the object enclosed 2. Every convex vertex of the MPP is a W vertex, but not every W vertex of a bound-
by the original boundary, to the area circumscribed by the gray walls in Fig. 11.7(b). ary is a vertex of the MPP.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 821 6/16/2017 2:14:53 PM DIP4E_GLOBAL_Print_Ready.indb 822 6/16/2017 2:14:54 PM


11.2 Boundary Preprocessing 823 824 Chapter 11 Feature Extraction

Direction of travel Then, it follows from matrix analysis that

> 0 if (a, b, c) is a counterclockwise sequence



det(A) =  0 if the points are colinear (11-2)
< 0 if (a, b, c) is a clockwise sequence

where det(A) is the determinant of A. In terms of this equation, movement in a


counterclockwise or clockwise direction is with respect to a right-handed coordinate
system (see the footnote in the discussion of Fig. 2.19). For example, using the image
coordinate system from Fig. 2.19 (in which the origin is at the top left, the positive
x-axis extends vertically downward, and the positive y-axis extends horizontally to
the right), the sequence a = (3, 4), b = (2, 3), and c = (3, 2) is in the counterclockwise
direction. This would give det(A) > 0 when substituted into Eq. (11-2). It is conve-
nient when describing the algorithm to define
a b c
FIGURE 11.8 (a) Region (dark gray) resulting from enclosing the original boundary by cells (see Fig. 11.7). (b) Convex sgn(a, b, c) ≡ det(A) (11-3)
(white dots) and concave (blue dots) vertices obtained by following the boundary of the dark gray region in the
counterclockwise direction. (c) Concave vertices (blue dots) displaced to their diagonal mirror locations in the so that sgn(a, b, c) > 0 for a counterclockwise sequence, sgn(a, b, c) < 0 for a clock-
outer wall of the bounding region; the convex vertices are not changed. The MPP (solid boundary) is superimposed wise sequence, and sgn(a, b, c) = 0 when the points are collinear. Geometrically,
for reference.
sgn(a, b, c) > 0 indicates that point c lies on the positive side of pair (a, b) (i.e., c lies on
the positive side of the line passing through points a and b). Similarly, if sgn(a, b, c) < 0,
point c lies on the negative side of the line. Equations (11-2) and (11-3) give the same
result if the sequence (c, a, b) or (b, c, a) is used because the direction of travel in the
3. Every mirrored concave vertex of the MPP is a B vertex, but not every B vertex sequence is the same as for (a, b, c). However, the geometrical interpretation is differ-
of a boundary is a vertex of the MPP. ent. For example, sgn(c, a, b) > 0 indicates that point b lies on the positive side of the
4. All B vertices are on or outside the MPP, and all W vertices are on or inside the line through points c and a.
MPP. To prepare the data for the MPP algorithm, we form a list of triplets consisting
5. The uppermost-leftmost vertex in a sequence of vertices contained in a cellular of a vertex label (e.g., V0 , V1 , etc.); the coordinates of each vertex; and an additional
complex is always a W vertex of the MPP (see Problem 11.8). element denoting whether the vertex is W or B. It is important that the concave ver-
tices be mirrored, as in Fig. 11.8(c), that the vertices be in sequential order,† and that
These assertions can be proved formally (Sklansky et al. [1972], Sloboda et al. [1998], the first vertex be the uppermost-leftmost vertex, which we know from property 5
and Klette and Rosenfeld [2004]). However, their correctness is evident for our pur- is a W vertex of the MPP. Let V0 denote this vertex. We assume that the vertices are
poses (see Fig. 11.8), so we do not dwell on the proofs here. Unlike the angles of the arranged in the counterclockwise direction. The algorithm for finding MPPs uses
vertices of the dark gray region in Fig. 11.8, the angles sustained by the vertices of two “crawler” points: a white crawler (WC ) and a blue crawler (BC ). WC crawls along
the MPP are not necessarily multiples of 90°. the convex (W) vertices, and BC crawls along the concave (B) vertices. These two
In the discussion that follows, we will need to calculate the orientation of triplets crawler points, the last MPP vertex found, and the vertex being examined are all that
of points. Consider a triplet of points, ( a, b, c ) , and let the coordinates of these points is necessary to implement the algorithm.
be a = (ax , ay ), b = (bx , by ), and c = (cx , cy ). If we arrange these points as the rows of The algorithm starts by setting WC = BC = V0 (recall that V0 is an MPP-vertex).
the matrix Then, at any step in the algorithm, let VL denote the last MPP vertex found, and let
Vk denote the current vertex being examined. One of the following three conditions
 ax ay 1
  can exist between VL , Vk , and the two crawler points:
A = bx by 1 (11-1)
 cx cy 1


Vertices of a boundary can be ordered by tracking the boundary using the boundary-following algorithm
discussed earlier.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 823 6/16/2017 2:14:54 PM DIP4E_GLOBAL_Print_Ready.indb 824 6/16/2017 2:14:57 PM


11.2 Boundary Preprocessing 825 826 Chapter 11 Feature Extraction

(a) Vk is on the positive side of the line through the pair of points (VL , WC ), in which The next vertex is V5 = (7, 1). Using the values from the previous step we obtain sgn(VL , WC , V5 ) = 9,
case sgn (VL , WC , Vk ) > 0. so condition (a) is satisfied. Therefore, we let VL = WC = (4, 1) (this is V4 ) and reinitialize:
(b) Vk is on the negative side of the line though pair (VL , WC ) or is collinear with BC = WC = VL = (4, 1). Note that once we knew that sgn(VL , WC , V5 ) > 0 we did not bother to compute
it; that is sgn (VL , WC , Vk ) ≤ 0. Simultaneously, Vk lies to the positive side of the the other sgn expression. Also, reinitialization means that we start fresh again by examining the next
line through (VL , BC ) or is collinear with it; that is, sgn (VL , BC , Vk ) ≥ 0. vertex following the newly found MPP vertex. In this case, that next vertex is V5 , so we visit it again.
With V5 = ( 7, 1) , and using the new values of VL , WC , and BC , it follows that sgn (VL , WC , V5 ) = 0 and
(c) Vk is on the negative side of the line though pair (VL , BC ) , in which case
sgn (VL , BC , V5 ) = 0, so condition (b) holds. Therefore, we let WC = V5 = ( 7, 1) because V5 is a W vertex.
sgn (VL , BC , Vk ) < 0.
The next vertex is V6 = ( 8, 2 ) and sgn (VL , WC , V6 ) = 3, so condition (a) holds. Thus, we let
If condition (a) holds, the next MPP vertex is WC , and we let VL = WC ; then we VL = WC = ( 7, 1) and reinitialize the algorithm by setting WC = BC = VL .
reinitialize the algorithm by setting WC = BC = VL , and start with the next vertex Because the algorithm was reinitialized at V5 , the next vertex is V6 = (8, 2) again. Using the results
after the newly changed VL . from the previous step gives us sgn(VL , WC , V6 ) = 0 and sgn(VL , BC , V6 ) = 0, so condition (b) holds this
If condition (b) holds, Vk becomes a candidate MPP vertex. In this case, we set time. Because V6 is B we let BC = V6 = (8, 2).
WC = Vk if Vk is convex (i.e., it is a W vertex); otherwise we set BC = Vk . We then Summarizing, we have found three vertices of the MPP up to this point: V1 = (1, 4), V4 = (4, 1), and
continue with the next vertex in the list. V5 = (7, 1). Continuing as above with the remaining vertices results in the MPP vertices in Fig. 11.8(c)
(see Problem 11.9). The mirrored B vertices at (2, 3), (3, 2), and on the lower-right side at (13, 10), are on
If condition (c) holds, the next MPP vertex is BC and we let VL = BC ; then we
the boundary of the MPP. However, they are collinear and thus are not considered vertices of the MPP.
reinitialize the algorithm by setting WC = BC = VL and start with the next vertex
Appropriately, the algorithm did not detect them as such.
after the newly changed VL .
The algorithm stops when it reaches the first vertex again, and thus has processed
all the vertices in the polygon. The VL vertices found by the algorithm are the ver- EXAMPLE 11.3 : Applying the MPP algorithm.
tices of the MPP. Klette and Rosenfeld [2004] have proved that this algorithm finds
Figure 11.9(a) is a 566 × 566 binary image of a maple leaf, and Fig. 11.9(b) is its 8-connected boundary.
all the MPP vertices of a polygon enclosed by a simply connected cellular complex. The sequence in Figs. 11.9(c) through (h) shows MMP representations of this boundary using square
cellular complex cells of sizes 2, 4, 6, 8, 16, and 32, respectively (the vertices in each figure were con-
EXAMPLE 11.2 : A numerical example showing the details of how the MPP algorithm works. nected with straight lines to form a closed boundary). The leaf has two major features: a stem and three
A simple example in which we can follow the algorithm step-by-step will help clarify the preceding con- main lobes. The stem begins to be lost for cell sizes greater than 4 × 4, as Fig. 11.9(e) shows. The three
cepts. Consider the vertices in Fig. 11.8(c). In our image coordinate system, the top-left point of the grid main lobes are preserved reasonably well, even for a cell size of 16 × 16, as Fig. 11.9(g) shows. However,
is at coordinates (0, 0). Assuming unit grid spacing, the first few (counterclockwise) vertices are: we see in Fig. 11.8(h) that by the time the cell size is increased to 32 × 32, this distinctive feature has
been nearly lost.
V0 (1, 4) W  V1 (2, 3) B  V2 (3, 3) W  V3 (3, 2) B  V4 (4, 1) W  V5 (7, 1) W  V6 (8, 2) B  V7 (9, 2) B The number of points in the original boundary [Fig. 11.9(b)] is 1900. The numbers of vertices in
Figs. 11.9(c) through (h) are 206, 127, 92, 66, 32, and 13, respectively. Figure 11.9(e), which has 127 ver-
where the triplets are separated by vertical lines, and the B vertices are mirrored, as required by the tices, retained all the major features of the original boundary while achieving a data reduction of over
algorithm. 90%. So here we see a significant advantage of MMPs for representing a boundary. Another important
The uppermost-leftmost vertex is always the first vertex of the MPP, so we start by letting VL and V0 advantage is that MPPs perform boundary smoothing. As explained in the previous section, this is a
be equal, VL = V0 = (1, 4), and initializing the other variables: WC = BC = VL = (1, 4 ) . usual requirement when representing a boundary by a chain code.
The next vertex is V1 = ( 2, 3) . In this case we have sgn (VL , WC , V1 ) = 0 and sgn (VL , BC , V1 ) = 0, so
condition (b) holds. Because V1 is a B (concave) vertex, we update the blue crawler: BC = V1 = ( 2, 3) . At
this stage, we have VL = (1, 4), WC = (1, 4), and BC = (2, 3). SIGNATURES
Next, we look at V2 = ( 3, 3) . In this case, sgn (VL , WC , V2 ) = 0, and sgn (VL , BC , V2 ) = 1, so condition (b) A signature is a 1-D functional representation of a 2-D boundary and may be gener-
holds. Because V2 is W, we update the white crawler: WC = (3, 3). ated in various ways. One of the simplest is to plot the distance from the centroid
The next vertex is V3 = ( 3, 2 ) . At this junction we have VL = (1, 4), WC = (3, 3), and BC = (2, 3). Then, to the boundary as a function of angle, as illustrated in Fig. 11.10. The basic idea of
sgn (VL , WC , V3 ) = −2 and sgn (VL , BC , V3 ) = 0, so condition (b) holds again. Because V3 is B, we let using signatures is to reduce the boundary representation to a 1-D function that
BC = V3 = (4, 3) and look at the next vertex. presumably is easier to describe than the original 2-D boundary.
The next vertex is V4 = ( 4, 1) . We are working with VL = (1, 4), WC = (3, 3), and BC = (3, 2). The values Based on the assumptions of uniformity in scaling with respect to both axes, and
of sgn are sgn(VL , WC , V4 ) = −3 and sgn(VL , BC , V4 ) = 0. So, condition (b) holds yet again, and we let that sampling is taken at equal intervals of u, changes in the size of a shape result
WC = V4 = (4, 1) because V4 is a W vertex. in changes in the amplitude values of the corresponding signature. One way to

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 825 6/16/2017 2:15:02 PM DIP4E_GLOBAL_Print_Ready.indb 826 6/16/2017 2:15:05 PM


11.2 Boundary Preprocessing 827 828 Chapter 11 Feature Extraction

a b
FIGURE 11.10
r r
Distance-versus- u u
angle signatures.
In (a), r(u) is
constant. In (b),
the signature
consists of A A
repetitions of
the pattern r(u) r(u)
r ( u ) = A sec u for
0 ≤ u ≤ p 4 , and 2A
r ( u ) = A csc u for
A A
p 4 < u ≤ p 2.

0 p p 3p 5p 3p 7p p p 3p 5p 3p 7p
p 2p 0 p 2p
4 2 4 4 2 4 4 2 4 4 2 4
u u

tangent-angle values. Because a histogram is a measure of the concentration of val-


a b c d ues, the slope density function responds strongly to sections of the boundary with
e f g h
constant tangent angles (straight or nearly straight segments) and has deep valleys
FIGURE 11.9 (a) 566 × 566 binary image. (b) 8-connected boundary. (c) through (h), MMPs obtained using square cells in sections producing rapidly varying angles (corners or other sharp inflections).
of sizes 2, 4, 6, 8, 16, and 32, respectively (the vertices were joined by straight-line segments for display). The number
of boundary points in (b) is 1900. The numbers of vertices in (c) through (h) are 206, 127, 92, 66, 32, and 13, respec-
tively. Images (b) through (h) are shown as negatives to make the boundaries easier to see. EXAMPLE 11.4 : Signatures of two regions.
Figures 11.11(a) and (d) show two binary objects, and Figs. 11.11(b) and (e) are their boundaries. The
corresponding r(u) signatures in Figs. 11.11(c) and (f) range from 0° to 360° in increments of 1°. The
normalize for this is to scale all functions so that they always span the same range of number of prominent peaks in the signatures is sufficient to differentiate between the shapes of the two
values, e.g., [0, 1]. The main advantage of this method is simplicity, but it has the dis- objects.
advantage that scaling of the entire function depends on only two values: the mini-
mum and maximum. If the shapes are noisy, this can be a source of significant error
SKELETONS, MEDIAL AXES, AND DISTANCE TRANSFORMS
from object to object. A more rugged (but also more computationally intensive)
approach is to divide each sample by the variance of the signature, assuming that Like boundaries, skeletons are related to the shape of a region. Skeletons can be
the variance is not zero—as in the case of Fig. 11.10(a)—or so small that it creates computed from a boundary by filling the area enclosed by the boundary with fore-
computational difficulties. Using the variance yields a variable scaling factor that ground values, and treating the result as a binary region. In other words, a skeleton is
is inversely proportional to changes in size and works much as automatic volume computed using the coordinates of points in the entire region, including its boundary.
control does. Whatever the method used, the central idea is to remove dependency The idea is to reduce a region to a tree or graph by computing its skeleton. As we
on size while preserving the fundamental shape of the waveforms. explained in Section 9.5 (see Fig. 9.25), the skeleton of a region is the set of points in
Distance versus angle is not the only way to generate a signature. For example, the region that are equidistant from the border of the region.
another way is to traverse the boundary and, corresponding to each point on the As is true of thinning, The skeleton is obtained using one of two principal approaches: (1) by succes-
boundary, plot the angle between a line tangent to the boundary at that point and a the MAT is highly sively thinning the region (e.g., using morphological erosion) while preserving end
susceptible to boundary
reference line. The resulting signature, although quite different from the r(u) curves and internal region points and line connectivity (this is called topology-preserving thinning); or (2)
in Fig. 11.10, carries information about basic shape characteristics. For instance, irregularities, so smooth- by computing the medial axis of the region via an efficient implementation of the
ing and other preprocess-
horizontal segments in the curve correspond to straight lines along the boundary ing steps generally are medial axis transform (MAT) proposed by Blum [1967]. We discussed thinning in
because the tangent angle is constant there. A variation of this approach is to use required to obtain a Section 9.5. The MAT of a region R with border B is as follows: For each point p in
clean a binary image.
the so-called slope density function as a signature. This function is a histogram of R, we find its closest neighbor in B. If p has more than one such neighbor, it is said

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 827 6/16/2017 2:15:05 PM DIP4E_GLOBAL_Print_Ready.indb 828 6/16/2017 2:15:06 PM


11.2 Boundary Preprocessing 829 830 Chapter 11 Feature Extraction

a b c a b c
d e f FIGURE 11.12
FIGURE 11.11 Medial axes
(a) and (d) Two (dashed) of three
binary regions, simple regions.
(b) and (e) their
external
boundaries, and
(c) and (f) their
corresponding r(u)
signatures. The
horizontal axes
in (c) and (f) cor-
respond to angles pixels to their nearest background (zero) pixels, which constitute the region bound-
from 0° to 360°, in ary. Thus, we compute the distance transform of the complement of the image, as
increments of 1°.
Figs. 11.13(c) and (d) illustrate. By comparing Figs. 11.13(d) and 11.12(a), we see
in the former that the MAT (skeleton) is equivalent to the ridge of the distance
transform [i.e., the ridge in the image in Fig. 11.13(d)]. This ridge is the set of local
maxima [shown bold in Fig. 11.13(d)]. Figures 11.13(e) and (f) show the same effect
on a larger (414 × 708) binary image.
Finding approaches for computing the distance transform efficiently has been a
topic of research for many years. Numerous approaches exist that can compute the
distance transform with linear time complexity, O(K ), for a binary image with K
pixels. For example, the algorithm by Maurer et al. [2003] not only can compute the
to belong to the medial axis of R. The concept of “closest” (and thus the resulting distance transform in O(K ), it can compute it in O(K P ) using P processors.
MAT) depends on the definition of a distance metric (see Section 2.5). Figure 11.12
shows some examples using the Euclidean distance. If the Euclidean distance is used,
the resulting skeleton is the same as what would be obtained by using the maximum a b 0 0 0 0 0 1.41 1 1 1 1.41
disks from Section 9.5. The skeleton of a region is defined as its medial axis. c d 0 1 1 1 0 1 0 0 0 1
e f 0 1 1 1 0 1 0 0 0 1
The MAT of a region has an intuitive interpretation based on the “prairie fire”
FIGURE 11.13 0 0 0 0 0 1.41 1 1 1 1.41
concept discussed in Section 11.3 (see Fig. 11.15). Consider an image region as a
(a) A small
prairie of uniform, dry grass, and suppose that a fire is lit simultaneously along all image and (b) its 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
the points on its border. All fire fronts will advance into the region at the same speed. distance 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0
The MAT of the region is the set of points reached by more than one fire front at transform. Note 0 1 1 1 1 1 1 1 0 0 1 2 2 2 2 2 1 0
the same time. that all 1-valued 0 1 1 1 1 1 1 1 0 0 1 2 3 3 3 2 1 0
In general, the MAT comes considerably closer than thinning to producing skel- pixels in (a) have 0 1 1 1 1 1 1 1 0 0 1 2 2 2 2 2 1 0
corresponding 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0
etons that “make sense.” However, computing the MAT of a region requires cal- 0’s in (b). (c) A
culating the distance from every interior point to every point on the border of the 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
small image, and
region—an impractical endeavor in most applications. Instead, the approach is to (d) the distance
obtain the skeleton equivalently from the distance transform, for which numerous transform of its
efficient algorithms exist. complement. (e) A
larger image, and
The distance transform of a region of foreground pixels in a background of zeros (f) the distance
is the distance from every pixel to the nearest nonzero valued pixel. Figure 11.13(a) transform of its
shows a small binary image, and Fig. 11.13(b) is its distance transform. Observe that complement. The
every 1-valued pixel has a distance transform value of 0 because its closest nonzero Euclidian distance
valued pixel is itself. For the purpose of finding skeletons equivalent to the MAT, was used through-
out.
we are interested in the distance from the pixels of a region of foreground (white)

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 829 6/16/2017 2:15:06 PM DIP4E_GLOBAL_Print_Ready.indb 830 6/16/2017 2:15:07 PM


11.3 Boundary Feature Descriptors 831 832 Chapter 11 Feature Extraction

a b SOME BASIC BOUNDARY DESCRIPTORS


c d The length of a boundary is one of its simplest descriptors. The number of pixels
FIGURE 11.14 along a boundary is an approximation of its length. For a chain-coded curve with
(a) Thresholded
unit spacing in both directions, the number of vertical and horizontal components
image of blood
vessels. plus 2 multiplied by the number of diagonal components gives its exact length. If
(b) Skeleton the boundary is represented by a polygonal curve, the length is equal to the sum of
obtained by the lengths of the polygonal segments.
thinning, shown The diameter of a boundary B is defined as
superimposed
on the image
(note the spurs).
(
diameter(B) = max  D pi , pj 
i, j
) (11-4)
(c) Result of 40
where D is a distance measure (see Section 2.5) and pi and pj are points on the
passes of spur
removal. boundary. The value of the diameter and the orientation of a line segment connect-
(d) Skeleton The major and minor
ing the two extreme points that comprise the diameter is called the major axis (or
obtained using the axes are used also as longest chord) of the boundary. That is, if the major axis is defined by points ( x1 , y1 )
distance regional descriptors. and ( x2 , y2 ), then the length and orientation of the major axis are given by
transform.
12
lengthm = ( x2 − x1 )2 + ( y2 − y1 )2  (11-5)

and
 y − y1 
anglem = tan −1  2 
 x2 − x1 
The minor axis (also called the longest perpendicular chord) of a boundary is defined
as the line perpendicular to the major axis, and of such length that a box passing
EXAMPLE 11.5 : Skeletons obtained using thinning and pruning vs. the distance transform. through the outer four points of intersection of the boundary with the two axes com-
Figure 11.14(a) shows a segmented image of blood vessels, and Fig. 11.14(b) shows the skeleton obtained pletely encloses the boundary. The box just described is called the basic rectangle or
using morphological thinning. As we discussed in Chapter 9, thinning is characteristically accompanied bounding box, and the ratio of the major to the minor axis is called the eccentricity
by spurs, which certainly is the case here. Figure 11.14(c) shows the result of forty passes of spur removal. of the boundary. We give some examples of this descriptor in Section 11.4.
With the exception of the few small spurs visible on the bottom left of the image, pruning did a reason- The curvature of a boundary is defined as the rate of change of slope. In general,
able job of cleaning up the skeleton. One drawback of thinning is the loss of potentially important obtaining reliable measures of curvature at a point of a raw digital boundary is dif-
features. This was not the case here, except the pruned skeleton does not cover the full expanse of the ficult because these boundaries tend to be locally “ragged.” Smoothing can help, but
image. Figure 11.14(c) shows the skeleton obtained using distance transform computations based on fast a more rugged measure of curvature is to use the difference between the slopes of
marching (see Lee et al. [2005] and Shi and Karl [2008]). The way the algorithm we used implements adjacent boundary segments that have been represented as straight lines. Polygonal
branch generation handles ambiguities such as spurs automatically. approximations are well-suited for this approach [see Fig. 11.8(c)], in which case we
The result in Fig. 11.14(d) is slightly superior to the result in Fig. 11.14(c), but both skeletons certainly are concerned only with curvature at the vertices. As we traverse the polygon in the
capture the important features of the image in this case. A key advantage of the thinning approach clockwise direction, a vertex point p is said to be convex if the change in slope at p
is simplicity of implementation, which can be important in dedicated applications. Overall, distance- is nonnegative; otherwise, p is said to be concave. The description can be refined
transform formulations tend to produce skeletons less prone to discontinuities, but overcoming the further by using ranges for the changes of slope. For instance, p could be labeled as
computational burden of the distance transform results in implementations that are considerably more part of a nearly straight line segment if the absolute change of slope at that point is
complex than thinning.
We will discuss corners less than 10°, or it could be labeled as “corner-like” point if the absolute change is
in detail later in this
chapter. in the range 90°, ± 30°.
Descriptors based on changes of slope can be formulated easily by expressing a
11.3 BOUNDARY FEATURE DESCRIPTORS
11.3 boundary in the form of a slope chain code (SSC), as discussed earlier (see Fig. 11.6).
We begin our discussion of feature descriptors by considering several fundamental A particularly useful boundary descriptor that is easily implemented using SSCs is
approaches for describing region boundaries. tortuosity, a measure of the twists and turns of a curve. The tortuosity, t, of a curve

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 831 6/16/2017 2:15:07 PM DIP4E_GLOBAL_Print_Ready.indb 832 6/16/2017 2:15:08 PM


11.3 Boundary Feature Descriptors 833 834 Chapter 11 Feature Extraction

represented by an SCC is defined as the sum of the absolute values of the chain ele- SHAPE NUMBERS
ments: The shape number of a Freeman chain-coded boundary, based on the 4-directional
n
t = ∑ ai (11-6) As explained code of Fig. 11.3(a), is defined as the first difference of smallest magnitude. The order,
Section 11.2, the first dif-
i =1
ference of smallest mag-
n, of a shape number is defined as the number of digits in its representation. More-
nitude makes a Freeman over, n is even for a closed boundary, and its value limits the number of possible
where n is the number of elements in the SCC, and ai are the values (slope changes) chain code independent different shapes. Figure 11.16 shows all the shapes of order 4, 6, and 8, along with
of the elements in the code. The next example illustrates one use of this descriptor of the starting point, and
is insensitive to rotation their chain-code representations, first differences, and corresponding shape numbers.
in increments of 90° if a Although the first difference of a 4-directional chain code is independent of rotation
4-directional code is used.
EXAMPLE 11.6 : Using slope chain codes to describe tortuosity. (in increments of 90°), the coded boundary in general depends on the orientation of
An important measures of blood vessel morphology is its tortuosity. This metric can assist in the computer- the grid. One way to normalize the grid orientation is by aligning the chain-code grid
aided diagnosis of Retinopathy of Prematurity (ROP), an eye disease that affects babies born prema- with the sides of the basic rectangle defined in the previous section.
turely (Bribiesca [2013]). ROP causes abnormal blood vessels to grow in the retina (see Section 2.1). This In practice, for a desired shape order, we find the rectangle of order n whose
growth can cause the retina to detach from the back of the eye, potentially leading to blindness. eccentricity (defined in Section 11.4) best approximates that of the basic rectangle,
Figure 11.15(a) shows an image of the retina (called a fundus image) from a newborn baby. Ophthal- and use this new rectangle to establish the grid size. For example, if n = 12, all the
mologists diagnose and make decisions about the initial treatment of ROP based on the appearance of rectangles of order 12 (that is, those whose perimeter length is 12) are of sizes
retinal blood vessels. Dilatation and increased tortuosity of the retinal vessels are signs of highly prob- 2 × 4, 3 × 3, and 1 × 5. If the eccentricity of the 2 × 4 rectangle best matches the
able ROP. Blood vessels denoted A, B, and C in Fig. 11.15 were selected to demonstrate the discrimi- eccentricity of the basic rectangle for a given boundary, we establish a 2 × 4 grid
native potential of SCCs for quantifying tortuosity (each vessel shown is a long, thin region, not a line centered on the basic rectangle and use the procedure outlined in Section 11.2 to
segment). obtain the Freeman chain code. The shape number follows from the first differ-
The border of each vessel was extracted and its length (number of pixels), P, was calculated. To make ence of this code. Although the order of the resulting shape number usually equals
SCC comparisons meaningful, the three boundaries were normalized so that each would have the same n because of the way the grid spacing was selected, boundaries with depressions
comparable to this spacing sometimes yield shape numbers of order greater than n.
number, m, of straight-line segments. The length, L, of the line segment was then computed as L = m P.
In this case, we specify a rectangle of order lower than n, and repeat the procedure
It follows that the number of elements of each SCC is m − 1. The tortuosity, t, of a curve represented by
until the resulting shape number is of order n. The order of a shape number starts
an SCC is defined as the sum of the absolute values of the chain elements, as noted in Eq. (11-6).
at 4 and is always even because we are working with 4-connectivity and require that
The table in Fig. 11.15(b) shows values of t for vessels A, B, and C based on 51 straight-line segments
boundaries be closed.
(as noted above, n = m − 1). The values of tortuosity are in agreement with our visual analysis of the
three vessels, showing B as being slightly “busier” than A, and C as having the fewest twists and turns. FIGURE 11.16 Order 4 Order 6
All shapes of
order 4, 6, and 8.
The directions are
from Fig. 11.3(a),
and the dot Chain code: 0 3 2 1 0 0 3 2 2 1
a b
indicates the
FIGURE 11.15 Difference: 3 3 3 3 3 0 3 3 0 3
starting point.
(a) Fundus image
from a prematurely Shape no.: 3 3 3 3 0 3 3 0 3 3
born baby with ROP.
(b) Tortuosity of Order 8
Curve n T
vessels A, B, and C.
(Courtesy of A 50 2.3770
Professor Ernesto
Bribiesca, IIMAS- B 50 2.5132
UNAM, Mexico.)
C 50 1.6285

Chain code: 0 0 3 3 2 2 1 1 0 3 0 3 2 2 1 1 0 0 0 3 2 2 2 1

Difference: 3 0 3 0 3 0 3 0 3 3 1 3 3 0 3 0 3 0 0 3 3 0 0 3

Shape no.: 0 3 0 3 0 3 0 3 0 3 0 3 3 1 3 3 0 0 3 3 0 0 3 3

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 833 6/16/2017 2:15:09 PM DIP4E_GLOBAL_Print_Ready.indb 834 6/16/2017 2:15:09 PM


11.3 Boundary Feature Descriptors 835 836 Chapter 11 Feature Extraction

FIGURE 11.18 jy
a b
c d A digital
boundary and its
FIGURE 11.17 representation
Steps in the as sequence of

Imaginary axis
generation of a complex numbers.
shape number. The points ( x0 , y0 )
and ( x1 , y1 ) are
(arbitrarily) the y0
first two points in y1
the sequence.

x
x 0 x1
Real axis

of the sequence was restated, the nature of the boundary itself was not changed.
Of course, this representation has one great advantage: It reduces a 2-D to a 1-D

0
1
description problem.
We know from Eq. (4-44) that the discrete Fourier transform (DFT) of s ( k ) is
3
2

Chain code: 0 0 0 0 3 0 0 3 2 2 3 2 2 2 1 2 1 1 K –1
a (u) = ∑ s ( k )e – j 2puk K (11-8)
Difference: 3 0 0 0 3 1 0 3 3 0 1 3 0 0 3 1 3 0 k=0

Shape no.: 0 0 0 3 1 0 3 3 0 1 3 0 0 3 1 3 0 3
for u = 0, 1, 2, …, K − 1. The complex coefficients a ( u ) are called the Fourier descrip-
tors of the boundary. The inverse Fourier transform of these coefficients restores
s ( k ) . That is, from Eq. (4-45),
EXAMPLE 11.7 : Computing shape numbers.
K –1
Suppose that n = 18 is specified for the boundary in Fig. 11.17(a). To obtain a shape number of this order 1
s (k ) = ∑ a (u )e j 2puk K (11-9)
we follow the steps just discussed. First, we find the basic rectangle, as shown in Fig. 11.17(b). Next we find K u=0
the closest rectangle of order 18. It is a 3 × 6 rectangle, requiring the subdivision of the basic rectangle
shown in Fig. 11.17(c). The chain-code directions are aligned with the resulting grid. The final step is to for k = 0, 1, 2,…, K − 1. We know from Chapter 4 that the inverse is identical to the
obtain the chain code and use its first difference to compute the shape number, as shown in Fig. 11.17(d). original input, provided that all the Fourier coefficients are used in Eq. (11-9). How-
ever, suppose that, instead of all the Fourier coefficients, only the first P coefficients
are used. This is equivalent to setting a ( u ) = 0 for u > P – 1 in Eq. (11-9). The result
FOURIER DESCRIPTORS is the following approximation to s ( k ) :
We use the “conven- Figure 11.18 shows a digital boundary in the xy-plane, consisting of K points. Starting P –1
tional” axis system here
at an arbitrary point ( x0 , y0 ) , coordinate pairs ( x0 , y0 ) , ( x1 , y1 ) , ( x2 , y2 ) ,…, ( xK −1 , yK −1 ) 1
for consistency with the sˆ ( k ) = ∑ a (u )e j 2puk K (11-10)
literature. However, the are encountered in traversing the boundary, say, in the counterclockwise direction. K u=0
These coordinates can be expressed in the form x ( k ) = xk and y ( k ) = yk . Using
same result is obtained
if we use the book for k = 0, 1, 2, …, K − 1. Although only P terms are used to obtain each component
image coordinate system this notation, the boundary itself can be represented as the sequence of coordinates of sˆ ( k ) , parameter k still ranges from 0 to K – 1. That is, the same number of points
s ( k ) =  x ( k ) , y ( k ) for k = 0, 1, 2, …, K − 1. Moreover, each coordinate pair can be
whose origin is at the
top left because both are exists in the approximate boundary, but not as many terms are used in the recon-
right-handed coordinate treated as a complex number so that struction of each point.
systems (see Fig. 2.19). In
the latter, the rows and
columns represent the
s ( k ) = x ( k ) + jy ( k ) (11-7) Deleting the high-frequency coefficients is the same as filtering the transform
real and imaginary parts with an ideal lowpass filter. You learned in Chapter 4 that the periodicity of the
of the complex number. for k = 0, 1, 2,…, K − 1. That is, the x-axis is treated as the real axis and the y-axis as DFT requires that we center the transform prior to filtering it by multiplying it by
the imaginary axis of a sequence of complex numbers. Although the interpretation (−1)x . Thus, we use this procedure when implementing Eq. (11-8), and use it again

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 835 6/16/2017 2:15:10 PM DIP4E_GLOBAL_Print_Ready.indb 836 6/16/2017 2:15:12 PM


11.3 Boundary Feature Descriptors 837 838 Chapter 11 Feature Extraction

to reverse the centering when computing the inverse in Eq. (11-10). Because of
symmetry considerations in the DFT, the number of points in the boundary and its
inverse must be even. This implies that the number of coefficients removed (set to 0)
before the inverse is computed must be even. Because the transform is centered, we
set to 0 half the number of coefficients on each end of the transform to preserve
symmetry. Of course, the DFT and its inverse are computed using an FFT algorithm.
Recall from discussions of the Fourier transform in Chapter 4 that high-frequency
components account for fine detail, and low-frequency components determine over-
all shape. Thus, the smaller we make P in Eq. (11-10), the more detail that will be lost
on the boundary, as the following example illustrates.

EXAMPLE 11.8 : Using Fourier descriptors.


Figure 11.19(a) shows the boundary of a human chromosome, consisting of 2868 points. The correspond-
ing 2868 Fourier descriptors were obtained using Eq. (11-8). The objective of this example is to examine
the effects of reconstructing the boundary using fewer Fourier descriptors. Figure 11.19(b) shows the
boundary reconstructed using one-half of the 2868 descriptors in Eq. (11-10). Observe that there is no
perceptible difference between this boundary and the original. Figures 11.19(c) through (h) show the
boundaries reconstructed with the number of Fourier descriptors being 10%, 5%, 2.5%, 1.25%, 0.63%
and 0.28% of 2868, respectively. When rounded to the nearest even integer, these percentages are equal
to 286, 144, 72, 36, 18, and 8 descriptors, respectively. The important point is that 18 descriptors, a mere
six-tenths of one percent of the original 2868 descriptors, were sufficient to retain the principal shape
features of the original boundary: four long protrusions and two deep bays. Figure 11.19(h), obtained
with 8 descriptors, is unacceptable because the principal features are lost. Further reductions to 4 and 2
descriptors would result in an ellipse and a circle, respectively (see Problem 11.18). a b c d
e f g h
FIGURE 11.19 (a) Boundary of a human chromosome (2868 points). (b)–(h) Boundaries reconstructed using 1434,
As the preceding example demonstrates, a few Fourier descriptors can be used 286, 144, 72, 36, 18, and 8 Fourier descriptors, respectively. These numbers are approximately 50%, 10%, 5%, 2.5%,
to capture the essence of a boundary. This property is valuable, because these coef- 1.25%, 0.63%, and 0.28% of 2868, respectively. Images (b)–(h) are shown as negatives to make the boundaries
ficients carry shape information. Thus, forming a feature vector from these coef- easier to see.
ficients can be used to differentiate between boundary shapes, as we will discuss in
Chapter 12.
We have stated several times that descriptors should be as insensitive as pos-
for u = 0, 1, 2, … , K − 1. Thus, rotation simply affects all coefficients equally by a
sible to translation, rotation, and scale changes. In cases where results depend on
multiplicative constant term e ju .
the order in which points are processed, an additional constraint is that descrip-
Table 11.1 summarizes the Fourier descriptors for a boundary sequence s ( k ) that
tors should be insensitive to the starting point. Fourier descriptors are not directly
undergoes rotation, translation, scaling, and changes in the starting point. The sym-
insensitive to these geometrical changes, but changes in these parameters can be
bol ∆ xy is defined as ∆ xy = ∆x + j ∆y, so the notation st ( k ) = s ( k ) + ∆ xy indicates
related to simple transformations on the descriptors. For example, consider rotation
redefining (translating) the sequence as
and recall from basic mathematical analysis that rotation of a point by an angle u
about the origin of the complex plane is accomplished by multiplying the point by
st ( k ) =  x ( k ) + ∆x  + j  y ( k ) + ∆y  (11-12)
e ju . Doing so to every point of s ( k ) rotates the entire sequence about the origin. The
rotated sequence is s ( k ) e ju , whose Fourier descriptors are Recall from Chapter 4
that the Fourier transform Note that translation has no effect on the descriptors, except for u = 0, which has the
value d(0). Finally, the expression sp ( k ) = s ( k – k0 ) means redefining the sequence
of a constant is an
K –1 impulse located at the
ar ( u ) = ∑ s ( k )e jue – j 2puk K origin. Recall also that as
an impulse δ(u) is zero
k= 0 (11-11) everywhere, except when
= a ( u ) e ju u = 0. sp (k ) = x ( k − k0 ) + jy ( k − k0 ) (11-13)

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 837 6/16/2017 2:15:12 PM DIP4E_GLOBAL_Print_Ready.indb 838 6/16/2017 2:15:14 PM


11.3 Boundary Feature Descriptors 839 840 Chapter 11 Feature Extraction

TABLE 11.1 An alternative approach is to normalize the area of g(r ) in Fig. 11.20 to unity and
Transformation Boundary Fourier Descriptor
Some basic
treat it as a histogram. In other words, g(ri ) is now treated as the probability of value
properties of Identity s (k ) a (u )
Fourier ri occurring. In this case, r is treated as the random variable and the moments are
descriptors. Rotation sr ( k ) = s ( k ) e ju ar ( u ) = a ( u ) e ju K –1
mn ( r ) = ∑ (ri – m)n g (ri ) (11-16)
Translation st ( k ) = s ( k ) + ∆ xy at ( u ) = a ( u ) + ∆ xy d ( u ) i=0
where
Scaling ss ( k ) = a s ( k ) as ( u ) = a a ( u ) K –1
m= ∑ ri g ( ri ) (11-17)
Starting point sp ( k ) = s ( k − k0 ) a p ( u ) = a ( u ) e – j 2 p k0 u K
i=0

In these equations, K is the number of points on the boundary, and mn (r ) is related


directly to the shape of signature g(r ). For example, the second moment m2 (r ) mea-
which changes the starting point of the sequence from k = 0 to k = k0 . The last entry sures the spread of the curve about the mean value of r, and the third moment m3 (r )
in Table 11.1 shows that a change in starting point affects all descriptors in a differ- measures its symmetry with respect to the mean.
ent (but known) way, in the sense that the term multiplying a ( u ) depends on u. Although moments are used frequently for characterizing signatures, they are not
the only descriptors used for this purpose. For instance, another approach is to com-
STATISTICAL MOMENTS pute the 1-D discrete Fourier transform of g(r ), obtain its spectrum, and use the first
few components as descriptors. The advantage of moments over other techniques is
Statistical moments of one variable are useful descriptors applicable to 1-D rendi-
We will discuss moments that their implementation is straightforward and they also carry a “physical” inter-
of two variable in tions of 2-D boundaries, such as signatures. To see how this can be accomplished,
pretation of signature (and by implication boundary) shape. The insensitivity of this
Section 11.4. consider Fig. 11.20 which shows the signature from Fig. 11.10(b) sampled, and treated
approach to rotation follows from the fact that signatures are independent of rota-
as an ordinary discrete function g(r ) of one variable, r.
tion, provided that the starting point is always the same along the boundary. Size
Suppose that we treat the amplitude of g as a discrete random variable z and
normalization can be achieved by scaling the values of g and r.
form an amplitude histogram p(zi ), i = 0, 1, 2, … , A − 1, where A is the number of
discrete amplitude increments in which we divide the amplitude scale. If p is normal-
11.4 REGION FEATURE DESCRIPTORS
ized so that the sum of its elements equals 1, then p(zi ) is an estimate of the prob- 11.4

ability of intensity value zi occurring. It then follows from Eq. (3-24) that the nth As we did with boundaries, we begin the discussion of regional features with some
moment of z about its mean is basic region descriptors.
A–1
mn ( z ) = ∑ ( zi – m)n p ( zi ) (11-14) SOME BASIC DESCRIPTORS
i=0
The major and minor axes of a region, as well as the idea of a bounding box, are
where
A–1
as defined earlier for boundaries. The area of a region is defined as the number of
m= ∑ zi p ( zi ) (11-15) pixels in the region. The perimeter of a region is the length of its boundary. When
i=0 area and perimeter are used as descriptors, they generally make sense only when
As you know, m is the mean (average) value of z, and m2 is its variance. Gener- they are normalized (Example 11.9 shows such a use). A more frequent use of these
ally, only the first few moments are required to differentiate between signatures of two descriptors is in measuring compactness of a region, defined as the perimeter
clearly distinct shapes. squared over the area:

p2
compactness = (11-18)
A
FIGURE 11.20 g(r)
Sampled This is a dimensionless measure that is 4p for a circle (its minimum value) and 16
signature from for a square.
Fig. 11.10(b) treat- Sometimes compactness
A similar dimensionless measure is circularity (also called roundness), defined as
ed as an ordinary, is defined as the inverse of
discrete function the circularity. Obviously, 4pA
of one variable. these two measures are circularity = (11-19)
r closely related. p2

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 839 6/16/2017 2:15:16 PM DIP4E_GLOBAL_Print_Ready.indb 840 6/16/2017 2:15:17 PM


11.4 Region Feature Descriptors 841 842 Chapter 11 Feature Extraction

The value of this descriptor is 1 for a circle (its maximum value) and p 4 for a square. where z k is a 2-D vector whose elements are the two spatial coordinates of a point in
Note that these two measures are independent of size, orientation, and translation. the region, K is the total number of points, and z is the mean vector:
Another measure based on a circle is the effective diameter: 1 K
z = ∑ zk (11-22)
A K k =1
de = 2 (11-20) The main diagonal elements of C are the variances of the coordinate values of the
p
points in the region, and the off-diagonal elements are their covariances.
This is the diameter of a circle having the same area, A, as the region being pro- An ellipse oriented in the same direction as the principal axes of the region can be
cessed. This measure is neither dimensionless nor independent of region size, but it interpreted as the intersection of a 2-D Gaussian function with the xy-plane. The ori-
is independent of orientation and translation. It can be normalized for size and made entation of the axes of the ellipse are also in the direction of the eigenvectors of the
dimensionless by dividing it by the largest diameter expected in a given application. covariance matrix, and the distances from the center of the ellipse to its intersection
In a manner analogous to the way we defined compactness and circularity relative with its major and minor axes is equal to the largest and smallest eigenvalues of the
to a circle, we define the eccentricity of a region relative to an ellipse as the eccentric- covariance matrix, respectively, as Fig. 11.21(b) shows. With reference to Fig. 11.21,
ity of an ellipse that has the same second central moments as the region. For 1-D, the and the equation of its eccentricity given above, we see by analogy that the eccen-
second central moment is the variance. For 2-D discrete data, we have to consider tricity of an ellipse with the same second moments as the region is given by
the variance of each variable as well as the covariance between them. These are
the components of the covariance matrix, which is estimated from samples using l22 − l12
eccentricity =
Eq. (11-21) below, with the samples in this case being 2-D vectors representing the l2 (11-23)
coordinates of the data.
Figure 11.21(a) shows an ellipse in standard form (i.e., an ellipse whose major and = 1 − (l1 l2 )2 l2 ≥ l1
minor axes are aligned with the coordinate axes). The eccentricity of such an ellipse For circular regions, l1 = l2 and the eccentricity is 0. For a line, l1 = 0 and the eccen-
is defined as the ratio of the distance between foci (2c in Fig. 11.21), and the length tricity is 1. Thus, values of this descriptor are in the range [0, 1].
of its major axis (2a), which gives the ratio 2c 2a = c a. That is,
EXAMPLE 11.9 : Comparison of feature descriptors.
c a 2 − b2
eccentricity = = = 1 − (b a)2 a ≥ b Figure 11.22 shows values of the preceding descriptors for several region shapes. None of the descriptors
a a
for the circle was exactly equal to its theoretical value because digitizing a circle introduces error into
However, we are interested in the eccentricity of an ellipse that has the same second the computation, and because we approximated the length of a boundary as its number of elements. The
central moments as a given 2-D region, which means that our ellipses can have arbi- eccentricity of the square did have an exact value of 0, because a square with no rotation aligns perfectly
trary orientations. Intuitively, what we are trying to do is approximate our 2-D data with the sampling grid. The other two descriptors for the square were close to their theoretical values also.
by an elliptical region whose axes are aligned with the principal axes of the data, as The values listed in the first two rows of Fig. 11.22 carry the same information. For example, we can
Fig. 11.21(b) illustrates. As you will learn in Section 11.5 (see Example 11.17), the tell that the star is less compact and less circular than the other shapes. Similarly, it is easy to tell from the
Often, you will the
constant in Eq. (11-21) principal axes are the eigenvectors of the covariance matrix, C, of the data, which is numbers listed that the teardrop region has by far the largest eccentricity, but it is harder to differentiate
written as 1/K instead of
given by: from the other shapes using compactness or circularity.
1/K−1. The latter is used
to obtain a statistically- K As we discussed in Section 11.1, feature descriptors typically are arranged in the form of feature
1
unbiased estimate of C.
For our purposes, either
C= ∑
K − 1 k =1
(z k − z )(z k − z )T (11-21) vectors for subsequent processing. Figure 11.23 shows the feature space for the descriptors in Fig. 11.22.
formulation is acceptable.

a b Major axis a b c d
FIGURE 11.21 Minor axis FIGURE 11.22 Descriptor
c 2 = a 2 − b2 e1
(a) An ellipse in Compactness,
standard form. l1 circularity, and
b Binary e 2 l2 e1 l1 and e 2 l2 are the
(b) An ellipse Focus Focus eccentricity of
region eigenvectors and Compactness 10.1701 42.2442 15.9836 13.2308
approximating a c some simple
corresponding eigenvalues
region in arbitrary binary regions. Circularity 1.2356 0.2975 0.7862 0.9478
Centroid of the covariance matrix of
orientation. a of region the coordinates of the region
Eccentricity 0.0411 0.0636 0 0.8117

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 841 6/16/2017 2:15:17 PM DIP4E_GLOBAL_Print_Ready.indb 842 6/16/2017 2:15:18 PM


11.4 Region Feature Descriptors 843 844 Chapter 11 Feature Extraction
FIGURE 11.23 x1 = compactness FIGURE 11.24
The descriptors Infrared images
60
from Fig. 11.22 in of the Americas at
3-D feature space. Star 50
night. (Courtesy
Each dot shown 40 of NOAA.)
corresponds to 30  x1 
a feature vector 20  
x =  x2 
whose compo- 10 Teardrop
0.2  x3 
nents are the three Square 0.2
0.4
0.4
corresponding 0.6 0.6
descriptors in Circle 0.8 0.8
1.0
Fig. 11.22. 1.2
1.0 1.2

x3 = eccentricity
x2 = circularity

Each point in feature space “encapsulates” the three descriptor values for each object. Although we can
tell from looking at the values of the descriptors in the figure that the circle and square are much more
similar than the other two objects, note how much clearer this fact is in feature space. You can imagine
that if we had multiple samples of those objects corrupted by noise, it could become difficult to differ-
entiate between vectors (points) corresponding to squares or circles. In contrast, the star and teardrop
Region no. Ratio of lights per
objects are far from each other, and from the circle and square, so they are less likely to be misclassified (from top) region to total lights
in the presence of noise. Feature space will play an important role in Chapter 12, when we discuss image 1 0.204
pattern classification. 2 0.640
3 0.049
4 0.107
EXAMPLE 11.10 : Using area features.
Even a simple descriptor such as normalized area can be quite useful for extracting information from
images. For instance, Fig. 11.24 shows a night-time satellite infrared image of the Americas. As we dis-
cussed in Section 1.3, such images provide a global inventory of human settlements. The imaging sensors
used to collect these images have the capability to detect visible and near infrared emissions, such as
lights, fires, and flares. The table alongside the images shows (by region from top to bottom) the ratio
of the area occupied by white (the lights) to the total light area in all four regions. A simple measure-
ment like this can give, for example, a relative estimate by region of electrical energy consumption. The
data can be refined by normalizing it with respect to land mass per region, with respect to population
numbers, and so on.

TOPOLOGICAL DESCRIPTORS affects distance, topological properties do not depend on the notion of distance or
Topology is the study of properties of a figure that are unaffected by any defor- any properties implicitly based on the concept of a distance measure.
mation, provided that there is no tearing or joining of the figure (sometimes these Another topological property useful for region description is the number of con-
are called rubber-sheet distortions). For example, Fig. 11.25(a) shows a region with See Sections 2.5 and 9.5 nected components of an image or region. Figure 11.25(b) shows a region with three
regarding connected
two holes. Obviously, a topological descriptor defined as the number of holes in components. connected components. The number of holes H and connected components C in a
the region will not be affected by a stretching or rotation transformation. However, figure can be used to define the Euler number, E :
the number of holes can change if the region is torn or folded. Because stretching
E=C − H (11-24)

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 843 6/16/2017 2:15:19 PM DIP4E_GLOBAL_Print_Ready.indb 844 6/16/2017 2:15:19 PM


11.4 Region Feature Descriptors 845 846 Chapter 11 Feature Extraction

a b FIGURE 11.27
A region Vertex
FIGURE 11.25
containing a
(a) A region with
polygonal
two holes. Face
network.
(b) A region with
three connected
components.

Hole
Edge
The Euler number is also a topological property. The regions shown in Fig. 11.26, for
example, have Euler numbers equal to 0 and −1, respectively, because the “A” has
one connected component and one hole, and the “B” has one connected component
but two holes.
Regions represented by straight-line segments (referred to as polygonal networks)
have a particularly simple interpretation in terms of the Euler number. Figure 11.27 11.28(b). The threshold was selected manually to illustrate the point that it would be impossible in this
shows a polygonal network. Classifying interior regions of such a network into faces case to segment the river by itself without other regions of the image also appearing in the thresholded
and holes is often important. Denoting the number of vertices by V, the number of result.
edges by Q, and the number of faces by F gives the following relationship, called the The image in Fig. 11.28(b) has 1591 connected components (obtained using 8-connectivity) and its
Euler formula: Euler number is 1552, from which we deduce that the number of holes is 39. Figure 11.28(c) shows the
connected component with the largest number of pixels (8479). This is the desired result, which we
V − Q + F =C − H (11-25) already know cannot be segmented by itself from the image using a threshold. Note how clean this result
which, in view of Eq. (11-24), can be expressed as is. The number of holes in the region defined by the connected component just found would give us the
number of land masses within the river. If we wanted to perform measurements, like the length of each
V − Q + F =E (11-26) branch of the river, we could use the skeleton of the connected component [Fig. 11.28(d)] to do so.
The network in Fig. 11.27 has seven vertices, eleven edges, two faces, one connected
region, and three holes; thus the Euler number is −2 (i.e., 7 − 11 + 2 = 1 − 3 = −2). TEXTURE
An important approach to region description is to quantify its texture content.
EXAMPLE 11.11 : Extracting and characterizing the largest feature in a segmented image. While no formal definition of texture exists, intuitively this descriptor provides mea-
Figure 11.28(a) shows a 512 × 512, 8-bit image of Washington, D.C. taken by a NASA LANDSAT satel- sures of properties such as smoothness, coarseness, and regularity (Fig. 11.29 shows
lite. This image is in the near infrared band (see Fig. 1.10 for details). Suppose that we want to segment some examples). In this section, we discuss statistical and spectral approaches for
the river using only this image (as opposed to using several multispectral images, which would simplify describing the texture of a region. Statistical approaches yield characterizations of
the task, as you will see later in this chapter). Because the river is a dark, uniform region relative to textures as smooth, coarse, grainy, and so on. Spectral techniques are based on prop-
the rest of the image, thresholding is an obvious approach to try. The result of thresholding the image erties of the Fourier spectrum and are used primarily to detect global periodicity in
with the highest possible threshold value before the river became a disconnected region is shown in Fig. an image by identifying high-energy, narrow peaks in its spectrum.

Statistical Approaches
a b One of the simplest approaches for describing texture is to use statistical moments
FIGURE 11.26 of the intensity histogram of an image or region. Let z be a random variable denot-
Regions with ing intensity, and let p ( zi ) , i = 0, 1, 2,…, L − 1, be the corresponding normalized his-
Euler numbers togram, where L is the number of distinct intensity levels. From Eq. (3-24), the nth
equal to 0 and −1, moment of z about the mean is
respectively.
L –1
mn ( z ) = ∑ ( zi − m)n p ( zi ) (11-27)
i=0

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 845 6/16/2017 2:15:20 PM DIP4E_GLOBAL_Print_Ready.indb 846 6/16/2017 2:15:20 PM


11.4 Region Feature Descriptors 847 848 Chapter 11 Feature Extraction

a b where m is the mean value of z (i.e., the average intensity of the image or region):
c d
L −1
FIGURE 11.28 m= ∑ zi p ( zi ) (11-28)
(a) Infrared image i=0
of the Washington,
D.C. area. Note from Eq. (11-27) that m0 = 1 and m1 = 0. The second moment [the variance
(b) Thresholded s 2 ( z) = m2 ( z)] is particularly important in texture description. It is a measure of
image. intensity contrast that can be used to establish descriptors of relative intensity
(c) The largest
connected compo- smoothness. For example, the measure
nent of (b).
1
(d) Skeleton of (c). R ( z) = 1 − (11-29)
(Original image 1 + s 2 ( z)
courtesy of NASA.)
is 0 for areas of constant intensity (the variance is zero there) and approaches 1 for
large values of s 2 ( z) . Because variance values tend to be large for grayscale images
with values, for example, in the range 0 to 255, it is a good idea to normalize the vari-
ance to the interval [0, 1] for use in Eq. (11-29). This is done simply by dividing s 2 ( z)
by ( L − 1) in Eq. (11-29). The standard deviation, s(z), also is used frequently as a
2

measure of texture because its values are more intuitive.


For texture, typically we As discussed in Section 2.6, the third moment, m3 ( z) , is a measure of the skewness
are interested in signs
and relative magnitudes.
of the histogram while the fourth moment, m4 ( z) , is a measure of its relative flat-
If, in addition, normaliza- ness. The fifth and higher moments are not so easily related to histogram shape, but
tion proves to be useful,
we normalize the third
they do provide further quantitative discrimination of texture content. Some useful
and fourth moments. additional texture measures based on histograms include a measure of uniformity,
defined as
L −1
U ( z) = ∑ p2 ( zi ) (11-30)
i=0

and a measure of average entropy that, as you may recall from information theory,
is defined as
L −1
a b c e ( z) = – ∑ p ( zi ) log 2 p ( zi ) (11-31)
FIGURE 11.29 i=0
The white squares
mark, from left Because values of p are in the range [0, 1] and their sum equals 1, the value of
to right, smooth, descriptor U is maximum for an image in which all intensity levels are equal (maxi-
coarse, and regular mally uniform), and decreases from there. Entropy is a measure of variability, and is
textures. These are 0 for a constant image.
optical microscope
images of a
superconductor, EXAMPLE 11.12 : Texture descriptors based on histograms.
human cholesterol,
and a microproces- Table 11.2 lists the values of the preceding descriptors for the three types of textures highlighted in
sor. (Courtesy of Fig. 11.29. The mean describes only the average intensity of each region and is useful only as a rough
Dr. Michael W. idea of intensity, not texture. The standard deviation is more informative; the numbers clearly show
Davidson, Florida that the first texture has significantly less variability in intensity (it is smoother) than the other two tex-
State University.)
tures. The coarse texture shows up clearly in this measure. As expected, the same comments hold for R,
because it measures essentially the same thing as the standard deviation. The third moment is useful for

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 847 6/16/2017 2:15:21 PM DIP4E_GLOBAL_Print_Ready.indb 848 6/16/2017 2:15:22 PM


11.4 Region Feature Descriptors 849 850 Chapter 11 Feature Extraction

TABLE 11.2 FIGURE 11.30 1 2 3 4 5 6 7 8


Statistical texture measures for the subimages in Fig. 11.29. How to construct 1 1 2 0 0 0 1 1 0
a co-occurrence
Standard 1 1 7 5 3 2 2 0 0 0 0 1 1 0 0
Texture Mean R (normalized) 3rd moment Uniformity Entropy matrix.
deviation
5 1 6 1 2 5 3 0 1 0 1 0 0 0 0
Smooth 82.64 11.79 0.002 − 0.105 0.026 5.434
8 8 6 8 1 2 4 0 0 1 0 1 0 0 0
Coarse 143.56 74.63 0.079 − 0.151 0.005 7.783
4 3 4 5 5 1 5 2 0 1 0 1 0 0 0
Regular 99.72 33.73 0.017 0.750 0.013 6.674
8 7 8 7 6 2 6 1 3 0 0 0 0 0 1
7 8 6 2 6 2 7 0 0 0 0 1 1 0 2
determining the symmetry of histograms and whether they are skewed to the left (negative value) or the 8 1 0 0 0 0 2 2 1
right (positive value). This gives an indication of whether the intensity levels are biased toward the dark
Image f Co-occurrence matrix G
or light side of the mean. In terms of texture, the information derived from the third moment is useful
only when variations between measurements are large. Looking at the measure of uniformity, we again
conclude that the first subimage is smoother (more uniform than the rest) and that the most random The number of possible intensity levels in the image determines the size
(lowest uniformity) corresponds to the coarse texture. Finally, we see that the entropy values increase as of matrix G. For an 8-bit image (256 possible intensity levels), G will be of size
uniformity decreases, leading us to the same conclusions regarding the texture of the regions as the uni- 256 × 256. This is not a problem when working with one matrix but, as you will see
formity measure did. The first subimage has the lowest variation in intensity levels, and the coarse image in as Example 11.13, co-occurrence matrices sometimes are used in sequences. One
the most. The regular texture is in between the two extremes with respect to both of these measures. approach for reducing computations is to quantize the intensities into a few bands
in order to keep the size of G manageable. For example, in the case of 256 intensities,
Measures of texture computed using only histograms carry no information regard- we can do this by letting the first 32 intensity levels equal to 1, the next 32 equal to 2,
ing spatial relationships between pixels, which is important when describing texture. and so on. This will result in a co-occurrence matrix of size 8 × 8.
One way to incorporate this type of information into the texture-analysis process is The total number, n, of pixel pairs that satisfy Q is equal to the sum of the ele-
to consider not only the distribution of intensities, but also the relative positions of ments of G (n = 30 in the example of Fig. 11.30). Then, the quantity
pixels in an image. gij
Let Q be an operator that defines the position of two pixels relative to each other, pij =
n
and consider an image, f , with L possible intensity levels. Let G be a matrix whose
element gij is the number of times that pixel pairs with intensities zi and zj occur in is an estimate of the probability that a pair of points satisfying Q will have values
Note that we are using
the intensity range [1, L]
image f in the position specified by Q, where 1 ≤ i, j ≤ L. A matrix formed in this ( )
zi , zj . These probabilities are in the range [0, 1] and their sum is 1:
instead of the usual manner is referred to as a graylevel (or intensity) co-occurrence matrix. When the K K
∑ ∑ pij = 1
[0, L− 1]. We do this so
that intensity values will meaning is clear, G is referred to simply as a co-occurrence matrix.
correspond with “tradi- Figure 11.30 shows an example of how to construct a co-occurrence matrix using i =1 j =1
tional” matrix indexing
(i.e., intensity value 1 L = 8 and a position operator Q defined as “one pixel immediately to the right” (i.e., where K is the row and column dimension of square matrix G.
corresponds to the first the neighbor of a pixel is defined as the pixel immediately to its right). The array on Because G depends on Q, the presence of intensity texture patterns can be detected
row and column indices
of G). the left is a small image and the array on the right is matrix G. We see that element by choosing an appropriate position operator and analyzing the elements of G. A set
(1, 1) of G is 1, because there is only one occurrence in f of a pixel valued 1 having of descriptors useful for characterizing the contents of G are listed in Table 11.3. The
a pixel valued 1 immediately to its right. Similarly, element (6, 2) of G is 3, because quantities used in the correlation descriptor (second row) are defined as follows:
there are three occurrences in f of a pixel with a value of 6 having a pixel valued 2
K K
immediately to its right. The other elements of G are similarly computed. If we had mr = ∑ i ∑ pij
defined Q as, say, “one pixel to the right and one pixel above,” then position (1, 1) i =1 j =1
in G would have been 0 because there are no instances in f of a 1 with another 1 in
the position specified by Q. On the other hand, positions (1, 3), (1, 5), and (1, 7) in K K

G would all be 1’s, because intensity value 1 occurs in f with neighbors valued 3, 5, mc = ∑ j ∑ pij
j =1 i =1
and 7 in the position specified by Q—one occurrence of each. As an exercise, you
should compute all the elements of G using this definition of Q. and

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 849 6/16/2017 2:15:24 PM DIP4E_GLOBAL_Print_Ready.indb 850 6/16/2017 2:15:25 PM


11.4 Region Feature Descriptors 851 852 Chapter 11 Feature Extraction

TABLE 11.3 Descriptor Explanation Formula K


Descriptors used mc = ∑ jP ( j )
for characterizing Maximum Measures the strongest response of G. max( pij ) j =1
i, j
co-occurrence probability The range of values is [0, 1].
matrices of size K
( r ) ( j – mc ) pij
Correlation A measure of how correlated a pixel is
sr2 = ∑ ( i – mr ) P ( i )
K K i – m 2
K × K . The term to its neighbor over the entire image. The ∑ ∑
pij is the ij-th term sr sc i =1
range of values is 1 to −1 corresponding i=1 j =1
of G divided by to perfect positive and perfect negative sr ≠ 0; sc ≠ 0 and
the sum of the correlations. This measure is not defined K
elements of G. sc2 = ∑ ( j – mc ) P ( j )
2
if either standard deviation is zero.
j =1
Contrast A measure of intensity contrast between a K K
∑ ∑ (i − j )
2
pixel and its neighbor over the entire image. pij
i =1 j =1 With reference to Eqs. (11-27), (11-28), and to their explanation, we see that mr is
The range of values is 0 (when G is constant)
in the form of a mean computed along rows of the normalized G, and mc is a mean
to (K − 1)2 .
computed along the columns. Similarly, sr and sc are in the form of standard devia-
Uniformity (also A measure of uniformity in the range [0, 1]. K K
tions (square roots of the variances) computed along rows and columns, respectively.
called Energy) Uniformity is 1 for a constant image. ∑ ∑ pij2
i =1 j = 1 Each of these terms is a scalar, independently of the size of G.
Homogeneity Measures the spatial closeness to the diagonal Keep in mind when studying Table 11.3 that “neighbors” are with respect to the
K K pij
of the distribution of elements in G. The range ∑ ∑ way in which Q is defined (i.e., neighbors do not necessarily have to be adjacent),
i =1 j =1 1 + i − j and also that the pij’s are nothing more than normalized counts of the number of
of values is [0, 1], with the maximum being
achieved when G is a diagonal matrix. times that pixels having intensities zi and zj occur in f relative to the position speci-
fied in Q. Thus, all we are doing here is trying to find patterns (texture) in those
Entropy Measures the randomness of the elements of K K
counts.
G. The entropy is 0 when all pij’s are 0, and is – ∑ ∑ pij log 2 pij
i =1 j =1
maximum when the pij’s are uniformly distrib-
uted. The maximum value is thus 2 log 2 K . EXAMPLE 11.13 : Using descriptors to characterize co-occurrence matrices.
Figures 11.31(a) through (c) show images consisting of random, horizontally periodic (sine), and mixed
pixel patterns, respectively. This example has two objectives: (1) to show values of the descriptors in
K K
Table 11.3 for the three co-occurrence matrices, G1 , G 2 ,and G 3 , corresponding (from top to bottom)
sr2 = ∑ ( i – mr ) ∑
2
pij to these images; and (2) to illustrate how sequences of co-occurrence matrices can be used to detect
i =1 j =1 texture patterns in an image.
K K Figure 11.32 shows co-occurrence matrices G1 , G 2 , and G 3 , displayed as images. These matrices were
sc2 = ∑ ( j – mc ) ∑
2
pij obtained using L = 256 and the position operator “one pixel immediately to the right.” The value at
j =1 i =1
coordinates (i, j ) in these images is the number of times that pixel pairs with intensities zi and zj occur
If we let in f in the position specified by Q, so it is not surprising that Fig. 11.32(a) is a random image, given the
nature of the image from which it was obtained.
K
Figure 11.32(b) is more interesting. The first obvious feature is the symmetry about the main diagonal.
P ( i ) = ∑ pij Because of the symmetry of the sine wave, the number of counts for a pair (zi , zj ) is the same as for the
j =1 pair (zj , zi ), which produces a symmetric co-occurrence matrix. The nonzero elements of G 2 are sparse
and because value differences between horizontally adjacent pixels in a horizontal sine wave are relatively
K small. It helps to remember in interpreting these concepts that a digitized sine wave is a staircase, with
P ( j ) = ∑ pij the height and width of each step depending on the frequency of the sine wave and the number of ampli-
i =1
tude levels used in representing the function.
then the preceding equations can be written as The structure of co-occurrence matrix G 3 in Fig. 11.32(c) is more complex. High count values are
K
grouped along the main diagonal also, but their distribution is more dense than for G 2 , a property
mr = ∑ iP ( i ) that is indicative of an image with a rich variation in intensity values, but few large jumps in intensity
i =1 between adjacent pixels. Examining Fig. 11.32(c), we see that there are large areas characterized by low

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 851 6/16/2017 2:15:27 PM DIP4E_GLOBAL_Print_Ready.indb 852 6/16/2017 2:15:29 PM


11.4 Region Feature Descriptors 853 854 Chapter 11 Feature Extraction

a TABLE 11.4
b Descriptors evaluated using the co-occurrence matrices displayed as images in Fig. 11.32.
c
Normalized
FIGURE 11.31 Maximum
Co-occurrence Correlation Contrast Uniformity Homogeneity Entropy
Images whose Probability
Matrix
pixels have
(a) random, G1 n1 0.00006 −0.0005 10838 0.00002 0.0366 15.75
(b) periodic, and
(c) mixed texture G 2 n2 0.01500 0.9650 00570 0.01230 0.0824 06.43
patterns. Each 0.06860 0.8798 01356 0.00480 0.2048 13.58
image is of size G 3 n3
263 × 800 pixels.

The contrast descriptor is highest for G1 and lowest for G 2 . Thus, we see that the less random an
image is, the lower its contrast tends to be. We can see the reason by studying the matrix displayed in
Fig. 11.32. The (i − j )2 terms are differences of integers for 1 ≤ i, j ≤ L, so they are the same for any G.
Therefore, the probabilities of the elements of the normalized co-occurrence matrices are the factors
that determine the value of contrast. Although G1 has the lowest maximum probability, the other two
matrices have many more zero or near-zero probabilities (the dark areas in Fig. 11.32). Because the sum
of the values of G n is 1, it is easy to see why the contrast descriptor tends to increase as a function of
variability in intensities. The high transitions in intensity occur at object boundaries, but these counts randomness.
are low with respect to the moderate intensity transitions over large areas, so they are obscured by the The remaining three descriptors are explained in a similar manner. Uniformity increases as a func-
ability of an image display to show high and low values simultaneously, as we discussed in Chapter 3. tion of the values of the probabilities squared. Thus, the less randomness there is in an image, the higher
The preceding observations are qualitative. To quantify the “content” of co-occurrence matrices, we the uniformity descriptor will be, as the fifth column in Table 11.4 shows. Homogeneity measures the
need descriptors such as those in Table 11.3. Table 11.4 shows values of these descriptors computed concentration of values of G with respect to the main diagonal. The values of the denominator term
for the three co-occurrence matrices in Fig. 11.32. To use these descriptors, the co-occurrence matrices (1 + i − j ) are the same for all three co-occurrence matrices, and they decrease as i and j become closer
must be normalized by dividing them by the sum of their elements, as discussed earlier. The entries in in value (i.e., closer to the main diagonal). Thus, the matrix with the highest values of probabilities
Table 11.4 agree with what one would expect from the images in Fig. 11.31 and their corresponding co- (numerator terms) near the main diagonal will have the highest value of homogeneity. As we discussed
occurrence matrices in Fig. 11.32. For example, consider the Maximum Probability column in Table 11.4. earlier, such a matrix will correspond to images with a “rich” gray-level content and areas of slowly vary-
The highest probability corresponds to the third co-occurrence matrix, which tells us that this matrix ing intensity values. The entries in the sixth column of Table 11.4 are consistent with this interpretation.
has the highest number of counts (largest number of pixel pairs occurring in the image relative to the The entries in the last column of the table are measures of randomness in co-occurrence matrices,
positions in Q) than the other two matrices. This agrees with our analysis of G 3 . The second column indi- which in turn translate into measures of randomness in the corresponding images. As expected, G1 had
cates that the highest correlation corresponds to G 2 , which in turn tells us that the intensities in the sec- the highest value because the image from which it was derived was totally random. The other two
ond image are highly correlated. The repetitiveness of the sinusoidal pattern in Fig. 11.31(b) indicates entries are self-explanatory. Note that the entropy measure for G1 is near the theoretical maximum of
why this is so. Note that the correlation for G1 is essentially zero, indicating that there is virtually no 16 (2 log 2 256 = 16). The image in Fig. 11.31(a) is composed of uniform noise, so each intensity level has
correlation between adjacent pixels, a characteristic of random images such as the image in Fig. 11.31(a). approximately an equal probability of occurrence, which is the condition stated in Table 11.3 for maxi-
mum entropy.
Thus far, we have dealt with single images and their co-occurrence matrices. Suppose that we want
a b c to “discover” (without looking at the images) if there are any sections in these images that contain
FIGURE 11.32 repetitive components (i.e., periodic textures). One way to accomplish this goal is to examine the cor-
256 × 256 relation descriptor for sequences of co-occurrence matrices, derived from these images by increasing
co-occurrence
matrices G1 , G 2 , the distance between neighbors. As mentioned earlier, it is customary when working with sequences of
and G 3 , co-occurrence matrices to quantize the number of intensities in order to reduce matrix size and corre-
corresponding sponding computational load. The following results were obtained using L = 8.
from left to right Figure 11.33 shows plots of the correlation descriptors as a function of horizontal “offset” (i.e., hori-
to the images in zontal distance between neighbors) from 1 (for adjacent pixels) to 50. Figure 11.33(a) shows that all
Fig. 11.31.
correlation values are near 0, indicating that no such patterns were found in the random image. The

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 853 6/16/2017 2:15:32 PM DIP4E_GLOBAL_Print_Ready.indb 854 6/16/2017 2:15:34 PM


11.4 Region Feature Descriptors 855 856 Chapter 11 Feature Extraction

1 purpose of analysis, every periodic pattern is associated with only one peak in the
spectrum, rather than two.
0.5 Detection and interpretation of the spectrum features just mentioned often
Correlation

are simplified by expressing the spectrum in polar coordinates to yield a function


0 S ( r, u ) , where S is the spectrum function, and r and u are the variables in this coor-
dinate system. For each direction u, S ( r, u ) may be considered a 1-D function Su ( r ) .
%0.5 Similarly, for each frequency r, Sr ( u ) is a 1-D function. Analyzing Su ( r ) for a fixed
value of u yields the behavior of the spectrum (e.g., the presence of peaks) along a
%1 radial direction from the origin, whereas analyzing Sr ( u ) for a fixed value of r yields
1 10 20 30 40 50 1 10 20 30 40 50 1 10 20 30 40 50
the behavior along a circle centered on the origin.
Horizontal Offset Horizontal Offset Horizontal Offset
A more global description is obtained by integrating (summing for discrete vari-
a b c
ables) these functions:
FIGURE 11.33 Values of the correlation descriptor as a function of offset (distance between “adjacent” pixels) corre- p
sponding to the (a) noisy, (b) sinusoidal, and (c) circuit board images in Fig. 11.31. S (r ) = ∑ Su (r ) (11-32)
u=0
and
R0
shape of the correlation in Fig. 11.33(b) is a clear indication that the input image is sinusoidal in the hori- S ( u ) = ∑ Sr ( u ) (11-33)
zontal direction. Note that the correlation function starts at a high value, then decreases as the distance r =1
between neighbors increases, and then repeats itself.
Figure 11.33(c) shows that the correlation descriptor associated with the circuit board image where R0 is the radius of a circle centered at the origin.
decreases initially, but has a strong peak for an offset distance of 16 pixels. Analysis of the image in Fig. The results of Eqs. (11-32) and (11-33) constitute a pair of values  S ( r ) , S ( u ) for
11.31(c) shows that the upper solder joints form a repetitive pattern approximately 16 pixels apart (see each pair of coordinates ( r, u ) . By varying these coordinates, we can generate two
Fig. 11.34). The next major peak is at 32, caused by the same pattern, but the amplitude of the peak is 1-D functions, S ( r ) and S ( u ) , that constitute a spectral-energy description of texture
lower because the number of repetitions at this distance is less than at 16 pixels. A similar observation for an entire image or region under consideration. Furthermore, descriptors of these
explains the even smaller peak at an offset of 48 pixels. functions themselves can be computed in order to characterize their behavior quan-
titatively. Descriptors useful for this purpose are the location of the highest value,
the mean and variance of both the amplitude and axial variations, and the distance
Spectral Approaches between the mean and the highest value of the function.
As we discussed in Section 5.4, the Fourier spectrum is ideally suited for describing
the directionality of periodic or semiperiodic 2-D patterns in an image. These global EXAMPLE 11.14 : Spectral texture.
texture patterns are easily distinguishable as concentrations of high-energy bursts in Figure 11.35(a) shows an image containing randomly distributed objects, and Fig. 11.35(b) shows an
the spectrum. Here, we consider three features of the Fourier spectrum that are use- image in which these objects are arranged periodically. Figures 11.35(c) and (d) show the corresponding
ful for texture description: (1) prominent peaks in the spectrum give the principal Fourier spectra. The periodic bursts of energy extending quadrilaterally in two dimensions in both Fou-
direction of the texture patterns; (2) the location of the peaks in the frequency plane rier spectra are due to the periodic texture of the coarse background material on which the objects rest.
gives the fundamental spatial period of the patterns; and (3) eliminating any peri- The other dominant components in the spectra in Fig. 11.35(c) are caused by the random orientation of
odic components via filtering leaves nonperiodic image elements, which can then the object edges in Fig. 11.35(a). On the other hand, the main energy in Fig. 11.35(d) not associated with
be described by statistical techniques. Recall that the spectrum is symmetric about the background is along the horizontal axis, corresponding to the strong vertical edges in Fig. 11.35(b).
the origin, so only half of the frequency plane needs to be considered. Thus, for the Figures 11.36(a) and (b) are plots of S ( r ) and S ( u ) for the random objects, and similarly in (c) and
(d) for the ordered objects. The plot of S ( r ) for the random objects shows no strong periodic compo-
nents (i.e., there are no dominant peaks in the spectrum besides the peak at the origin, which is the dc
FIGURE 11.34 16 pixels component). Conversely, the plot of S ( r ) for the ordered objects shows a strong peak near r = 15 and
A zoomed section a smaller one near r = 25, corresponding to the periodic horizontal repetition of the light (objects) and
of the circuit board dark (background) regions in Fig. 11.35(b). Similarly, the random nature of the energy bursts in Fig.
image showing 11.35(c) is quite apparent in the plot of S ( u ) in Fig. 11.36(b). By contrast, the plot in Fig. 11.36(d) shows
periodicity of
components. strong energy components in the region near the origin and at 90° and 180°. This is consistent with the
energy distribution of the spectrum in Fig. 11.35(d).

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 855 6/16/2017 2:15:35 PM DIP4E_GLOBAL_Print_Ready.indb 856 6/16/2017 2:15:37 PM


11.4 Region Feature Descriptors 857 858 Chapter 11 Feature Extraction

a b MOMENT INVARIANTS
c d
The 2-D moment of order ( p + q ) of an M × N digital image, f ( x, y), is defined as
FIGURE 11.35
(a) and (b) Images M –1 N –1
of random and
ordered objects.
m pq = ∑ ∑ x p yq f ( x, y )
x=0 y=0
(11-34)
(c) and (d) Cor-
responding where p = 0, 1, 2,… and q = 0, 1, 2,… are integers. The corresponding central moment
Fourier spectra. All of order ( p + q ) is defined as
images are of size
600 × 600 pixels.
M –1 N –1
p q
m pq = ∑ ∑ ( x – x ) ( y – y ) f ( x, y )
x=0 y=0
(11-35)

for p = 0, 1, 2,… and q = 0, 1, 2,…, where

m10 m
x= and y = 01 (11-36)
m00 m00
The normalized central moment of order ( p + q ) , denoted h pq , is defined as

m pq
h pq = (11-37)
m g00
where

p + q
g= + 1 (11-38)
2

a b 9.0 2.7 for p + q = 2, 3, … .A set of seven, 2-D moment invariants can be derived from the
c d 8.0 2.6 second and third normalized central moments:†
7.0 2.5
FIGURE 11.36
(a) and (b) Plots 6.0 2.4 f1 = h20 + h02 (11-39)
5.0 2.3
of S(r ) and S(u)
for Fig. 11.35(a). 4.0 2.2
f2 = ( h20 – h02 ) + 4h11
2 2
(c) and (d) Plots 3.0 2.1 (11-40)
of S(r ) and S(u) 2.0 2.0
for Fig. 11.35(b). 1.0 1.9
All vertical axes f3 = ( h30 – 3h12 ) + ( 3h21 – h03 )
0 1.8 2 2
0 50 100 150 200 250 300 0 20 40 60 80 100 120 140 160 180 (11-41)
are ×10 5.
6.0 3.6
3.4
5.0
f4 = ( h30 + h12 ) + ( h21 + h03 )
3.2 2 2
3.0
(11-42)
4.0
2.8
3.0 2.6
2.4
2.0 2.2
1.0 2.0 †
Derivation of these results requires concepts that are beyond the scope of this discussion. The book by Bell
1.8 [1965] and the paper by Hu [1962] contain detailed discussions of these concepts. For generating moment invari-
0 1.6 ants of an order higher than seven, see Flusser [2000]. Moment invariants can be generalized to n dimensions
0 50 100 150 200 250 300 0 20 40 60 80 100 120 140 160 180 (see Mamistvalov [1998]).

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 857 6/16/2017 2:15:38 PM DIP4E_GLOBAL_Print_Ready.indb 858 6/16/2017 2:15:39 PM


11.5 Principal Components as Feature Descriptors 859 860 Chapter 11 Feature Extraction

f5 = ( h30 – 3h12 ) ( h30 + h12 ) ( h30 + h12 ) − 3 ( h21 + h03 ) 


2 2
 
+ ( 3h21 – h03 ) ( h21 + h03 )  3 ( h30 + h12 ) – ( h21 + h03 )  (11-43)
2 2
 

f6 = ( h20 – h02 ) ( h30 + h12 ) – ( h21 + h03 ) 


2 2
 
+ 4h11 ( h30 + h12 ) ( h21 + h03 ) (11-44)

f7 = ( 3h21 – h03 ) ( h30 + h12 ) ( h30 + h12 ) – 3 ( h21 + h03 ) 


2 2
 
+ ( 3h12 – h30 ) ( h21 + h03 )  3 ( h30 + h12 ) – ( h21 + h03 ) 
2 2 (11-45)
 
This set of moments is invariant to translation, scale change, mirroring (within a
minus sign), and rotation. We can attach physical meaning to some of the low-order
moment invariants. For example, f1 is the sum of two second moments with respect
to the principal axes of data spread, so this moment can be interpreted as a mea-
sure of data spread. Similarly, f3 is the difference of second moments, and may be
interpreted as a measure of “slenderness.” However, as the order of the moment
invariants increases, the complexity of their formulation causes physical meaning to
be lost. The importance of Eqs. (11-39) through (11-45) is their invariance, not their
physical meaning.
a b c
EXAMPLE 11.15 : Moment invariants. d e f
The objective of this example is to compute and compare the preceding moment invariants using the FIGURE 11.37 (a) Original image. (b)–(f) Images translated, scaled by one-half, mirrored, rotated by 45°, and rotated
image in Fig. 11.37(a). The black (0) border was added to make all images in this example be of the by 90°, respectively.
same size; the zeros do not affect computation of the moment invariants. Figures 11.37(b) through (f)
show the original image translated, scaled by 0.5 in both spatial dimensions, mirrored, rotated by 45°,
and rotated by 90°, respectively. Table 11.5 summarizes the values of the seven moment invariants for
these six images. To reduce dynamic range and thus simplify interpretation, the values shown are scaled TABLE 11.5
using the expression − sgn ( fi ) log10 ( fi ) . The absolute value is needed to handle any numbers that may Moment invariants for the images in Fig. 11.37.
be negative. The term sgn ( fi ) preserves the sign of fi , and the minus sign in front is there to handle
Moment Original
fractions in the log computation. The idea is to make the numbers easier to interpret. Interest in this Translated Half Size Mirrored Rotated 45° Rotated 90°
Invariant Image
example is on the invariance and relative signs of the moments, not on their actual values. The two key
points in Table 11.5 are: (1) the closeness of the values of the moments, independent of translation, scale f1 2.8662 2.8662 2.8664 2.8662 2.8661 2.8662
change, mirroring and rotation; and (2) the fact that the sign of f7 is different for the mirrored image. f2 7.1265 7.1265 7.1257 7.1265 7.1266 7.1265

f3 10.4109 10.4109 10.4047 10.4109 10.4115 10.4109


11.5 PRINCIPAL COMPONENTS AS FEATURE DESCRIPTORS
11.5
f4 10.3742 10.3742 10.3719 10.3742 10.3742 10.3742
As we show in Example The material in this section is applicable to boundaries and regions. It is different
11.17, principal compo- f5 21.3674 21.3674 21.3924 21.3674 21.3663 21.3674
nents can be used also
from our discussion thus far, in the sense that features are based on more than one
to normalize regions or image. Suppose that we are given the three component images of a color image. The f6 13.9417 13.9417 13.9383 13.9417 13.9417 13.9417
boundaries for variations
in size, translation, and three images can be treated as a unit by expressing each group of three correspond- −20.7809 −20.7809 −20.7724 20.7809 −20.7813 −20.7809
f7
rotation. ing pixels as a vector, as discussed in Section 11.1. If we have a total of n registered

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 859 6/16/2017 2:15:40 PM DIP4E_GLOBAL_Print_Ready.indb 860 6/16/2017 2:15:41 PM


11.5 Principal Components as Feature Descriptors 861 862 Chapter 11 Feature Extraction

images, then the corresponding pixels at the same spatial location in all images can It is not difficult to show (see Problem 11.25) that the mean of the y vectors result-
be arranged as an n-dimensional vector: ing from this transformation is zero; that is,

 x1  my = E {y} = 0 (11-50)
x 
x= 
2
(11-46) It follows from basic matrix theory that the covariance matrix of the y’s is given in
# 
  terms of A and Cx by the expression
 xn 
C y = ACx AT (11-51)
Throughout this section, the assumption is that all vectors are column vectors (i.e.,
matrices of order n × 1). We can write them on a line of text simply by expressing Furthermore, because of the way A was formed, Cy is a diagonal matrix whose ele-
them as x = ( x1 , x2 ,…, xn ) , where T indicates the transpose.
T
ments along the main diagonal are the eigenvalues of Cx ; that is,
You may find it helpful We can treat the vectors as random quantities, just like we did when constructing
to review the tutorials on an intensity histogram. The only difference is that, instead of talking about quanti-  *1 0
probability and matrix
ties like the mean and variance of the random variables, we now talk about mean  *2 
Cy =  
theory available on the
book website. vectors and covariance matrices. The mean vector of the population is defined as (11-52)
 $ 
 
m x = E {x } (11-47) 0 *n 

where E {x} is the expected value of x, and the subscript denotes that m is associated The off-diagonal elements of this covariance matrix are 0, so the elements of the
with the population of x vectors. Recall that the expected value of a vector or matrix y vectors are uncorrelated. Keep in mind that the li are the eigenvalues of Cx and
is obtained by taking the expected value of each element. that the elements along the main diagonal of a diagonal matrix are its eigenvalues
The covariance matrix of the vector population is defined as (Noble and Daniel [1988]). Thus, Cx and Cy have the same eigenvalues.
Another important property of the Hotelling transform deals with the reconstruc-
{
Cx = E ( x – mx )( x – mx )
T
} (11-48) tion of x from y. Because the rows of A are orthonormal vectors, it follows that
A –1 = AT , and any vector x can be recovered from its corresponding y by using the
Because x is n dimensional, Cx is an n × n matrix. Element cii of Cx is the variance expression
of xi , the ith component of the x vectors in the population, and element cij of Cx
x = AT y + mx (11-53)
is the covariance between elements xi and x j of these vectors. Matrix Cx is real
and symmetric. If elements xi and x j are uncorrelated, their covariance is zero and, But, suppose that, instead of using all the eigenvectors of Cx , we form a matrix A k
therefore, cij = 0, resulting in a diagonal covariance matrix. from the k eigenvectors corresponding to the k largest eigenvalues, yielding a trans-
Because Cx is real and symmetric, finding a set of n orthonormal eigenvectors formation matrix of order k × n. The y vectors would then be k dimensional, and
is always possible (Noble and Daniel [1988]). Let ei and li , i =1, 2, … , n, be the the reconstruction given in Eq. (11-53) would no longer be exact (this is somewhat
eigenvectors and corresponding eigenvalues of CX , † arranged (for convenience) in analogous to the procedure we used in Section 11.3 to describe a boundary with a
descending order so that * j ≥ * j +1 for j = 1, 2,…, n − 1. Let A be a matrix whose few Fourier coefficients).
rows are formed from the eigenvectors of CX , arranged in descending value of their The vector reconstructed by using A k is
eigenvalues, so that the first row of A is the eigenvector corresponding to the largest
eigenvalue. x̂ = ATk y + mx (11-54)
Suppose that we use A as a transformation matrix to map the x’s into vectors
denoted by y’s, as follows: It can be shown that the mean squared error between x and x̂ is given by the expres-
y = A ( x – mx ) (11-49) sion
n k n
The Hotelling transform
is the same as the This expression is called the Hotelling transform, which, as you will learn shortly, has ems = ∑ * j − ∑ *j = ∑ *j (11-55)
discrete Karhunen-Loève j =1 j =1 j = k +1
transform, so the
some very interesting and useful properties.
two names are used
interchangeably in the Equation (11-55) indicates that the error is zero if k = n (that is, if all the eigen-
literature.

By definition, the eigenvector and eigenvalues of an n × n matrix C satisfy the equation Cei = li ei .
vectors are used in the transformation). Because the * j ’s decrease monotonically,

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 861 6/16/2017 2:15:47 PM DIP4E_GLOBAL_Print_Ready.indb 862 6/16/2017 2:15:51 PM


11.5 Principal Components as Feature Descriptors 863 864 Chapter 11 Feature Extraction

Eq. (11-55) also shows that the error can be minimized by selecting the k eigenvec- FIGURE 11.39
tors associated with the largest eigenvalues. Thus, the Hotelling transform is optimal Forming of a
in the sense that it minimizes the mean squared error between the vectors x and feature vector from
corresponding
their approximations x̂. Due to this idea of using the eigenvectors corresponding
pixels in six images.
to the largest eigenvalues, the Hotelling transform also is known as the principal
components transform.

Spectral band 6
EXAMPLE 11.16 : Using principal components for image description. x1
Spectral band 5
Figure 11.38 shows six multispectral satellite images corresponding to six spectral bands: visible blue x2
x3 Spectral band 4
(450–520 nm), visible green (520–600 nm), visible red (630–690 nm), near infrared (760–900 nm), middle
x )
infrared (1550–1,750 nm), and thermal infrared (10,400–12,500 nm). The objective of this example is to x4 Spectral band 3
illustrate how to use principal components as image features. x5 Spectral band 2
Organizing the images as in Fig. 11.39 leads to the formation of a six-element vector x from each set x6
Spectral band 1
of corresponding pixels in the images, as discussed earlier in this section. The images in this example
are of size 564 × 564 pixels, so the population consisted of ( 564 ) = 318, 096 vectors from which the
2

mean vector, covariance matrix, and corresponding eigenvalues and eigenvectors were computed. The eigenvectors were then used as the rows of matrix A, and a set of y vectors were obtained using Eq.
(11-49). Similarly, we used Eq. (11-51) to obtain C y . Table 11.6 shows the eigenvalues of this matrix.
Note the dominance of the first two eigenvalues.
A set of principal component images was generated using the y vectors mentioned in the previous
paragraph (images are constructed from vectors by applying Fig. 11.39 in reverse). Figure 11.40 shows
the results. Figure 11.40(a) was formed from the first component of the 318,096 y vectors, Fig. 11.40(b)
from the second component of these vectors, and so on, so these images are of the same size as the origi-
nal images in Fig. 11.38. The most obvious feature in the principal component images is that a significant
portion of the contrast detail is contained in the first two images, and it decreases rapidly from there. The
reason can be explained by looking at the eigenvalues. As Table 11.6 shows, the first two eigenvalues are
much larger than the others. Because the eigenvalues are the variances of the elements of the y vectors,
and variance is a measure of intensity contrast, it is not unexpected that the images formed from the
vector components corresponding to the largest eigenvalues would exhibit the highest contrast. In fact,
the first two images in Fig. 11.40 account for about 89% of the total variance. The other four images have
low contrast detail because they account for only the remaining 11%.
According to Eqs. (11-54) and (11-55), if we used all the eigenvectors in matrix A we could recon-
struct the original images from the principal component images with zero error between the original
and reconstructed images (i.e., the images would be identical). If the objective is to store and/or transmit
the principal component images and the transformation matrix for later reconstruction of the original
images, it would make no sense to store and/or transmit all the principal component images because
nothing would be gained. Suppose, however, that we keep and/or transmit only the two principal com-
ponent images. Then there would be significant savings in storage and/or transmission (matrix A would
be of size 2 × 6, so its impact would be negligible).
Figure 11.41 shows the results of reconstructing the six multispectral images from the two principal
component images corresponding to the largest eigenvalues. The first five images are quite close in

a b c
d e f TABLE 11.6
L1 L2 L3 L4 L5 L6
FIGURE 11.38 Multispectral images in the (a) visible blue, (b) visible green, (c) visible red, (d) near infrared, (e) middle Eigenvalues of Cx
infrared, and (f) thermal infrared bands. (Images courtesy of NASA.) obtained from the 10344 2966 1401 203 94 31
images in Fig. 11.38.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 863 6/16/2017 2:15:51 PM DIP4E_GLOBAL_Print_Ready.indb 864 6/16/2017 2:15:52 PM


11.5 Principal Components as Feature Descriptors 865 866 Chapter 11 Feature Extraction

a b c a b c
d e f d e f
FIGURE 11.40 The six principal component images obtained from vectors computed using Eq. (11-49). Vectors are FIGURE 11.41 Multispectral images reconstructed using only the two principal component images corresponding to the
converted to images by applying Fig. 11.39 in reverse. two principal component vectors with the largest eigenvalues. Compare these images with the originals in Fig. 11.38.

appearance to the originals in Fig. 11.38, but this is not true for the sixth image. The reason is that the of maximum variance (data spread) of the population, while the second eigenvector is perpendicular to
original sixth image is actually blurry, but the two principal component images used in the reconstruc- the first, as Fig. 11.43(b) shows. In terms of the present discussion, the principal components transform in
tion are sharp, therefore, the blurry “detail” is lost. Figure 11.42 shows the differences between the Eq. (11-49) accomplishes two things: (1) it establishes the center of the transformed coordinates system
original and reconstructed images. The images in Fig. 11.42 were enhanced to highlight the differences as the centroid (mean) of the population because mx is subtracted from each x; and (2) the y coordinates
between them. If they were shown without enhancement, the first five images would appear almost all (vectors) it generates are rotated versions of the x’s, so that the data align with the eigenvectors. If we
black, with the sixth (difference) image showing the most variability. define a ( y1 , y2 ) axis system so that y1 is along the first eigenvector and y2 is along the second, then the
geometry that results is as illustrated in Fig. 11.43(c). That is, the dominant data directions are aligned
with the new axis system. The same result will be obtained regardless of the size, translation, or rotation
EXAMPLE 11.17 : Using principal components for normalizing for variations in size, translation, and rotation. of the object, provided that all points in the region or boundary undergo the same transformation. If we
wished to size-normalize the transformed data, we would divide the coordinates by the corresponding
As we mentioned earlier in this chapter, feature descriptors should be as independent as possible of
eigenvalues.
variations in size, translation, and rotation. Principal components provide a convenient way to normal-
Observe in Fig. 11.43(c) that the points in the y-axes system can have both positive and negative val-
ize boundaries and/or regions for variations in these three variables. Consider the object in Fig. 11.43,
ues. To convert all coordinates to positive values, we simply subtract the vector ( y1 min , y2 min )T from all
and assume that its size, location, and orientation (rotation) are arbitrary. The points in the region (or its
the y vectors. To displace the resulting points so that they are all greater than 0, as in Fig. 11.43(d), we
boundary) may be treated as 2-D vectors, x = ( x1 , x2 ) , where x1 and x2 are the coordinates of any object
T
add to them a vector ( a, b) where a and b are greater than 0.
T
point. All the points in the region or boundary constitute a 2-D vector population that can be used to
Although the preceding discussion is straightforward in principle, the mechanics are a frequent source
compute the covariance matrix Cx and mean vector mx . One eigenvector of Cx points in the direction
of confusion. Thus, we conclude this example with a simple manual illustration. Figure 11.44(a) shows

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 865 6/16/2017 2:15:53 PM DIP4E_GLOBAL_Print_Ready.indb 866 6/16/2017 2:15:54 PM


11.5 Principal Components as Feature Descriptors 867 868 Chapter 11 Feature Extraction

a b x2 x2
c d Direction perpendicular
FIGURE 11.43 to the direction of max
(a) An object. variance
(b) Object show- e2
ing eigenvectors e1
of its covariance Direction of
matrix. max variance
(c) Transformed
object, obtained
using Eq. (11-49).
(d) Object
translated so that
all its coordinate
values are greater x1 x1
than 0. y2 y2

Centroid

y1

a b c
d e f
y1
FIGURE 11.42 Differences between the original and reconstructed images. All images were enhanced by scaling them
to the full [0, 255] range to facilitate visual analysis.

four points with coordinates (1, 1), (2, 4), (4, 2), and (5, 5). The mean vector, covariance matrix, and nor- When transforming image pixels, keep in mind that image coordinates are the same as matrix coor-
malized (unit length) eigenvectors of this population are: dinates; that is, ( x, y) represents (r, c), and the origin is the top left. Axes of the principal components
just illustrated are as shown in Figs. 11.43(a) and (d). You need to keep this in mind in interpreting the
 3  3.333 2.00  results of applying a principal components transformation to objects in an image.
mx =   , Cx =  
 3  2.00 3.333
and 11.6 WHOLE-IMAGE FEATURES
11.6

 0.707   − 0.707  The descriptors introduced in Sections 11.2 through 11.4 are well suited for appli-
e1 =   , e 2 =  0.707 
 0.707    cations (e.g., industrial inspection), in which individual regions can be segmented
reliably using methods such as the ones discussed in Chapters 10 and 11. With the
The corresponding eigenvalues are *1 = 5.333 and * 2 = 1.333. Figure 11.44(b) shows the eigenvec-
exception of the application in Example 11.17, the principal components feature
tors superimposed on the data. From Eq. (11-49), the transformed points (the y’s) are (−2.828, 0)T ,
vectors in Section 11.5 are different from the earlier material, in the sense that they
(0, − 1.414)T, (0, 1.414)T , and (2.828, 0)T. These points are plotted in Fig. 11.44(c). Note that they are
are based on multiple images. But even these descriptors are localized to sets of
aligned with the y-axes and that they have fractional values. When working with images, coordinate
corresponding pixels. In some applications, such as searching image databases for
values are integers, making it necessary to round all values to their nearest integer value. Figure 11.44(d)
shows the points rounded to the nearest integer and their location shifted so that all coordinate values matches (e.g., as in human face recognition), the variability between images is so
are integers greater than 0, as in the original figure. extensive that the methods in Chapters 10 and 11 are not applicable.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 867 6/16/2017 2:15:55 PM DIP4E_GLOBAL_Print_Ready.indb 868 6/16/2017 2:15:56 PM


11.6 Whole-Image Features 869 870 Chapter 11 Feature Extraction

a b x2 x2 FIGURE 11.45
c d Illustration of how
7 7 the Harris-Stephens Region 2
FIGURE 11.44 corner detector A
A manual 6 6
operates in the
example. 5 5 three types of sub-
(a) Original points. 4 4 regions indicated by B
Boundary C
(b) Eigenvectors of e2 e1 A (flat), B (edge),
the covariance 3 3
and C (corner). The Region 1
matrix of the points 2 2
wiggly arrows
in (a). 1 1 indicate graphically
(c) Transformed 0 x1 0 x1 a directional
points obtained 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 response in the
using Eq. (11-49). detector as it moves
(d) Points from (c), y2 y2 in the three areas
rounded and trans- shown.
lated so that all
3 7
coordinate values
are integers greater 2 6
than 0. The dashed 1 5
lines are included y1
happens when the window is located in a constant (or nearly constant) region, as
4
to facilitate viewing. %3 %2 %1 1 2 3 in location A in Fig. 11.45; (2) areas of changes in one direction but no (or small)
They are not part of %1 3
changes in the orthogonal direction, which this happens when the window spans a
the data. %2 2
boundary between two regions, as in location B; and (3) areas of significant changes
%3 1
in all directions, a condition that happens when the window contains a corner (or
0 y1
0 1 2 3 4 5 6 7 isolated points), as in location C. The HS corner detector is a mathematical formula-
tion that attempts to differentiate between these three conditions.
A patch is the image area Let f denote an image, and let f ( s, t ) denote a patch of the image defined by the
The state of the art in image processing is such that as the complexity of the task spanned by the detector
window at any given
values of ( s, t ). A patch of the same size, but shifted by ( x, y), is given by f ( s + x, t + y).
increases, the number of techniques suitable for addressing those tasks decreases. time. Then, the weighted sum of squared differences between the two patches is given by
This is particularly true when dealing with feature descriptors applicable to entire
The discussion in
images that are members of a large family of images. In this section, we discuss C( x, y) = ∑ ∑ w( s, t )[ f ( s + x, t + y) − f ( s, t )]
2
(11-56)
Sections 12.5 through two of the principal feature detection methods currently being used for this pur- s t
12.7 dealing with neural
networks is also impor-
pose. One is based on detecting corners, and the other works with entire regions
tant in terms of process- in an image. Then, in Section 11.7 we present a feature detection and description where w( s, t ) is a weighting function to be discussed shortly. The shifted patch can be
ing large numbers of
approach designed specifically to work with these types of features. approximated by the linear terms of a Taylor expansion
entire images for the
purpose of characterizing
their content.
THE HARRIS-STEPHENS CORNER DETECTOR f ( s + x, t + y) ≈ f ( s, t ) + xfx ( s, t ) + yfy ( s, t ) (11-57)
Our use the term “corner” Intuitively, we think of a corner as a rapid change of direction in a curve. Corners
is broader than just
90° corners; it refers to
are highly effective features because they are distinctive and reasonably invariant to where fx ( s, t ) = ∂f ∂x and fy ( s, t ) = ∂f ∂y , both evaluated at ( s, t ). We can then write
features that are “corner- viewpoint. Because of these characteristics, corners are used routinely for matching Eq. (11-56) as
like.”
image features in applications such as tracking for autonomous navigation, stereo
2
machine vision algorithms, and image database queries. C( x, y) = ∑ ∑ w( s, t )  xfx ( s, t ) + yfy ( s, t ) (11-58)
In this section, we discuss an algorithm for corner detection formulated by Har- s t
ris and Stephens [1988]. The idea behind the Harris-Stephens (HS) corner detec-
tor is illustrated in Fig. 11.45. The basic approach is this: Corners are detected by This equation can written in matrix form as
running a small window over an image, as we did in Chapter 3 for spatial filtering.
 x
The detector window is designed to compute intensity changes. We are interested in C( x, y) = [ x y ] M   (11-59)
three scenarios: (1) Areas of zero (or small) intensity changes in all directions, which  y

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 869 6/16/2017 2:15:56 PM DIP4E_GLOBAL_Print_Ready.indb 870 6/16/2017 2:15:57 PM


11.6 Whole-Image Features 871 872 Chapter 11 Feature Extraction

where

M = ∑ ∑ w( s, t ) A (11-60)
s t

and
 fx2 fx fy 
A=  (11-61)
 fx fy fy2 

Matrix M sometimes is called the Harris matrix. It is understood that its terms are
evaluated at ( s, t ). If w( s, t ) is isotropic, then M is symmetric because A is. The
weighting function w( s, t ) used in the HS detector generally has one of two forms:
(1) it is 1 inside the patch and 0 elsewhere (i.e., it has the shape of a box lowpass filter
kernel), or (2) it is an exponential function of the form Flat %1 Straight %1 Corner %1
Edge
2
+ t 2 ) 2s 2
w( s, t ) = e − ( s (11-62)

The box is used when computational speed is paramount and the noise level is low.
The exponential form is used when data smoothing is important.
1 1 1
As illustrated in Fig. 11.45, a corner is characterized by large values in region C, %1
fy
%1
fy
%1
fy
in both spatial directions. However, when the patch spans a boundary there will also
be a response in one direction. The question is: How can we tell the difference? As
we discussed in Section 11.5 (see Example 11.17), the eigenvectors of a real, sym-
lx : small lx : large lx : large
metric matrix (such as M above) point in the direction of maximum data spread, ly : small 1 ly : small 1 ly : large 1
fx fx fx
and the corresponding eigenvalues are proportional to the amount of data spread in
the direction of the eigenvectors. In fact, the eigenvectors are the major axes of an a b c
ellipse fitting the data, and the magnitude of the eigenvalues are the distances from d e f
the center of the ellipse to the points where it intersects the major axes. Figure 11.46 FIGURE 11.46 (a)–(c) Noisy images and image patches (small squares) encompassing image regions similar in content
illustrates how we can use these properties to differentiate between the three cases to those in Fig. 11.45. (d)–(f) Plots of value pairs ( fx , fy ) showing the characteristics of the eigenvalues of M that are
in which we are interested. useful for detecting the presence of a corner in an image patch.
The small image patches in Figs. 11.46(a) through (c) are representative of regions
A, B, and C in Fig. 11.45. In Fig. 11.46(d), we show values of ( fx , fy ) computed using
the derivative kernels wy = [ −1 0 1] and wx = wTy (remember, we use the coordinate
The eigenvalues of the imply the presence of a vertical or horizontal boundary; and (3) two large eigenval-
As noted in Chapter 3, we 2 × 2 matrix M can be
do not use bold notation system defined in Fig. 2.19). Because we compute the derivatives at each point in the expressed in a closed ues imply the presence of a corner or (unfortunately) isolated bright points.
for vectors and matrices patch, variations caused by noise result in scattered values, with the spread of the
form (see Problem 11.31). Thus, we see that the eigenvalues of the matrix formed from derivatives in the
representing spatial However, their computa-
kernels. scatter being directly related to the noise level and its properties. As expected, the tion requires squares and image patch can be used to differentiate between the three scenarios of interest.
derivatives from the flat region form a nearly circular cluster, whose eigenvalues are
square roots, which are However, instead of using the eigenvalues (which are expensive to compute), the HS
expensive to process.
almost identical, yielding a nearly circular fit to the points (we label these eigenvalues detector utilizes a measure of corner response based on the fact that the trace of a
as “small” in relation to the other two plots). Figure 11.46(e) shows the derivatives of square matrix is equal to the sum of its eigenvalues, and its determinant is equal to
the patch containing the edge. Here, the spread is greater along the x-axis, and about The advantage of this for- the product of its eigenvalues. The measure is defined as
mulation is that the trace
nearly the same as Fig. 11.46 (a) in the y-axis. Thus, eigenvalue lx is “large” while ly is is the sum of the main R = lx ly − k(lx + ly )2
“small.” Consequently, the ellipse fitting the data is elongated in the x-direction. Final- diagonal terms of M (just (11-63)
ly, Fig. 11.46(f) shows the derivatives of the patch containing the corner. Here, the
two numbers). The deter- = det(M) − k trace 2 (M)
minant of a 2 × 2 matrix
data is spread along both directions, resulting in two large eigenvalues and a much is the product of the main
where k is a constant to be explained shortly. Measure R has large positive values
diagonal elements minus
larger and nearly circular fitting ellipse. From this we conclude that: (1) two small the product of the cross when both eigenvalues are large, indicating the presence of a corner; it has large
eigenvalues indicate nearly constant intensity; (2) one small and one large eigenvalue elements. These are trivial
negative values when one eigenvalue is large and the other small, indicating an edge;
computations.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 871 6/16/2017 2:15:58 PM DIP4E_GLOBAL_Print_Ready.indb 872 6/16/2017 2:15:59 PM


11.6 Whole-Image Features 873 874 Chapter 11 Feature Extraction

and its absolute value is small when both eigenvalues are small, indicating that the a b
image patch under consideration is flat. c d
Constant k is determined empirically, and its range of values depends on the imple- FIGURE 11.48
mentation. For example, the MATLAB Image Processing Toolbox uses 0 < k < 0.25. (a) Same as Fig.
You can interpret k as a “sensitivity factor;” the smaller it is, the more likely the detec- 11.47(a), but
corrupted with
tor is to find corners. Typically, R is used with a threshold, T. We say that a corner at Gaussian noise of
an image location has been detected only if R > T for a patch at that location. mean 0 and
variance 0.01.
(b) Result of using
EXAMPLE 11.18 : Applying the HS corner detector.
the HS detector
Figure 11.47(a) shows a noisy image, and Fig. 11.47(b) is the result of using the HS corner detector with k = 0.04 and
with k = 0.04 and T = 0.01 (the default values in our implementation). All corners of the squares were T = 0.01 [compare
with Fig. 11.47(b)].
detected correctly, but the number of false detections is too high (note that all errors occurred on the (c) Result with
right side of the image, where the difference in intensity between squares is less). Figure 11.47(c) shows k = 0.249, (near
the highest value
in our implementa-
tion), and T = 0.01.
(d) Result of using
k = 0.04 and
T = 0.15.

the result obtained by increasing k to 0.1 and leaving T at 0.01. This time, all corners were detected cor-
rectly. As Fig. 11.47(d) shows, increasing the threshold to T = 0.1 yielded the same result. In fact, using
the default value of k and leaving T at 0.1 also produced the same result, as Fig. 11.47(e) shows. The
point of all this is that there is considerable flexibility in the interplay between the values of k and T.
Figure 11.47(f) shows the result obtained using the default value for k and using T = 0.3. As expected,
increasing the value of the threshold eliminated some corners, yielding in this case only the corner of
the squares with larger intensity differences. Increasing the value of k to 0.1 and setting T to its default
value yielded the same result, as did using k = 0.1 and T = 0.3, demonstrating again the flexibility in the
values chosen for these two parameters. However, as the level of noise increases, the range of possible
values becomes narrower, as the results in the next paragraph illustrate.
Figure 11.48(a) shows the checkerboard corrupted by a much higher level of additive Gaussian noise
(see the figure caption). Although this image does not appear much different than Fig. 11.47(a), the
results using the default values of k and T are much worse than before. False corners were detected even
a b c on the left side of the image, where the intensity differences are much stronger. Figure 11.48(c) is the
d e f result of increasing k near the maximum value in our implementation (2.5) while keeping T at its default
FIGURE 11.47 (a) A 600 × 600 image with values in the range [0, 1], corrupted by additive Gaussian noise with 0 mean value. This time, k alone could not overcome the higher noise level. On the other hand, decreasing k to
and variance of 0.006. (b) Result of applying the HS corner detector with k = 0.04 and T = 0.01 (the defaults). Sev- its default value and increasing T to 0.15 produced a perfect result, as Fig. 11.48(d) shows.
eral errors are visible. (c) Result using k = 0.1 and T = 0.01. (d) Result using k = 0.1 and T = 0.1. (e) Result using Figure 11.49(a) shows a more complex image with a significant number of corners embedded in
k = 0.04 and T = 0.1. (f) Result using k = 0.04 and T = 0.3 (only the strongest corners on the left were detected). various ranges of intensities. Figure 11.49(b) is the result obtained using the default values for k and T.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 873 6/16/2017 2:16:01 PM DIP4E_GLOBAL_Print_Ready.indb 874 6/16/2017 2:16:02 PM


11.6 Whole-Image Features 875 876 Chapter 11 Feature Extraction

a b
FIGURE 11.50
(a) Image
rotated 5°.
(b) Corners
detected using the
parameters used
to obtain
Fig. 11.49(f).

MAXIMALLY STABLE EXTREMAL REGIONS (MSERs)


The Harris-Stephens corner detector discussed in the previous section is useful in
applications characterized by sharp transitions of intensities, such as the intersec-
tion of straight edges, that result in corner-like features in an image. Conversely, the
maximally stable extremal regions (MSERs) introduced by Matas et al. [2002] are
more “blob” oriented. As with the HS corner detector, MSERs are intended to yield
whole image features for the purpose of establishing correspondence between two
or more images.
We know from Fig. 2.18 that a grayscale image can be viewed as a topographic
map, with the xy-axes representing spatial coordinates, and the z-axis representing
a b c
d e f intensities. Imagine that we start thresholding an 8-bit grayscale image one intensity
level at a time. The result of each thresholding is a binary image in which we show
FIGURE 11.49 600 × 600 image of a building. (b) Result of applying the HS corner detector with k = 0.04 and T = 0.01
(the default values in our implementation). Numerous irrelevant corners were detected. (c) Result using k = 0.249 the pixels at or above the threshold in white, and the pixels below the threshold as
and the default value for T. (d) Result using k = 0.17 and T = 0.05. (e) Result using the default value for k and black. When the threshold, T, is 0, the result is a white image (all pixel values are
T = 0.05. (f) Result using the default value of k and T = 0.07. at or above 0). As we start increasing T in increments of one intensity level, we will
begin to see black components in the resulting binary images. These correspond to
local minima in the topographic map view of the image. These black regions may
As you can see, numerous detection errors occurred (see, for example, the large number of wrong corner begin to grow and merge, but they never get smaller from image to image. Finally,
detections in the right edge of the building). Increasing k alone had little effect on the over-detection when we reach T = 255, the resulting image will be black (there are no pixel values
of corners until k was near its maximum value. Using the same values as in Fig. 11.48(c) resulted in the above this level). Because each stage of thresholding results in a binary image, there
image in 11.49(c), which shows a reduced number of erroneous corners, at the expense of missing numer- will be one or more connected components of white pixels in each image. The set of
ous important ones in the front of the building. Reducing k to 0.17 and increasing T to 0.05 did a much all such components resulting from all thresholdings is the set of extremal regions.
better job, as Fig. 11.49(d) show. Parameter k did not play a major role in corner detection for the building Extremal regions that do not change size (number of pixels) appreciably over a
image. In fact, Figs. 11.49(e) and (f) show essentially the same level of performance obtained by reducing range of threshold values are called maximally stable extremal regions.
k to its default value of 0.04, and using T = 0.05 and T = 0.07, respectively. As you will see shortly, the procedure just discussed can be cast in the form of a
Finally, Fig. 11.50 shows corner detection on a rotated image. The result in Fig. 11.50(b) was obtained rooted, connected tree called a component tree, where each level of the tree corre-
using the same parameters we used in Fig. 11.49(f), showing the relative insensitivity of the method to Remember, ∀ sponds to a value of the threshold discussed in the previous paragraph. Each node
rotation. Figures 11.49(f) and 11.50(b) show detection of at least one corner in every major structural means “for any,” ∈ of this tree represents an extremal region, R, defined as
means “belonging to,”
feature of the image, such as the front door, all the windows, and the corners that define the apex of the and a colon, :,
facade. For matching purposes, these are excellent results. is used to
∀p ∈ R and ∀q ∈ boundary(R) : I ( p) > I (q) (11-64)
mean “it is true that.”

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 875 6/16/2017 2:16:03 PM DIP4E_GLOBAL_Print_Ready.indb 876 6/16/2017 2:16:03 PM


11.6 Whole-Image Features 877 878 Chapter 11 Feature Extraction

where I is the image under consideration, and p and q are image points. This equa-
tion indicates that an extremal region R is a region of I, with the property that the 225 175 90 90
intensity of any point in the region is higher than the intensity at any point in the
boundary of the region. As usual, we assume that image intensities are integers, 125 5 90 225
ordered from 0 (black) to the maximum intensity (e.g., 255 for 8-bit images), which
are represented by white. 5 5 5 225
MSERs are found by analyzing the nodes of the component tree. For each con-
nected region in the tree, we compute a stability measure, c, defined as 125 5 225 225
RiT + ( n−1)!T − RkT + ( n+1)!T
c(RTj + n!T ) = (11-65) T + !T = 60
RTj + n!T
Region R1
where R is the size of the area (number of pixels) of connected region R, T is a Area = 11
threshold value in the range T ∈[min( I ), max( I )], and !T is a specified thresh-
old increment. Regions RiT + ( n−1)!T , RTj + n!T , and RkT + ( n+1)!T are connected regions
obtained at threshold levels T + (n − 1)!T , T + n!T , and T + (n + 1)!T , respectively. T + 2!T = 110
In terms of the component tree, regions Ri and Rk are respectively the parent and
Region R2 Region R3 Region R4
child of region Rj . Because T + (n − 1)!T < T + (n + 1)!T , we are guaranteed that
Area = 3 Area = 1 Area = 3
| RiT + ( n−1)!T | ≥ | RkT + ( n+1)!T |. It then follows from Eq. (11-65) that c ≥ 0. MSREs c=3 c=83
are the regions corresponding to the nodes in the tree that have a stability value
that is a local minimum along the path of the tree containing that region. What this
means in practice is that maximally stable regions are regions whose sizes do not T + 3!T = 160
change appreciably across two, 2!T neighboring thresholded images. Region R5 Region R6
Figure 11.51 illustrates the concepts just introduced. The grayscale image at the Area = 2 Area = 3
top consists of some simple regions of constant intensity, with values in the range c=1 c=0
[0, 255]. Based on the explanation of Eqs. (11-64) and (11-65), we used the threshold
T = 10, which is in the range [ min( I ) = 5, max( I ) = 225 ]. Choosing !T = 50 segmen-
T + 4!T = 210
ted all the different regions of the image. The column of binary images on the left con-
tains the results of thresholding the grayscale image with the threshold values shown. Region R7 Region R8
The resulting component tree is on the right. Note that the tree is shown “root up,” Area = 1 Area = 3
which is the way you would normally program it.
All the squares in the grayscale image are of the same size (area); therefore,
regardless of the image size, we can normalize the size of each square to 1. For exam-
ple, if the image is of size 400 × 400 pixels, the size of each square is 100 × 100 = 10 4 FIGURE 11.51 Detecting MSERs. Top: Grayscale image. Left: Thresholded images using T = 10 and !T = 50. Right:
pixels. Normalizing the size to 1 means that size 1 corresponds to 10 4 pixels (one Component tree, showing the individual regions. Only one MSER was detected (see dashed tree node on the
square), size 2 corresponds to 2 × 10 4 pixels (two squares), and so forth. You can rightmost branch of the tree). Each level of the tree is formed from the thresholded image on the left, at that same
arrive at the same conclusion by noticing that the ratio in Eq. (11-65) eliminates the level. Each node of the tree contains one extremal region (connected component) shown in white, and denoted by
a subscripted R.
common 10 4 factor.
The component tree in Fig. 11.51 is a good summary of how the MSER algorithm
works. The first level is the result of thresholding I with T + !T = 60. There is only
regions in the binary image obtained by thresholding I using T + 2!T = 110. As you
one connected component (white pixels) in the thresholded image on the left. The
can see on the left, this image has three connected components, so we create three
size of the connected component is 11 normalized units. As mentioned above, each
nodes in the component tree at the level of the thresholded image. Similarly, the
node of a component tree, denoted by a subscripted R, contains one connected
binary image obtained by thresholding I with T + 3!T = 160 has two connected
component consisting of white pixels. The next level in the tree is formed from the

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 877 6/16/2017 2:16:06 PM DIP4E_GLOBAL_Print_Ready.indb 878 6/16/2017 2:16:06 PM


11.6 Whole-Image Features 879 880 Chapter 11 Feature Extraction

components, so we create two nodes in the tree at this level. These two connected a b
components are children of the connected components in the previous level, so we c d
place the new nodes in the same path as their respective parents. The next level of FIGURE 11.52
the tree is explained in the same manner. Note that the center node in the previous (a) 600 × 570 CT
level had no children, so that path of the tree ends in the second level. slice of a human
head. (b) Image
Because we need to check size variations between parent and child regions to deter- smoothed with a
mine stability, only the two middle regions (corresponding to threshold values of 110 box kernel of size
and 160) are relevant in this example. As you can see in our component tree, only R6 15 × 15 elements. (c)
has a parent and child of similar size (the sizes are identical in this case). Therefore, A extremal region
along the path of the
region R6 is the only MSER detected in this case. Observe that if we had used a single tree containing one
global threshold to detect the brightest regions, region R7 would have been detected MSER.
also (an undesirable result in this context). Thus, we see that although MSERs are (d) The MSER.
based on intensity, they also depend on the nature of the background surrounding a (All MSER regions
were limited to the
region. In this case, R6 was surrounded by a darker background than R7 , and the darker range 10,260 – 34,200
background was thresholded earlier in the tree, allowing the size of R6 to remain con- pixels, correspond-
stant over the two, 2!T neighboring range required for detection as an MSER. ing to a range
In our example, it was easy to detect an MSER as the only region that did not between 3%
and 10% of image
change size, which gave a stability factor 0. A value of zero automatically implies
size.)
that an MSER has been found because the parent and child regions are of the (Original image
same size. When working with more complex images, the values of stability fac- courtesy of Dr.
tors seldom are zero because of variations in intensity caused by variables such David R.
as illumination, viewpoint, and noise. The concept of a local minimum mentioned Pickens, Vanderbilt
University.)
earlier is simply a way of saying that MSERs are extremal regions that do change
size significantly over a 2!T thresholding range. What is considered a “significant”
change depends on the application.
It is not unusual for numerous MSERs to be detected, many of which may not be
meaningful because of their size. One way to control the number of regions detected preprocessing step when !T is relatively small. In this case, we used T = 0 and !T = 10. This increment
is by the choice of !T. Another is to label as insignificant any region whose size is was small enough to require smoothing for proper MSER detection. In addition, we used a “size filter,”
not in a specified size range. We illustrate this in Example 11.19. in the sense that the size (area) of an MSER had to be between 10,262 and 34,200 pixels; these size limits
Matas et al. [2002] indicate that MSERs are affine-covariant (see Section 11.1). are 3% and 10% of the size of the image, respectively.
This follows directly from the fact that area ratios are preserved under affine trans- Figure 11.53 illustrates MSER detection on a more complex image. We used less blurring (a 5 × 5 box
formations, which in turn implies that for an affine transformation the original and kernel) in this image because is has more fine detail. We used the same T and !T as in Fig. 11.52, and
transformed regions are related by that transformation. We illustrate this property a valid MSER size in the range 10,000 to 30,000 pixels, corresponding approximately to 3% and 8% of
in Figs. 11.54 and 11.55. image size, respectively. Two MSERs were detected using these parameters, as Figs. 11.53(c) and (d)
Finally, keep in mind that the preceding MSER formulation is designed to detect show. The composite MSER, shown in Fig. 11.53(e), is a good representation of the front of the building.
bright regions with darker surroundings. The same formulation applied to the nega- Figure 11.54 shows the behavior under rotation of the MSERs detected in Fig. 11.53. Figure 11.54(a)
tive (in the sense defined in Section 3.2) of an image will detect dark regions with is the building image rotated 5° in the conterclockwise direction. The image was cropped after rota-
lighter surroundings. If interest lies in detecting both types of regions simultaneously, tion to eliminate the resulting black areas (see Fig. 2.41), which would change the nature of the image
we form the union of both sets of MSERs. data and thus influence the results. Figure 11.54(b) is the result of performing the same smoothing as
in Fig. 11.53, and Fig. 11.54(c) is the composite MSER detected using the same parameters as in Fig.
11.53(e). As you can see, the composite MSER of the rotated image corresponds quite closely to the
EXAMPLE 11.19 : Extracting MSERs from grayscale images. MSER in Fig. 11.53(e).
Figure 11.52(a) shows a slice image from a CT scan of a human head, and Fig. 11.52(b) shows the result Finally, Fig. 11.55 shows the behavior of the MSER detector under scale changes. Figure 11.55(a) is the
of smoothing Fig. 11.52(a) with a box kernel of size 15 × 15 elements. Smoothing is used routinely as a building image scale to 0.5 of its original dimensions, and Fig. 11.55(b) shows the image smoothed with
a correspondingly smaller box kernel of size 3 × 3. Because the image area is now one-fourth the size

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 879 6/16/2017 2:16:07 PM DIP4E_GLOBAL_Print_Ready.indb 880 6/16/2017 2:16:08 PM


11.7 Scale-Invariant Feature Transform (SIFT) 881 882 Chapter 11 Feature Extraction

a b c
FIGURE 11.54 (a) Building image rotated 5° counterclockwise. (b) Smoothed image using the same kernel as in
Fig. 11.53(b). (c) Composite MSER detected using the same parameters we used to obtain Fig. 11.53(e). The MSERs
of the original and rotated images are almost identical.

to assemble a set of reasonably well-understood individual methods into a “system”


capable of addressing problems that cannot be solved by any single known method
acting alone. Thus, we are forced to determine experimentally the interplay between
the various parameters controlling the performance of more complex systems.
When images are similar in nature (same scale, similar orientation, etc), cor-
ner detection and MSERs are suitable as whole image features. However, in the
presence of variables such as scale changes, rotation, changes in illumination, and
a b
c d e
changes in viewpoint, we are forced to use methods like SIFT.
SIFT features (called keypoints) are invariant to image scale and rotation, and
FIGURE 11.53 (a) Building image of size 600 × 600 pixels. (b) Image smoothed using a 5 × 5 box kernel. (c) and
(d) MSERs detected using T = 0, !T = 10, and MSER size range between 10,000 and 30,000 pixels, corresponding
are robust across a range of affine distortions, changes in 3-D viewpoint, noise, and
approximately to 3% and 8% of the area of the image. (e) Composite image. changes of illumination. The input to SIFT is an image. Its output is an n-dimensional
feature vector whose elements are the invariant feature descriptors. We begin our
discussion by analyzing how scale invariance is achieved by SIFT.
of the original area, we reduced the valid MSER range by one-fourth to 2500 –7500 pixels. Other than
these changes, we used the same parameters as in Fig. 11.53. Figure 11.55(c) shows the resulting MSER.
As you can see, this figure is quite close to the full-size result in Fig. 11.53(e).

11.7 SCALE-INVARIANT FEATURE TRANSFORM (SIFT)


11.7

SIFT is an algorithm developed by Lowe [2004] for extracting invariant features from
an image. It is called a transform because it transforms image data into scale-invariant
coordinates relative to local image features. SIFT is by far the most complex feature
detection and description approach we discuss in this chapter.
As you progress though this section, you will notice the use of a significant num- a b c
ber of experimentally determined parameters. Thus, unlike most of the formulations FIGURE 11.55 (a) Building image reduced to half-size. (b) Image smoothed with a 3 × 3 box
of individual approaches we have discussed thus far, SIFT is strongly heuristic. This kernel. (c) Composite MSER obtained with the same parameters as Fig. 11.53(e), but using a
is a consequence of the fact that our current knowledge is insufficient to tell us how valid MSER region size range of 2,500 -–7,500 pixels.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 881 6/16/2017 2:16:09 PM DIP4E_GLOBAL_Print_Ready.indb 882 6/16/2017 2:16:09 PM


11.7 Scale-Invariant Feature Transform (SIFT) 883 884 Chapter 11 Feature Extraction

SCALE SPACE FIGURE 11.56


Scale space,
The first stage of the SIFT algorithm is to find image locations that are invariant showing three More octaves
to scale change. This is achieved by searching for stable features across all possible octaves. Because .
scales, using a function of scale known as scale space, which is a multi-scale rep- s = 2 in this case, .
.
resentation suitable for handling image structures at different scales in a consis- each octave has five

6.
4
k .s3
smoothed ..
tent manner. The idea is to have a formalism for handling the fact that objects in ks3
images. A Scale
unconstrained scenes will appear in different ways, depending on the scale at which Gaussian ker- Octave 3 3 = 2s2
s
images are captured. Because these scales may not be known beforehand, a reason- nel was used for k 4.s2 Standard deviations used
able approach is to work with all relevant scales simultaneously. Scale space repre-
sents an image as a one-parameter family of smoothed images, with the objective of
smoothing, so the
space parameter
is s.
ks2
..

s2 = 2s1
6. in the Gaussian lowpass
kernels of each octave (the
same number of images
simulating the loss of detail that would occur as the scale of an image decreases. The Scale
with the same powers of k is
parameter controlling the smoothing is referred to as the scale parameter. Octave 2 generated in each octave)

6
In SIFT, Gaussian kernels are used to implement smoothing, so the scale param- k 4s1
eter is the standard deviation. The reason for using Gaussian kernels in based on
work performed by Lindberg [1994], who showed that the only smoothing kernel
k 3s1
k 2s1
.
that meets a set of important constraints, such as linearity and shift-invariance, is ks1
the Gaussian lowpass kernel. Based on this, the scale space, L( x, y, s), of a grayscale s1
image, f ( x, y),† is produced by convolving f with a variable-scale Gaussian kernel,
G( x, y, s) :
Scale Images smoothed using
As in Chapter 3, “!” Octave 1 Gaussian lowpass kernels
indicates spatial convolu- L( x, y, s) = G( x, y, s) ! f ( x, y) (11-66)
tion.

where the scale is controlled by parameter s, and G is of the form The preceding discussion indicates that the number of smoothed images gener-
ated in an octave is s + 1. However, as you will see in the next section, the smoothed
1 2 2
2s2 images in scale space are used to compute differences of Gaussians [see Eq. (10-32)]
G( x, y, s) = e −( x + y ) (11-67)
2ps 2 which, in order to cover a full octave, implies that an additional two images past the
octave image are required, giving a total of s + 3 images. Because the octave image is
The input image f ( x, y) is successively convolved with Gaussian kernels having always the ( s + 1)th image in the stack (counting from the bottom), it follows that this
standard deviations s, ks, k 2s, k 3s, . . . to generate a “stack” of Gaussian-filtered image is the third image from the top in the expanded sequence of s + 3 images. Each
(smoothed) images that are separated by a constant factor k, as shown in the lower octave in Fig. 11.56 contains five images, indicating that s = 2 was used in this case.
left of Fig. 11.56. The first image in the second octave is formed by downsampling the original
SIFT subdivides scale space into octaves, with each octave corresponding to a image (by skipping every other row and column), and then smoothing it using a
doubling of s, just as an octave in music theory corresponds to doubling the fre- kernel with twice the standard deviation used in the first octave (i.e., s2 = 2s1 ).
quency of a sound signal. SIFT further subdivides each octave into an integer num- Subsequent images in that octave are smoothed using s 2 , with the same sequence
ber, s, of intervals, so that an interval of 1 consists of two images, an interval of 2 of values of k as in the first octave (this is denoted by dots in Fig. 11.56). The same
consists of three images, and so forth. It then follows that the value used in the Gauss- basic procedure is then repeated for subsequent octaves. That is, the first image of
ian kernel that generates the image corresponding to an octave is k ss = 2s which the new octave is formed by: (1) downsampling the original image enough times
Instead of repeatedly
means that k = 21 s. For example, for s = 2, k = 2, and the input image is succes- downsampling the to achieve half the size of the image in the previous octave, and (2) smoothing the
sively smoothed using standard deviations of s, ( 2 ) s, and ( 2 )2 s, so that the third original image, we can
downsampled image with a new standard deviation that is twice the standard devia-
carry the previously
image (i.e., the octave image for s = 2) in the sequence is filtered using a Gaussian downsampled image, tion of the previous octave. The rest of the images in the new octave are obtained by
kernel with standard deviation ( 2 )2 s = 2s. and downsample it
smoothing the downsampled image with the new standard deviation multiplied by
by 2 to obtain the image

required for the next the same sequence of values of k as before.
Experimental results reported by Lowe [2004] suggest that smoothing the original image using a Gaussian octave.
kernel with s = 0.5 and then doubling its size by linear (nearest-neighbor) interpolation improves the number
When k = 2, we can obtain the first image of a new octave without having to
of stable features detected by SIFT. This preprocessing step is an integral part of the algorithm. Images are smooth the downsampled image. This is because, for this value of k, the kernel used
assumed to have values in the range [0, 1]. to smooth the first image of every octave is the same as the kernel used to smooth

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 883 6/16/2017 2:16:11 PM DIP4E_GLOBAL_Print_Ready.indb 884 6/16/2017 2:16:12 PM


11.7 Scale-Invariant Feature Transform (SIFT) 885 886 Chapter 11 Feature Extraction

the third image from the top of the previous octave. Thus, the first image of a new FIGURE 11.57
octave can be obtained directly by downsampling that third image of the previous Illustration using
images of the first k 4s 3
octave by 2. The result will be the same (see Problem 11.36). The third image from three octaves of
the top of any octave is called the octave image because the standard deviation used scale space in
to smooth it is twice (i.e., k 2 = 2) the value of the standard deviation used to smooth SIFT. The entries k 4s 2
k 3s3
the first image in the octave. in the table are
Figure 11.57 uses grayscale images to further illustrate how scale space is con- values of standard
deviation used k 2 s3
structed in SIFT. Because each octave is composed of five images, it follows that at each scale of
we are again using s = 2. We chose s1 = 2 2 = 0.707 and k = 2 = 1.414 for this each octave. For 4
k s1
example so that the numbers would result in familiar multiples. As in Fig. 11.56, the k 3s2
example the ks3
images going up scale space are blurred by using Gaussian kernels with progressively standard
larger standard deviations, and the first image of the second and subsequent octaves deviation used in

Scale
scale 2 of octave 1
is obtained by downsampling the octave image from the previous octave by 2. As is ks1 , which is
s3 = 2s2 = 4s1
you can see, the images become significantly more blurred (and consequently lose equal to 1.0. 2
Octave 3
k s2
more fine detail) as they go up both in scale as well as in octave. The images in the (The images

Book Page Writable Area (45p6 by 37p0)


third octave show significantly fewer details, but their gross appearance is unmistak- of octave 1 are k 3s1
ably that of the same structure. shown slightly
overlapped to
fit in the figure
DETECTING LOCAL EXTREMA space.) ks2

SIFT initially finds the locations of keypoints using the Gaussian filtered images,
then refines the locations and validity of those keypoints using two processing steps.
k 2s1

Scale
Finding the Initial Keypoints s2 = 2s1
Keypoint locations in scale space are found initially by SIFT by detecting extrema Octave 2
in the difference of Gaussians of two adjacent scale-space images in an octave, con-
volved with the input image that corresponds to that octave. For example, to find s1 = 2 2 = 0.707 k= 2 = 1.414
keypoint locations related to the first two levels of octave 1 in scale space, we look
for extrema in the function
ks1
Scale
Octave
D( x, y, s) = [G( x, y, ks) − G( x, y, s)] ! f ( x, y) (11-68) 1 2 3 4 5
1 0.707 1.000 1.414 2.000 2.828
It follows from Eq. (11-66) that
2 1.414 2.000 2.828 4.000 5.657
D( x, y, s) = L( x, y, ks) − L( x, y, s) (11-69)

Scale
3 2.828 4.000 5.657 8.000 11.314
s1
In other words, all we have to do to form function D( x, y, s) is subtract the first two Octave 1
images of octave 1. Recall from the discussion of the Marr-Hildreth edge detector
(Section 10.2) that the difference of Gaussians is an approximation to the Laplacian
of a Gaussian (LoG). Therefore, Eq. (11-69) is nothing more than an approximation
to Eq. (10-30). The key difference is that SIFT looks for extrema in D( x, y, s), where- G( x, y, ks) − G( x, y, s) ≈ (k − 1) s 2 ( 2G (11-70)
as the Marr-Hildreth detector would look for the zero crossings of this function.
Lindberg [1994] showed that true scale invariance in scale space requires that the Therefore, DoGs already have the necessary scaling “built in.” The factor (k − 1) is
LoG be normalized by s 2 (i.e., that s 2 ( 2G be used). It can be shown (see Problem constant over all scales, so it does not influence the process of locating extrema in
11.34) that scale space. Although Eqs. (11-68) and (11-69) are applicable to the first two images

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 885 6/16/2017 2:16:12 PM DIP4E_GLOBAL_Print_Ready.indb 886 6/16/2017 2:16:13 PM


11.7 Scale-Invariant Feature Transform (SIFT) 887 888 Chapter 11 Feature Extraction

of octave 1, the same form of these equations is applicable to any two images from FIGURE 11.59
any octave, provided that the appropriate downsampled image is used, and the DoG Extrema (maxima
or minima) of the
is computed from two adjacent images in the octave.
D( x, y, s) images
Figure 11.58 illustrates the concepts just discussed, using the building image from in an octave are
Fig. 11.57. A total of s + 2 difference functions, D( x, y, s), are formed in each octave detected by
from all adjacent pairs of Gaussian-filtered images in that octave. These difference comparing a pixel
(shown in black)
functions can be viewed as images, and one sample of such an image is shown for each
to its 26 neighbors
of the three octaves in Fig. 11.58. As you might expect from the results in Fig. 11.57, (shown shaded) in
the level of detail in these images decreases the further up we go in scale space. 3 × 3 regions at the
Figure 11.59 shows the procedure used by SIFT to find extrema in a D( x, y, s) current and Scale
adjacent scale
image. At each location (shown in black) in a D( x, y, s) image, the value of the pixel Corresponding sections of three
images. contiguous D( x, y, s) images
at that location is compared to the values of its eight neighbors in the current image
and its nine neighbors in the images above and below. The point is selected as an
extremum (maximum or minimum) point if its value is larger than the values of all
its neighbors, or smaller than all of them. No extrema can be detected in the first extremum (to achieve subpixel accuracy) is to fit an interpolating function at each
(last) scale of an octave because it has no lower (upper) scale image of the same size. extremum point found in the digital function, then look for an improved extremum
location in the interpolated function. SIFT uses the linear and quadratic terms of
Improving the Accuracy of Keypoint Locations a Taylor series expansion of D( x, y, s), shifted so that the origin is located at the
sample point being examined. In vector form, the expression is
When a continuous function is sampled, its true maximum or minimum may actually
be located between sample points. The usual approach used to get closer to the true
∂D T 1 T ∂ ∂D
D(x) = D + a bx + x a bx
∂x 2 ∂x ∂x (11-71)
1
= D + ( (D) x + xT H x
T

Octave 3 where D and its derivatives are evaluated at the sample point, x = ( x, y, s)T is the
offset from that point, ( is the familiar gradient operator,

 ∂D ∂x 
∂D 
(D = =  ∂D ∂y  (11-72)
∂x
Octave 2  ∂D ∂s 

and H is the Hessian matrix

 ∂ 2 D ∂x 2 ∂ 2 D ∂x∂y ∂ 2 D ∂x∂s 
 
H =  ∂ 2 D ∂y∂x ∂ 2 D ∂y 2 ∂ 2 D ∂y∂s  (11-73)
 2 2 2 2 
D( x, y, s)  ∂ D ∂s∂x ∂ D ∂s∂y ∂ D ∂s 
Octave 1 Sample D( x, y, s)

Scale The location of the extremum, xˆ , is found by taking the derivative of Eq. (11-71)
Gaussian-filtered images, L( x, y, s) with respect to x and setting it to zero, which gives us (see Problem 11.37):
Because D and its
FIGURE 11.58 How Eq. (11-69) is implemented in scale space. There are s + 3 L( x, y, s) images and s + 2 corre- derivatives are evalu-
sponding D( x, y, s) images in each octave.
ated at the sample point, x̂ = − H −1 ( (D) (11-74)
they are constants with
respect to x.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 887 6/16/2017 2:16:14 PM DIP4E_GLOBAL_Print_Ready.indb 888 6/16/2017 2:16:15 PM


11.7 Scale-Invariant Feature Transform (SIFT) 889 890 Chapter 11 Feature Extraction

The Hessian and gradient of D are approximated using differences of neighbor- If the determinant is negative, the curvatures have different signs and the keypoint
ing points, as we did in Section 10.2. The resulting 3 × 3 system of linear equations in question cannot be an extremum, so it is discarded.
is easily solved computationally. If the offset x̂ is greater than 0.5 in any of its three As with the HS corner Let r denote the ratio of the largest to the smallest eigenvalue. Then a = r b and
detector, the advantage
dimensions, we conclude that the extremum lies closer to another sample point, in of this formulation is
which case the sample point is changed and the interpolation is performed about that the trace and deter-
minants of 2 × 2 matrix
[ Tr(H)]2 = (a + b)
2
=
( r b + b ) = ( r + 1)
2 2
(11-78)
that point instead. The final offset x̂ is added to the location of its sample point to H are easy to compute. Det(H ) ab rb2 r
obtain the interpolated estimate of the location of the extremum. See the margin note in
⁄ Eq. (11-63).
The function value at the extremum, D(x), is used by SIFT for rejecting unstable which depends on the ratio of the eigenvalues, rather than their individual values.

extrema with low contrast, where D(x) is obtained by substituting Eq. (11-74) into The minimum of (r + 1)2 r occurs when the eigenvalues are equal, and it increases
Eq. (11-71), giving (see Problem 11.37): with r. Therefore, to check that the ratio of principal curvatures is below some
threshold, r, we only need to check
1

D(x) = D + ( (D)T x⁄ (11-75)
[ Tr(H)]2 (r + 1)
2
2 < (11-79)

Det(H ) r
In the experimental results reported by Lowe [2004], any extrema for which D(x)
was less than 0.03 was rejected, based on all image values being in the range [0, 1]. which is a simple computation. In the experimental results reported by Lowe [2004],
This eliminates keypoints that have low contrast and/or are poorly localized. a value of r = 10 was used, meaning that keypoints with ratios of curvature greater
than 10 were eliminated.
Figure 11.60 shows the SIFT keypoints detected in the building image using the
Eliminating Edge Responses ⁄
approach discussed in this section. Keypoints for which D(x) in Eq. (11-75) was less
Recall from Section 10.2 that using a difference of Gaussians yields edges in an than 0.03 were rejected, as were keypoints that failed to satisfy Eq. (11-79) with
image. But keypoints of interest in SIFT are “corner-like” features, which are signifi- r = 10.
cantly more localized. Thus, intensity transitions caused by edges are eliminated. To
If you display an image quantify the difference between edges and corners, we can look at local curvature. KEYPOINT ORIENTATION
as a topographic map
An edge is characterized by high curvature in one direction, and low curvature in the
(see Fig. 2.18), edges At this point in the process, we have computed keypoints that SIFT considers stable.
will appear as ridges orthogonal direction. Curvature at a point in an image can be estimated from the
Because we know the location of each keypoint in scale space, we have achieved
that have low curvature
2 × 2 Hessian matrix evaluated at that point. Thus, to estimate local curvature of the
along the ridge and high scale independence. The next step is to assign a consistent orientation to each key-
curvature perpendicular DoG at any level in scalar space, we compute the Hessian matrix of D at that level:
to it. point based on local image properties. This allows us to represent a keypoint rela-
tive to its orientation and thus achieve invariance to image rotation. SIFT uses a
 ∂ 2 D ∂x 2 ∂ 2 D ∂x∂y   Dxx Dxy 
H= 2 = (11-76)
 ∂ D ∂y∂x ∂ D ∂y   Dyx
2 2 Dyy  FIGURE 11.60
SIFT keypoints
detected in the
where the form on the right uses the same notation as the A term [Eq. (11-61)] of building image.
the Harris matrix (but note that the main diagonals are different). The eigenvalues The points were
of H are proportional to the curvatures of D. As we explained in connection with the enlarged slightly
Harris-Stephens corner detector, we can avoid direct computation of the eigenvalues to make them
easier to see.
by formulating tests based on the trace and determinant of H, which are equal to
the sum and product of the eigenvalues, respectively. To use notation different from
the HS discussion, let a and b be the eigenvalues of H with the largest and smallest
magnitude, respectively. Using the relationship between the eigenvalues of H and
its trace and determinant we have (remember, H is is symmetric and of size 2 × 2) :

Tr(H ) = Dxx + Dyy = a + b


(11-77)
D et(H ) = Dxx Dyy − (Dxy )2 = ab

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 889 6/16/2017 2:16:17 PM DIP4E_GLOBAL_Print_Ready.indb 890 6/16/2017 2:16:17 PM


11.7 Scale-Invariant Feature Transform (SIFT) 891 892 Chapter 11 Feature Extraction

straightforward approach for this. The scale of the keypoint is used to select the of similar sets of keypoints in the image. For example, observe the keypoints on the
Gaussian smoothed image, L, that is closest to that scale. In this way, all orienta- right, vertical corner of the building. The lengths of the arrows vary, depending on
tion computations are performed in a scale-invariant manner. Then, for each image illumination and image content, but their direction is unmistakably consistent. Plots
sample, L( x, y), at this scale, we compute the gradient magnitude, M( x, y), and ori- of keypoint orientations generally are quite cluttered and are not intended for gen-
See Section 10.2 regard-
ing computation of the entation angle, u( x, y), using pixel differences: eral human interpretation. The value of keypoint orientation is in image matching,
gradient magnitude and 1 as we will illustrate later in our discussion.
angle.
M( x, y) = ( L( x + 1, y) − L( x − 1, y)) + ( L( x, y + 1) − L( x, y − 1))  2
2 2
(11-80)
 
KEYPOINT DESCRIPTORS
and The procedures discussed up to this point are used for assigning an image location,
scale, and orientation to each keypoint, thus providing invariance to these three
u( x, y) = tan −1 ( L( x, y + 1) − L( x, y − 1)) ( L( x + 1, y) − L( x − 1, y)) (11-81) variables. The next step is to compute a descriptor for a local region around each
keypoint that is highly distinctive, but is at the same time as invariant as possible to
A histogram of orientations is formed from the gradient orientations of sample changes in scale, orientation, illumination, and image viewpoint. The idea is to be
points in a neighborhood of each keypoint. The histogram has 36 bins covering the able to use these descriptors to identify matches (similarities) between local regions
360° range of orientations on the image plane. Each sample added to the histogram in two or more images.
is weighed by its gradient magnitude, and by a circular Gaussian function with a stan- The approach used by SIFT to compute descriptors is based on experimental
dard deviation 1.5 times the scale of the keypoint. results suggesting that local image gradients appear to perform a function similar
Peaks in the histogram correspond to dominant local directions of local gradients. to what human vision does for matching and recognizing 3-D objects from different
The highest peak in the histogram is detected and any other local peak that is within viewpoints (Lowe [2004]). Figure 11.62 summarizes the procedure used by SIFT
80% of the highest peak is used also to create another keypoint with that orienta- to generate the descriptors associated with each keypoint. A region of size 16 × 16
tion. Thus, for the locations with multiple peaks of similar magnitude, there will be
multiple keypoints created at the same location and scale, but with different orienta-
tions. SIFT assigns multiple orientations to only about 15% of points with multiple
orientations, but these contribute significant to image matching (to be discussed FIGURE 11.62

}
}
later and in Chapter 12). Finally, a parabola is fit to the three histogram values clos- Approach used to
compute a
est to each peak to interpolate the peak position for better accuracy.
keypoint
Figure 11.61 shows the same keypoints as Fig. 11.60 superimposed on the image descriptor.
and showing keypoint orientations as arrows. Note the consistency of orientation

FIGURE 11.61
The keypoints
from Fig. 11.60 Gradients
superimposed in 16*16
region
on the original
image. The arrows = Keypoint
Gaussian weighting function
indicate keypoint
orientations.

8-directional histogram (the


bins are multiples of 45°)

Keypoint descriptor = 128-dimensional vector

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 891 6/16/2017 2:16:18 PM DIP4E_GLOBAL_Print_Ready.indb 892 6/16/2017 2:16:18 PM


11.7 Scale-Invariant Feature Transform (SIFT) 893 894 Chapter 11 Feature Extraction

pixels is centered on a keypoint, and the gradient magnitude and direction are com- are less likely to affect gradient orientation. SIFT reduces the influence of large
puted at each point in the region using pixel differences. These are shown as ran- gradient magnitudes by thresholding the values of the normalized feature vector
domly oriented arrows in the upper-left of the figure. A Gaussian weighting function so that all components are below the experimentally determined value of 0.2. After
with standard deviation equal to one-half the size of the region is then used to assign thresholding, the feature vector is renormalized to unit length.
a weight that multiplies the magnitude of the gradient at each point. The Gaussian
weighting function is shown as a circle in the figure, but it is understood that it is a SUMMARY OF THE SIFT ALGORITHM
bell-shaped surface whose values (weights) decrease as a function of distance from
the center. The purpose of this function is to reduce sudden changes in the descriptor As the material in the preceding sections shows, SIFT is a complex procedure con-
with small changes in the position of the function. sisting of many parts and empirically determined constants. The following is a step-
Because there is one gradient computation for each point in the region surround- As indicated at the
by-step summary of the method.
ing a keypoint, there are (16)2 gradient directions to process for each keypoint. beginning of this section,
1. Construct the scale space. This is done using the procedure outlined in Figs. 11.56
smoothing and doubling
There are 16 directions in each 4 × 4 subregion. The top-rightmost subregion is the size of the input and 11.57. The parameters that need to be specified are s, s, (k is computed
shown zoomed in the figure to simplify the explanation of the next step, which image is assumed. Input
consists of quantizing all gradient orientations in the 4 × 4 subregion into eight pos- images are assumed to from s), and the number of octaves. Suggested values are s = 1.6, s = 2, and
sible directions differing by 45°. Rather than assigning a directional value as a full
have values in the range
[0, 1].
three octaves.
count to the bin to which it is closest, SIFT performs interpolation that distributes a 2. Obtain the initial keypoints. Compute the difference of Gaussians, D( x, y, s),
histogram entry among all bins proportionally, depending on the distance from that from the smoothed images in scale space, as explained in Fig. 11.58 and Eq. (11-69).
value to the center of each bin. This is done by multiplying each entry into a bin by Find the extrema in each D( x, y, s) image using the method explained in Fig.
a weight of 1 − d, where d is the shortest distance from the value to the center of a 11.59. These are the initial keypoints.
bin, measured in the units of the histogram spacing, so that the maximum possible
distance is 1. For example, the center of the first bin is at 45° 2 = 22.5°, the next cen- 3. Improve the accuracy of the location of the keypoints. Interpolate the values
ter is at 22.5° + 45° = 67.5°, and so on. Suppose that a particular directional value is of D( x, y, s) via a Taylor expansion. The improved key point locations are given
22.5°. The distance from that value to the center of the first histogram bin is 0, so we by Eq. (11-74).
would assign a full entry (i.e., a count of 1) to that bin in the histogram. The distance 4. Delete unsuitable keypoints. Eliminate keypoints that have low contrast and/or
to the next center would be greater than 0, so we would assign a fraction of a full are poorly localized. This is done by evaluating D from Step 3 at the improved
entry, that is 1 * (1 − d), to that bin, and so forth for all bins. In this way, every bin locations, using Eq. (11-75). All keypoints whose values of D are lower than a
gets a proportional fraction of a count, thus avoiding “boundary” effects in which a threshold are deleted. A suggested threshold value is 0.03. Keypoints associated
descriptor changes abruptly as a small change in orientation causes it to be assigned with edges are deleted also, using Eq. (11-79). A value of 10 is suggested for r.
from one bin to another.
Figure 11.62 shows the eight directions of a histogram as a small cluster of vec- 5. Compute keypoint orientations. Use Eqs. (11-80) and (11-81) to compute the
tors, with the length of each vector being equal to the value of its corresponding bin. magnitude and orientation of each keypoint using the histogram-based proce-
Sixteen histograms are computed, one for each 4 × 4 subregion of the 16 × 16 region dure discussed in connection with these equations.
surrounding a keypoint. A descriptor, shown on the lower left of the figure, then con- 6. Compute keypoint descriptors. Use the method summarized in Fig. 11.62 to
sists of a 4 × 4 array, each containing eight directional values. In SIFT, this descriptor compute a feature (descriptor) vector for each keypoint. If a region of size
data is organized as a 128-dimensional vector. 16 × 16 around each keypoint is used, the result will be a 128-dimensional feature
In order to achieve orientation invariance, the coordinates of the descriptor and vector for each keypoint.
the gradient orientations are rotated relative to the keypoint orientation. In order to
reduce the effects of illumination, a feature vector is normalized in two stages. First, The following example illustrates the power of this algorithm.
the vector is normalized to unit length by dividing each component by the vector
norm. A change in image contrast resulting from each pixel value being multiplied EXAMPLE 11.20 : Using SIFT for image matching.
by a constant will multiply the gradients by the same constant, so the change in We illustrate the performance of the SIFT algorithm by using it to find the number of matches between
contrast will be cancelled by the first normalization. A brightness change caused an image of a building and a subimage formed by extracting part of the right corner edge of the building.
by a constant being added to each pixel will not affect the gradient values because We also show results for rotated and scaled-down versions of the image and subimage. This type of pro-
they are computed from pixel differences. Therefore, the descriptor is invariant to cess can be used in applications such as finding correspondences between two images for the purpose of
affine changes in illumination. However, nonlinear illumination changes resulting, image registration, and for finding instances of an image in a database of images.
for example, from camera saturation, can also occur. These types of changes can Figure 11.63(a) shows the keypoints for the building image (this is the same as Fig. 11.61), and the
cause large variations in the relative magnitudes of some of the gradients, but they keypoints for the subimage, which is a separate, much smaller image. The keypoints were computed

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 893 6/16/2017 2:16:19 PM DIP4E_GLOBAL_Print_Ready.indb 894 6/16/2017 2:16:20 PM


11.7 Scale-Invariant Feature Transform (SIFT) 895 896 Chapter 11 Feature Extraction

a b a b
FIGURE 11.63 (a) Keypoints and their directions (shown as gray arrows) for the building image and for a section of FIGURE 11.64 (a) Keypoints for the rotated (by 5°) building image and for a section of the right corner of the building.
the right corner of the building. The subimage is a separate image and was processed as such. (b) Corresponding The subimage is a separate image and was processed as such. (b) Corresponding keypoints between the corner and
key points between the building and the subimage (the straight lines shown connect pairs of matching points). Only the building. Of the 26 matches found, only two are in error.
three of the 36 matches found are incorrect.

that we do not always know a priori when images have been acquired under different conditions and
geometrical arrangements. A more practical test is to compute features for a prototype image and test
using SIFT independently for each image. The building shows 643 keypoints and the subimage 54 key- them against unknown samples. Figure 11.66 shows the results of such tests. Figure 11.66(a) is the origi-
points. Figure 11.63(b) shows the matches found by SIFT between the image and subimage; 36 keypoint nal building image, for which SIFT features vectors were already computed (see Fig. 11.63). SIFT was
matches were found and, as the figure shows, only three were incorrect. Considering the large number used to compare the rotated subimage from Fig. 11.64(a) against the original, unrotated image. As Fig.
of initial keypoints, you can see that keypoint descriptors offer a high degree of accuracy for establishing 11.66(a) shows, 10 matches were found, of which two were incorrect. These are excellent results, con-
correspondences between images. sidering the relatively small size of the subimage, and the fact that it was rotated. Figure 11.66(b) shows
Figure 11.64(a) shows keypoints for the building image after it was rotated by 5° counterclockwise, the results of matching the half-sized subimage against the original image. Eleven matches were found,
and for a subimage extracted from its right corner edge. The rotated image is smaller than the original
because it was cropped to eliminate the constant areas created by rotation (see Fig. 2.41). Here, SIFT
found 547 keypoints for the building and 49 for the subimage. A total of 26 matches were found and, as
Fig. 11.64(b) shows, only two were incorrect.
Figure 11.65 shows the results obtained using SIFT on an image of the building reduced to half the
size in both spatial directions. When SIFT was applied to the downsampled image and a correspond-
ing subimage, no matches were found. This was remedied by brightening the reduced image slightly
by manipulating the intensity gamma. The subimage was extracted from this image. Despite the fact
that SIFT has the capability to handle some degree of changes in intensity, this example indicates that
performance can be improved by enhancing the contrast of an image prior to processing. When work-
ing with a database of images, histogram specification (see Chapter 3) is an excellent tool for normal-
izing the intensity of all images using the characteristics of the image being queried. SIFT found 195
keypoints for the half-size image and 24 keypoints for the corresponding subimage. A total of seven
matches were found between the two images, of which only one was incorrect. a b
The preceding two figures illustrate the insensitivity of SIFT to rotation and scale changes, but they FIGURE 11.65 (a) Keypoints for the half-sized building and a section of the right corner. (b) Corresponding keypoints
are not ideal tests because the reason for seeking insensitivity to these variables in the first place is between the corner and the building. Of the seven matches found, only one is in error.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 895 6/16/2017 2:16:20 PM DIP4E_GLOBAL_Print_Ready.indb 896 6/16/2017 2:16:21 PM


11.7 Scale-Invariant Feature Transform (SIFT) 897 898 Chapter 11 Feature Extraction

Our discussion of moment-invariants is based on Hu [1962]. For generating moments of arbitrary order, see Flusser
[2000].
Hotelling [1933] was the first to derive and publish the approach that transforms discrete variables into uncor-
related coefficients (Section 11.5). He referred to this technique as the method of principal components. His paper
gives considerable insight into the method and is worth reading. Principal components are still used widely in
numerous fields, including image processing, as evidenced by Xiang et al. [2016]. The corner detector in Section 11.6
is from Harris and Stephens [1988], and our discussion of MSERs is based on Matas et al. [2002]. The SIFT material
in Section 11.7 is from Lowe [2004]. For details on the software aspects of many of the examples in this chapter, see
Gonzalez, Woods, and Eddins [2009].

Problems
Solutions to the problems marked with an asterisk (*) are in the DIP4E Student Support Package (consult the book
website: www.ImageProcessingPlace.com).

11.1 Do the following: (b) Does a chain-coded closed curve always


a b have an even number of segments? If your
(a) * Provide all the missing steps in Fig. 11.1.
answer is yes, prove it. If it is no, give an ex-
FIGURE 11.66 (a) Matches between the original building image and a rotated version of a segment of its right corner. Show your results using the same format as
ample.
Ten matches were found, of which two are incorrect. (b) Matches between the original image and a half-scaled ver- in that figure.
sion of a segment of its right corner. Here, 11 matches were found, of which four were incorrect. (c) Find the normalized starting point of the
(b) When applied to binary regions, the bound-
code 11076765543322.
ary-following algorithm in Section 11.2 typi-
cally yields boundaries that are one pixel 11.4 Do the following:
of which four were incorrect. Again, these are good results, considering the fact that significant detail
thick, but this is not always the case. Give a (a) * Show that the first difference of a chain code
was lost in the subimage when it was rotated or reduced in size. If asked in both cases: Based solely on small image example in which the boundary normalizes it to rotation, as explained in
the matches found by SIFT, from which part of the building did the two subimages come? The obvious is thicker than one pixel in at least one place. Section 11.2.
answer in both is that the subimages are from the right corner of the building. The preceding two tests
11.2 With reference to the Moore boundary-following (b) Compute the first difference of the code
illustrate the adaptability of SIFT to variations in rotation and scale.
algorithm explained in Section 11.2, answer the 0101030303323232212111.
following, using the same grid as in Fig. 11.2 to
11.5 Answer the following:
identify boundary points in your explanation
(a) * Given a one-pixel-thick, open or closed,
Summary, References, and Further Reading [remember, the origin is at (1, 1), instead of our
4-connected simple (does not intersect
usual (0, 0)]. Include the position of points b
Feature extraction is a fundamental process in the operation of most automated image processing applications. and c at each point you mention. itself) digital curve, can a slope chain code
As indicated by the range of feature detection and description techniques covered in this chapter, the choice of be formulated so that it behaves exactly as
one method over another is determined by the problem under consideration. The objective is to choose feature (a) * Give the coordinates in Fig. 11.2(a) at which
a Freeman chain code? If your answer is no,
descriptors that “capture” essential differences between objects, or classes of objects, while maintaining as much the algorithm starts and ends. What would
explain why. If your answer is yes, explain
independence as possible to changes in variables such as location, scale, orientation, illumination, and viewing angle. it do when it arrived at the end point of the
how you would do it, detailing any assump-
The Freeman chain code discussed in Section 11.2 was first proposed by Freeman [1961, 1974], while the slope boundary?
tions you need to make for your answer to
chain code is due to Bribiesca [2013]. See Klette and Rosenfeld [2004] regarding the minimum-perimeter polygon (b) How would the algorithm behave when hold.
algorithm. For additional reading on signatures see Ballard and Brown [1982]. The medial axis transform is gener- it arrived at the intersection point in Fig. (b) Repeat (a) for an 8-connected curve.
ally credited to Blum [1967]. For efficient computation of the Euclidean distance transform used for skeletonizing 11.2(b) for the first time, and then for the
see Maurer et al. [2003]. second time? (c) How would you normalize a slope chain code
For additional reading on the basic boundary feature descriptors in Section 11.3, see Rosenfeld and Kak [1982]. for scale changes?
11.3 Answer the following:
The discussion on shape numbers is based on the work of Bribiesca and Guzman [1980]. For additional reading on 11.6 * Explain why a slope chain code with an angle
Fourier descriptors, see the early paper by Zahn and Roskies [1972]. For an example of current uses of this tech- (a) * Does normalizing the Freeman chain code accuracy of 10 −1 produces 19 symbols.
nique, see Sikic and Konjicila [2016]. The discussion on statistical moments as boundary descriptors is from basic of a closed curve so that the starting point
11.7 Let L be the length of the straight-line segments
probability (for example, see Montgomery and Runger [2011]). is the smallest integer always give a unique
used in a slope chain code. Assume that L is such
For additional reading on the basic region descriptors discussed in Section 11.4, see Rosenfeld and Kak [1982]. starting point?
that an integral number of line segments fit the
For further introductory reading on texture, see Haralick and Shapiro [1992] and Shapiro and Stockman [2001].

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 897 6/16/2017 2:16:21 PM DIP4E_GLOBAL_Print_Ready.indb 898 6/16/2017 2:16:21 PM


Problems 899 900 Chapter 11 Feature Extraction

curve under consideration. Assume also that the the following boundaries, and plot the signatures. 11.19 * Give the smallest number of statistical moment 11.27 For a set of images of size 64 × 64, assume that
angle accuracy is high enough so that it may be (a) * An equilateral triangle. descriptors needed to differentiate between the the covariance matrix given in Eq. (11-52) is
considered infinite for your purposes, answer the signatures of the figures in Fig. 11.10. the identity matrix. What would be the mean
following: (b) A rectangle. squared error between the original images and
11.20 Give two boundary shapes that have the same
(a) * What is the tortuosity of a square boundary (c) An ellipse mean and third statistical moment descriptors, images reconstructed using Eq. (11-54) with only
of size d × d ? but different second moments. half of the original eigenvectors?
11.14 Do the following:
(b) * What is the tortuosity of a circle of radius r? 11.21 * Propose a set of descriptors capable of differen- 11.28 Under what conditions would you expect the
(a) * With reference to Figs. 11.11(c) and (f), give
tiating between the shapes of the characters 0, 1, major axes of a boundary, defined in the discus-
(c) What is the tortuosity of a closed convex a word description of an algorithm for count-
8, 9, and X. (Hint: Use topological descriptors in sion of Eq. (11-4), to be equal to the eigen axes of
curve? ing the peaks in the two waveforms. Such an
conjunction with the convex hull.) that boundary?
11.8 * Advance an argument that explains why the algorithm would allow us to differentiate
between triangles and rectangles. 11.22 Consider a binary image of size 200 × 200 pix- 11.29 *You are contracted to design an image process-
uppermost-leftmost point of a digital closed
els, with a vertical black band extending from ing system for detecting imperfections on the
curve has the property that a polygonal approxi- (b) How can you make your solution indepen-
columns 1 to 99 and a vertical white band extend- inside of certain solid plastic wafers. The wafers
mation to the curve has a convex vertex at that dent of scale changes? You may assume that
ing from columns 100 to 200. are examined using an X-ray imaging system,
point. the scale changes are the same in both direc-
which yields 8-bit images of size 512 × 512. In
11.9 With reference to Example 11.2, start with vertex tions. (a) Obtain the co-occurrence matrix of this the absence of imperfections, the images appear
V7 and apply the MPP algorithm through, and 11.15 Draw the medial axis of: image using the position operator “one pixel uniform, having a mean intensity of 100 and vari-
including, V11 . to the right.” ance of 400. The imperfections appear as blob-
(a) * A circle.
11.10 Do the following: (b) * Normalize this matrix so that its elements like regions in which about 70% of the pixels
(b) * A square.
become probability estimates, as explained have excursions in intensity of 50 intensity levels
(a) * Explain why the rubber-band polygonal (c) An equilateral triangle. in Section 11.4. or less about a mean of 100. A wafer is consid-
approximation approach discussed in Sec-
11.16 For the figure shown, ered defective if such a region occupies an area
tion 11.2 yields a polygon with minimum (c) Use your matrix from (b) to compute the six
exceeding 20 × 20 pixels in size. Propose a system
perimeter for a convex curve. (a) * What is the order of the shape number? descriptors in Table 11.3.
based on texture analysis for solving this prob-
(b) Show that if each cell corresponds to a pixel (b) Obtain the shape number. 11.23 Consider a checkerboard image composed of lem.
on the boundary, the maximum possible alternating black and white squares, each of size
11.30 With reference to Fig. 11.46, answer the following:
error in that cell is 2d, where d is the mini- m × m pixels. Give a position operator that will
mum possible horizontal or vertical distance yield a diagonal co-occurrence matrix. (a) * What is the cause of nearly identical clusters
between adjacent pixels (i.e., the distance near the origin in Figs. 11.46(d)-(f).
11.24 Obtain the gray-level co-occurrence matrix of
between lines in the sampling grid used to an array pattern of alternating single 0’s and 1’s (b) Look carefully, and you will see a single point
produce the digital image). (starting with 0) if: near coordinates (0.8, 0.8) in Fig. 11.46(f).
11.11 Explain how the MPP algorithm in Section 11.2 What caused this point?
11.17 * The procedure discussed in Section 11.3 for using (a) * The position operator Q is defined as “one
behaves under the following conditions: (c) The results in Fig. 11.46(d)–(e) are for
Fourier descriptors consists of expressing the pixel to the right.”
(a) * One-pixel wide, one-pixel deep indentations. the small image patches shown in Figs.
coordinates of a contour as complex numbers, (b) The position operator Q is defined as “two 11.46(a)–(b). What would the results look
(b) * One-pixel wide, two-or-more pixel deep taking the DFT of these numbers, and keeping pixels to the right.” like if we performed the computations over
indentations. only a few components of the DFT as descriptors
11.25 Do the following. the entire image, instead of limiting the com-
of the boundary shape. The inverse DFT is then
(c) One-pixel wide, n-pixel long protrusions. putation to the patches?
an approximation to the original contour. What (a) * Prove the validity of Eqs. (11-50) and (11-51).
11.12 Do the following. class of contour shapes would have a DFT con- 11.31 When we discussed the Harris-Stephens corner
sisting of real numbers, and how would the axis (b) Prove the validity of Eq. (11-52). detector, we mentioned that there is a closed-form
(a) * Plot the signature of a square boundary using
the tangent-angle method discussed in Sec- system in Fig. 11.18 have to be set up to obtain 11.26 * We mentioned in Example 11.16 that a credible formula for computing the eigenvalues of a 2 × 2
tion 11.2. those real numbers? job could be done of reconstructing approxima- matrix.
11.18 Show that if you use only two Fourier descrip- tions to the six original images by using only the (a) * Given matrix M = [a b; c d], give the gen-
(b) Repeat (a) for the slope density function.
tors (u = 0 and u = 1) to reconstruct a bound- two principal-component images associated with eral formula for finding its eigenvalues.
Assume that the square is aligned with the x-
ary with Eq. (11-10), the result will always be a the largest eigenvalues. What would be the mean Express your formula in terms of the trace
and y-axes, and let the x-axis be the reference
circle. (Hint: Use the parametric representation squared error incurred in doing so? Express your and determinant of M.
line. Start at the corner closest to the origin.
of a circle in the complex plane, and express the answer as a percentage of the maximum possible
11.13 Find an expression for the signature of each of error. (b) Give the formula for symmetric matrices of
equation of a circle in polar coordinates.)

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 899 6/16/2017 2:16:22 PM DIP4E_GLOBAL_Print_Ready.indb 900 6/16/2017 2:16:22 PM


Problems 901 902 Chapter 11 Feature Extraction

size 2 × 2 in terms of its four elements, with- 11.38 A company that bottles a variety of industrial image: (1) Determine the ratio of the area occu- your report, state the physical dimensions of the
out using the trace nor the determinant. chemicals employs you to design an approach for pied by bubbles to the total area of the image; smallest bubble your solution can detect. State
detecting when bottles of their product are not and (2) count the number of distinct bubbles. clearly all assumptions that you make and that
11.32 * With reference to the component tree in Fig.
full. As they move along a conveyor line past an Based on the material you have learned up to are likely to impact the solution you propose.
11.51, assume that any pixels extending past the
automatic filling and capping station, the bottles this point, propose a solution to this problem. In
border of the small image are 0. Is region R1 an
extremal region? Explain. appear as shown in the following image. A bottle
is considered imperfectly filled when the level
11.33 With reference to the discussion of maximally of the liquid is below the midway point between
stable extremal regions in Section 11.6, can the the bottom of the neck and the shoulder of the
root of a component tree contain an MSER? bottle. The shoulder is defined as the intersection
Explain. of the sides and slanted portions of the bottle.
11.34 * The well known heat-diffusion equation of a The bottles move at a high rate of speed, but the
temperature function g( x, y, z, t ) of three spatial company has an imaging system equipped with
variables, ( x, y, z), is given by ∂g ∂t − a( 2 g = 0, an illumination flash front end that effectively
where a is the thermal diffusivity and ( 2 is the stops motion, so you will be given images that
Laplacian operator. In terms of our discussion of look very close to the sample shown here. Based
SIFT, the form of this equation is used to estab- on the material you have learned up to this point,
lish a relationship between the difference of propose a solution for detecting bottles that are
Gaussians and the scaled Laplacian, s 2 ( 2 . Show not filled properly. State clearly all assumptions
how this can be done to derive Eq. (11-70). that you make and that are likely to impact the
solution you propose.
11.35 With reference to the SIFT algorithm discussed
in Section 11.7, assume that the input image is
square, of size M × M (with M = 2 n ), and let the
number of intervals per octave be s = 2.
(a) How many smoothed images will there be in
each octave?
(b) * How many octaves could be generated before
it is no longer possible to down-sample the
image by 2? 11.39 Having heard about your success with the
bottle inspection problem, you are contacted by a
(c) If the standard deviation used to smooth fluids company that wishes to automate bubble-
the first image in the first octave is s, what counting in certain processes for quality control.
are the values of standard deviation used to The company has solved the imaging problem
smooth the first image in each of the remain- and can obtain 8-bit images of size 700 × 700 pix-
ing octaves in (b)? els, such as the one shown in the figure below.
11.36 Advance an argument showing that smoothing
an image and then downsampling it by 2 gives
the same result as first downsampling the image
by 2 and then smoothing it with the same kernel.
By downsampling we mean skipping every other
row and column. (Hint: Consider the fact that
convolution is a linear process.)
11.37 Do the following:
(a) * Show how to obtain Eq. (11-74) from Eq.
(11-71).
(b) Show how Eq. (11-75) follows from Eqs. Each image represents an area of 7 cm 2 . The
(11-74) and (11-71). company wishes to do two things with each

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 901 6/16/2017 2:16:24 PM DIP4E_GLOBAL_Print_Ready.indb 902 6/16/2017 2:16:24 PM


904 Chapter 12 Image Pattern Classification

12.1 BACKGROUND

12
12.1

Humans possess the most sophisticated pattern recognition capabilities in the known
biological world. By contrast, the capabilities of current recognition machines pale
in comparison with tasks humans perform routinely, from being able to interpret the
meaning of complex images, to our ability for generalizing knowledge stored in our
Image Pattern Classification brains. But recognition machines play an important, sometimes even crucial role in
everyday life. Imagine what modern life would be like without machines that read
barcodes, process bank checks, inspect the quality of manufactured products, read
fingerprints, sort mail, and recognize speech.
One of the most interesting aspects of the world is that it can be In image pattern recognition, we think of a pattern as a spatial arrangement of
considered to be made up of patterns. features. A pattern class is a set of patterns that share some common properties. Pat-
A pattern is essentially an arrangement. It is characterized by tern recognition by machine encompasses techniques for automatically assigning
the order of the elements of which it is made, rather than by the patterns to their respective classes. That is, given a pattern or sets of patterns whose
intrinsic nature of these elements. class is unknown, the job of a pattern recognition system is to assign a class label to
Norbert Wiener each of its input patterns.
There are four main stages involved in recognition: (1) sensing, (2) preprocessing,
(3) feature extraction, and (4) classification. In terms of image processing, sensing is
concerned with generating signals in a spatial (2-D) or higher-dimensional format.
We covered numerous aspects of image sensing in Chapter 1. Preprocessing deals
with techniques for tasks such as noise reduction, enhancement, restoration, and
Preview segmentation, as discussed in earlier chapters. You learned about feature extraction
We conclude our coverage of digital image processing with an introduction to techniques for image in Chapter 11. Classification, the focus of this chapter, deals with using a set of fea-
pattern classification. The approaches developed in this chapter are divided into three principal catego- tures as the basis for assigning class labels to unknown input image patterns.
ries: classification by prototype matching, classification based on an optimal statistical formulation, and In the following section, we will discuss three basic approaches used for image
pattern classification: (1) classification based on matching unknown patterns against
classification based on neural networks. The first two approaches are used extensively in applications in
specified prototypes, (2) optimum statistical classifiers, and (3) neural networks.
which the nature of the data is well understood, leading to an effective pairing of features and classifier
One way to characterize the differences between these approaches is in the level
design. These approaches often rely on a great deal of engineering to define features and elements of a of “engineering” required to transform raw data into formats suitable for computer
classifier. Approaches based on neural networks rely less on such knowledge, and lend themselves well processing. Ultimately, recognition performance is determined by the discriminative
to applications in which pattern class characteristics (e.g., features) are learned by the system, rather power of the features used.
than being specified a priori by a human designer. The focus of the material in this chapter is on prin- In classification based on prototypes, the objective is to make the features so
ciples, and on how they apply specifically in image pattern classification. unique and easily detectable that classification itself becomes a simple task. A good
example of this are bank-check processors, which use stylized font styles to simplify
Upon completion of this chapter, readers should: machine processing (we will discuss this application in Section 12.3).
In the second category, classification is cast in decision-theoretic, statistical terms,
Understand the meaning of patterns and pat- Understand perceptrons and their history. and the classification approach is based on selecting parameters that can be shown
tern classes, and how they relate to digital to yield optimum classification performance in a statistical sense. Here, emphasis is
Be familiar with the concept of learning from
image processing. placed on both the features used, and the design of the classifier. We will illustrate
training samples.
Be familiar with the basics of minimum-dis- this approach in Section 12.4 by deriving the Bayes pattern classifier, starting from
Understand neural network architectures. basic principles.
tance classification.
Be familiar with the concept of deep learning In the third category, classification is performed using neural networks. As you
Know how to apply image correlation tech-
in fully connected and deep convolutional neu- will learn in Sections 12.5 and 12.6, neural networks can operate using engineered
niques for template matching.
ral networks. In particular, be familiar with the features too, but they have the unique ability of being able to generate, on their own,
Understand the concept of string matching. importance of the latter in digital image pro- representations (features) suitable for recognition. These systems can accomplish
Be familiar with Bayes classifiers. cessing. this using raw data, without the need for engineered features.

903

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 903 6/16/2017 2:16:24 PM DIP4E_GLOBAL_Print_Ready.indb 904 6/16/2017 2:16:24 PM


12.1 Background 905 906 Chapter 12 Image Pattern Classification

One characteristic shared by the preceding three approaches is that they are distributions. Starting with Section 12.5, we will spend the rest of the chapter discuss-
based on parameters that must be either specified or learned from patterns that rep- ing neural networks. We will begin Section 12.5 with a brief introduction to percep-
resent the recognition problem we want to solve. The patterns can be labeled, mean- trons and some historical facts about machine learning. Then, we will introduce the
ing that we know the class of each pattern, or unlabeled, meaning that the data are concept of deep neural networks and derive the equations of backpropagation, the
known to be patterns, but the class of each pattern is unknown. A classic example method of choice for training deep neural nets. These networks are well-suited for
of labeled data is the character recognition problem, in which a set of character applications in which input patterns are vectors. In Section 12.6, we will introduce
samples is collected and the identity of each character is recorded as a label from deep convolutional neural networks, which currently are the preferred approach
the group 0 through 9 and a through z. An example of unlabeled data is when we are when the system inputs are digital images. After deriving the backpropagation equa-
seeking clusters in a data set, with the aim of utilizing the resulting cluster centers as tions used for training convolutional nets, we will give several examples of appli-
being prototypes of the pattern classes contained in the data. cations involving classes of images of various complexities. In addition to working
When working with a labeled data, a given data set generally is subdivided into directly with image inputs, deep convolutional nets are capable of learning, on their
three subsets: a training set, a validation set, and a test set (a typical subdivision might own, image features suitable for classification. This is accomplished starting with raw
Because the examples in
this chapter are intended be 50% training, and 25% each for the validation and test sets). The process by image data, as opposed to the other classification methods discussed in Sections 12.3
to demonstrate basic
which a training set is used to generate classifier parameters is called training. In and 12.4, which rely on “engineered” features whose form, as noted earlier, is speci-
principles and are not
large scale, we dispense this mode, a classifier is given the class label of each pattern, the objective being to fied a priori by a human designer.
with validation and
subdivide the pattern
make adjustments in the parameters if the classifier makes a mistake in identify-
data into training and ing the class of the given pattern. At this point, we might be working with several 12.2 PATTERNS AND PATTERN CLASSES
test sets.
candidate designs. At the end of training, we use the validation set to compare the
12.2

In image pattern classification, the two principal pattern arrangements are quantita-
various designs against a performance objective. Typically, several iterations of train-
tive and structural. Quantitative patterns are arranged in the form of pattern vectors.
ing/validation are required to establish the design that comes closest to meeting the
Structural patterns typically are composed of symbols, arranged in the form of strings,
desired objective. Once a design has been selected, the final step is to determine how
trees, or, less frequently, as graphs. Most of the work in this chapter is based on pat-
it will perform “in the field.” For this, we use the test set, which consists of patterns
tern vectors, but we will discuss structural patterns briefly at the end of this section,
that the system has never “seen” before. If the training and validation sets are truly
and give an example at the end of Section 12.3.
representative of the data the system will encounter in practice, the results of train-
ing/validation should be close to the performance using the test set. If training/vali-
dation results are acceptable, but test results are not, we say that training/validation PATTERN VECTORS
“over fit” the system parameters to the available data, in which case further work on Pattern vectors are represented by lowercase letters, such as x, y, and z, and have
the system architecture is required. Of course all this assumes that the given data are the form
truly representative of the problem we want to solve, and that the problem in fact
can be solved by available technology.  x1 
x 
A system that is designed using training data is said to undergo supervised learn-
x= 
2
(12-1)
ing. If we are working with unlabeled data, the system learns the pattern classes # 
themselves while in an unsupervised learning mode. In this chapter, we deal only  
 xn 
with supervised learning. As you will see in this and the next chapter, supervised
learning covers a broad range of approaches, from applications in which a system where each component, xi , represents the ith feature descriptor, and n is the total
learns parameters of features whose form is fixed by a designer, to systems that uti- number of such descriptors. We can express a vector in the form of a column, as
in Eq. (12-1), or in the equivalent row form x = ( x1 , x2 , …, xn ) , where T indicates
T
Generally, we associate
lize deep learning and large sets of raw data sets to learn, on their own, the features
the concept of deep required for classification. These systems accomplish this task without a human transposition. A pattern vector may be “viewed” as a point in n-dimensional Euclid-
learning with large sets
of data. These ideas are
designer having to specify the features, a priori. ean space, and a pattern class may be interpreted as a “hypercloud” of points in this
discussed in more detail After a brief discussion in the next section of how patterns are formed, and on pattern space. For the purpose of recognition, we like for our pattern classes to be
later in this section and
next.
the nature of patterns classes, we will discuss in Section 12.3 various approaches for grouped tightly, and as far away from each other as possible.
prototype-based classification. In Section 12.4, we will start from basic principles Pattern vectors can be formed directly from image pixel intensities by vector-
and derive the equations of the Bayes classifier, an approach characterized by opti- izing the image using, for example, linear indexing, as in Fig. 12.1. A more common
We discussed linear
mum classification performance on an average basis. We will also discuss supervised indexing in Section 2.4 approach is for pattern elements to be features. An early example is the work of
training of a Bayes classifier based on the assumption of multivariate Gaussian (see Fig. 2.22). Fisher [1936] who, close to a century ago, reported the use of what then was a new

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 905 6/16/2017 2:16:24 PM DIP4E_GLOBAL_Print_Ready.indb 906 6/16/2017 2:16:25 PM


12.2 Patterns and Pattern Classes 907 908 Chapter 12 Image Pattern Classification

a b a b
225 175 90 90 r(u)
FIGURE 12.1  225  FIGURE 12.3
125  (a) A noisy object  g ( r(u1 )) 
Using linear   r  
indexing to 125 5 90 225  5  boundary, and (b) g ( r ( u2 )) 
vectorize a   its corresponding
u x=
x= #   # 
grayscale image. signature.  
5 5 5 225
175
 225  g ( r(un ))
 
175 
  u
 225  p p 3p 5p 3p 7p
125 5 225 225 0 p 2p
4 2 4 4 2 4

Vectors can be formed also from features of both boundary and regions. For
technique called discriminant analysis to recognize three types of iris flowers (Iris
example, the objects in Fig. 12.4 can be represented by 3-D vectors whose compo-
setosa, virginica, and versicolor). Fisher described each flower using four features:
nents capture shape information related to both boundary and region properties
Sepals are the undergrowth the length and width of the petals, and similarly for the sepals (see Fig. 12.2). This
beneath the petals. of single binary objects. Pattern vectors can be used also to represent properties of
leads to the 4-D vectors shown in the figure. A set of these vectors, obtained for fifty
image regions. For example, the elements of the 6-D vector in Fig. 12.5 are texture
samples of each flower gender, constitutes the three famous Fisher iris pattern class-
measures based on the feature descriptors in Table 11.3. Figure 12.6 shows an exam-
es. Had Fisher been working today, he probably would have added spectral colors
ple in which pattern vector elements are features that are invariant to transforma-
and shape features to his measurements, yielding vectors of higher dimensionality.
tions, such as image rotation and scaling (see Section 11.4).
We will be working with the original iris data set later in this chapter.
When working with sequences of registered images, we have the option of using
A higher-level representation of patterns is based on feature descriptors of the
pattern vectors formed from corresponding pixels in those images (see Fig. 12.7).
types you learned in Chapter 11. For instance, pattern vectors formed from descrip-
Forming pattern vectors in this way implies that recognition will be based on infor-
tors of boundary shape are well-suited for applications in controlled environments,
mation extracted from the same spatial location across the images. Although this
such as industrial inspection. Figure 12.3 illustrates the concept. Here, we are inter-
may seem like a very limiting approach, it is ideally suited for applications such as
ested in classifying different types of noisy shapes, a sample of which is shown in
recognizing regions in multispectral images, as you will see in Section 12.4.
the figure. If we represent an object by its signature, we would obtain 1-D signals
When working with entire images as units, we need the detail afforded by vectors
of the form shown in Fig. 12.3(b). We can express a signature as a vector by sam-
of much-higher dimensionality, such as those we discussed in Section 11.7 in connec-
pling its amplitude at increments of u, then formimg a vector by letting xi = r(ui ),
tion with the SIFT algorithm. However, a more powerful approach when working
for i = 0, 1, 2, … , n. Instead of using “raw” sampled signatures, a more common
with entire images is to use deep convolutional neural networks. We will discuss
approach is to compute some function, xi = g ( r(ui )) , of the signature samples and
neural nets in detail in Sections 12.5 and 12.6.
use them to form vectors. You learned in Section 11.3 several approaches to do this,
such as statistical moments.
STRUCTURAL PATTERNS
Pattern vectors are not suitable for applications in which objects are represented
FIGURE 12.2 by structural features, such as strings of symbols. Although they are used much less
Petal and sepal Sepal than vectors in image processing applications, patterns containing structural descrip-
width and length
measurements
Petal tions of objects are important in applications where shape is of interest. Figure 12.8
(see arrows) shows an example. The boundaries of the bottles were approximated by a polygon
performed on iris
 x1 
flowers for the x 
purpose of data x =  2 a b c d
classification. The  x3 
 
image shown is of  x4  FIGURE 12.4
the Iris virginica Pattern vectors
gender. (Image whose components
courtesy of x1 = Petal width capture both bound-  x1  x1 = compactness
x2 = Petal length  
USDA.) ary and regional x =  x2  x2 = circularity
x3 = Sepal width
x4 = Sepal length
characteristics.  x3  x3 = eccentricity

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 907 6/16/2017 2:16:26 PM DIP4E_GLOBAL_Print_Ready.indb 908 6/16/2017 2:16:26 PM


12.2 Patterns and Pattern Classes 909 910 Chapter 12 Image Pattern Classification
FIGURE 12.5 FIGURE 12.8
An example of Symbol string
pattern vectors generated from
based on a polygonal #
properties of b Direction of travel
approximation of
subimages. See the boundaries of u
b
Table 11.3 for an medicine bottles. b a = " bubb"
explanation of the #
components of x. Symbol string
 x1 
x  x1 = max probability
 2
 x3  x2 = correlation u = interior angle
x=  x3 = contrast b = line segment of specified length
 x4  x4 = uniformity
x 
 5 x5 = homogeneity
 x6  x6 = entropy

FIGURE 12.6 Feature


vectors with using the approach explained in Section 11.2. The boundary is subdivided into line
components that segments (denoted by b in the figure), and the interior angle, u, is computed at each
are invariant to  x1   f1 intersection of two line segments. A string of sequential symbols is generated as the
transformations  x  f  boundary is traversed in the counterclockwise direction, as the figure shows. Strings
such as rotation,  2   2
scaling, and
 x3   f3 of this form are structural patterns, and the objective, as you will see in Section 12.3,
   
translation. The x =  x4  =  f4 is to match a given string against stored string prototypes.
vector compo-  x  f  A tree is another structural representation, suitable for higher-level descriptions
 5   5
nents are moment  x6   f6 of an entire image in terms of its component regions. Basically, most hierarchical
invariants.     ordering schemes lead to tree structures. For example, Fig. 12.9 shows a satellite
 x7   f7
image of a heavily built downtown area and surrounding residential areas. Let the
The f's are moment invariants
symbol $ represent the root of a tree. The (upside down) tree shown in the figure
was obtained using the structural relationship “composed of.” Thus, the root of the
tree represents the entire image. The next level indicates that the image is composed
Images in spectral bands 1–3 of a downtown and residential areas. In turn, the residential areas are composed
of housing, highways, and shopping malls. The next level down in the tree further
Spectral band 6 describes the housing and highways. We can continue this type of subdivision until
Spectral band 5 we reach the limit of our ability to resolve different regions in the image.
Spectral band 4
12.3 PATTERN CLASSIFICATION BY PROTOTYPE MATCHING
Spectral band 3 12.3

Spectral band 2 Prototype matching involves comparing an unknown pattern against a set of pro-
x1
Spectral band 1
totypes, and assigning to the unknown pattern the class of the prototype that is the
x2 most “similar” to the unknown. Each prototype represents a unique pattern class,
x3 but there may be more than one prototype for each class. What distinguishes one
x )
x4 matching method from another is the measure used to determine similarity.
x5
x6 MINIMUM-DISTANCE CLASSIFIER
Images in spectral bands 4 – 6 The minimum-distance One of the simplest and most widely used prototype matching methods is the
classifier is also referred
minimum-distance classifier which, as its name implies, computes a distance-based
FIGURE 12.7 Pattern (feature) vectors formed by concatenating corresponding pixels from a set of registered images. to as the nearest-neighbor
(Original images courtesy of NASA.) classifier. measure between an unknown pattern vector and each of the class prototypes. It
then assigns the unknown pattern to the class of its closest prototype. The prototype

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 909 6/16/2017 2:16:27 PM DIP4E_GLOBAL_Print_Ready.indb 910 6/16/2017 2:16:28 PM


12.3 Pattern Classification by Prototype Matching 911 912 Chapter 12 Image Pattern Classification

and assigning an unknown pattern x to the class whose prototype yielded the largest
value of d. That is, x is assigned to class ci , if

di (x) > d j (x) j = 1, 2, … , Nc ; j ≠ i (12-5)

When used for recognition, functions of this form are referred to as decision or dis-
criminant functions.
The decision boundary separating class ci from c j is given by the values of x for
which
di (x) = d j (x) (12-6)

or, equivalently, by values of x for which

Image di (x) − d j (x) = 0 (12-7)


$
The decision boundaries for a minimum-distance classifier follow directly from this
equation and Eq. (12-4):
Downtown Residential
dij (x) = di (x) − d j (x)
Buildings Highways Housing Shopping Highways 1 (12-8)
malls
= (mi − m j )T x − (mi − m j )T (mi + m j ) = 0
2
High Large Multiple Numerous Loops
densitity structures intersections Low Small Wooded Single Few
density structures areas intersections The boundary given by Eq. (12-8) is the perpendicular bisector of the line segment
joining mi and m j (see Problem 12.3). In 2-D (i.e., n = 2), the perpendicular bisector
FIGURE 12.9 Tree representation of a satellite image showing a heavily built downtown area (Washington, D.C.) and is a line, for n = 3 it is a plane, and for n > 3 it is called a hyperplane.
surrounding residential areas. (Original image courtesy of NASA.)

EXAMPLE 12.1 : Illustration of the minimum-distance classifier for two classes in 2-D.
vectors of the minimum-distance classifier usually are the mean vectors of the vari-
Figure 12.10 shows scatter plots of petal width and length values for the classes Iris versicolor and Iris
ous pattern classes:
setosa. As mentioned in the previous section, pattern vectors in the iris database consists of four mea-
1 surements for each flower. We show only two here so that you can visualize the pattern classes and the
n j x∑
mj = x j = 1, 2, …, Nc (12-2)
∈c j
decision boundary between them. We will work with the complete database later in this chapter.
We denote the Iris versicolor and setosa data as classes c1 and c2 , respectively. The means of the two
classes are m1 = ( 4.3, 1.3) and m2 = (1.5, 0.3) . It then follows from Eq. (12-4) that
T T
where n j is the number of pattern vectors used to compute the jth mean vector,
c j is the jth pattern class, and Nc is the number of classes. If we use the Euclidean
1 T
distance to determine similarity, the minimum-distance classifier computes the dis- d1 ( x ) = mT1 x − m1 m1
tances 2
Dj ( x ) =! x − m j ! j = 1, 2,…, Nc (12-3) = 4.3 x1 + 1.3 x2 − 10.1
and
where ! a ! = (aT a)1 2 is the Euclidean norm. The classifier then assigns an unknown 1 T
d2 ( x ) = mT2 x − m2 m2
pattern x to class ci if Di (x) < Dj (x) for j = 1, 2, …, Nc , j ≠ i. Ties [i.e., Di (x) = Dj (x)] 2
are resolved arbitrarily. = 1.5 x1 + 0.3 x2 − 1.17
It is not difficult to show (see Problem 12.2) that selecting the smallest distance is
equivalent to evaluating the functions From Eq. (12-8), the equation of the boundary is

1 T d12 (x) = d1 (x) − d2 (x)


d j ( x ) = mTj x − mj mj j = 1, 2, …, Nc (12-4)
2 = 2.8 x1 + 1.0 x2 − 8.9 = 0

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 911 6/16/2017 2:16:29 PM DIP4E_GLOBAL_Print_Ready.indb 912 6/16/2017 2:16:32 PM


12.3 Pattern Classification by Prototype Matching 913 914 Chapter 12 Image Pattern Classification

FIGURE 12.10 x2 FIGURE 12.11


Decision Iris versicolor The American
boundary of a Iris setosa Bankers
minimum distance Association
2.0 2.8x1 & 1.0x2 % 8.9 ) 0 E-13B font
classifier (based
on two measure- character set and
ments) for the corresponding

Petal width (cm)


classes of Iris 1.5 waveforms.
versicolor and Iris
setosa. The dark
dot and square
are the means of 1.0
the two classes. T
r
a
n
0.5 s
i
t
% &
A
m
0 x1 o
0 1 2 3 4 5 6 7 u
n
t
Petal length (cm)
O
n
Figure 12.10 shows a plot of this boundary. Substituting any pattern vector from class c1 into this equa- U
tion would yield d12 (x) > 0. Conversely, any pattern from class c2 would give d12 (x) < 0. Thus, given an s

unknown pattern x belonging to one of these two classes, the sign of d12 (x) would be sufficient to deter-
D
mine the class to which that pattern belongs. a
s
h

The minimum-distance classifier works well when the distance between means is
large compared to the spread or randomness of each class with respect to its mean.
In addition to a stylized font design, the operation of the reading system is further
In Section 12.4 we will show that the minimum-distance classifier yields optimum
enhanced by printing each character using an ink that contains finely ground mag-
performance (in terms of minimizing the average loss of misclassification) when the
netic material. To improve character detectability in a check being read, the ink is
distribution of each class about its mean is in the form of a spherical “hypercloud” in Appropriately, recogni-
tion of magnetized char- subjected to a magnetic field that accentuates each character against the background.
n-dimensional pattern space. acters is referred to as The stylized design further enhances character detectability. The characters are
As noted earlier, one of the keys to accurate recognition performance is to specify Magnetic Ink Character
Recognition (MICR). scanned in a horizontal direction with a single-slit reading head that is narrower but
features that are effective discriminators between classes. As a rule, the better the
taller than the characters. As a check passes through the head, the sensor produces a
features are at meeting this objective, the better the recognition performance will be.
1-D electrical signal (a signature) that is conditioned to be proportional to the rate
In the case of the minimum-distance classifier this implies wide separation between
of increase or decrease of the character area under the head. For example, consider
means and tight grouping of the classes.
the waveform of the number 0 in Fig. 12.11. As a check moves to the right past the
Systems based on the Banker’s Association E-13B font character are a classic
head, the character area seen by the sensor begins to increase, producing a positive
example of how highly engineered features can be used in conjunction with a simple
derivative (a positive rate of change). As the right leg of the character begins to pass
classifier to achieve superior results. In the mid-1940s, bank checks were processed
under the head, the character area seen by the sensor begins to decrease, produc-
manually, which was a laborious, costly process prone to mistakes. As the volume
ing a negative derivative. When the head is in the middle zone of the character, the
of check writing increased in the early 1950s, banks became keenly interested in
area remains nearly constant, producing a zero derivative. This waveform repeats
automating this task. In the middle 1950s, the E-13B font and the system that reads
itself as the other leg of the character enters the head. The design of the font ensures
it became the standard solution to the problem. As Fig. 12.11 shows, this font set con-
that the waveform of each character is distinct from all others. It also ensures that
sists of 14 characters laid out on a 9 × 7 grid. The characters are stylized to maximize
the peaks and zeros of each waveform occur approximately on the vertical lines of
the difference between them. The font was designed to be compact and readable by
the background grid on which these waveforms are displayed, as the figure shows.
humans, but the overriding purpose was that the characters should be readable by
The E-13B font has the property that sampling the waveforms only at these (nine)
machine, quickly, and with very high accuracy.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 913 6/16/2017 2:16:33 PM DIP4E_GLOBAL_Print_Ready.indb 914 6/16/2017 2:16:34 PM


12.3 Pattern Classification by Prototype Matching 915 916 Chapter 12 Image Pattern Classification

points yields enough information for their accurate classification. The effectiveness FIGURE 12.12
of these highly engineered features is further refined by the magnetized ink, which The mechanics of (m % 1)/ 2
template
results in clean waveforms with almost no scatter. Origin
matching.
Designing a minimum-distance classifier for this application is straightforward. (n % 1)/ 2
We simply store the sample values of each waveform at the vertical lines of the grid, n
and let each set of the resulting samples be represented as a 9-D prototype vector, m
m j , j = 1, 2,…, 14. When an unknown character is to be classified, the approach is (x, y)
to scan it in the manner just described, express the grid samples of the waveform as Template w
a 9-D vector, x, and identify its class by selecting the class of the prototype vector centered at an arbitrary
that yields the highest value in Eq. (12-4). We do not even need a computer to do location (x, y)
this. Very high classification speeds can be achieved with analog circuits composed
of resistor banks (see Problem 12.4). Image, f
The most important lesson in this example is that a recognition problem often can
Padding
be made trivial if we can control the environment in which the patterns are gener-
ated. The development and implementation of the E13-B font reading system is a
striking example of this fact. On the other hand, this system would be inadequate if It can be shown (see Problem 12.5) that g( x, y) has values in the range [ −1, 1] and is thus
we added the requirement that it has to recognize the textual content and signature normalized to changes in the amplitudes of w and f . The maximum value of g occurs
written on each check. For this, we need systems that are significantly more complex, when the normalized w and the corresponding normalized region in f are identical.
such as the convolutional neural networks we will discuss in Section 12.6. This indicates maximum correlation (the best possible match). The minimum occurs
when the two normalized functions exhibit the least similarity in the sense of Eq. (12-10).
USING CORRELATION FOR 2-D PROTOTYPE MATCHING Figure 12.12 illustrates the mechanics of the procedure just described. The border
We introduced the basic idea of spatial correlation and convolution in Section 3.4, around image f is padding, as explained in Section 3.4. In template matching, values
and used these concepts extensively in Chapter 3 for spatial filtering. From Eq. (3-34), of correlation when the center of the template is past the border of the image gener-
we know that correlation of a kernel w with an image f ( x, y) is given by ally are of no interest, so the padding is limited to half the kernel width.
The template in Fig. 12.12 is of size m × n, and it is shown with its center at an
(w " f )( x, y) = ∑ ∑ w(s, t )f ( x + s, y + t ) (12-9) arbitrary location ( x, y). The value of the correlation coefficient at that point is com-
s t
puted using Eq. (12-10). Then, the center of the template is incremented to an adja-
where the limits of summation are taken over the region shared by w and f . This cent location and the procedure is repeated. Values of the correlation coefficient
equation is evaluated for all values of the displacement variables x and y so all ele- g( x, y) are obtained by moving the center of the template (i.e., by incrementing x
ments of w visit every pixel of f . As you know, correlation has its highest value(s) and y) so the center of w visits every pixel in f . At the end of the procedure, we
in the region(s) where f and w are equal or nearly equal. In other words, Eq. (12-9) look for the maximum in g( x, y) to find where the best match occurred. It is possible
finds locations where w matches a region of f . But this equation has the drawback to have multiple locations in g( x, y) with the same maximum value, indicating sev-
that the result is sensitive to changes in the amplitude of either function. In order eral matches between w and f .
To be formal, we should to normalize correlation to amplitude changes in one or both functions, we perform
refer to correlation (and matching using the correlation coefficient instead:
the correlation
EXAMPLE 12.2 : Matching by correlation.
coefficient) as cross-

∑ ∑ [ w(s, t ) - w ]  f ( x + s, y + t ) − f Figure 12.13(a) shows a 913 × 913 satellite image of 1992 Hurricane Andrew, in which the eye of the
correlation when the

functions are different, xy 
and as autocorrelation
g( x, y) = s t
(12-10) storm is clearly visible. We want to use correlation to find the location of the best match in Fig. 12.13(a)
when they are the same. 1
 2 2 of the template in Fig. 12.13(b), which is a 31 × 31 subimage of the eye of the storm. Figure 12.13(c)
 ∑ ∑ [ w( s, t ) − w ] ∑ ∑  f ( x + s, y + t ) − fxy  
However, it is customary 2
to use the generic term shows the result of computing the correlation coefficient in Eq. (12-10) for all values of x and y in
correlation and  s t s t 
correlation coefficient, the original image. The size of this image was 943 × 943 pixels due to padding (see Fig. 12.12), but we
except when the distinc-
where the limits of summation are taken over the region shared by w and f , w is the cropped it to the size of the original image for display. The intensity in this image is proportional to the
tion is important (as in
deriving equations, in average value of the kernel (computed only once), and fxy is the average value of f in correlation values, and all negative correlations were clipped at 0 (black) to simplify the visual analysis
which it makes a dif-
ference which is being
the region coincident with w. In image correlation work, w is often referred to as a of the image. The area of highest correlation values appears as a small white region in this image. The
applied). template (i.e., a prototype subimage) and correlation is referred to as template matching. brightest point in this region matches with the center of the eye of the storm. Figure 12.13(d) shows as a

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 915 6/16/2017 2:16:36 PM DIP4E_GLOBAL_Print_Ready.indb 916 6/16/2017 2:16:38 PM


12.3 Pattern Classification by Prototype Matching 917 918 Chapter 12 Image Pattern Classification

a b FIGURE 12.14
c d Circuit board
image of size
FIGURE 12.13 948 × 915 pixels,
(a) 913 × 913 and a subimage
satellite image of one of the
of Hurricane connectors. The
Andrew. subimage is of size
(b) 31 × 31 212 × 128 pixels,
template of the shown zoomed
eye of the storm. on the right for
(c) Correlation clarity. (Original
coefficient shown image courtesy of
as an image (note Mr. Joseph E.
the brightest Pascente, Lixi,
point, indicated Inc.)
by an arrow).
(d) Location of
the best match know from the discussion in Section 10.2 that the Hough transform simplifies looking
(identified by the for data patterns by utilizing bins that reduce the level of detail with which we look at
arrow). This point
is a single pixel, a data set. We already discussed the SIFT algorithm in Section 11.7. The focus in this
but its size was section is to further illustrate the capabilities of SIFT for prototype matching.
enlarged to make Figure 12.14 shows the circuit board image we have used several times before.
it easier to see. The small rectangle enclosing the rightmost connector on the top of the large image
(Original image identifies an area from which an image of the connector was extracted. The small
courtesy of
NOAA.) image is shown zoomed for clarity. The sizes of the large and small images are shown
in the figure caption. Figure 12.15 shows the keypoints found by SIFT, as explained
in Section 11.7. They are visible as faint lines on both images. The zoomed view of
the subimage shows them a little clearer. It is important to note that the keypoints
for the image and subimage were found independently by SIFT. The large image
had 2714 keypoints, and the small image had 35.
Figure 12.16 shows the matches between keypoints found by SIFT. A total of 41
matches were found between the two images. Because there are only 35 keypoints
white dot the location of this maximum correlation value (in this case there was a unique match whose
maximum value was 1), which we see corresponds closely with the location of the eye in Fig. 12.13(a).
FIGURE 12.15
Keypoints found
MATCHING SIFT FEATURES by SIFT. The
We discussed the scale-invariant feature transform (SIFT) in Section 11.7. SIFT large image has
computes a set of invariant features that can be used for matching between known 2714 keypoints
(visible as faint
(prototype) and unknown images. The SIFT implementation in Section 11.7 yields gray lines). The
128-dimensional feature vectors for each local region in an image. SIFT performs subimage has 35
matching by looking for correspondences between sets of stored feature vector pro- keypoints. This is
totypes and feature vectors computed for an unknown image. Because of the large a separate image,
number of features involved, searching for exact matches is computationally inten- and SIFT found
its keypoints inde-
sive. Instead, the approach is to use a best-bin-first method that can identify the near- pendently of the
est neighbors with high probability using only a limited amount of computation (see large image. The
Lowe [1999], [2004]). The search is further simplified by looking for clusters of poten- zoomed section is
tial solutions using the generalized Hough transform proposed by Ballard [1981]. We shown for clarity.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 917 6/16/2017 2:16:38 PM DIP4E_GLOBAL_Print_Ready.indb 918 6/16/2017 2:16:39 PM


12.3 Pattern Classification by Prototype Matching 919 920 Chapter 12 Image Pattern Classification

FIGURE 12.16 described by shape numbers. With reference to the discussion in Section 11.3, the
Matches found by degree of similarity, k, between two region boundaries, is defined as the largest order
SIFT between the
large and small
for which their shape numbers still coincide. For example, let a and b denote shape
images. A total of numbers of closed boundaries represented by 4-directional chain codes. These two
41 matching pairs shapes have a degree of similarity k if
were found. They Parameter j starts at
Errors
are shown 4 and is always even s j ( a ) = sj ( b) for j = 4, 6, 8, …, k; and
connected by because we are working (12-11)
straight lines. with 4-connectivity, and
we require that
s j ( a ) ≠ s j ( b) for j = k + 2, k + 4, …
Only three of the boundaries be closed.
matches were
“real” errors where s indicates shape number, and the subscript indicates shape order. The dis-
(labeled “Errors” tance between two shapes a and b is defined as the inverse of their degree of simi-
in the figure). larity:
1
D ( a, b) = (12-12)
k
This expression satisfies the following properties:

D ( a, b) ≥ 0
D ( a, b) = 0 if and only if a = b (12-13)
D ( a, c ) ≤ max  D ( a, b) , D ( b, c )

in the small image, obviously at least six matches are either incorrect, or there are Either k or D may be used to compare two shapes. If the degree of similarity is used,
multiple matches. Three of the errors are clearly visible as matches with connectors the larger k is, the more similar the shapes are (note that k is infinite for identical
in the middle of the large image. However, if you compare the shape of the connec- shapes). The reverse is true when Eq. (12-12) is used.
tors in the middle of the large image, you can see that they are virtually identical to
parts of the connectors on the right. Therefore, these errors can be explained on that
basis. The other three extra matches are easier to explain. All connectors on the top EXAMPLE 12.3 : Matching shape numbers.
right of the circuit board are identical, and we are comparing one of them against Suppose we have a shape, f , and want to find its closest match in a set of five shape prototypes, denoted
the rest. There is no way for a system to tell the difference between them. In fact, by by a, b, c, d, and e, as shown in Fig. 12.17(a). The search may be visualized with the aid of the similarity
looking at the connecting lines, we can see that the matches are between the subim- tree in Fig. 12.17(b). The root of the tree corresponds to the lowest possible degree of similarity, which
age and all five connectors. These in fact are correct matches between the subimage is 4. Suppose shapes are identical up to degree 8, with the exception of shape a, whose degree of simi-
and other connectors that are identical to it. larity with respect to all other shapes is 6. Proceeding down the tree, we find that shape d has degree of
similarity 8 with respect to all others, and so on. Shapes f and c match uniquely, having a higher degree
MATCHING STRUCTURAL PROTOTYPES of similarity than any other two shapes. Conversely, if a had been an unknown shape, all we could have
The techniques discussed up to this point deal with patterns quantitatively, and said using this method is that a was similar to the other five shapes with degree of similarity 6. The same
largely ignore any structural relationships inherent in pattern shapes. The methods information can be summarized in the form of the similarity matrix in Fig. 12.17(c).
discussed in this section seek to achieve pattern recognition by capitalizing precisely
on these types of relationships. In this section, we introduce two basic approaches String Matching
for the recognition of boundary shapes based on string representations, which are Suppose two region boundaries, a and b, are coded into strings of symbols, denot-
the most practical approach in structural pattern recognition. ed as a1a2 … an and b1b2 … bm , respectively. Let a represent the number of matches
between the two strings, where a match occurs in the kth position if ak = bk . The
Matching Shape Numbers number of symbols that do not match is
A procedure similar in concept to the minimum-distance classifier introduced ear-
b = max ( a , b ) − a (12-14)
lier for pattern vectors can be formulated for comparing region boundaries that are

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 919 6/16/2017 2:16:39 PM DIP4E_GLOBAL_Print_Ready.indb 920 6/16/2017 2:16:41 PM


12.3 Pattern Classification by Prototype Matching 921 922 Chapter 12 Image Pattern Classification

a a b
b c c d
b c e f
FIGURE 12.17 g
a
(a) Shapes.
FIGURE 12.18
(b) Similarity
(a) and (b) sample
tree. (c) Similarity
boundaries of two
matrix.
different object
(Bribiesca and d e f
classes; (c) and (d)
Guzman.)
their corresponding
polygonal
approximations;
Degree
(e)–(g) tabulations
4 abcdef of R. R 1.a 1.b 1.c 1.d 1.e 1.f R 2.a 2.b 2.c 2.d 2.e 2.f
a b c d e f
(Sze and Yang.)
1.a 2.a
6 abcdef a 6 6 6 6 6
1.b 16.0 2.b 33.5
b 8 8 10 8
bcdef 1.c 9.6 26.3 2.c 4.8 5.8
8 a
c 8 8 12 1.d 5.1 8.1 10.3 2.d 3.6 4.2 19.3

10 a cf be 1.e 4.7 7.2 10.3 14.2 2.e 2.8 3.3 9.2 18.3
d d 8 8
1.f 4.7 7.2 10.3 8.4 23.7 2.f 2.6 3.0 7.7 13.5 27.0
12 a d cf b e e 8
R 1.a 1.b 1.c 1.d 1.e 1.f
f
14 2.a 1.24 1.50 1.32 1.47 1.55 1.48
a d c f b e 2.b 1.18 1.43 1.32 1.47 1.55 1.48
2.c 1.02 1.18 1.19 1.32 1.39 1.48

where arg is the length (number of symbols) of string in the argument. It can be 2.d 1.02 1.18 1.19 1.32 1.29 1.40
shown that b = 0 if and only if a and b are identical (see Problem 12.7). 2.e 0.93 1.07 1.08 1.19 1.24 1.25
An effective measure of similarity is the ratio 2.f 0.89 1.02 1.02 1.24 1.22 1.18

a a
R= = (12-15) corresponding to the boundaries in Figs. 12.18(a) and (b), respectively. Strings were formed from the
b max ( a , b ) − a
polygons by computing the interior angle, u, between segments as each polygon was traversed clock-
We see that R is infinite for a perfect match and 0 when none of the corresponding wise. Angles were coded into one of eight possible symbols, corresponding to multiples of 45°; that is,
symbols in a and b match (a = 0 in this case). Because matching is done symbol by a1 : 0° < u ≤ 45°; a2 : 45° < u ≤ 90°; … ; a8 : 315° < u ≤ 360°.
symbol, the starting point on each boundary is important in terms of reducing the Figure 12.18(e) shows the results of computing the measure R for six samples of object 1 against
amount of computation required to perform a match. Any method that normalizes themselves. The entries are values of R and, for example, the notation 1.c refers to the third string from
Refer to Section 11.2
to, or near, the same starting point is helpful if it provides a computational advan- object class 1. Figure 12.18(f) shows the results of comparing the strings of the second object class
for examples of how the
starting point of a curve tage over brute-force matching, which consists of starting at arbitrary points on each against themselves. Finally, Fig. 12.18(g) shows the R values obtained by comparing strings of one class
can be normalized.
string, then shifting one of the strings (with wraparound) and computing Eq. (12-15) against the other. These values of R are significantly smaller than any entry in the two preceding tabu-
for each shift. The largest value of R gives the best match. lations. This indicates that the R measure achieved a high degree of discrimination between the two
classes of objects. For example, if the class of string 1.a had been unknown, the smallest value of R result-
ing from comparing this string against sample (prototype) strings of class 1 would have been 4.7 [see
EXAMPLE 12.4 : String matching. Fig. 12.18(e)]. By contrast, the largest value in comparing it against strings of class 2 would have been
Figures 12.18(a) and (b) show sample boundaries from each of two object classes, which were approxi- 1.24 [see Fig. 12.18(g)]. This result would have led to the conclusion that string 1.a is a member of object
mated by a polygonal fit (see Section 11.2). Figures 12.18(c) and (d) show the polygonal approximations class 1. This approach to classification is analogous to the minimum-distance classifier introduced earlier.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 921 6/16/2017 2:16:42 PM DIP4E_GLOBAL_Print_Ready.indb 922 6/16/2017 2:16:42 PM


12.4 Optimum (Bayes) Statistical Classifiers 923 924 Chapter 12 Image Pattern Classification

12.4 OPTIMUM (BAYES) STATISTICAL CLASSIFIERS


12.4
for all j; j ≠ i. The loss for a correct decision generally is assigned a value of 0, and
In this section, we develop a probabilistic approach to pattern classification. As is the loss for any incorrect decision usually is assigned a value of 1. Then, the loss
true in most fields that deal with measuring and interpreting physical events, prob- function becomes
ability considerations become important in pattern recognition because of the ran- Lij = 1 − dij (12-20)
domness under which pattern classes normally are generated. As shown in the fol-
lowing discussion, it is possible to derive a classification approach that is optimal in where dij = 1 if i = j, and dij = 0 if i ≠ j. Equation (12-20) indicates a loss of unity for
the sense that, on average, it yields the lowest probability of committing classifica- incorrect decisions and a loss of zero for correct decisions. Substituting Eq. (12-20)
tion errors (see Problem 12.12). into Eq. (12-18) yields
Nc

DERIVATION OF THE BAYES CLASSIFIER


rj ( x ) = ∑ (1 − dkj ) p ( x ck ) P (ck )
k =1 (12-21)
The probability that a pattern vector x comes from class ci is denoted by p ( ci x ) . If (
= p (x ) − p x cj P cj) ( )
the pattern classifier decides that x came from class c j when it actually came from ci
it incurs a loss (to be defined shortly), denoted by Lij . Because pattern x may belong The Bayes classifier then assigns a pattern x to class ci if, for all j ≠ i,
to any one of Nc possible classes, the average loss incurred in assigning x to class c j is
Nc ( ) ( )
p ( x ) − p ( x ci ) P ( ci ) < p ( x ) − p x c j P c j (12-22)
rj ( x ) = ∑ Lkj p (ck x ) (12-16)
k =1 or, equivalently, if
Quantity rj (x) is called the conditional average risk or loss in decision-theory termi- ( ) ( )
p ( x ci ) P ( ci ) > p x c j P c j j = 1, 2, …, Nc ; j ≠ i (12-23)
nology.
We know from Bayes’ rule that p(a b) = [ p(a) p(b a)] p(b), so we can write Eq. Thus, the Bayes classifier for a 0-1 loss function computes decision functions of the
(12-16) as form

1 Nc
( ) ( )
d j ( x ) = p x c j P c j j = 1, 2, …, Nc (12-24)
rj ( x ) = ∑ Lkj p ( x ck ) P (ck )
p (x ) k =1
(12-17)
and assigns a pattern to class ci if di ( x) > d j ( x) for all j ≠ i. This is exactly the same
where p ( x ck ) is the probability density function (PDF) of the patterns from class process described in Eq. (12-5), but we are now dealing with decision functions that
ck , and P(ck ) is the probability of occurrence of class ck (sometimes P(ck ) is referred have been shown to be optimal in the sense that they minimize the average loss in
to as the a priori, or simply the prior, probability). Because 1 p( x ) is positive and misclassification.
common to all the rj ( x ) , j = 1, 2, …, Nc , it can be dropped from Eq. (12-17) without For the optimality of Bayes decision functions to hold, the probability density
affecting the relative order of these functions from the smallest to the largest value. functions of the patterns in each class, as well as the probability of occurrence of
The expression for the average loss then reduces to each class, must be known. The latter requirement usually is not a problem. For
instance, if all classes are equally likely to occur, then P(c j ) = 1 Nc . Even if this con-
Nc
dition is not true, these probabilities generally can be inferred from knowledge of
rj ( x ) = ∑ Lkj p ( x ck ) P (ck )
k =1
(12-18)
the problem. Estimating the probability density functions p(x c j ) is more difficult. If
the pattern vectors are n-dimensional, then p(x c j ) is a function of n variables. If the
Given an unknown pattern, the classifier has Nc possible classes from which to form of p(x c j ) is not known, estimating it requires using multivariate estimation
choose. If the classifier computes r1 (x), r2 (x), …, rNc (x) for each pattern x and methods. These methods are difficult to apply in practice, especially if the number
assigns the pattern to the class with the smallest loss, the total average loss with of representative patterns from each class is not large, or if the probability density
respect to all decisions will be minimum. The classifier that minimizes the total functions are not well behaved. For these reasons, uses of the Bayes classifier often
average loss is called the Bayes classifier. This classifier assigns an unknown pat- are based on assuming an analytic expression for the density functions. This in turn
tern x to class ci if ri (x) < rj (x) for j = 1, 2, …, Nc ; j ≠ i. In other words, x is assigned reduces the problem to one of estimating the necessary parameters from sample
to class ci if patterns from each class using training patterns. By far, the most prevalent form
Nc Nc assumed for p(x c j ) is the Gaussian probability density function. The closer this
∑ Lki p ( x ck ) P (ck ) < ∑ Lqj p ( x cq ) P (cq ) (12-19) assumption is to reality, the closer the Bayes classifier approaches the minimum
k =1 q=1 average loss in classification.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 923 6/16/2017 2:16:45 PM DIP4E_GLOBAL_Print_Ready.indb 924 6/16/2017 2:16:48 PM


12.4 Optimum (Bayes) Statistical Classifiers 925 926 Chapter 12 Image Pattern Classification

BAYES CLASSIFIER FOR GAUSSIAN PATTERN CLASSES m j = E j {x } (12-27)


To begin, let us consider a 1-D problem (n = 1) involving two pattern classes ( Nc = 2)
You may find it helpful and
governed by Gaussian densities, with means m1 and m2 , and standard deviations s1
{(x − m ) (x − m ) }
to review the tutorial on
T
probability available in
the book website.
and s2 , respectively. From Eq. (12-24) the Bayes decision functions have the form C j = Ej j j (12-28)

( ) ( )
dj ( x ) = p x c j P c j where E j { ⋅ } is the expected value of the argument over the patterns of class c j . In
Eq. (12-26), n is the dimensionality of the pattern vectors, and C j is the determinant
( x − mj )
2

− (12-25) of matrix C j . Approximating the expected value E j by the sample average yields an
=
1
2ps j
e
2 s 2j
P cj ( ) j = 1, 2 estimate of the mean vector and covariance matrix:

1
where the patterns are now scalars, denoted by x. Figure 12.19 shows a plot of the mj =
nj
∑x (12-29)
probability density functions for the two classes. The boundary between the two x∈ c j
and
classes is a single point, x0 , such that d1 ( x0 ) = d2 ( x0 ). If the two classes are equally
1
likely to occur, then P(c1 ) = P(c2 ) = 1 2 , and the decision boundary is the value of Cj =
nj
∑ xxT − m j mTj (12-30)
x0 for which p( x0 c1 ) = p( x0 c2 ). This point is the intersection of the two probabil- x∈ c j

ity density functions, as shown in Fig. 12.19. Any pattern (point) to the right of x0 is where n j is the number of sample pattern vectors from class c j and the summation
classified as belonging to class c1 . Similarly, any pattern to the left of x0 is classified is taken over these vectors. We will give an example later in this section of how to
as belonging to class c2 . When the classes are not equally likely to occur, x0 moves to use these two expressions.
the left if class c1 is more likely to occur or, conversely, it moves to the right if class The covariance matrix is symmetric and positive semidefinite. Its kth diagonal ele-
c2 is more likely to occur. This result is to be expected, because the classifier is trying ment is the variance of the kth element of the pattern vectors. The kjth off-diagonal
to minimize the loss of misclassification. For instance, in the extreme case, if class c2 matrix element is the covariance of elements xk and x j in these vectors. The multi-
never occurs, the classifier would never make a mistake by always assigning all pat- variate Gaussian density function reduces to the product of the univariate Gauss-
terns to class c1 (that is, x0 would move to negative infinity). ian density of each element of x when the off-diagonal elements of the covariance
In the n-dimensional case, the Gaussian density of the vectors in the jth pattern matrix are zero, which happens when the vector elements xk and x j are uncorrelated.
class has the form From Eq. (12-24), the Bayes decision function for class c j is d j (x) = p(x c j )P(c j ).
However, the exponential form of the Gaussian density allows us to work with the
1
( ) ( )
T

( ) 1 − x − mj C−j 1 x − m j
p x cj = e 2 (12-26) natural logarithm of this decision function, which is more convenient. In other words,
12
( 2p ) n2
Cj we can use the form

where each density is specified completely by its mean vector m j and covariance (
d j ( x ) = ln  p x c j P c j ) ( ) (12-31)
matrix C j , which are defined as
(
= ln p x c j + ln P c j) ( )
This expression is equivalent to Eq. (12-24) in terms of classification performance
FIGURE 12.19 because the logarithm is a monotonically increasing function. That is, the numerical
Probability
density functions p( x c2 ) order of the decision functions in Eqs. (12-24) and (12-31) is the same. Substituting
for two 1-D Eq. (12-26) into Eq. (12-31) yields
Probability density

pattern classes.
Point x0 (at the
( ) n 1 1
ln 2p − ln C j −  x − m j ( ) ( )
C−j 1 x − m j  (12-32)
T
intersection of the p( x c1 ) d j ( x ) = ln P c j −
two curves) is the
As noted in Section 6.7 2 2 2  
[see Eq. (6-49)], the
Bayes decision square root of the
boundary if the rightmost term in this The term ( n 2 ) ln 2p is the same for all classes, so it can be eliminated from Eq.
equation is called the
two classes are Mahalanobis distance. (12-32), which then becomes
equally likely to

( ) 1 1
( ) ( )
occur.
ln C j −  x − m j C−j 1 x − m j 
T
x d j ( x ) = ln P c j − (12-33)
m2 x0 m1 2 2  

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 925 6/16/2017 2:16:51 PM DIP4E_GLOBAL_Print_Ready.indb 926 6/16/2017 2:16:53 PM


12.4 Optimum (Bayes) Statistical Classifiers 927 928 Chapter 12 Image Pattern Classification

FIGURE 12.20 x3
for j = 1, 2, …, Nc . This equation gives the Bayes decision functions for Gaussian
Two simple
pattern classes under the condition of a 0-1 loss function.
pattern classes (0, 0, 1)
The decision functions in Eq. (12-33) are hyperquadrics (quadratic functions in and the portion (0, 1, 1)
n-dimensional space), because no terms higher than the second degree in the com- of their Bayes
ponents of x appear in the equation. Clearly, then, the best that a Bayes classifier decision bound-
for Gaussian patterns can do is to place a second-order decision boundary between ary (shaded) that
intersects the
each pair of pattern classes. If the pattern populations are truly Gaussian, no other cube. (1, 0, 1) (1, 1, 1)
boundary would yield a lesser average loss in classification.
If all covariance matrices are equal, then C j = C for j = 1, 2, …, Nc . By expanding x2
Eq. (12-33), and dropping all terms that do not depend on j, we obtain (0, 0, 0) (0, 1, 0)

( )
d j ( x ) = ln P c j + xT C−1m j −
1 T −1
2
mj C mj (12-34) (1, 0, 0)
(1, 1, 0)
∈ c1
∈ c2
x1
which are linear decision functions (hyperplanes) for j = 1, 2, …, Nc .
If, in addition, C = I, where I is the identity matrix, and also if the classes are 3 1 1 
equally likely (i.e., P(c j ) = 1 Nc for all j), then we can drop the term ln P(c j ) because 1 
C1 = C2 =  1 3 −1
it would be the same for all values of j. Equation (12-34) then becomes 16
1 −1 3 
1 T The inverse of this matrix is
d j ( x ) = mTj x − mj mj j = 1, 2, …, Nc (12-35)
2  8 − 4 −4 
C1−1 = C−2 1 =  −4 8 4 
which we recognize as the decision functions for a minimum-distance classifier [see
Eq. (12-4)]. Thus, as mentioned earlier, the minimum-distance classifier is optimum  −4 4 8 
in the Bayes sense if (1) the pattern classes follow a Gaussian distribution, (2) all
covariance matrices are equal to the identity matrix, and (3) all classes are equally Next, we obtain the decision functions. Equation (12-34) applies because the covariance matrices are
likely to occur. Gaussian pattern classes satisfying these conditions are spherical equal, and we are assuming that the classes are equally likely:
clouds of identical shape in n dimensions (called hyperspheres). The minimum- 1 T −1
distance classifier establishes a hyperplane between every pair of classes, with the d j ( x ) = xT C−1m j − mj C mj
2
property that the hyperplane is the perpendicular bisector of the line segment join-
ing the center of the pair of hyperspheres. In 2-D, the patterns are distributed in cir- Carrying out the vector-matrix expansion, we obtain the two decision functions:
cular regions, and the boundaries become lines that bisect the line segment joining d1 ( x ) = 4 x1 − 1.5 and d2 ( x ) = −4 x1 + 8 x2 + 8 x3 − 5.5
the center of every pair of such circles.
The decision boundary separating the two classes is then
EXAMPLE 12.5 : A Bayes classifier for 3-D patterns.
d1 ( x ) − d2 ( x ) = 8 x1 − 8 x2 − 8 x3 + 4 = 0
We illustrate the mechanics of the preceding development using the simple patterns in Fig. 12.20. We
assume that the patterns are samples from two Gaussian populations, and that the classes are equally Figure 12.20 shows a section of this planar surface. Note that the classes were separated effectively.
likely to occur. Applying Eq. (12-29) to the patterns in the figure results in

 3 1  EXAMPLE 12.6 : Classification of multispectral data using a Bayes classifier.


1  1 
m1 = 1 and m = 3 As discussed in Sections 1.3 and 11.5, a multispectral scanner responds to selected bands of the electro-
3  3 
2
1   3 magnetic energy spectrum, such as the bands: 0.45– 0.52, 0.53– 0.61, 0.63– 0.69, and 0.78– 0.90 microns.
These ranges are in the visible blue, visible green, visible red, and near infrared bands, respectively. A
And, from Eq. (12-30), region on the ground scanned using these multispectral bands produces four digital images of the region,

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 927 6/16/2017 2:16:55 PM DIP4E_GLOBAL_Print_Ready.indb 928 6/16/2017 2:16:56 PM


12.4 Optimum (Bayes) Statistical Classifiers 929 930 Chapter 12 Image Pattern Classification
one for each band. If the images are registered spatially, they can be visualized as being stacked one
behind the other, as illustrated in Fig. 12.7. As we explained in that figure, every point on the ground
in this example can be represented by a 4-D pattern vector of the form x = ( x1 , x2 , x3 , x4 ) , where x1
T

is a shade of blue, x2 a shade of green, and so on. If the images are of size 512 × 512 pixels, each stack
of four multispectral images can be represented by 266,144 four-dimensional pattern vectors. As noted
previously, the Bayes classifier for Gaussian patterns requires estimates of the mean vector and covari-
ance matrix for each class. In remote sensing applications, these estimates are obtained using training
multispectral data whose classes are known from each region of interest (this knowledge sometimes is
referred to as ground truth). The resulting vectors are then used to estimate the required mean vectors
and covariance matrices, as in Example 12.5.
Figures 12.21(a) through (d) show four 512 × 512 multispectral images of the Washington, D.C. area,
taken in the bands mentioned in the previous paragraph. We are interested in classifying the pixels in
these images into one of three pattern classes: water, urban development, or vegetation. The masks in
Fig. 12.21(e) were superimposed on the images to extract samples representative of these three classes.
Half of the samples were used for training (i.e., for estimating the mean vectors and covariance matri-
ces), and the other half were used for independent testing to assess classifier performance. We assume
that the a priori probabilities are equal, P(c j ) = 1 3 ; j = 1, 2, 3.
Table 12.1 summarizes the classification results we obtained with the training and test data sets. The
percentage of training and test pattern vectors recognized correctly was about the same with both data
sets, indicating that the learned parameters did not over-fit the parameters to the training data. The larg-
est error in both cases was with patterns from the urban area. This is not unexpected, as vegetation is
present there also (note that no patterns in the vegetation or urban areas were misclassified as water).
Figure 12.21(f) shows as black dots the training and test patterns that were misclassified, and as white
dots the patterns that were classified correctly. No black dots are visible in region 1, because the seven
misclassified points are very close to the boundary of the white region. You can compute from the num-
bers in the table that the correct recognition rate was 96.4% for the training patterns, and 96.1% for the
test patterns.
Figures 12.21(g) through (i) are more interesting. Here, we let the system classify all image pixels into
one of the three categories. Figure 12.21(g) shows in white all pixels that were classified as water. Pixels
not classified as water are shown in black. We see that the Bayes classifier did an excellent job of deter-
mining which parts of the image were water. Figure 12.21(h) shows in white all pixels classified as urban
development; observe how well the system performed in recognizing urban features, such as the bridges
and highways. Figure 12.21(i) shows the pixels classified as vegetation. The center area in Fig. 12.21(h)
shows a high concentration of white pixels in the downtown area, with the density decreasing as a func-
tion of distance from the center of the image. Figure 12.21(i) shows the opposite effect, indicating the
least vegetation toward the center of the image, where urban development is the densest. a b c
d e f
g h i
We mentioned in Section 10.3 when discussing Otsu’s method that thresholding
may be viewed as a Bayes classification problem, which optimally assigns patterns FIGURE 12.21 Bayes classification of multispectral data. (a)–(d) Images in the visible blue, visible green, visible red,
to two or more classes. In fact, as the previous example shows, pixel-by-pixel classi- and near infrared wavelength bands. (e) Masks for regions of water (labeled 1), urban development (labeled 2),
fication may be viewed as a segmentation that partitions an image into two or more and vegetation (labeled 3). (f) Results of classification; the black dots denote points classified incorrectly. The other
possible types of regions. If only one single variable (e.g., intensity) is used, then (white) points were classified correctly. (g) All image pixels classified as water (in white). (h) All image pixels clas-
sified as urban development (in white). (i) All image pixels classified as vegetation (in white).
Eq. (12-24) becomes an optimum function that similarly partitions an image based
on the intensity of its pixels, as we did in Section 10.3. Keep in mind that optimal-
ity requires that the PDF and a priori probability of each class be known. As we

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 929 6/16/2017 2:16:56 PM DIP4E_GLOBAL_Print_Ready.indb 930 6/16/2017 2:16:57 PM


12.5 Neural Networks and Deep Learning 931 932 Chapter 12 Image Pattern Classification

TABLE 12.1 binary thresholding devices, and stochastic algorithms involving sudden 0–1 and 1–0
Bayes classification of multispectral image data. Classes 1, 2, and 3 are water, urban, and vegetation, respectively. changes of states, as the basis for modeling neural systems. Subsequent work by
Training Patterns Test Patterns Hebb [1949] was based on mathematical models that attempted to capture the con-
cept of learning by reinforcement or association.
No. of Classified into Class % No. of Classified into Class % During the mid-1950s and early 1960s, a class of so-called learning machines origi-
Class Samples 1 2 3 Correct Class Samples 1 2 3 Correct nated by Rosenblatt [1959, 1962] caused a great deal of excitement among research-
1 484 482 2 0 99.6 1 483 478 3 2 98.9 ers and practitioners of pattern recognition. The reason for the interest in these
machines, called perceptrons, was the development of mathematical proofs showing
2 933 0 885 48 94.9 2 932 0 880 52 94.4
that perceptrons, when trained with linearly separable training sets (i.e., training sets
3 483 0 19 464 96.1 3 482 0 16 466 96.7 separable by a hyperplane), would converge to a solution in a finite number of itera-
tive steps. The solution took the form of parameters (coefficients) of hyperplanes
that were capable of correctly separating the classes represented by patterns of the
have mentioned previously, estimating these densities is not a trivial task. If assump- training set.
tions have to be made (e.g., as in assuming Gaussian densities), then the degree of Unfortunately, the expectations following discovery of what appeared to be a
optimality achieved in classification depends on how close the assumptions are to well-founded theoretical model of learning soon met with disappointment. The
reality. basic perceptron, and some of its generalizations, were inadequate for most pattern
recognition tasks of practical significance. Subsequent attempts to extend the power
of perceptron-like machines by considering multiple layers of these devices lacked
12.5 NEURAL NETWORKS AND DEEP LEARNING
12.5
effective training algorithms, such as those that had created interest in the percep-
The principal objectives of the material in this section and in Section 12.6 are to tron itself. The state of the field of learning machines in the mid-1960s was sum-
present an introduction to deep neural networks, and to derive the equations that marized by Nilsson [1965]. A few years later, Minsky and Papert [1969] presented
are the foundation of deep learning. We will discuss two types of networks. In this a discouraging analysis of the limitation of perceptron-like machines. This view was
section, we focus attention on multilayer, fully connected neural networks, whose held as late as the mid-1980s, as evidenced by comments made by Simon [1986]. In
inputs are pattern vectors of the form introduced in Section 12.2. In Section 12.6, we this work, originally published in French in 1984, Simon dismisses the perceptron
will discuss convolutional neural networks, which are capable of accepting images under the heading “Birth and Death of a Myth.”
as inputs. We follow the same basic approach in presenting the material in these two More recent results by Rumelhart, Hinton, and Williams [1986] dealing with the
sections. That is, we begin by developing the equations that describe how an input is development of new training algorithms for multilayers of perceptron-like units
mapped through the networks to generate the outputs that are used to classify that have changed matters considerably. Their basic method, called backpropagation
input. Then, we derive the equations of backpropagation, which are the tools used (backprop for short), provides an effective training method for multilayer networks.
to train both types of networks. We give examples in both sections that illustrate the Although this training algorithm cannot be shown to converge to a solution in the
power of deep neural networks and deep learning for solving complex pattern clas- sense of the proof for the single-layer perceptron, backpropagation is capable of
sification problems. generating results that have revolutionized the field of pattern recognition.
The approaches to pattern recognition we have studied up to this point rely on
BACKGROUND human-engineered techniques to transform raw data into formats suitable for com-
The essence of the material that follows is the use of a multitude of elemental non- puter processing. The methods of feature extraction we studied in Chapter 11 are
linear computing elements (called artificial neurons), organized as networks whose examples of this. Unlike these approaches, neural networks can use backpropaga-
interconnections are similar in some respects to the way in which neurons are inter- tion to automatically learn representations suitable for recognition, starting with
connected in the visual cortex of mammals. The resulting models are referred to raw data. Each layer in the network “refines” the representation into more abstract
by various names, including neural networks, neurocomputers, parallel distributed levels. This type of multilayered learning is commonly referred to as deep learning,
processing models, neuromorphic systems, layered self-adaptive networks, and con- and this capability is one of the underlying reasons why applications of neural net-
nectionist models. Here, we use the name neural networks, or neural nets for short. works have been so successful. As we noted at the beginning of this section, practical
We use these networks as vehicles for adaptively learning the parameters of decision implementations of deep learning generally are associated with large data sets.
functions via successive presentations of training patterns. Of course, these are not “magical” systems that assemble themselves. Human
Interest in neural networks dates back to the early 1940s, as exemplified by the intervention is still required for specifying parameters such as the number of layers,
work of McCulloch and Pitts [1943], who proposed neuron models in the form of the number of artificial neurons per layer, and various coefficients that are problem

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 931 6/16/2017 2:16:57 PM DIP4E_GLOBAL_Print_Ready.indb 932 6/16/2017 2:16:57 PM


12.5 Neural Networks and Deep Learning 933 934 Chapter 12 Image Pattern Classification

dependent. Teaching proper recognition to a complex multilayer neural network is reduces the training and operation of neural nets to a simple, straightforward cas-
not a science; rather, it is an art that requires considerable knowledge and experi- cade of matrix multiplications.
mentation on the part of the designer. Countless applications of pattern recogni- After studying several examples of fully connected neural nets, we will follow a
tion, especially in constrained environments, are best handled by more “traditional” similar approach in developing the foundation of CNNs, including how they differ
methods. A good example of this is stylized font recognition. It would be senseless from fully connected neural nets, and how their training is different. This is followed
to develop a neural network to recognize the E-13B font we studied in Fig. 12.11. A by several examples of how CNNs are used for image pattern classification.
minimum-distance classifier implemented on a hard-wired architecture is the ideal
solution to this problem, provided that interest is limited to reading only the E-13B THE PERCEPTRON
font printed on bank checks. On the other hand, neural networks have proved to be A single perceptron unit learns a linear boundary between two linearly separable
the ideal solution if the scope of application is expanded to require that all relevant pattern classes. Figure 12.22(a) shows the simplest possible example in two dimen-
text written on checks, including cursive script, be read with high accuracy. sions: two pattern classes, consisting of a single pattern each. A linear boundary in
Deep learning has shined in applications that defy other methods of solution. In 2-D is a straight line with equation y = ax + b, where coefficient a is the slope and b
the two decades following the introduction of backpropagation, neural networks is the y-intercept. Note that if b = 0, the line goes through the origin. Therefore, the
have been used successfully in a broad range of applications. Some of them, such as function of parameter b is to displace the line from the origin without affecting its
speech recognition, have become an integral part of everyday life. When you speak slope. For this reason, this “floating” coefficient that is not multiplied by a coordi-
into a smart phone, the nearly flawless recognition is performed by a neural network. nate is often referred to as the bias, the bias coefficient, or the bias weight.
This type of performance was unachievable just a few years ago. Other applications We are interested in a line that separates the two classes in Fig. 12.22. This is a line
from which you benefit, perhaps without realizing it, are smart filters that learn user positioned in such a way that pattern ( x1 , y1 ) from class c1 lies on one side of the line,
preferences for rerouting spam and other junk mail from email accounts, and the and pattern ( x2 , y2 ) from class c2 lies on the other. The locus of points ( x, y) that are
systems that read zip codes on postal mail. Often, you see television clips of vehicles on the line, satisfy the equation y − ax − b = 0. It then follows that any point on one
navigating autonomously, and robots that are capable of interacting with their envi- side of the line would yield a positive value when its coordinates are plugged into
ronment. Most are solutions based on neural networks. Less familiar applications this equation, and conversely for a point on the other side.
include the automated discovery of new medicines, the prediction of gene mutations Generally, we work with patterns in much higher dimensions than two, so we need
in DNA research, and advances in natural language understanding. more general notation. Points in n dimensions are vectors. The components of a vec-
Although the list of practical uses of neural nets is long, applications of this tech- tor, x1 , x2 , …, xn , are the coordinates of the point. For the coefficients of the boundary
nology in image pattern classification has been slower in gaining popularity. As separating the two classes, we use the notation w1 , w2 , …, wn , wn+1 , where wn+1 is the
you will learn shortly, using neural nets in image processing is based principally on bias. The general equation of our line using this notation is w1 x1 + w2 x2 + w3 = 0 (we
neural network architectures called convolutional neural nets (denoted by CNNs can express this equation in slope-intercept form as x2 + (w1 w2 )x1 + w3 w2 = 0).
or ConvNets). One of the earliest well-known applications of CNNs is the work of Figure 12.22(b) is the same as (a), but using this notation. Comparing the two fig-
LeCun et al. [1989] for reading handwritten U.S. postal zip codes. A number of other ures, we see that y = x2 , x = x1 , a = w1 w 2 , and b = w3 w 2 . Equipped with our more
applications followed shortly thereafter, but it was not until the results of the 2012
ImageNet Challenge were published (e.g., see Krizhevsky, Sutskever, and Hinton FIGURE 12.22 y x2
[2012]) that CNNs became widely used in image pattern recognition. Today, this is (a) The simplest
the approach of choice for addressing complex image recognition tasks. two-class example
in 2-D, showing one + +
The neural network literature is vast and rapidly evolving, so as usual, our
possible decision − −
approach is to focus on fundamentals. In this and the following sections, we will boundary out of an ∈ c1 ∈ c1
establish the foundation of how neural nets are trained, and how they operate after infinite number of
training. We will begin by briefly discussing perceptrons. Although these computing such boundaries.
(b) Same as (a), but
elements are not used per se in current neural network architectures, the opera-
with the w1 x1 + w2 x2 + w3 = 0
tions they perform are almost identical to artificial neurons, which are the basic decision boundary y = ax + b, or
computing units of neural nets. In fact, an introduction to neural networks would expressed using y − ax − b = 0
∈ c2 ∈ c2
be incomplete without a discussion of perceptrons. We will follow this discussion by more general
developing in detail the theoretical foundation of backpropagation. After develop- notation.
ing the basic backpropagation equations, we will recast them in matrix form, which x x1

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 933 6/16/2017 2:16:57 PM DIP4E_GLOBAL_Print_Ready.indb 934 6/16/2017 2:16:59 PM


12.5 Neural Networks and Deep Learning 935 936 Chapter 12 Image Pattern Classification

general notation, we say that an arbitrary point ( x1 , x2 ) is on the positive side of a 2) If x(k ) ∈ c2 and w T (k )x(k ) + wn+1 (k ) ≥ 0, let
line if w1 x1 + w2 x2 + w3 > 0, and conversely for any point on the negative side. For
points in 3-D, we work with the equation of a plane, w1 x1 + w2 x2 + w3 x3 + w4 = 0, w (k + 1) = w (k ) − a x(k )
(12-41)
but would perform exactly the same test to see if a point lies on the positive or vn+1 (k + 1) = vn+1 (k ) − a
negative side of the plane. For a point in n dimensions, the test would be against a
hyperplane, whose equation is 3) Otherwise, let
w (k + 1) = w (k )
w1 x1 + w2 x2 + " + wn xn + wn+1 = 0 (12-36) (12-42)
vn+1 (k + 1) = vn+1 (k )
This equation is expressed in summation form as
The correction in Eq. (12-40) is applied when the pattern is from class c1 and
n
Eq. (12-39) does not give a positive response. Similarly, the correction in Eq. (12-41)
∑ wi xi + wn+1 = 0
i =1
(12-37)
is applied when the pattern is from class c2 and Eq. (12-39) does not give a negative
or in vector form as response. As Eq. (12-42) shows, no change is made when Eq. (12-39) gives the cor-
rect response.
wT x + wn+1 = 0 (12-38) The notation in Eqs. (12-40) through (12-42) can be simplified if we add a 1 at
the end of every pattern vector and include the bias in the weight vector. That is,
where w and x are n-dimensional column vectors and wT x is the dot (inner) prod- we definex % [ x1 , x2 ,… , xn , 1]T and w % [w1 , w2 ,… , wn , wn+1 ]T . Then, Eq. (12-39)
uct of the two vectors. Because the inner product is commutative, we can express becomes
Eq. (12-38) in the equivalent form xT w + wn+1 = 0. We refer to w as a weight vector
and, as above, to wn+1 as a bias. Because the bias is a weight that is always multiplied > 0 if x ∈ c1
wT x =  (12-43)
by 1, sometimes we avoid repetition by using the term weights, coefficients, or param- < 0 if x ∈ c2
eters when referring to the bias and the elements of a weight vector collectively.
Stating the class separation problem in general form we say that, given any pat- where both vectors are now (n + 1)-dimensional. In this formulation, x and w are
It is customary to tern vector x from a vector population, we want to find a set of weights with the referred to as augmented pattern and weight vectors, respectively. The algorithm in
associate > with class c1
and < with class c2, but property Eqs. (12-40) through (12-42) then becomes: For any pattern vector, x(k ), at step k
the sense of the
> 0 if x ∈ c1
inequality is arbitrary,
wT x + wn+1 =  (12-39) 1$) If x(k ) ∈ c1 and w T (k ) x(k ) ≤ 0, let
< 0 if x ∈ c2
provided that you are
consistent. Note that this
equation implements a
linear decision function. Finding a line that separates two linearly separable pattern classes in 2-D can be w (k + 1) = w (k ) + ax(k ) (12-44)
done by inspection. Finding a separating plane by visual inspection of 3-D data is
2$) If x(k ) ∈ c2 and w T (k )x(k ) ≥ 0, let
Linearly separable class- more difficult, but it is doable. For n > 3, finding a separating hyperplane by inspec-
es satisfy Eq. (12-39). tion becomes impossible in general. We have to resort instead to an algorithm to find
That is, they are w (k + 1) = w (k ) − ax(k ) (12-45)
separable by single a solution. The perceptron is an implementation of such an algorithm. It attempts
hyperplanes. to find a solution by iteratively stepping through the patterns of each of two classes. 3$) Otherwise, let
It starts with an arbitrary weight vector and bias, and is guaranteed to converge in a
finite number of iterations if the classes are linearly separable. w (k + 1) = w (k ) (12-46)
The perceptron algorithm is simple. Let a > 0 denote a correction increment (also
called the learning increment or the learning rate), let w(1) be a vector with arbi- where the starting weight vector, w(1), is arbitrary and, as above, a is a positive
trary values, and let wn+1 (1) be an arbitrary constant. Then, do the following for constant. The procedure implemented by Eqs. (12-40)–(12-42) or (12-44)–(12-46) is
k = 2, 3, … : For a pattern vector, x(k ), at step k, called the perceptron training algorithm. The perceptron convergence theorem states
that the algorithm is guaranteed to converge to a solution (i.e., a separating hyper-
1) If x(k ) ∈ c1 and w T (k )x(k ) + wn+1 (k ) ≤ 0, let plane) in a finite number of steps if the two pattern classes are linearly separable
(see Problem 12.15). Normally, Eqs. (12-44)–(12-46) are the basis for implementing
w (k + 1) = w (k ) + a x(k ) the perceptron training algorithm, and we will use it in the following paragraphs
(12-40)
vn+1 (k + 1) = vn+1 (k ) + a of this section. However, the notation in Eqs. (12-40)–(12-42), in which the bias is

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 935 6/16/2017 2:17:01 PM DIP4E_GLOBAL_Print_Ready.indb 936 6/16/2017 2:17:03 PM


12.5 Neural Networks and Deep Learning 937 938 Chapter 12 Image Pattern Classification
FIGURE 12.23 x1
w1 1
Schematic of a
perceptron, x2 w2 wT (2)x(2) = [ 3 3 1] 1 = 7
showing the 1
operations it .. +1
.. n
performs.
. ∑ wk xk
k =1
+ wn +1 +1 or − 1 The result is positive when it should have been negative, so Step 2$ applies:
wn −1
xn  3 1   2 
w (3) = w (2) − ax(2) =  3 − (1) 1 =  2 
wn+1

1 1  1  0 

We have gone through a complete training epoch with at least one correction, so we cycle through the
training set again.
shown separately, is more prevalent in neural networks, so you need to be familiar
For k = 3, x(3) = [3 3 1]T ∈ c1 , and w(3) = [ 2 2 0]T . Their inner product is positive (i.e., 6) as it should
with it as well.
be because x(3) ∈c1 . Therefore, Step 3$ applies and the weight vector is not changed:
Figure 12.23 shows a schematic diagram of the perceptron. As you can see, all
this simple “machine” does is form a sum of products of an input pattern using the 2
weights and bias found during training. The output of this operation is a scalar value
Note that the perceptron w (4) = w (3) =  2 
model implements Eq. that is then passed through an activation function to produce the unit’s output. For
(12-39), which is in the perceptron, the activation function is a thresholding function (we will consider  0 
the form of a decision
function. other forms of activation when we discuss neural networks). If the thresholded out-
put is a +1, we say that the pattern belongs to class c1 . Otherwise, a −1 indicates that For k = 4, x(4) = [1 1 1]T ∈ c2 , and w(4) = [ 2 2 0]T . Their inner product is positive (i.e., 4) and it should
the pattern belongs to class c2 . Values 1 and 0 sometimes are used to denote the two have been negative, so Step 2$ applies:
possible states of the output. 2 1   1 
w (5) = w (4) − ax(4) =  2  − (1) 1 =  1 
EXAMPLE 12.7 : Using the perceptron algorithm to learn a decision boundary.  0  1  −1
We illustrate the steps taken by a perceptron in learning the coefficients of a linear boundary by solving
At least one correction was made, so we cycle through the training patterns again. For k = 5, we have
the mini problem in Fig. 12.22. To simplify manual computations, let the pattern vector furthest from the
x(5) = [3 3 1]T ∈ c1 , and, using w(5), we compute their inner product to be 5. This is positive as it should
origin be x = [3 3 1]T , and the other be x = [1 1 1]T , where we augmented the vectors by appending a
be, so Step 3$ applies and we let w (6) = w (5) = [1 1 − 1]T . Following this procedure just discussed, you
1 at the end, as discussed earlier. To match the figure, let these two patterns belong to classes c1 and c2 ,
can show (see Problem 12.13) that the algorithm converges to the solution weight vector
respectively. Also, assume the patterns are “cycled” through the perceptron in that order during training
(one complete iteration through all patterns of the training is called an epoch). To start, we let a = 1 and
1
w(1) = 0 = [0 0 0]T ; then,
For k = 1, x(1) = [3 3 1]T ∈ c1 , and w(1) = [0 0 0]T . Their inner product is zero, w = w (12) =  1 
 −3
 3 which gives the decision boundary
wT (1)x(1) = [ 0 0 0 ]  3 = 0
x1 + x2 − 3 = 0
1 

so Step 1$ of the second version of the training algorithm applies: Figure 12.24(a) shows the boundary defined by this equation. As you can see, it clearly separates the
patterns of the two classes. In terms of the terminology we used in the previous section, the decision
0   3  3 surface learned by the perceptron is d(x) = d( x1 , x2 ) = x1 + x2 − 3, which is a plane. As before, the
w (2) = w (1) + ax(1) =  0  + (1)  3 =  3 decision boundary is the locus of points such that d(x) = d( x1 , x2 ) = 0, which is a line. Another way to
visualize this boundary is that it is the intersection of the decision surface (a plane) with the x1 x2 -plane,
 0  1  1 
as Fig. 12.24(b) shows. All points ( x1 , x2 ) such that d( x1 , x2 ) > 0 are on the positive side of the boundary,
For k = 2, x(2) = [1 1 1]T ∈ c2 and w(2) = [3 3 1]T . Their inner product is and vice versa for d( x1 , x2 ) < 0.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 937 6/16/2017 2:17:05 PM DIP4E_GLOBAL_Print_Ready.indb 938 6/16/2017 2:17:07 PM


12.5 Neural Networks and Deep Learning 939 940 Chapter 12 Image Pattern Classification

a b d(x) = d( x1 , x2 ) = x1 + x2 − 3
x2
FIGURE 12.24 E E
x1 + x2 − 3 1
(a) Segment +
3 0.50 0.50
of the decision x1 + x2 − 3 = 0
boundary learned
by the perceptron x1 + x2 − 3 = 0 0.5
algorithm. 2 0.25 0.25
(b) Section of the
x2
decision surface. x1 0
The decision 3
3 2
boundary is the 1 2 2
2 0 wx 0 wx
1
intersection of the 1 1 0 1 2 0 1 2
1
decision surface 0
with the x1 x2 - 0
x1 a b c
plane. 0 1 2 3
FIGURE 12.25 Plots of E as a function of wx for r = 1. (a) A value of a that is too small can slow down convergence.
(b) If a is too large, large oscillations or divergence may occur. (c) Shape of the error function in 2-D.

EXAMPLE 12.8 : Using the perceptron to classify two sets of iris data measurements. We find the minimum of E(w) using an iterative gradient descent algorithm, whose
form is
In Fig. 12.10 we showed a reduced set of the iris database in two dimensions, and mentioned that the
only class that was separable from the others is the class of Iris setosa. As another illustration of the Note that the right side  ∂E ( w ) 
perceptron, we now find the full decision boundary between the Iris setosa and the Iris versicolor classes. of this equation is the w ( k + 1) = w(k ) − a   (12-48)
As we mentioned when discussing Fig. 12.10, these are 4-D data sets. Letting a = 0.5, and starting with gradient of E(w).  ∂ w  w = w( k )
all parameters equal to zero, the perceptron converged in only four epochs to the solution weight vector
where the starting weight vector is arbitrary, and a > 0.
w = [0.65, 2.05, − 2.60, − 1.10, 0.50]T , where the last element is wn+1 .
Figure 12.25(a) shows a plot of E for scalar values, w and x, of w and x. We want
to move w incrementally so E(w) approaches a minimum, which implies that E
In practice, linearly separable pattern classes are rare, and a significant amount should stop changing or, equivalently, that ∂E(w) ∂ w = 0. Equation (12-48) does
of research effort during the 1960s and 1970s went into developing techniques for precisely this. If ∂E(w) ∂ w > 0, a portion of this quantity (determined by the value
dealing with nonseparable pattern classes. With recent advances in neural networks, of the learning increment a) is subtracted from w(k ) to create a new, updated value
many of those methods have become items of mere historical interest, and we will w(k + 1), of the weight. The opposite happens if ∂E(w) ∂ w < 0. If ∂E(w) ∂ w = 0,
not dwell on them here. However, we mention briefly one approach because it is rel- the weight is unchanged, meaning that we have arrived at a minimum, which is the
evant to the discussion of neural networks in the next section. The method is based solution we are seeking. The value of a determines the relative magnitude of the
on minimizing the error between the actual and desired response at any training step. correction in weight value. If a is too small, the step changes will be correspond-
Let r denote the response we want the perceptron to have for any pattern during ingly small and the weight would move slowly toward convergence, as Fig. 12.25(a)
training. The output of our perceptron is either +1 or −1, so these are the two pos- illustrates. On the other hand, choosing a too large could cause large oscillations
sible values that r can have. We want to find the augmented weight vector, w, that on either side of the minimum, or even become unstable, as Fig. 12.25(b) illustrates.
minimizes the mean squared error (MSE) between the desired and actual responses There is no general rule for choosing a. However, a logical approach is to start small
of the perceptron. The function should be differentiable and have a unique mini- and experiment by increasing a to determine its influence on a particular set of
mum. The function of choice for this purpose is a quadratic of the form training patterns. Figure 12.25(c) shows the shape of the error function for two vari-
The 1 ⁄ 2 is used to cancel ables.
out the 2 that will result
1
( ) Because the error function is given analytically and it is differentiable, we can
2
from taking the deriva- E( w) = r − wT x (12-47)
tive of this expression. 2 express Eq. (12-48) in a form that does not require computing the gradient explicitly
Also, remember that wTx
is a scalar.
at every step. The partial of E(w) with respect to w is
where E is our error measure, w is the weight vector we are seeking, x is any pattern
∂E ( w )
from the training set, and r is the response we desire for that pattern. Both w and x
are augmented vectors. ∂w
(
= − r − wT x x ) (12-49)

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 939 6/16/2017 2:17:08 PM DIP4E_GLOBAL_Print_Ready.indb 940 6/16/2017 2:17:11 PM


12.5 Neural Networks and Deep Learning 941 942 Chapter 12 Image Pattern Classification

Substituting this result into Eq. (12-48) yields


The weight vector at the end of 50 epochs of training was w = [0.098 0.357 − 0.548 − 0.255 0.075]T .
All patterns were classified correctly into their two respective classes using this vector. That is, although
w ( k + 1) = w ( k ) + a r ( k ) − w T ( k ) x ( k ) x ( k ) (12-50) the MSE did not become zero, the resulting weight vector was able to classify all the patterns correctly.
But keep in mind that the LMSE algorithm does not always achieve 100% correct recognition of lin-
which is in terms of known or easily computable terms. As before, w(1) is arbitrary. early separable classes.
Widrow and Stearns [1985] have shown that it is necessary (but not sufficient) As noted earlier, only the Iris setosa samples are linearly separable from the others. But the Iris ver-
for a to be in the range 0 < a < 2 for the algorithm in Eq. (12-50) to converge. A sicolor and virginica samples are not. The perceptron algorithm would not converge when presented
typical range for a is 0.1 < a < 1.0. Although the proof is not shown here, the algo- with these data, whereas the LMSE algorithm does. Figure 12.26(b) is the MSE as a function of training
rithm converges to a solution that minimizes the mean squared error over the pat- epoch for these two data sets, obtained using the same values for w(1) and a as in (a). This time, it took
terns of the training set. For this reason, the algorithm is often referred to as the 900 epochs for the MSE to stabilize at 0.09, which is much higher than before. The resulting weight vec-
least-mean-squared-error (LMSE) algorithm. In practice, we say that the algorithm tor was w = [0.534 0.584 − 0.878 − 1.028 0.651]T . Using this vector resulted in seven misclassification
has converged when the error decreases below a specified threshold. The solution errors out of 100 patterns, giving a recognition rate of 93%.
at convergence may not be a hyperplane that fully partitions two linearly separable
classes. That is, a mean-square-error solution does not imply a solution in the sense of
the perceptron training theorem. This uncertainty is the price of using an algorithm A classic example used to show the limitations of single linear decision boundar-
whose convergence is independent of the linear separability of the pattern classes. ies (and hence single perceptron units) is the XOR classification problem. The table
in Fig. 12.27(a) shows the definition of the XOR operator for two variables. As you
can see, the XOR operation produces a logical true (1) value when either of the
EXAMPLE 12.9 : Using the LMSE algorithm. variables (but not both) is true; otherwise, the result is false (0). The XOR two-class
It will be interesting to compare the performance of the LMSE algorithm using the same set of separa- pattern classification problem is set up by letting each pair of values A and B be a
ble iris data as in Example 12.8. Figure 12.26(a) is a plot of the error [Eq. (12-47)] as a function of epoch point in 2-D space, and letting the true (1) XOR values define one class, and the false
for 50 epochs, using Eq. (12-50) (with a = 0.001) to obtain the weights (we started with w(1) = 0). Each (0) values define the other. In this case, we assigned the class c1 label to patterns
epoch of training consisted of sequentially updating the weights, one pattern at a time, and computing {(0, 0), (1, 1)} , and the c2 label to patterns {(1, 0), (0, 1)} . A classifier capable of solv-
Eq. (12-47) for each weight and the corresponding pattern. At the end of the epoch, the errors were ing the XOR problem must respond with a value, say, 1, when a pattern from class c1
added and divided by 100 (the total number of patterns) to obtain the mean squared error (MSE). This is presented, and a different value, say, 0 or −1, when the input pattern is from class
yielded one point of the curve of Fig. 12.26(a). After increasing and then decreasing rapidly, no appre- c2 . You can tell by inspection of Fig. 12.27(b) that a single linear decision boundary
ciable difference in error occurred after about 20 epochs. For example, the error at the end of the 50th (a straight line) cannot separate the two classes correctly. This means that we cannot
epoch was 0.02 and, at the end of 1,000 epochs, it was 0.0192. Getting smaller error values is possible by solve the problem with a single perceptron. The simplest linear boundary consists
further decreasing a, but at the expense of slower decay in the error, as noted in Fig. 12.25. Keep in mind of two straight lines, as Fig. 12.27(b) shows. A more complex, nonlinear, boundary
also that MSE is not directly proportional to correct recognition rate. capable of solving the problem is a quadratic function, as in Fig. 12.27(c).

x2 x2
+
A B A XOR B
– +
a b 0.3 –
1 1
0 0 0
FIGURE 12.26 –
Mean squared error (MSE)

MSE as a function +
0 1 1
of epoch for:
0.2
(a) the linearly 1 0 1
separable Iris x1 x1
classes (setosa 1 1 0 0 1 0 1
∈ c1
and versicolor);
0.1 ∈ c2
and (b) the
linearly nonsepa- a b c
rable Iris classes FIGURE 12.27 The XOR classification problem in 2-D. (a) Truth table definition of the XOR
(versicolor and operator. (b) 2-D pattern classes formed by assigning the XOR truth values (1) to one pattern
virginica). 0
1 10 20 30 40 50 180 360 540 720 900 class, and false values (0) to another. The simplest decision boundary between the two classes
consists of two straight lines. (c) Nonlinear (quadratic) boundary separating the two classes.
Training epochs Training epochs

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 941 6/16/2017 2:17:12 PM DIP4E_GLOBAL_Print_Ready.indb 942 6/16/2017 2:17:13 PM


12.5 Neural Networks and Deep Learning 943 944 Chapter 12 Image Pattern Classification

a b w1 w1 FIGURE 12.29 a1(& − 1)


x1 x1 wi1 (&)
FIGURE 12.28 w7 Model of an
w2 w2 artificial neuron, wi 2 (&)
(a) Minimum a 2(& − 1)
perceptron solution w3 1 w3 1 showing all the h
to the XOR problem w4 w4 operations it ..
in 2-D. (b) A solution
x2
w5
x2
w5 w8 w 1 performs. The .. zi (&) =
n& −1

∑ wij (&) a j (& − 1) ai (&) = h ( zi (&))


“& ” is used to
.
9
that implements the j =1
+ bi (&)
XOR truth table in denote a
Fig. 12.27(a). w6 1 w6 1 particular layer in win&−1 (&)
a layered an&−1 (& − 1)
bi (&)
network.
Natural questions at this point are: Can more than one perceptron solve the XOR Neuron i in layer &
1
problem? If so, what is the minimum number of units required? We know that a
single perceptron can implement one straight line, and we need to implement two
lines, so the obvious answers are: yes to the first question, and two units to the sec-
ond. Figure 12.28(a) shows the solution for two variables, which requires a total of a large swing in value from +1 to −1. Neural networks are formed from layers of
six coefficients because we need two lines. The solution coefficients are such that, computing units, in which the output of one unit affects the behavior of all units fol-
for either of the two patterns from class c1 , one output is true (1) and the other is lowing it. The perceptron’s sensitivity to the sign of small signals can cause serious
false (0). The opposite condition must hold for either pattern from class c2 . This stability problems in an interconnected system of such units, making perceptrons
solution requires that we analyze two outputs. If we want to implement the truth unsuitable for layered architectures.
table, meaning that a single output should give the same response as the XOR func- The solution is to change the characteristic of the activation function from a hard-
tion [the third column in Fig. 12.27(a)], then we need one additional perceptron. limiter to a smooth function. Figure 12.29 shows an example based on using the
Figure 12.28(b) shows the architecture for this solution. Here, one perceptron in the activation function
first layer maps any input from one class into a 1, and the other perceptron maps a 1
pattern from the other class into a 0. This reduces the four possible inputs into two h(z) = (12-51)
1 + e− z
outputs, which is a two-point problem. As you know from Fig. 12.24, a single percep-
tron can solve this problem. Therefore, we need three perceptrons to implement the where z is the result of the computation performed by the neuron, as shown in Fig.
XOR table, as in Fig. 12.28(b). 12.29. Except for more complicated notation, and the use of a smooth function rath-
With a little work, we could determine by inspection the coefficients needed to er than a hard threshold, this model performs the same sum-of-products operations
implement either solution in Fig. 12.28. However, rather than dwell on that, we focus as in Eq. (12-36) for the perceptron. Note that the bias term is denoted by b instead
attention in the following section on a more general, layered architecture, of which
the XOR solution is a trivial, special case.
1.0 1.0 6
1
MULTILAYER FEEDFORWARD NEURAL NETWORKS h(z) = h(z) = tanh(z) h(z) = max(0, z)
1 + e− z
1 if z > 0
h (z) = h(z) [1 − h(z)] h$(z) = 1 − [ h(z)]
2
In this section, we discuss the architecture and operation of multilayer neural net- $ 0.5 h$(z) = 
works, and derive the equations of backpropagation used to train them. We then 4 0 if z ≤ 0
give several examples illustrating the capabilities of neural nets
0.5 0.0
Model of an Artificial Neuron
2
Neural networks are interconnected perceptron-like computing elements called − 0.5
artificial neurons. These neurons perform the same computations as the perceptron, Sigmoid tanh ReLu
but they differ from the latter in how they process the result of the computations.
As illustrated in Fig. 12.23, the perceptron uses a “hard” thresholding function that 0.0 − 1.0 0
outputs two values, such as +1 and −1, to perform classification. Suppose that in a −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

network of perceptrons, the output before thresholding of one of the perceptrons a b c


is infinitesimally greater than zero. When thresholded, this very small signal will be FIGURE 12.30 Various activation functions. (a) Sigmoid. (b) Hyperbolic tangent (also has a sigmoid shape, but it is
turned into a +1. But a similarly small signal with the opposite sign would cause centered about 0 in both dimensions). (c) Rectifier linear unit (ReLU).

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 943 6/16/2017 2:17:14 PM DIP4E_GLOBAL_Print_Ready.indb 944 6/16/2017 2:17:14 PM


12.5 Neural Networks and Deep Learning 945 946 Chapter 12 Image Pattern Classification

of wn+1 , as we do the perceptron. It is customary to use different notation, typically FIGURE 12.31 a1(& − 1)
General model
b, in neural networks to denote the bias term, so we are following convention. The of a feedforward,
a 2(& − 1)
more complicated notation used in Fig. 12.29, which we will explain shortly, is need- fully connected ..
ed because we will be dealing with multilayer arrangements with several neurons neural net. The .. h
per layer. We use the symbol “&” to denote layers. neuron is the . n& −1

As you can see by comparing Figs. 12.29 and 12.23, we use variable z to denote same as in a j (& − 1) ∑ wij (&) aj (& − 1) ai (&) = h ( zi (&))
Fig. 12.29. Note .. zi (&) =
j =1
the sum-of-products computed by the neuron. The output of the unit, denoted by a, how the output of .. + bi (&)

is obtained by passing z through h. We call h the activation function, and refer to its each neuron goes .
output, a = h(z), as the activation value of the unit. Note in Fig. 12.29 that the inputs to the input of all an&−1 (& − 1)
to a neuron are activation values from neurons in the previous layer. Figure 12.30(a) neurons in the
following layer, 1
shows a plot of h(z) from Eq. (12-51). Because this function has the shape of a sig-
hence the name
moid function, the unit in Fig. 12.29 is sometimes called an artificial sigmoid neuron, fully connected
or simply a sigmoid neuron. Its derivative has a very nice form, expressible in terms for this type of Neuron i in hidden layer & Output ai(&) goes to all neurons in layer & + 1
of h(z) [see Problem 12.16(a)]: architecture. a i ( &)
x1
∂h(z)
h$(z) = = h(z)[1 − h(z)] (12-52)
∂z

Figures 12.30(b) and (c) show two other forms of h(z) used frequently. The hyper- x2
bolic tangent also has the shape of a sigmoid function, but it is symmetric about both
axes. This property can help improve the convergence of the backpropagation algo-
rithm to be discussed later. The function in Fig. 12.30(c) is called the rectifier func-
tion, and a unit using it is referred to a rectifier linear unit (ReLU). Often, you see
the function itself referred to as the ReLU activation function. Experimental results x3
suggest that this function tends to outperform the other two in deep neural networks.

Interconnecting Neurons to Form a Fully Connected Neural Network


Figure 12.31 shows a generic diagram of a multilayer neural network. A layer in the
network is the set of nodes (neurons) in a column of the network. As indicated by
the zoomed node in Fig. 12.31, all the nodes in the network are artificial neurons of
the form shown in Fig. 12.29, except for the input layer, whose nodes are the com- xn
ponents of an input pattern vector x. Therefore, the outputs (activation values) of
layer &
the first layer are the values of the elements of x. The outputs of all other nodes are Layer 1
Hidden Layers
Layer L
the activation values of neurons in a particular layer. Each layer in the network can (Input) (Output)
(The number of nodes in
have a different number of nodes, but each node has a single output. The multiple the hidden layers can be
lines shown at the outputs of the neurons in Fig. 12.31 indicate that the output of different from layer to layer )

every node is connected to the input of all nodes in the next layer, to form a fully
connected network. We also require that there be no loops in the network. Such
networks are called feedforward networks. Fully connected, feedforward neural nets sometimes you will see the words “shallow” and “deep” used subjectively to denote
are the only types of networks considered in this section. networks with a “few” and with “many” layers, respectively.
We obviously know the values of the nodes in the first layer, and we can observe We used the notation in Eq. (12-37) to label all the inputs and weights of a per-
the values of the output neurons. All others are hidden neurons, and the layers that ceptron. In a neural network, the notation is more complicated because we have to
contain them are called hidden layers. Generally, we call a neural net with a single account for neuron weights, inputs, and outputs within a layer, and also from layer
hidden layer a shallow neural network, and refer to network with two or more hid- to layer. Ignoring layer notation for a moment, we denote by wij the weight that
den layers as a deep neural network. However, this terminology is not universal, and associates the link connecting the output of neuron j to the input of neuron i. That is,

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 945 6/16/2017 2:17:16 PM DIP4E_GLOBAL_Print_Ready.indb 946 6/16/2017 2:17:16 PM


12.5 Neural Networks and Deep Learning 947 948 Chapter 12 Image Pattern Classification

the first subscript denotes the neuron that receives the signal, and the second refers FORWARD PASS THROUGH A FEEDFORWARD NEURAL NETWORK
to the neuron that sends the signal. Because i precedes j alphabetically, it would A forward pass through a neural network maps the input layer (i.e., values of x) to
seem to make more sense for i to send and for j to receive. The reason we use the the output layer. The values in the output layer are used for determining the class of
notation as stated is to avoid a matrix transposition in the equation that describes an input vector. The equations developed in this section explain how a feedforward
propagation of signals through the network. This notation is convention, but there is neural network carries out the computations that result in its output. Implicit in the
no doubt that it is confusing, so special care is necessary to keep the notation straight. discussion in this section is that the network parameters (weights and biases) are
Remember, a bias is a
Because the biases depend only on the neuron containing it, a single subscript known. The important results in this section will be summarized in Table 12.2 at the
weight that is always that associates a bias with a neuron is sufficient. For example, we use bi to denote the end of our discussion, but understanding the material that gets us there is important
multiplied by 1. bias value associated with the ith neuron in a given layer of the network. Our use of when we discuss training of neural nets in the next section.
b instead of wn+1 (as we did for perceptrons) follows notational convention used in
neural networks. The weights, biases, and activation function(s) completely define a The Equations of a Forward Pass
neural network. Although the activation function of any neuron in a neural network
could be different from the others, there is no convincing evidence to suggest that The outputs of the layer 1 are the components of input vector x:
there is anything to be gained by doing so. We assume in all subsequent discussions a j (1) = x j j = 1, 2, … , n1 (12-53)
that the same form of activation function is used in all neurons.
Let & denote a layer in the network, for & = 1, 2, … , L. With reference to Fig. 12.31, where n1 = n is the dimensionality of x. As illustrated in Figs. 12.29 and 12.31, the
& = 1 denotes the input layer, & = L is the output layer, and all other values of & computation performed by neuron i in layer & is given by
denote hidden layers. The number of neurons in layer & is denoted n& . We have two
n& − 1
options to include layer indexing in the parameters of a neural network. We can do
it as a superscript, for example, wij& and bi& ; or we can use the notation wij (&) and
zi (&) = ∑
j =1
wij (&) a j (& − 1) + bi (&) (12-54)
bi (&). The first option is more prevalent in the literature on neural network. We use
the second option because it is more consistent with the way we describe iterative for i = 1, 2, … , n& and & = 2, … , L. Quantity zi (&) is called the net (or total) input to
expressions in the book, and also because you may find it easier to follow. Using this neuron i in layer &, and is sometimes denoted by neti . The reason for this terminol-
notation, the output (activation value) of neuron k in layer & is denoted ak (&). ogy is that zi (&) is formed using all outputs from layer & − 1. The output (activation
Keep in mind that our objective in using neural networks is the same as for per- value) of neuron i in layer & is given by
ceptrons: to determine the class membership of unknown input patterns. The most ai (&) = h ( zi (&)) i = 1, 2, … , n& (12-55)
common way to perform pattern classification using a neural network is to assign a
class label to each output neuron. Thus, a neural network with nL outputs can clas- where h is an activation function. The value of network output node i is
sify an unknown pattern into one of nL classes. The network assigns an unknown
pattern vector x to class ck if output neuron k has the largest activation value; that is, ai (L) = h ( zi (L)) i = 1, 2, … , nL (12-56)
if ak (L) > a j (L), j = 1, 2, … , nL ; j ≠ k. †
In this and the following section, the number of outputs of our neural networks Equations (12-53) through (12-56) describe all the operations required to map the
will always equal the number of classes. But this is not a requirement. For instance, a input of a fully connected feedforward network to its output.
network for classifying two pattern classes could be structured with a single output
(Problem 12.17 illustrates such a case) because all we need for this task is two states,
EXAMPLE 12.10 : Illustration of a forward pass through a fully connected neural network.
and a single neuron is capable of that. For three and four classes, we need three and
four states, respectively, which can be achieved with two output neurons. Of course, It will be helpful to consider a simple numerical example. Figure 12.32 shows a three-layer neural network
the problem with this approach is that we would need additional logic to decipher consisting of the input layer, one hidden layer, and the output layer. The network accepts three inputs, and
the output combinations. It is simply more practical to have one neuron per output, has two outputs. Thus, this network is capable of classifying 3-D patterns into one of two classes.
and let the neuron with the highest output value determine the class of the input. The numbers shown above the arrow heads on each input to a node are the weights of that node
associated with the outputs from the nodes in the preceding layer. Similarly, the number shown in the
output of each node is the activation value, a, of that node. As noted earlier, there is only one output

value for each node, but it is routed to the input of every node in the next layer. The inputs associated
Instead of a sigmoid or similar function in the final output layer, you will sometimes see a softmax function used
with the 1’s are bias values.
instead. The concept is the same as we explained earlier, but the activation values in a softmax implementation
are given by ai (L) = exp[zi (L)] ∑ k exp[zi (L)], where the summation is over all outputs. In this formulation, the Let us look at the computations performed at each node, starting with the first (top) node in layer 2.
sum of all activations is 1, thus giving the outputs a probabilistic interpretation. We use Eq. (12-54) to compute the net input, z1(2), for that node:

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 947 6/16/2017 2:17:18 PM DIP4E_GLOBAL_Print_Ready.indb 948 6/16/2017 2:17:19 PM


12.5 Neural Networks and Deep Learning 949 950 Chapter 12 Image Pattern Classification

FIGURE 12.32 Matrix Formulation


x1 3
A small,
fully connected, The details of the preceding example reveal that there are numerous individual
0.1
feedforward 0.7858 0.2 computations involved in a pass through a neural network. If you wrote a computer
0.2 0.6982
net with labeled 0.1 program to automate the steps we just discussed, you would find the code to be very
weights, biases,  x1   3  0.6 inefficient because of all the required loop computations, the numerous node and
  0
and outputs. The x =  x2  =  0  x2 0.4
1 0.6
1
layer indexing you would need, and so forth. We can develop a more elegant (and
activation  x3   1  0.4 0.1
function is a 0.3 computationally faster) implementation by using matrix operations. This means
0.4
sigmoid. 0.8176
0.6694 writing Eqs. (12-53) through (12-55) as follows.
0.1 First, note that the number of outputs in layer 1 is always of the same dimension
1
x3 1 1 as an input pattern, x, so its matrix (vector) form is simple:
0.2 0.3

a(1) = x (12-57)

3 Next, we look at Eq. (12-54). We know that the summation term is just the inner
z1 (2) = ∑ w1j (2) a j (1) + b1 (2) = (0.1)(3) + (0.2)(0) + (0.6)(1) + 0.4 = 1.3 product of two vectors [see Eqs. (12-37) and (12-38)]. However, this equation has
j =1 to be evaluated for all nodes in every layer past the first. This implies that a loop is
We obtain the output of this node using Eqs. (12-51) and (12-55): required if we do the computations node by node. The solution is to form a matrix,
W(&), that contains all the weights in layer &. The structure of this matrix is simple—
1
a1 (2) = h ( z1 (2)) = = 0.7858 each of its rows contains the weights for one of the nodes in layer & :
1 + e −1.3
A similar computation gives the value for the output of the second node in the second layer, With reference to our  w11 (&) w12 (&) " w1n&−1 (&) 
earlier discussion on the
 
 w21 (&) w22 (&) " w2 n&−1 (&) 
order of the subscripts
3
z2 (2) = ∑ w2j (2) a j (1) + b2 (2) = (0.4)(3) + (0.3)(0) + (0.1)(1) + 0.2 = 1.5
i and j, if we had let i W(&) =   (12-58)
be the sending node
# # "
j =1 and j the receiver, this  
and matrix would have to be wn& 1 (&) wn& 2 (&) " wn& n&−1 (&)
transposed.
1
a2 (2) = h ( z2 (2)) = = 0.8176
1 + e −1.5 Then, we can obtain all the sum-of-products computations, zi (&), for layer & simulta-
We use the outputs of the nodes in layer 2 to obtain the net values of the neurons in layer 3: neously:
z(&) = W(&) a(& − 1) + b(&) & = 2, 3, … , L (12-59)
2
z1 (3) = ∑ w1j (3) a j (2) + b1 (3) = (0.2)(0.7858) + (0.1)(0.8176) + 0.6 = 0.8389 where a(& − 1) is a column vector of dimension n&−1 × 1 containing the outputs of
j =1
layer & − 1, b(&) is a column vector of dimension n& × 1 containing the bias values
The output of this neuron is of all the neurons in layer &, and z(&) is an n& × 1 column vector containing the net
1 input values, zi (&), i = 1, 2, … , n& , to all the nodes in layer &. You can easily verify
a1 (3) = h ( z1 (3)) = = 0.6982 that Eq. (12-59) is dimensionally correct.
1 + e −0.8389
Because the activation function is applied to each net input independently of the
Similarly, others, the outputs of the network at any layer can be expressed in vector form as:
2
z2 (3) = ∑ w2j (3) a j (2) + b2 (3) = (0.1)(0.7858) + (0.4)(0.8176) + 0.3 = 0.7056  h ( z1 (&)) 
j =1  
and  h ( z2 (&)) 
a(&) = h [ z(&)] =   (12-60)
1 #
a2 (3) = h ( z2 (2)) = = 0.6694  
1 + e −0.7056 (
 h zn& (&) 
 )
If we were using this network to classify the input, we would say that pattern x belongs to class c1
because a1(L) > a2 (L), where L = 3 and nL = 2 in this case. Implementing Eqs. (12-57) through (12-60) requires just a series of matrix opera-
tions, with no loops.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 949 6/16/2017 2:17:21 PM DIP4E_GLOBAL_Print_Ready.indb 950 6/16/2017 2:17:23 PM


12.5 Neural Networks and Deep Learning 951 952 Chapter 12 Image Pattern Classification

FIGURE 12.33  3 capable of processing all patterns in a single forward pass. Extending Eqs. (12-57)
 0.1 0.2 0.6   0.2 0.1
Same as Fig. 12.32, a(1) = x =  0  W(2) =   W(3) =   through (12-60) to this more general formulation is straightforward. We begin by
but using matrix  0.4 0.3 0.1  0.1 0.4 
labeling.  1  arranging all our input pattern vectors as columns of a single matrix, X, of dimension
 0.4   0.6 
x1 b(2) =   b(3) =   n × n p where, as before, n is the dimensionality of the vectors and n p is the number
 0.2   0.3 
of pattern vectors. It follows from Eq. (12-57) that

A(1) = X (12-61)
 0.6982 
x2 a(3) =  
 0.6694  where each column of matrix A(1) contains the initial activation values (i.e., the vec-
tor values) for one pattern. This is a straightforward extension of Eq. (12-57), except
that we are now dealing with an n × n p matrix instead of an n × 1 vector.
 0.7858 
a(2) =   The parameters of a network do not change because we are processing more
x3  0.8176  pattern vectors, so the weight matrix is as given in Eq. (12-58). This matrix is of size
n& × n& −1 . When & = 2, we have that W(2) is of size n2 × n, because n1 is always equal
EXAMPLE 12.11 : Redoing Example 12.10 using matrix operations. to n. Then, extending the product term of Eq. (12-59) to use A(2) instead of a(2),
results in the matrix product W(2)A(2), which is of size (n2 × n)(n × n p ) = n2 × n p .
Figure 12.33 shows the same neural network as in Fig. 12.32, but with all its parameters shown in matrix
form. As you can see, the representation in Fig. 12.33 is more compact. Starting with To this, we have to add the bias vector for the second layer, which is of size n2 × 1.
Obviously, we cannot add a matrix of size n2 × n p and a vector of size n2 × 1. How-
 3 ever, as is true of the weight matrices, the bias vectors do not change because we
a(1) =  0  are processing more pattern vectors. We just have to account for one identical bias
 1  vector, b(2), per input vector. We do this by creating a matrix B(2) of size n2 × n p ,
formed by concatenating column vector b(2) n p times, horizontally. Then, Eq. (12-59)
it follows that
 3 written in matrix becomes Z(2) = W(2)A(1) + B(2). Matrix Z(2) is of size n2 × n p ; it
 0.1 0.2 0.6     0.4  1.3 contains the computation performed by Eq. (12-59), but for all input patterns. That
z(2) = W(2) a(1) + b(2) =    0  +  0.2  = 1.5 
 0.4 0.3 0.1       is, each column of Z(2) is exactly the computation performed by Eq. (12-59) for one
1 input pattern.
Then,
The concept just discussed applies to the transition from any layer to the next
 h ( z1 (2))   h(1.3)  0.7858  in the neural network, provided that we use the weights and bias appropriate for a
a(2) = h [ z(2)] =  = = 
 h ( z2 (2))  h(1.5)  0.8176  particular location in the network. Therefore, the full matrix version of Eq. (12-59) is
With a(2) as input to the next layer, we obtain
Z(&) = W(&)A(& − 1) + B(&) (12-62)
 0.2 0.1  0.7858   0.6   0.8389 
z(3) = W(3) a(2) + b(3) =    +  0.3  =  0.7056 
 0.1 0.4   0.8176      where W(&) is given by Eq. (12-58) and B(&) is an n& × n p matrix whose columns are
duplicates of b(&), the bias vector containing the biases of the neurons in layer &.
and, as before,
All that remains is the matrix formulation of the output of layer &. As Eq. (12-60)
 h ( z1 (3))   h(0.8389)  0.6982  shows, the activation function is applied independently to each element of the vec-
a(3) = h [ z(3)] =  = = 
 h ( z2 (3))  h(0.7056)  0.6694 
tor z(&). Because each column of Z(&) is simply the application of Eq. (12-60) cor-
responding to a particular input vector, it follows that
The clarity of the matrix formulation over the indexed notation used in Example 12.10 is evident.
A(&) = h [ Z(&)] (12-63)
Equations (12-57) through (12-60) are a significant improvement over node-by-
node computations, but they apply only to one pattern. To classify multiple pat- where activation function h is applied to each element of matrix Z(&).
tern vectors, we would have to loop through each pattern using the same set of Summarizing the dimensions in our matrix formulation, we have: X and A(1)
matrix equations per loop iteration. What we are after is one set of matrix equations are of size n × n p , Z(&) is of size n& × n p , W(&) is of size n& × n& −1 , A(& − 1) is of

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 951 6/16/2017 2:17:24 PM DIP4E_GLOBAL_Print_Ready.indb 952 6/16/2017 2:17:27 PM


12.5 Neural Networks and Deep Learning 953 954 Chapter 12 Image Pattern Classification

TABLE 12.2 that minimize an error (also called cost or objective) function. Our interest is in
Steps in the matrix computation of a forward pass through a fully connected, feedforward multilayer neural net. classification performance, so we define the error function for a neural network as
Step Description Equations the average of the differences between desired and actual responses. Let r denote
Step 1 Input patterns A(1) = X
the desired response for a given pattern vector, x, and let a(L) denote the actu-
al response of the network to that input. For example, in a ten-class recognition
Step 2 Feedforward For & = 2, … , L, compute Z(&) = W(&)A(& − 1) + B(&) and A(&) = h ( Z(&)) application, r and a(L) would be 10-D column vectors. The ten components of a(L)
would be the ten outputs of the neural network, and the components of r would be
Step 3 Output A(L) = h ( Z(L))
zero, except for the element corresponding to the class of x, which would be 1. For
example, if the input training pattern belongs to class 6, the 6th element of r would
be 1 and the rest would be 0’s.
size n&−1 × n p , B(&) is of size n& × n p , and A(&) is of size n& × n p . Table 12.2 summa- The activation values of neuron j in the output layer is a j (L). We define the error
rizes the matrix formulation for the forward pass through a fully connected, feed- of that neuron as
forward neural network for all pattern vectors. Implementing these operations in a
matrix-oriented language like MATLAB is a trivial undertaking. Performance can 1
( )
2
Ej = rj − a j (L) (12-64)
be improved significantly by using dedicated hardware, such as one or more graphics 2
processing units (GPUs).
The equations in Table 12.2 are used to classify each of a set of patterns into one for j = 1, 2, … , nL , where rj is the desired response of output neuron a j (L) for a
of nL pattern classes. Each column of output matrix A(L) contains the activation given pattern x. The output error with respect to a single x is the sum of the errors of
values of the nL output neurons for a specific pattern vector. The class membership all output neurons with respect to that vector:
of that pattern is given by the location of the output neuron with the highest activa-
nL
1 nL
tion value. Of course, this assumes we know the weights and biases of the network.
( )
2
E = ∑ Ej = ∑ rj − aj (L)
These are obtained during training using backpropagation, as we explain next. See Eqs. (2-50) and
j =1 2 j =1
(2-51) regarding the (12-65)
Euclidean vector norm. 1
USING BACKPROPAGATION TO TRAIN DEEP NEURAL NETWORKS = ! r − a(L) !2
2
A neural network is defined completely by its weights, biases, and activation func-
tion. Training a neural network refers to using one or more sets of training patterns where the second line follows from the definition of the Euclidean vector norm. The
to estimate these parameters. During training, we know the desired response of total network output error over all training patterns is defined as the sum of the errors
every output neuron of a multilayer neural net. However, we have no way of know- When the meaning is
of the individual patterns. We want to find the weights that minimize this total error.
ing what the values of the outputs of hidden neurons should be. In this section, we clear, we sometimes As we did for the LMSE perceptron, we find the solution using gradient descent.
develop the equations of backpropagation, the tool of choice for finding the value include the bias term in
the word “weights.”
However, unlike the perceptron, we have no way for computing the gradients of the
of the weights and biases in a multilayer network. This training by backpropaga- weights in the hidden nodes. The beauty of backpropagation is that we can achieve an
tion involves four basic steps: (1) inputting the pattern vectors; (2) a forward pass equivalent result by propagating the output error back into the network.
through the network to classify all the patterns of the training set and determine the The key objective is to find a scheme to adjust all weights in a network using train-
classification error; (3) a backward (backpropagation) pass that feeds the output ing patterns. In order to do this, we need to know how E changes with respect to the
error back through the network to compute the changes required to update the weights in the network. The weights are contained in the expression for the net input
parameters; and (4) updating the weights and biases in the network. These steps are to each node [see Eq. (12-54)], so the quantity we are after is ∂E ∂zj (&) where, as
repeated until the error reaches an acceptable level. We will provide a summary of defined in Eq. (12-54), zj (&) is the net input to node j in layer &. In order to simplify
all principal results derived in this section at the end of the discussion (see Table the notation later, we use the symbol d j (&) to denote ∂E ∂zj (&). Because backpropa-
12.3). As you will see shortly, the principal mathematical tool needed to derive the gation starts with the output and works backward from there, we look first at
We use “j” generically
equations of backpropagation is the chain rule from basic calculus. to mean any node in the
∂E
network. We are not d j (L) = (12-66)
The Equations of Backpropagation
concerned at the moment ∂zj (L)
with inputs to, or outputs
from, a node.
Given a set of training patterns and a multilayer feedforward neural network archi- We can express this equation in terms of the output a j (L) using the chain rule:
tecture, the approach in the following discussion is to find the network parameters

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 953 6/16/2017 2:17:29 PM DIP4E_GLOBAL_Print_Ready.indb 954 6/16/2017 2:17:30 PM


12.5 Neural Networks and Deep Learning 955 956 Chapter 12 Image Pattern Classification

d j (L) =
∂E
=
∂E ∂a j (L)
=
∂E ∂h zj (L) ( ) ∂E
=
∂E ∂zi (&)
∂wij (&) ∂zi (&) ∂wij (&)
∂zj (L) ∂ a j (L) ∂zj (L) ∂ a j (L) ∂zj (L)
(12-67) ∂zi (&)
∂E
=
∂ a j (L)
(
h$ zj (L) ) = di (&)
∂wij (&)
(12-71)

= a j (& − 1) di (&)
where we used Eq. (12-56) to obtain the last expression in the first line. This equa-
tion gives us the value of d j (L) in terms of quantities that can be observed or com-
puted. For example, if we use Eq. (12-64) as our error measure, and Eq. (12-52) for where we used Eq. (12-54), Eq. (12-69), and interchanged the order of the results
( )
h$ zj ( x) , then to clarify matrix formulations later in our discussion. Similarly (see Problem 12.26),

( ) ( )
d j (L) = h zj (L) 1 − h zj (L)   a j (L) − rj  (12-68) ∂E
= di (&) (12-72)
∂bi (&)
(
where we interchanged the order of the terms. The h zj (L) are computed in the ) Now we have the rate of change of E with respect to the network weights and biases
forward pass, a j (L) can be observed in the output of the network, and rj is given
along with x during training. Therefore, we can compute d j (L). in terms of quantities we can compute. The last step is to use these results to update
Because the relationship between the net input and the output of any neuron in the network parameters using gradient descent:
any layer (except the first) is the same, the form of Eq. (12-66) is valid for any node
j in any hidden layer: ∂E( & )
wij (&) = wij (&) − a
∂E ∂wij (&) (12-73)
d j ( &) = (12-69)
∂z j ( &) = wij (&) − a di (&) a j (& − 1)

This equation tells us how E changes with respect to a change in the net input to any and
neuron in the network. What we want to do next is express d j (&) in terms of d j (& + 1).
Because we will be proceeding backward in the network, this means that if we have ∂E
bi (&) = bi (&) − a
this relationship, then we can start with d j (L) and find d j (L − 1). We then use this ∂bi (&) (12-74)
result to find d j (L − 2), and so on until we arrive at layer 2. We obtain the desired = bi (&) − a di (&)
expression using the chain rule (see Problem 12.25):
for & = L − 1, L − 2,… 2, where the a’s are computed in the forward pass, and the d’s
∂E ∂E ∂zi (& + 1) ∂a j (&)
d j ( &) = =∑ are computed during backpropagation. As with the perceptron, a is the learning
∂z j ( & ) i ∂zi (& + 1) ∂a j (&) ∂zj (&) rate constant used in gradient descent. There are numerous approaches that attempt
∂zi (& + 1) to find optimal learning rates, but ultimately this is a problem-dependent parameter
= ∑ di (& + 1)
i ∂a j ( &)
(
h$ zj (&) ) (12-70) that involves experimenting. A reasonable approach is to start with a small value of
a (e.g., 0.01), then experiment with vectors from the training set to determine a suit-
(
= h$ zj (&) ) ∑ w (& + 1) d (& + 1)
ij i
able value in a given application. Remember, a is used only during training, so it has
i no effect on post-training operating performance.

for & = L − 1, L − 2, … 2, where we used Eqs. (12-55) and (12-69) to obtain the mid-
dle line, and Eq. (12-54), plus some rearranging to obtain the last line. Matrix Formulation
The preceding development tells us how we can start with the error in the output As with the equations that describe the forward pass through a neural network, the
(which we can compute) and obtain how that error changes as function of the net equations of backpropagation developed in the previous discussion are excellent for
inputs to every node in the network. This is an intermediate step toward our final describing how the method works at a fundamental level, but they are clumsy when
objective, which is to obtain expressions for ∂E ∂wij (&) and ∂E ∂bi (&) in terms of it comes to implementation. In this section, we follow a procedure similar to the one
d j (&) = ∂E zj (&). For this, we use the chain rule again: we used for the forward pass to develop the matrix equations for backpropagation.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 955 6/16/2017 2:17:32 PM DIP4E_GLOBAL_Print_Ready.indb 956 6/16/2017 2:17:33 PM


12.5 Neural Networks and Deep Learning 957 958 Chapter 12 Image Pattern Classification

As before, we arrange all the pattern vectors as columns of matrix X, and package each column corresponding to one pattern vector. All matrices in Eq. (12-78) are of
the weights of layer & as matrix W(&). We use D(&) to denote the matrix equiva- size nL × n p .
lent of Î(&), the vector containing the errors in layer &. Our first step is to find an Following a similar line of reasoning, we can express Eq. (12-70) in matrix form as
expression for D(L). We begin at the output and proceed backward, as before. From
Eq. (12-67), ( )
D(&) = WT (& + 1)D(& + 1) } h' ( Z(&)) (12-79)

 ∂E   ∂E   h$ z (L)  It is easily confirmed by dimensional analysis that the matrix D(&) is of size n& × n p
 ∂ a (L) h$ ( z1 (L))   ∂ a (L)   ( 1 )
 (see Problem 12.27). Note that Eq. (12-79) uses the weight matrix transposed. This
 d1 (L)     1  
1
 reflects the fact that the inputs to layer & are coming from layer & + 1, because in
 d (L)   ∂E h$ ( z (L))   ∂E   $
h ( z ( L) ) 
 =  ∂ a2 (L)   ∂ a (L)   backpropagation we move in the direction opposite of a forward pass.
D(L) = 
2 2 2

 #   = 2 } 
(12-75) We complete the matrix formulation by expressing the weight and bias update
   #   #   #  equations in matrix form. Considering the weight matrix first, we can tell from Eqs.
dnL (L)  ∂E     
 (
h$ znL (L) )   ∂E   h$ z (L)  ( ) (12-70) and (12-73) that we are going to need matrices W(&), D(&), and A(& − 1).
 ∂ an (L)   ∂ an (L)   nL
 We already know that W(&) is of size n& × n& −1 and that D(&) is of size n& × n p . Each
 L   L 
column of matrix A(& − 1) is the set of outputs of the neurons in layer & − 1 for one
pattern vector. There are n p patterns, so A(& − 1) is of size n&−1 × n p . From Eq. (12-
where, as defined in Section 2.6, “}” denotes elementwise multiplication (of two 73) we infer that A post-multiplies D, so we are also going to need AT (& − 1), which
vectors in this case). We can write the vector on the left of this symbol as ∂E ∂ a(L), is of size n p × n&−1 . Finally, recall that in a matrix formulation, we construct a matrix
and the vector on the right as h$ ( z(L)) . Then, we can write Eq. (12-75) as B(&) of size n& × n p whose columns are copies of vector b(&), which contains all the
∂E biases in layer &.
Î(L) = } h$ ( z(L)) (12-76) Next, we look at updating the biases. We know from Eq. (12-74) that each ele-
∂ a(L)
ment bi (&) of b(&) is updated as bi (&) = bi (&) − a di (&), for i = 1, 2, … , n& . Therefore,
This nL × 1 column vector contains the activation values of all the output neurons b(&) = b(&) − aÎ (&). But this is for one pattern, and the columns of D(&) are the
for one pattern vector. The only error function we use in this chapter is a quadratic Î(&)’ s for all patterns in the training set. This is handled in a matrix formulation by
function, which is given in vector form in Eq. (12-65). The partial of that quadratic using the average of the columns of D(&) (this is the average error over all patterns)
function with respect to a(L) is ( a(L) − r ) which, when substituted into Eq. (12-76), to update b(&).
gives us Putting it all together results in the following two equations for updating the
network parameters:
Î(L) = ( a(L) − r ) } h$ ( z(L)) (12-77)
W(&) = W(&) − a D(&)AT (& − 1) (12-80)
Column vector Î(L) accounts for one pattern vector. To account for all n p patterns and
simultaneously we form a matrix D(&), whose columns are the Î(L) from Eq. (12-77), np
evaluated for a specific pattern vector. This is equivalent to writing Eq. (12-77) b(&) = b(&) − a ∑ Îk (&) (12-81)
directly in matrix form as k= 1

where Îk (&) is the kth column of matrix D(&). As before, we form matrix B(&) of size
D(L) = ( A(L) − R ) } h$ ( Z(L)) (12-78) n& × n p by concatenating b(&) n p times in the horizontal direction:
Each column of A(L) is the network output for one pattern. Similarly, each col- B(&) = concatenate {b(&)} (12-82)
umn of R is a binary vector with a 1 in the location corresponding to the class of a n times
p

particular pattern vector, and 0’s elsewhere, as explained earlier. Each column of
the difference ( A(L) − R ) contains the components of ! a − r ! . Therefore, squaring As we mentioned earlier, backpropagation consists of four principal steps: (1)
the elements of a column, adding them, and dividing by 2 is the same as computing inputting the patterns, (2) a forward pass, (3) a backpropagation pass, and (4) a
the error measure defined in Eq. (12-65), for one pattern. Adding all the column parameter update step. The process begins by specifying the initial weights and bias-
computations gives an average measure of error for all the patterns. Similarly, the es as (small) random numbers. Table 12.3 summarizes the matrix formulations of
columns of matrix h$ ( Z(L)) are values of the net inputs to all output neurons, with these four steps. During training, these steps are repeated for a number of specified
epochs, or until a predefined measure of error is deemed to be small enough.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 957 6/16/2017 2:17:35 PM DIP4E_GLOBAL_Print_Ready.indb 958 6/16/2017 2:17:39 PM


12.5 Neural Networks and Deep Learning 959 960 Chapter 12 Image Pattern Classification

TABLE 12.3 x2
Matrix formulation for training a feedforward, fully connected multilayer neural network using backpropagation.
Steps 1–4 are for one epoch of training. X, R, and the learning rate parameter a, are provided to the network for train- 1 1
ing. The network is initialized by specifying weights, W(1), and biases, B(1), as small random numbers.
0.8

Step Description Equations 0.6


x1 0.4
Step 1 Input patterns A(1) = X −1 1
0.2
0
Step 2 Forward pass For & = 2, … , L, compute: Z(&) = W(&)A(& − 1) + B(&); A(&) = h ( Z(&)) ; –1.5 1.5
− 1.0 1.0
h$( Z(&)) ; and D(L) = ( A(L) − R ) } h$( Z(L)) –1 – 0.5
0.0 0.0
0.5

∈ c1 0.5 – 0.5

( )
1.0 – 1.0
Step 3 Backpropagation For & = L − 1, L − 2, … , 2, compute D(&) = WT (& + 1)D(& + 1) } h' ( Z(&)) ∈ c2 1.5 – 1.5
a b c
Step 4 Update weights and For & = 2, … , L, let W(&) = W(&) − a D(&)AT (& − 1), b(&) = b(&) − a∑ np Î k (&), FIGURE 12.34 Neural net solution to the XOR problem. (a) Four patterns in an XOR arrangement. (b) Results of
k =1
biases classifying additional points in the range −1.5 to 1.5 in increments of 0.1. All solid points were classified as belong-
and B(&) = concatenate {b(&)} , where the Î k (&) are the columns of D(&)
n times
p ing to class c1 and all open circles were classified as belonging to class c2 . Together, the two lines separating the
regions constitute the decision boundary [compare with Fig. 12.27(b)]. (c) Decision surface, shown as a mesh. The
decision boundary is the pair of dashed, white lines in the intersection of the surface and a plane perpendicular to
the vertical axis, intersecting that axis at 0.5. (Figure (c) is shown in a different perspective than (b) in order to make
There are two major types of errors in which we are interested. One is the clas- all four patterns visible.)
sification error, which we compute by counting the number of patterns that were
misclassified and dividing by the total number of patterns in the training set. Mul-
tiplying the result by 100 gives the percentage of patterns misclassified. Subtracting  4.792 4.792   4.590   −9.180 9.429   4.420 
W(2) =   ; b(2) =  − 4.486  ; W(3) =  9.178 −9.427  ; b(3) =  − 4.419 
the result from 1 and multiplying by 100 gives the percent correct recognition. The  4.486 4.486       
other is the mean squared error (MSE), which is based on actual values of E. For
the error defined in Eq. (12-65), this value is obtained (for one pattern) by squaring Figure 12.35 shows the neural net based on these values.
the elements of a column of the matrix ( A(L) − R ) , adding them, and dividing by When presented with the four training patterns after training was completed, the results at the two
the result by 2 (see Problem 12.28). Repeating this operation for all columns and outputs should have been equal to the values in R. Instead, the values were close:
dividing the result by the number of patterns in X gives the MSE over the entire
 0.987 0.990 0.010 0.010 
training set. A(3) =  
 0.013 0.010 0.990 0.990 
These weights and biases, along with the sigmoid activation function, completely specify our trained
EXAMPLE 12.12 : Using a fully connected neural net to solve the XOR problem.
neural network. To test its performance with values other than the training patterns, which we know it
Figure 12.34(a) shows the XOR classification problem discussed previously (the coordinates were cho- classifies correctly, we created a set of 2-D test patterns by subdividing the pattern space into increments
sen to center the patterns for convenience in indexing, but the spatial relationships are as before). Pat- of 0.1, from −1.5 to 1.5 in both directions, and classified the resulting points using a forward pass through
tern matrix X and class membership matrix R are:

1 − 1 − 1 1  1 1 0 0 FIGURE 12.35  4.792 4.792   −9.180 9.429 


X=  ; R = 0 0 1 1  Neural net used W(2) =   W(3) =  
1 − 1 1 − 1    to solve the XOR
 4.486 4.486   9.178 −9.427 
problem, showing x1
We specified a neural network having three layers, with two nodes each (see Fig. 12.35). This is the small- the weights and
est network consistent with our architecture in Fig. 12.31. Comparing it to the minimum perceptron biases learned
arrangements in Fig. 12.28(a), we see that our neural network performs the same basic function, in the via training using
sense that it has two inputs and two outputs. the equations in
We used a = 1.0, an initial set of Gaussian random weights of zero mean and standard deviation of Table 12.3. x2
0.02, and the activation function in Eq. (12-51). We then trained the network for 10,000 epochs (we
used a large number of epochs to get close to the values in the R; we discuss below solutions with fewer  4.590 
b(2) = 
 4.420 
b(3) = 
 
epochs). The resulting weights and biases were:  − 4.486   − 4.419 

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 959 6/16/2017 2:17:41 PM DIP4E_GLOBAL_Print_Ready.indb 960 6/16/2017 2:17:43 PM


12.5 Neural Networks and Deep Learning 961 962 Chapter 12 Image Pattern Classification

FIGURE 12.36 1.4


MSE as a function
of training epochs 1.2 Spectral band 4
for the XOR
Spectral band 3
pattern

Mean squared error


1.0
arrangement. Spectral band 2
0.8 Spectral band 1
 x1 
x 
0.6
x =  2
 x3 
0.4  
 x4 
0.2

0 (b) Approach used to extract pattern vectors


0 200 400 600 800 1,000
(a) Images in spectral bands 1– 4 and binary mask used to extract training samples
Epochs
FIGURE 12.37 (a) Starting with the leftmost image: blue, green, red, near infrared, and binary mask images. In the
the network. If the activation value of output node 1 was greater than the activation value of output mask, the lower region is for water, the center region is for the urban area, and the left mask corresponds to vegeta-
node 2, the pattern was assigned to class c1 ; otherwise, it was assigned to class c2 . Fig. 12.34(b) is a plot tion. All images are of size 512 × 512 pixels. (b) Approach used for generating 4-D pattern vectors from a stack of
of the results. Solid dots are points classified into to class c1 , and white dots were classified as belong- the four multispectral images. (Multispectral images courtesy of NASA.)
ing to class c2 . The boundaries between these two regions (shown as solid black lines) are precisely the
boundaries in Fig. 12.27(b). Thus, our small neural network found the simplest boundary between the and vegetation. Figure 12.37 shows the four multispectral images used in the experiment, the masks used
two classes, and thus performed the same function as the perceptron arrangement in Fig. 12.28(a). to extract the training and test samples, and the approach used to generate the 4-D pattern vectors.
Figure 12.34(c) shows the decision surface. This figure is analogous to Fig. 12.24(b), but it intersects As in Example 12.6, we extracted a total of 1900 training pattern vectors and 1887 test pattern vectors
the plane twice because the patterns are not linearly separable. Our decision boundary is the intersec- (see Table 12.1 for a listing of vectors by class). After preliminary runs with the training data to establish
tion of the decision surface with a plane perpendicular to the vertical axis, and intersecting that axis at that the mean squared error was decreasing as a function of epoch, we determined that a neural net
0.5. This is because the range of values in the output nodes is in the [0, 1] range, and we assign a pattern with one hidden layer of two nodes achieved stable learning with a = 0.001 and 1,000 training epochs.
to the class for which one the two outputs had the largest value. The plane is shown shaded in the fig- Keeping those two parameters fixed, we varied the number of nodes in the internal layer, as listed in
ure, and the decision boundary is shown as dashed white lines. We adjusted the viewing perspective of Table 12.4. The objective of these preliminary runs was to determine the smallest neural net that would
Fig. 12.34(c) so you can see all the XOR points. give the best recognition rate. As you can see from the results in the table, [4 3 3] is clearly the architec-
Because classification in this case is based on selecting the largest output, we do not need the outputs ture of choice in this case. Figure 12.38 shows this neural net, along with the parameters learned during
to be so close to 1 and 0 as we showed above, provided they are greater for the patterns of class c1 and training.
conversely for the patterns of class c2 . This means that we can train the network using fewer epochs After the basic architecture was defined, we kept the learning rate constant at a = 0.001 and varied the
and still achieve correct recognition. For example, correct classification of the XOR patterns can be number of epochs to determine the best recognition rate with the architecture in Fig. 12.38. Table 12.5
achieved using the parameters learned with as few as 150 epochs. Figure 12.36 shows the reason why this shows the results. As you can see, the recognition rate improved slowly as a function of epoch, reach-
is possible. By the end of the 1000th epoch, the mean squared error has decreased almost to zero, so we ing a plateau at around 50,000 epochs. In fact, as Fig. 12.39 shows, the MSE decreased quickly up to
would expect it to decrease very little from there for 10,000 epochs. We know from the preceding results about 800 training epochs and decreased slowly after that, explaining why the correct recognition rate
that the neural net performed flawlessly using the weights learned with 10,000 epochs. Because the changed so little after about 2,000 epochs. Similar results were obtained with a = 0.01, but decreasing
error for 1,000 and 10,000 epochs is close, we can expect the weights to be close as well. At 150 epochs,
the error has decreased by close to 90% from its maximum, so the probability that the weights would
perform well should be reasonably high, which was true in this case. TABLE 12.4
Recognition rate as a function of neural net architecture for a = 0.001 and 1,000 training epochs. The network archi-
tecture is defined by the numbers in brackets. The first and last number inside each bracket refer to the number of
input and output nodes, respectively. The inner entries give the number of nodes in each hidden layer.
EXAMPLE 12.13 : Using neural nets to classify multispectral image data.
Network
In this example, we compare the recognition performance of the Bayes classifier we discussed in Sec- [4 2 3] [4 3 3] [4 4 3] [4 5 3] [4 2 2 3] [4 4 3 3] [4 4 4 3] [4 10 3 3] [4 10 10 3]
Architecture
tion 12.4 and the multilayer neural nets discussed in this section. The objective here is the same as in Recognition 95.8% 96.2% 95.9% 96.1% 74.6% 90.8% 87.1% 84.9% 89.7%
Example 12.6: to classify the pixels of multispectral image data into three pattern classes: water, urban, Rate

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 961 6/16/2017 2:17:43 PM DIP4E_GLOBAL_Print_Ready.indb 962 6/16/2017 2:17:44 PM


12.5 Neural Networks and Deep Learning 963 964 Chapter 12 Image Pattern Classification

FIGURE 12.38 this parameter to a = 0.1 resulted in a drop of the best correct recognition rate to 49.1%. Based on the
Neural net x1
preceding results, we used a = 0.001 and 50,000 epochs to train the network.
architecture used to
classify the The parameters in Fig. 12.38 were the result of training. The recognition rate for the training data
multispectral image using these parameters was 97%. We achieved a recognition rate of 95.6% on the test set using the same
x2
data in Fig. 12.37 parameters. The difference between these two figures, and the 96.4% and 96.2%, respectively, obtained
into three classes: for the same data with the Bayes classifier (see Example 12.6), are statistically insignificant.
water, urban, and The fact that our neural networks achieved results comparable to those obtained with the Bayes
vegetation. The x3
parameters shown classifier is not surprising. It can be shown (Duda, Hart, and Stork [2001]) that a three-layer neural net,
were obtained trained by backpropagation using a sum of errors squared criterion, approximates the Bayes decision
in 50,000 epochs functions in the limit, as the number of training samples approaches infinity. Although our training sets
of training using x4 were small, the data were well behaved enough to yield results that are close to what theory predicts.
a = 0.001.
 2.393 1.020 1.249 −15.965   4.093 −10.563 −3.245 
W(2) =  6.599 −2.705 − 0.912 14.928  W(3) =  7.045 9.662 6.436  12.6 DEEP CONVOLUTIONAL NEURAL NETWORKS
12.6

 8.745 0.270 3.358 1.249   −7.447 3.931 − 6.619 


Up to this point, we have organized pattern features as vectors. Generally, this
b(2) = [ 4.920 −2.002 −3.485] b(3) = [ 3.277 −14.982 1.582 ] assumes that the form of those features has been specified (i.e., “engineered” by a
T T

human designer) and extracted from images prior to being input to a neural network
(Example 12.13 is an illustration of this approach). But one of the strengths of neural
networks is that they are capable of learning pattern features directly from training
TABLE 12.5 data. What we would like to do is input a set of training images directly into a neural
Recognition performance on the training set as a function of training epochs. The learning rate constant was a = 0.001 network, and have the network learn the necessary features on its own. One way to
in all cases.
do this would be to convert images to vectors directly by organizing the pixels based
Training on a linear index (see Fig. 12.1), and then letting each element (pixel) of the linear
1,000 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000
Epochs index be an element of the vector. However, this approach does not utilize any spa-
Recognition 95.3% 96.6% 96.7% 96.8% 96.9% 97.0% 97.0% 97.0% 97.0% tial relationships that may exist between pixels in an image, such as pixel arrange-
Rate ments into corners, the presence of edge segments, and other features that may help
to differentiate one image from another. In this section, we present a class of neural
networks called deep convolutional neural networks (CNNs or ConvNets for short)
that accept images as inputs and are ideally suited for automatic learning and image
classification. In order to differentiate between CNNs and the neural nets we stud-
ied in Section 12.5, we will refer to the latter as “fully connected” neural networks.
FIGURE 12.39 1.4
MSE for the A BASIC CNN ARCHITECTURE
network 1.2
Mean squared error (×10 3 )

architecture in In the following discussion, we use a LeNet architecture (see references at the end of
Fig. 12.38 as a 1.0 this chapter) to introduce convolutional nets. We do this for two main reasons: First,
function of the the LeNet architecture is reasonably simple to understand. This makes it ideal for
number of 0.8
training epochs.
introducing basic CNN concepts. Second, our real interest is in deriving the equa-
The learning rate 0.6
tions of backpropagation for convolutional networks, a task that is simplified by the
parameter was intuitiveness of LeNets.
a = 0.001 in all 0.4 To simplify the explana- The CNN in Fig. 12.40 contains all the basic elements of a LeNet architecture,
cases. tion of the CNN in
Fig. 12.40, we focus and we use it without loss of generality. A key difference between this architecture
0.2 attention initially on and the neural net architectures we studied in the previous section is that inputs to
a single image input.
Multiple input images CNNs are 2-D arrays (images), while inputs to our fully connected neural networks
0
0 1 2 3 4 5 are a trivial extension we are vectors. However, as you will see shortly, the computations performed by both
will consider later in our
Training epochs (×10 4 ) discussion. networks are very similar: (1) a sum of products is formed, (2) a bias value is added,

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 963 6/16/2017 2:17:45 PM DIP4E_GLOBAL_Print_Ready.indb 964 6/16/2017 2:17:45 PM


12.6 Deep Convolutional Neural Networks 965 966 Chapter 12 Image Pattern Classification
Convolution + Bias + Activation Convolution
+
basic computations performed by a CNN and those performed by the neural nets
Subsampling Bias discussed in the previous section.

Vectorizing
+
Activation These remarks are summarized in Fig. 12.40, the leftmost part of which shows a
neighborhood at one location in the input image. In CNN terminology, these neigh-
A
Receptive field Subsampling borhoods are called receptive fields. All a receptive field does is select a region of
pixels in the input image. As the figure shows, the first operation performed by a
B CNN is convolution, whose values are generated by moving the receptive field over
the image and, at each location, forming a sum of products of a set of weights and
Input image Feature maps Pooled Feature Pooled Fully connected
the pixels contained in the receptive field. The set of weights, arranged in the shape
feature maps feature neural net of the receptive field, is a kernel, as in Chapter 3. The number of spatial increments
maps maps
by which a receptive field is moved is called the stride. Our spatial convolutions in
FIGURE 12.40 A CNN containing all the basic elements of a LeNet architecture. Points A and B are specific values previous chapters had a stride of one, but that is not a requirement of the equations
to be addressed later in this section. The last pooled feature maps are vectorized and serve as the input to a fully themselves. In CNNs, an important motivations for using strides greater than one is
connected neural network. The class to which the input image belongs is determined by the output neuron with the
highest value.
data reduction. For example, changing the stride from one to two reduces the image
resolution by one-half in each spatial dimension, resulting in a three-fourths reduc-
tion in the amount of data per image. Another important motivation is as a substi-
tute for subsampling which, as we discuss below, is used to reduce system sensitivity
(3) the result is passed through an activation function, and (4) the activation value to spatial translation.
becomes a single input to a following layer. To each convolution value (sum of products) we add a bias, then pass the result
Despite the fact that the computations performed by CNNs and fully connected through an activation function to generate a single value. Then, this value is fed to
neural nets are similar, there are some basic differences between the two, beyond the corresponding ( x, y) location in the input of the next layer. When repeated for all
their input formats being 2-D versus vectors. An important difference is that CNNs locations in the input image, the process just explained results in a 2-D set of values
are capable of learning 2-D features directly from raw image data, as mentioned ear- In the terminology of that we store in next layer as a 2-D array, called a feature map. This terminology is
Chapter 3, a feature map
lier. Because the tools for systematically engineering comprehensive feature sets for is a spatially filtered
motivated by the fact that the role performed by convolution is to extract features
complex image recognition tasks do not exist, having a system that can learn its own image. such as edges, points, and blobs from the input (remember, convolution is the basis
image features from raw image data is a crucial advantage of CNNs. Another major of spatial filtering, which we used in Chapter 3 for tasks such as smoothing, sharpen-
difference is in the way in which layers are connected. In a fully connected neural net, ing, and computing edges in an image). The same weights and a single bias are used
we feed the output of every neuron in a layer directly into the input of every neuron in to generate the convolution (feature map) values corresponding to all locations of
the next layer. By contrast, in a CNN we feed into every input of a layer, a single value, the receptive field in the input image. This is done to cause the same feature to be
determined by the convolution (hence the name convolutional neural net) over a detected at all points in the image. Using the same weights and bias for this purpose
spatial neighborhood in the output of the previous layer. Therefore, CNNs are not is called weight (or parameter) sharing.
fully connected in the sense defined in the last section. Another difference is that the Figure 12.40 shows three feature maps in the first layer of the network. The other
2-D arrays from one layer to the next are subsampled to reduce sensitivity to transla- two feature maps are generated in the manner just explained, but using a different
tional variations in the input. These differences and their meaning will become clear set of weights and bias for each feature map. Because each set of weights and bias
as we look at various CNN configurations in the following discussion. is different, each feature map generally will contain a different set of features, all
extracted from the same input image. The feature maps are referred to collectively
Basics of How a CNN Operates as a convolutional layer. Thus, the CNN in Fig. 12.40 has two convolutional layers.
The process after convolution and activation is subsampling (also called pooling),
As noted above, the type of neighborhood processing in CNNs is spatial convolu-
which is motivated by a model of the mammal visual cortex proposed by Hubel
tion. We explained the mechanics of spatial convolution in Fig. 3.29, and expressed
We will discuss in the and Wiesel [1959]. Their findings suggest that parts of the visual cortex consist of
next subsection the exact it mathematically in Eq. (3-35). As that equation shows, convolution computes a
form of neural computa- simple and complex cells. The simple cells perform feature extraction, while the
sum of products between pixels and a set of kernel weights. This operation is car-
tions in a CNN, and show complex cells combine (aggregate) those features into a more meaningful whole. In
they are equivalent in ried out at every spatial location in the input image. The result at each location
form to the computations this model, a reduction in spatial resolution appears to be responsible for achieving
performed by neurons in
( x, y) in the input is a scalar value. Think of this value as the output of a neuron in
translational invariance. Pooling is a way of modeling this reduction in dimension-
a fully connected neural a layer of a fully connected neural net. If we add a bias and pass the result through
net. ality. When training a CNN with large image databases, pooling has the additional
an activation function (see Fig. 12.29), we have a complete analogy between the

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 965 6/16/2017 2:17:46 PM DIP4E_GLOBAL_Print_Ready.indb 966 6/16/2017 2:17:46 PM


12.6 Deep Convolutional Neural Networks 967 968 Chapter 12 Image Pattern Classification

advantage of reducing the volume of data being processed. You can think of the value of the last pooled layer into a fully connected neural net, the details of which
results of subsampling as producing pooled feature maps. In other words, a pooled you learned in Section 12.5. But the outputs of a CNN are 2-D arrays (i.e., filtered
feature map is a feature map of reduced spatial resolution. Pooling is done by subdi- images of reduced resolution), whereas the inputs to a fully connected net are vec-
viding a feature map into a set of small (typically 2 × 2) regions, called pooling neigh- tors. Therefore, we have to vectorize the 2-D pooled feature maps in the last layer.
borhoods, and replacing all elements in such a neighborhood by a single value. We We do this using linear indexing (see Fig. 12.1). Each 2-D array in the last layer of
assume that pooling neighborhoods are adjacent (i.e., they do not overlap). There the CNN is converted into a vector, then all resulting vectors are concatenated (verti-
Adjacency is not a
requirement of pooling are several ways to compute the pooled values; collectively, the different approaches The parameters of the
cally for a column) to form a single vector. This vector propagates through the neu-
per se. We assume it
are called pooling methods. Three common pooling methods are: (1) average pool- fully connected neural ral net, as explained in Section 12.5. In any given application, the number of outputs
here for simplicity
and because this is an ing, in which the values in each neighborhood are replaced by the average of the
net are learned during
training of the CNN, to
in the fully connected net is equal to the number of pattern classes being classified.
approach that is used
values in the neighborhood; (2) max-pooling, which replaces the values in a neigh- be discussed shortly. As before, the output with the highest value determines the class of the input.
frequently.
borhood by the maximum value of its elements; and (3) L2 pooling, in which the
resulting pooled value is the square root of the sum of the neighborhood values EXAMPLE 12.14 : Receptive fields, pooling neighborhoods, and their corresponding feature maps.
squared. There is one pooled feature map for each feature map. The pooled feature
The top row of Fig. 12.41 shows a numerical example of the relative sizes of feature maps and pooled
maps are referred to collectively as a pooling layer. In Fig. 12.40 we used 2 × 2 pool-
feature maps as a function of the sizes of receptive fields and pooling neighborhoods. The input image
ing so each resulting pooled map is one-fourth the size of the preceding feature map. is of size 28 × 28 pixels, and the receptive field is of size 5 × 5. If we require that the receptive field be
The use of receptive fields, convolution, parameter sharing, and pooling are charac- contained in the image during convolution, you know from Section 3.4 that the resulting convolution
teristics unique to CNNs. array (feature map) will be of size 24 × 24. If we use a pooling neighborhood of size 2 × 2, the resulting
Because feature maps are the result of spatial convolution, we know from Chapter 3 pooled feature maps will be of size 12 × 12, as the figure shows. As noted earlier, we assume that pooling
that they are simply filtered images. It then follows that pooled feature maps are fil- neighborhoods do not overlap.
tered images of lower resolution. As Fig. 12.40 illustrates, the pooled feature maps As an analogy with fully connected neural nets, think of each element of a 2-D array in the top row
in the first layer become the inputs to the next layer in the network. But, whereas of Fig. 12.41 as a neuron. The outputs of the neurons in the input are pixel values. The neurons in the
we showed a single image as an input to the first layer, we now have multiple pooled feature map of the first layer have output values generated by convolving with the input image a kernel
feature maps (filtered images) that are inputs into the second layer. whose size and shape are the same as the receptive field, and whose coefficients are learned during train-
To see how these multiple inputs to the second layer are handled, focus for a ing. To each convolution value we add a bias and pass the result through an activation function to gener-
moment on one pooled feature map. To generate the values for the first feature map ate the output value of the corresponding neuron in the feature map. The output values of the neurons in
in the second convolutional layer, we perform convolution, add a bias, and use acti- the pooled feature maps are generated by pooling the output values of the neurons in the feature maps.
vation, as before. Then, we change the kernel and bias, and repeat the procedure for The second row in Fig. 12.41 illustrates visually how feature maps and pooled feature maps look
the second feature map, still using the same input. We do this for every remaining based on the input image shown in the figure. The kernel shown is as described in the previous para-
feature map, changing the kernel weights and bias for each. Then, we consider the graph, and its weights (shown as intensity values) were learned from sample images using the training
next pooled feature map input and perform the same procedure (convolution, plus of the CNN described later in Example 12.17. Therefore, the nature of the learned features is deter-
bias, plus activation) for every feature map in the second layer, using yet another set mined by the learned kernel coefficients. Note that the contents of the feature maps are specific features
of different kernels and biases. When we are finished, we will have generated three detected by convolution. For example, some of the features emphasize edges in the the character. As
values for the same location in every feature map, with one value coming from the mentioned earlier, the pooled features are lower-resolution versions of this effect.
You could interpret the
corresponding location in each of the three inputs. The question now is: How do
convolution with several we combine these three individual values into one? The answer lies in the fact that
input images as 3-D con-
volution, but with move-
convolution is a linear process, from which it follows that the three individual values EXAMPLE 12.15 : Graphical illustration of the functions performed by the components of a CNN.
ment only in the spatial are combined into one by superposition (that is, by adding them).
Figure 12.42 shows the 28 × 28 image from Fig. 12.41, input into an expanded version of the CNN archi-
(x and y) directions. The In the first layer, we had one input image and three feature maps, so we needed
result would be identical tecture from Fig. 12.40. The expanded CNN, which we will discuss in more detail in Example 12.17, has
to summing individual three kernels to complete all required convolutions. In the second layer, we have
six feature maps in the first layer, and twelve in the second. It uses receptive fields of size 5 × 5, and
convolutions with each three inputs and seven feature maps, so the total number of kernels (and biases)
image separately, as we pooling neighborhoods of size 2 × 2. Because the receptive fields are of size 5 × 5, the feature maps in
do here. needed is 3 × 7 = 21. Each feature map is pooled to generate a corresponding
the first layer are of size 24 × 24, as we explained in Example 12.14. Each feature map has its own set of
pooled feature map, resulting in seven pooled feature maps. In Fig. 12.40, there are
weights and bias, so we will need a total of (5 × 5) × 6 + 6 = 156 parameters (six kernels with twenty-five
only two layers, so these seven pooled feature maps are the outputs of the last layer.
weights each, and six biases) to generate the feature maps in the first layer. The top row of Fig. 12.43(a)
As usual, the ultimate objective is to use features for classification, so we need
shows the kernels with the weights learned during training of the CNN displayed as images, with intensity
a classifier. As Fig. 12.40 shows, in a CNN we perform classification by feeding the
being proportional to kernel values.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 967 6/16/2017 2:17:46 PM DIP4E_GLOBAL_Print_Ready.indb 968 6/16/2017 2:17:47 PM


12.6 Deep Convolutional Neural Networks 969 970 Chapter 12 Image Pattern Classification

FIGURE 12.41 Convolution + bias + activation


Top row: How the 5 # 5 Receptive field
sizes of receptive 2 # 2 Pooling neighborhood
fields and pooling Subsampling
neighborhoods
affect the sizes of
feature maps and
pooled feature
maps.
Bottom row: An
image example. Pooled
This figure is feature map
explained in more (size 12 # 12)
detail in Example Feature map
Input image (size 24 # 24)
12.17. (Image (size 28 # 28)
courtesy of NIST.)

FIGURE 12.43 Top: The weights (shown as images of size 5 × 5) corresponding to the six feature maps in the first layer
of the CNN in Fig. 12.42. Bottom: The weights corresponding to the twelve feature maps in the second layer.

Kernel
parameters to generate the feature maps in the second layer (i.e., twelve sets of six kernels with twenty-
five weights each, plus twelve biases). The bottom part of Fig. 12.43 shows the kernels as images. Because
Because we used pooling neighborhoods of size 2 × 2, the pooled feature maps in the first layer of we are using receptive fields of size 5 × 5, the feature maps in the second layer are of size 8 × 8. Using
Fig. 12.42 are of size 12 × 12. As we discussed earlier, the number of feature maps and pooled feature 2 × 2 pooling neighborhoods resulted in pooled feature maps of size 4 × 4 in the second layer.
maps is the same, so we will have six arrays of size 12 × 12 acting as inputs to the twelve feature maps As we discussed earlier, the pooled feature maps in the last layer have to be vectorized to be able to
in the second layer (the number of feature maps generally is different from layer to layer). Each fea- input them into the fully connected neural net. Each pooled feature map resulted in a column vector of
ture map will have its own set of weights and bias, so will need a total of 6 × (5 × 5) × 12 + 12 = 1812 size 16 × 1. There are 12 of these vectors which, when concatenated vertically, resulted in a single vector
of size 192 × 1. Therefore, our fully connected neural net has 192 input neurons. There are ten numeral
Convolution + Bias + Activation Convolution
classes, so there are 10 output neurons. As you will see later, we obtained excellent performance by using
+ a neural net with no hidden layers, so our complete neural net had a total of 192 input neurons and 10
Vectorization

Subsampling Bias 0.21


+ 0 output neurons. For the input character shown in Fig. 12.42, the highest value in the output of the fully
0.17
Activation 1
0.09
2 connected neural net was in the seventh neuron, which corresponds to the class of 6’s. Therefore, the
0.10
3 input was recognized properly. This is shown in bold text in the figure.
0.12
Subsampling 4
0.39
5 Figure 12.44 shows graphically what the feature maps look like as the input image propagates through
0.88
6
0.19
7
the CNN. Consider the feature maps in the first layer. If you look at each map carefully, you will notice
0.36
0.42
8 that it highlights a different characteristic of the input. For example, the map on the top of the first
9
column highlights the two principal edges on the top of the character. The second map highlights the
Input image Feature maps Pooled Feature Pooled Fully connected edges of the entire inner region, and the third highlights a “blob-like” nature of the digit, almost as if it
feature maps feature neural net had been blurred by a lowpass kernel. The other three images show other features. Although the pooled
maps maps
feature maps are lower-resolution versions of the original feature maps, they still retained the key char-
FIGURE 12.42 Numerical example illustrating the various functions of a CNN, including recognition of an input image. acteristics of the features in the latter. If you look at the first two feature maps in the second layer, and
A sigmoid activation function was used throughout.
compare them with the first two in the first layer, you can see that they could be interpreted as higher-

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 969 6/16/2017 2:17:48 PM DIP4E_GLOBAL_Print_Ready.indb 970 6/16/2017 2:17:49 PM


12.6 Deep Convolutional Neural Networks 971 972 Chapter 12 Image Pattern Classification

FIGURE 12.44 Feature Pooled Feature Pooled Neural


maps feature maps feature net
where l and k span the dimensions of the kernel. Suppose that w is of size 3 × 3.
Visual summary
of an input image maps maps Then, we can then expand this equation into the following sum of products:
propagating 0
through the CNN w ! ax,y = w ! ax,y = ∑ ∑ wl ,k ax − l , y − k
1
in Fig. 12.42. Shown l k (12-84)
as images are all the = w1,1ax −1, y −1 + w1,2 ax −1, y − 2 + " + w3,3ax − 3, y − 3
results of 2
convolution
(feature maps) and 3 We could relabel the subscripts on w and a, and write instead
pooling (pooled
feature maps) for 4 w ! ax,y=w1a1 + w2 a2 + " + w9 a9

Vector
both layers of the 9 (12-85)
network. (Example 5 = ∑ wi ai
12.17 contains more i =1
details about this 6
figure.) The results of Eqs. (12-84) and (12-85) are identical. If we add a bias to the latter
7 equation and call the result z we have
9
8
z = ∑ wj aj + b
j =1 (12-86)
9
= w ! ax,y + b

The form of the first line of this equation is identical to Eq. (12-54). Therefore, we
level abstractions of the top part of the character, in the sense that they show an area flanked on both
conclude that if we add a bias to the spatial convolution computation performed by
sides by areas of opposite intensity. These abstractions are not always easy to analyze visually, but as you
a CNN at any fixed position ( x, y) in the input, the result can be expressed in a form
will see in later examples, they can be very effective. The vectorized version of the last pooled layer is
identical to the computation performed by an artificial neuron in a fully connected
self-explanatory. The output of the fully connected neural net shows dark for low values and white for
neural net. We need the x, y only to account for the fact that we are working in 2-D.
the highest value, indicating that the input was properly recognized as a number 6. Later in this section,
If we think of z as the net input to a neuron, the analogy with the neurons discussed
we will show that the simple CNN architecture in Fig. 12.42 is capable of recognizing the correct class of in Section 12.5 is completed by passing z through an activation function, h, to get
over 70,000 numerical samples with nearly perfect accuracy. the output of the neuron:
a = h(z) (12-87)
Neural Computations in a CNN
This is exactly how the value of any point in a feature map (such as the point labeled
Recall from Fig. 12.29 that the basic computation performed by an artificial neuron A in Fig. 12.40) is computed.
is a sum of products between weights and values from a previous layer. To this we Now consider point B in that figure. As mentioned earlier, its value is given by
add a bias and call the result the net (total) input to the neuron, which we denoted adding three convolution equations:
by zi . As we showed in Eq. (12-54), the sum involved in generating zi is a single sum.
The computations performed in a CNN to generate a single value in a feature map
, k ! ax, y + wl , k ! ax, y + wl , k ! ax, y = ∑ ∑ wl , k ax −ll , y − k +
( 1) (2) ( 3) ( 1) ( 1)
wl(1) (2) (3)
is 2-D convolution. As you learned in Chapter 3, this is a double sum of products l k
(12-88)
between the coefficients of a kernel and the corresponding elements of the image
array overlapped by the kernel. With reference to Fig. 12.40, let w denote a kernel
∑l ∑k wl(,2k)a(x2−)l ,y−k + ∑l ∑k wl(,2k)a(x2−)l ,y−k
formed by arranging the weights in the shape of the receptive field we discussed
in connection with that figure. For notational consistency with Section 12.5, let ax, y where the superscripts refer to the three pooled feature maps in Fig. 12.40. The val-
denote image or pooled feature values, depending on the layer. The convolution ues of l , k, x, and y are the same in all three equations because all three kernels are
value at any point ( x, y) in the input is given by of the same size and they move in unison. We could expand this equation and obtain
a sum of products that is lengthier than for point A in Fig. 12.40, but we could still
w ! ax,y = ∑ ∑ wl , k ax − l , y − k (12-83) relabel all terms and obtain a sum of products that involves only one summation,
l k exactly as before.

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 971 6/16/2017 2:17:50 PM DIP4E_GLOBAL_Print_Ready.indb 972 6/16/2017 2:17:52 PM


12.6 Deep Convolutional Neural Networks 973 974 Chapter 12 Image Pattern Classification

The preceding result tells us that the equations used to obtain the value of an and
element of any feature map in a CNN can be expressed in the form of the computa-
tion performed by an artificial neuron. This holds for any feature map, regardless
of how many convolutions are involved in the computation of the elements of that
(
ax, y (&) = h zx, y (&) ) (12-92)

feature map, in which case we would simply be dealing with the sum of more con-
for & = 1, 2, … , Lc , where Lc is the number of convolutional layers, and ax, y (&)
volution equations. The implication is that we can use the basic form of Eqs. (12-86)
denotes the values of pooled features in convolutional layer &. When & = 1,
and (12-87) to describe how the value of an element in any feature map of a CNN
is obtained. This means we do not have to account explicitly for the number of dif-
ferent pooled feature maps (and hence the number of different kernels) used in a ax, y (0) = {values of pixels in the input image(s)} (12-93)
pooling layer. The result is a significant simplification of the equations that describe
forward and backpropagation in a CNN. When & = Lc ,

Multiple Input Images ax, y (Lc ) = {values of pooled features in last layer of the CN
NN} (12-94)
The values of ax, y just discussed are pixel values in the first layer but, in layers past
the first, ax, y denotes values of pooled features. However, our equations do not dif- Note that & starts at 1 instead of 2, as we did in Section 12.5. The reason is that we are
ferentiate based on what these variables actually represent. For example, suppose naming layers, as in “convolutional layer &.” It would be confusing to start at convo-
we replace the input to Fig. 12.40 with three images, such as the three components lutional layer 2. Finally, we note that the pooling does not require any convolutions.
of an RGB image. The equations for the value of point A in the figure would now The only function of pooling is to reduce the spatial dimensions of the feature map
have the same form as those we stated for point B—only the weights and biases preceding it, so we do not include explicit pooling equations here.
would be different. Thus, the results in the previous discussion for one input image Equations (12-91) through (12-94) are all we need to compute all values in a
are applicable directly to multiple input images. We will give an example of a CNN forward pass through the convolutional section of a CNN. As described in Fig. 12.40,
with three input images later in our discussion. the values of the pooled features of the last layer are vectorized and fed into a fully
connected feedforward neural network, whose forward propagation is explained in
Eqs. (12-54) and (12-55) or, in matrix form, in Table 12.2.
THE EQUATIONS OF A FORWARD PASS THROUGH A CNN
We concluded in the preceding discussion that we can express the result of convolv- THE EQUATIONS OF BACKPROPAGATION USED TO TRAIN CNNS
As noted earlier, a kernel ing a kernel, w, and an input array with values ax, y , as
is formed by organizing As you saw in the previous section, the feedforward equations of a CNN are similar
the weights in the shape of
a corresponding receptive zx, y = ∑ ∑ wl , k ax − l , y − k + b to those of a fully connected neural net, but with multiplication replaced by convo-
field. Also keep in mind l k (12-89) lution, and notation that reflects the fact that CNNs are not fully connected in the
that w and ax,y represent
all the weights and = w ! ax,y + b sense defined in Section 12.5. As you will see in this section, the equations of back-
corresponding values in propagation also are similar in many respects to those in fully connected neural nets.
a set of input images or
pooled features. where l and k span the dimensions of the kernel, x and y span the dimensions of the As in the derivation of backpropagation in Section 12.5, we start with the defini-
input, and b is a bias. The corresponding value of ax, y is tion of how the output error of our CNN changes with respect to each neuron in the
network. The form of the error is the same as for fully connected neural nets, but
( )
ax, y = h zx, y (12-90) now it is a function of x and y instead of j:

But this ax, y is different from the one we used to compute Eq. (12-89), in which ax, y ∂E
d x , y ( &) = (12-95)
represents values from the previous layer. Thus, we are going to need additional ∂zx, y (&)
notation to differentiate between layers. As in fully connected neural nets, we use &
for this purpose, and write Eqs. (12-89) and (12-90) as As in Section 12.5, we want to relate this quantity to d xy (& + 1), which we again do
using the chain rule:
zx , y (&) = ∑ ∑ wl , k (&)ax − l , y − k (& − 1) + b(&)
l k (12-91) ∂E ∂E ∂zu,v (& + 1)
= w(&) ! ax,y (& − 1) + b(&) d x , y ( &) = = ∑∑ (12-96)
∂zx, y (&) u v ∂zu,v (& + 1) ∂zx, y (&)

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 973 6/16/2017 2:17:53 PM DIP4E_GLOBAL_Print_Ready.indb 974 6/16/2017 2:17:54 PM


12.6 Deep Convolutional Neural Networks 975 976 Chapter 12 Image Pattern Classification

where u and v are any two variables of summation over the range of possible values But the kernels do not depend on x and y, so we can write this equation as
of z. As noted in Section 12.5, these summations result from applying the chain rule.
By definition, the first term of the double summation of Eq. (12-96) is d x, y (& + 1).
So, we can write this equation as
( )
d x, y (&) = h$ zx, y (&) d x, y (& + 1) ! rot 180 ( w(& + 1)) (12-103)

As in Section 12.5, our final objective is to compute the change in E with respect
∂E ∂z (& + 1)
d x , y ( &) = = ∑ ∑ du,v (& + 1) u,v (12-97) to the weights and biases. Following a similar procedure as above, we obtain
∂zx, y (&) u v ∂zx, y (&)
∂E ∂E ∂zx, y (&)
Substituting Eq. (12-92) into Eq. (12-91), and using the resulting zu,v in Eq. (12-97), = ∑∑
we obtain ∂wl,k x y ∂zx, y (&) ∂wl,k

∂zx, y (&)
= ∑ ∑ d x , y ( &)
 ∂  ∂wl, k
d x, y (&) = ∑ ∑ du,v (& + 1) ( )
∑ ∑ wl ,k (& + 1) h z u − l ,v − k(&) + b(& + 1) (12-98)
∂zx, y (&)  l k
x y


( )
u v
∂  
= ∑ ∑ d x , y ( &) ∑ ∑ wl,k (&) h zx −l , y− k (& − 1) + b(&) (12-104)
x y ∂wl,k  l k
The derivative of the expression inside the brackets is zero unless u − l = x and
v − k = y, and because the derivative of b(& + 1) with respect to zx, y (&) is zero. But, if
u − l = x and v − k = y, then l = u − x and k = v − y. Therefore, taking the indicated x y
(
= ∑ ∑ d x, y (&) h zx − l , y − k (& − 1) )
derivative of the expression in brackets, we can write Eq. (12-98) as
= ∑ ∑ d x, y (&) a x − l , y − k (& − 1)
x y
 
d x, y (&) = ∑ ∑ du,v (& + 1)  ∑ ∑ wu − x,v − y (& + 1) h$ zx, y (&)  ( ) (12-99)
u v  u− x v− y  where the last line follows from Eq. (12-92). This line is in the form of a convolution
Values of x, y, u, and v are specified outside of the terms inside the brackets. Once the but, comparing it to Eq. (12-91), we see there is a sign reversal between the summa-
values of these variables are fixed, u − x and v − y inside the brackets are simply two tion variables and their corresponding subscripts. To put it in the form of a convolu-
constants. Therefore, the double summation evaluates to wu − x,v − y (& + 1) h$ zx, y (&) , ( ) tion, we write the last line of Eq. (12-104) as
and we can write Eq. (12-99) as
∂E
= ∑ ∑ d x, y (&) a− ( l − x ), − ( k − y ) (& − 1)
d x, y (&) = ∑ ∑ du,v (& + 1) wu − x,v − y (& + 1) h$ zx, y (&)
u v
( ) ∂wl,k x y

(12-100) = dl , k (&) ! a− l ,− k (& − 1) (12-105)


(
= h$ zx, y (&) )∑ ∑ d
u v
u ,v ( & + 1) wu − x,v − y (& + 1)
= dl , k (&) ! rot180 ( a(& − 1))

The double sum expression in the second line of this equation is in the form of a con- Similarly (see Problem 12.32),
volution, but the displacements are the negatives of those in Eq. (12-91). Therefore,
we can write Eq. (12-100) as
∂E
= ∑ ∑ d x , y ( &) (12-106)
( )
d x, y (&) = h$ zx, y (&) d x, y (& + 1) ! w− x,− y (& + 1) (12-101) ∂ b(&) x y

Using the preceding two expressions in the gradient descent equations (see
The negatives in the subscripts indicate that w is reflected about both spatial axes.
Section 12.5), it follows that
This is the same as rotating w by 180°, as we explained in connection with Eq. (3-35).
Using this fact, we finally arrive at an expression for the error at a layer & by writing ∂E
Eq. (12-101) equivalently as wl , k (&) = wl , k (&) − a
∂wl,k
The 180° rotation is
for each 2-D kernel in
a layer.
( ) (
d x, y (&) = h$ zx, y (&) d x, y (& + 1) ! rot 180 wx, y (& + 1)  ) (12-102) = wl , k (&) − a dl , k (&) ! rot180 ( a(& − 1)) (12-107)

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 975 6/16/2017 2:17:57 PM DIP4E_GLOBAL_Print_Ready.indb 976 6/16/2017 2:17:58 PM


12.6 Deep Convolutional Neural Networks 977 978 Chapter 12 Image Pattern Classification

and FIGURE 12.45


∂E CNN with one
b(&) = b(&) − a convolutional
∂b(&)

Vectorization
(12-108) layer used to
= b(&) − a ∑ ∑ d x, y (&) learn to recognize
x y the images in Fig.
12.46.
Equations (12-107) and (12-108) update the weights and bias of each convolution 3 output
Two pooled
layer in a CNN. As we have mentioned before, it is understood that the wl , k repre- feature maps neurons
sents all the weights of a layer. The variables l and k span the spatial dimensions of Two feature maps of size 2 # 2 8 input neurons
Image of size 6 # 6
the 2-D kernels, all of which are of the same size. of size 4 # 4
Fully connected
In a forward pass, we went from a convolution layer to a pooled layer. In back- two-layer neural net
propagation, we are going in the opposite direction. But the pooled feature maps
are smaller than their corresponding feature maps (see Fig. 12.40). Therefore, when epochs, or until the output error of the neural net reaches an acceptable value. The
going in the reverse direction, we upsample (e.g., by pixel replication) each pooled error is computed exactly as we did in Section 12.5. It can be the mean squared error,
feature map to match the size of the feature map that generated it. Each pooled or the recognition error. Keep in mind that the weights in w(&) and the bias value
feature map corresponds to a unique feature map, so the path of backpropagation b(&) are different for each feature map in layer &.
is clearly defined.
With reference to Fig. 12.40, backpropagation starts at the output of the fully con-
nected neural net. We know from Section 12.5 how to update the weights of this net- EXAMPLE 12.16 : Teaching a CNN to recognize some simple images.
work. When we get to the “interface” between the neural net and the CNN, we have We begin our illustrations of CNN performance by teaching the CNN in Fig. 12.45 to recognize the small
to reverse the vectorization method used to generate input vectors. That is, before 6 × 6 images in Fig. 12.46. As you can see on the left of this figure, there are three samples each of images
we can proceed with backpropagation using Eqs. (12-107) and (12-108), we have to of a horizontal stripe, a small centered square, and a vertical stripe. These images were used as the train-
regenerate the individual pooled feature maps from the single vector propagated ing set. On the right are noisy samples of images in these three categories. These were used as the test set.
back by the fully connected neural net.
We summarized in Table 12.3 the backpropagation steps for a fully connected
neural net. Table 12.6 summarizes the steps for performing backpropagation in the
CNN architecture in Fig. 12.40. The procedure is repeated for a specified number of
TABLE 12.6
The principal steps used to train a CNN. The network is initialized with a set of small random weights and biases.
In backpropagation, a vector arriving (from the fully connected net) at the output pooling layer must be converted
to 2-D arrays of the same size as the pooled feature maps in that layer. Each pooled feature map is upsampled to
match the size of its corresponding feature map. The steps in the table are for one epoch of training.

Step Description Equations


Step 1 Input images a(0) = the set of image pixels in the input to layer 1
Step 2 Forward pass For each neuron corresponding to location ( x, y) in each feature map in layer &
compute:
( )
zx, y (&) = w(&) ! ax, y (& − 1) + b(&) and ax, y (&) = h zx, y (&) ; & = 1, 2, … , Lc
Step 3 Backpropagation For each neuron in each feature map in layer & compute:
( )
d x, y (&) = h$ zx, y (&) d x, y (& + 1) ! rot 180 ( w(& + 1)) ; & = Lc − 1, Lc − 2, … , 1
Training Image Set Test Image Set
Step 4 Update parameters Update the weights and bias for each feature map using
FIGURE 12.46 Left: Training images. Top row: Samples of a dark horizontal stripe. Center row: Samples of a centered
wl , k (&) = wl , k (&) − a dl , k (&) ! rot180 ( a(& − 1)) and dark square. Bottom row: Samples of a dark vertical stripe. Right: Noisy samples of the three categories on the left,
b(&) = b(&) − a ∑ ∑ d x, y (&); & = 1, 2, … , Lc created by adding Gaussian noise of zero mean and unit variance to the samples on the left. (All images are 8-bit
x y grayscale images.)

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 977 6/16/2017 2:17:59 PM DIP4E_GLOBAL_Print_Ready.indb 978 6/16/2017 2:18:00 PM


12.6 Deep Convolutional Neural Networks 979 980 Chapter 12 Image Pattern Classification
FIGURE 12.48
FIGURE 12.47 0.5
Samples
Training MSE as a
similar to those
function of epoch
0.4 available in the
for the images in
NIST and MNIST

Mean squared error


Fig. 12.46. Perfect
databases. Each
recognition of the 0.3 character
training and test
subimage is
sets was achieved
0.2 of size 28 × 28
after approxi-
pixels.(Individual
mately 100
images courtesy
epochs, despite
0.1 of NIST.)
the fact that the
MSE was rela-
tively high there. 0
0 100 200 300 400
Epochs

As Fig. 12.45 shows, the inputs to our system are single images. We used a receptor field of size 3 × 3,
which resulted in feature maps of size 4 × 4. There are two feature maps, which means we need two significant variability in the characters—and this is just a small sampling of the 70,000 characters avail-
kernels of size 3 × 3, and two biases. The pooled feature maps were generated using average pooling in able for experimentation.
neighborhoods of size 2 × 2. This resulted in two pooled feature maps of size 2 × 2, because the feature Figure 12.49 shows the architecture of the CNN we trained to recognize the ten digits in the MNIST
maps are of size 4 × 4. The two pooled maps contain eight total elements which were organized as an database. We trained the system for 200 epochs using a = 1.0. Figure 12.50 shows the training MSE as a
8-D column vector to vectorize the output of the last layer. (We used linear indexing of each image, then function of epoch for the 60,000 training images in the MNIST database.
concatenated the two resulting 4-D vectors into a single 8-D vector.) This vector was then fed into the Training was done using mini batches of 50 images at a time to improve the learning rate (see the dis-
fully connected neural net on the right, which consists of the input layer and a three-neuron output layer, cussion in Section 12.7). We also classified all images of the training set and all images of the test set after
one neuron per class. Because this network has no hidden layers, it implements linear decision functions each epoch of training. The objective of doing this was to see how quickly the system was learning the
(see Problem 12.18). To train the system, we used a = 1.0 and ran the system for 400 epochs. Figure 12.47 characteristics of the data. Figure 12.51 shows the results. A high level of correct recognition performance
is a plot of the MSE as a function of epoch. Perfect recognition of the training set was achieved after was achieved after relatively few epochs for both data sets, with approximately 98% correct recognition
approximately 100 epochs of training, despite the fact that the MSE was relatively high there. Recogni- achieved after about 40 epochs. This is consistent with the training MSE in Fig. 12.50, which dropped
tion of the test set was 100% as well. The kernel and bias values learned by the system were: quickly, then began a slow descent after about 40 epochs. Another 160 epochs of training were required
for the system to achieve recognition of about 99.9%. These are impressive results for such a small CNN.
 3.0132 1.1808 − 0.0945   − 0.7388 1.8832 4.1077 
w1 =  0.9718 0.7087 − 0.9093 , b1 = − 0.2990 w2 =  −1.0027 0.3908 2.0357  , b2 = − 0.2834
 0.7193 0.0230 − 0.883
33  −1.2164 −1.1853 − 0.1987 
192 input neurons
It is important that the CNN learned these parameters automatically from the raw training images. No

Vectorization
features in the sense discussed in Chapter 11 were employed.

12
EXAMPLE 12.17 : Using a large training set to teach a CNN to recognize handwritten numerals. 6 pooled 12
pooled 10
feature feature output
In this example, we look at a more practical application using a database containing 60,000 training and feature
Image of size 28 # 28 maps of maps of neurons
6 feature maps maps of
10,000 test images of handwritten numeric characters. The content of this database, called the MNIST size 12 # 12 size 8 # 8
size 4 # 4
of size 24 # 24 Fully connected
database, is similar to a database from NIST (National Institute of Standards and Technology). The
two-layer neural net
former is a “cleaned up” version of the latter, in which the characters have been centered and for-
matted into grayscale images of size 28 × 28 pixels. Both databases are freely available online. Figure FIGURE 12.49 CNN used to recognize the ten digits in the MNIST database. The system was trained with 60,000
numerical character images of the same size as the image shown on the left. This architecture is the same as the
12.48 shows examples of typical numeric characters available in the databases. As you can see, there is
architecture we used in Fig. 12.42. (Image courtesy of NIST.)

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 979 6/16/2017 2:18:01 PM DIP4E_GLOBAL_Print_Ready.indb 980 6/16/2017 2:18:01 PM


12.6 Deep Convolutional Neural Networks 981 982 Chapter 12 Image Pattern Classification

FIGURE 12.50 0.3 1.00 1.00


Training mean Training set Test set
squared error 0.99 0.99
as a function of

Recognition accuracy

Recognition accuracy
0.98 0.98
epoch for the 0.2

Training MSE
60,000 training 0.97 0.97
digit images in the
MNIST database. 0.96 0.96

0.1 0.95 0.95

0.94 0.94

0.93 0.93
0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
0 40 80 120 160 200 Class (digit number) Class (digit number)
Epoch a b
FIGURE 12.52 (a) Recognition accuracy of training set by image class. Each bar shows a number between 0 and 1.
When multiplied by 100%, these numbers give the correct recognition percentage for that class. (b) Recognition
Figure 12.52 shows recognition performance on each digit class for both the training and test sets. The results per class in the test set. In both graphs the recognition rate is above 98%.
most revealing feature of these two graphs is that the CNN did equally as well on both sets of data. This
is a good indication that the training was successful, and that it generalized well to digits it had not seen
before. This is an example of the neural network not “over-fitting” the data in the training set.
Figure 12.53 shows the values of the kernels for the first feature map, displayed as intensities. There 5 × 5 kernels corresponding to one of the feature maps in the second layer. We used 2 × 2 pooling in
is one input image and six feature maps, so six kernels are required to generate the feature maps of both layers, resulting in a 50% reduction of each of the two spatial dimensions of the feature maps.
the first layer. The dimensions of the kernels are the same as the receptive field, which we set at 5 × 5. Finally, it is of interest to visualize how one input image proceeds through the network, using the
Thus, the first image on the left in Fig. 12.53 is the 5 × 5 kernel corresponding to the first feature map. kernels learned during training. Figure 12.55 shows an input digit image from the test set, and the com-
Figure 12.54 shows the kernels for the second layer. In this layer, we have six inputs (which are the putations performed by the CNN at each layer. As before, we display numerical results as intensities.
pooled maps of the first layer) and twelve feature maps, so we need a total of 6 × 12 = 72 kernels and Consider the results of convolution in the first layer. If you look at each resulting feature map care-
biases to generate the twelve feature maps in the second layer. Each column of Fig. 12.54 shows the six fully, you will notice that it highlights a different characteristic of the input. For example, the feature map
on the top of the first column highlights the two vertical edges on the top of the character. The second
1.00
highlights the edges of the entire inner region, and the third highlights a “blob-like”feature of the digit,
1.00
as if it had been blurred by a lowpass kernel. The other three feature maps show other features. If you
0.98 0.98 now look at the first two feature maps in the second layer, and compare them with the first feature map
Training accuracy (#100%)

Testing accuracy (#100%)

in the first layer, you can see that they could be interpreted as higher-level abstractions of the top of the
0.96 0.96
character, in the sense that they show a dark area flanked on each side by white areas. Although these
0.94 0.94 abstractions are not always easy to analyze visually, this example clearly demonstrates that they can be
very effective. And, remember the important fact that our simple system learned these features auto-
0.92 0.92 matically from 60,000 training images. This capability is what makes convolutional networks so powerful
when it comes to image pattern classification. In the next example, we will consider even more complex
0.90 0.90
images, and show some of the limitations of our simple CNN architecture.
0.88 0.88

0.86 0.86
0 40 80 120 160 200 0 40 80 120 160 200 EXAMPLE 12.18 : Using a large image database to teach a CNN to recognize natural images.
Epoch Epoch In this example, we trained the same CNN architecture as in Fig. 12.49, but using the RGB color images
a b
in Fig. 12.56. These images are representative of those found in the CIFAR-10 database, a popular data-
FIGURE 12.51 (a) Training accuracy (percent correct recognition of the training set) as a function of epoch for the
base used to test the performance of image classification systems. Our objective was to test the limita-
60,000 training images in the MNIST database. The maximum achieved was 99.36% correct recognition. (b) Accu-
racy as a function of epoch for the 10,000 test images in the MNIST database. The maximum correct recognition tions of the CNN architecture in Fig. 12.49 by training it with data that is significantly more complex
rate was 99.13%. than the MNIST images in Example 12.17. The only difference between the architecture needed to

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 981 6/16/2017 2:18:02 PM DIP4E_GLOBAL_Print_Ready.indb 982 6/16/2017 2:18:02 PM


12.6 Deep Convolutional Neural Networks 983 984 Chapter 12 Image Pattern Classification

FIGURE 12.53 FIGURE 12.56


Kernels of the Mini images
first layer after of size 32 × 32 Airplane
200 epochs of pixels,
training, shown as representative of
images. the 50,000
training and Automobile
10,000 test images
in the CIFAR-10
database (the 10 Bird
stands for ten
classes). The class
names are shown
on the right. Cat
(Images courtesy
of Pearson
Education.) Deer

Dog

Frog

Horse
FIGURE 12.54 Kernels of the second layer after 200 epochs of training, displayed as images of size 5 × 5. There are six
inputs (pooled feature maps) into the second layer. Because there are twelve feature maps in the second layer, the
CNN learned the weights of 6 × 12 = 72 kernels. Ship

FIGURE 12.55 Feature Pooled Feature Pooled Neural


Results of a for- maps feature maps feature net Truck
ward pass for one maps maps
digit image through 0
the CNN in Fig.
12.49 after training. 1
FIGURE 12.57 0.45
The feature maps
were generated 2 Training mean
using the kernels squared error
0.40
from Figs. 12.53 and 3 as a function of
12.54, followed by the number of

Training MSE
pooling. The neural 4 epochs for a train- 0.35
Vector

net is the two-layer ing set of 50,000


neural network 5 CIFAR-10 images.
0.30
from Fig. 12.49. The
output high value 6
(in white) indicates 0.25
that the CNN rec- 7
ognized the input
properly. (This 8 0.20
figure is the same 0 100 200 300 400 500
as Fig. 12.44.) 9 Epoch

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 983 6/16/2017 2:18:03 PM DIP4E_GLOBAL_Print_Ready.indb 984 6/16/2017 2:18:04 PM


12.6 Deep Convolutional Neural Networks 985 986 Chapter 12 Image Pattern Classification

1.0 1.0 1.0 1.0


Training set Test set
0.9 0.85
0.9
0.81
Training accuracy (#100%)

0.8 0.79 0.79 0.77 0.8

Testing accuracy (#100%)


0.8 0.8 0.76

Recognition accuracy

Recognition accuracy
0.72
0.7 0.69 0.68 0.7 0.69
0.65 0.64

0.6 0.6 0.6 0.6


0.55 0.54 0.56
0.50 0.50
0.5 0.5 0.48 0.47
0.43
0.4 0.4 0.4 0.4
0.3 0.3
0.2 0.2
0.2 0.2
0.1 0.1
0.0 0.0
0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
0 100 200 300 400 500 0 100 200 300 400 500 Class Class
Epoch Epoch
a b
a b FIGURE 12.59 (a) CIFAR-10 recognition rate of training set by image class. Each bar shows a number between 0 and 1.
FIGURE 12.58 (a) Training accuracy (percent correct recognition of the training set) as a function of epoch for the When multiplied by 100%, these numbers give the correct recognition percentage for that class. (b) Recognition
50,000 training images in the CIFAR-10 database. (b) Accuracy as a function of epoch for the 10,000 CIFAR-10 results per class in the test set.
test images.
as there is little we can infer that deep into the network, especially at this small scale, and considering
process the CIFAR-10 images, and the architecture in Fig. 12.49, is that the CIFAR-10 images are RGB the complexity of the images in the training set. Finally, Fig. 12.62 shows a complete recognition pass
color images, and hence have three channels. We worked with these input images using the approach through the CNN using the weights in Figs. 12.60 and 12.61. The input shows the three color channels
explained in the subsection entitled Multiple Input Images, on page 973. of the RGB image in the seventh column of the first row in Fig. 12.56. The feature maps in the first
We trained the modified CNN for 500 epochs using the 50,000 training images of the CIFAR-10 data- column, show the various features extracted from the input. The second column shows the pooling
base. Figure 12.57 is a plot of the mean squared error as a function of epoch during the training phase. results, zoomed to the size of the features maps for clarity. The third and fourth columns show the results
Observe that the MSE begins to plateau at a value of approximately 0.25. In contrast, the MSE plot in in the second layer, and the fifth column shows the vectorized output. Finally, the last column shows the
Fig. 12.50 for the MNIST data achieved a much lower final value. This is not unexpected, given that the result of recognition, with white representing a high output, and the others showing much smaller values.
CIFAR-10 images are significantly more complex, both in the objects of interest as well as their back- The input image was properly recognized as belonging to class 1.
grounds. The lower expected recognition performance of the training set is confirmed by the training-
accuracy plotted in Fig. 12.58(a) as a function of epoch. The recognition rate leveled-off around 68% for
the training data and about 61% for the test data. Although these results are not nearly as good as those
obtained for the MNIST data, they are consistent with what we would expect from a very basic network. FIGURE 12.60
It is possible to achieve over 96% accuracy on this database (see Graham [2015]), but that requires a Weights of the
more complex network and a different pooling strategy. kernels of the first
Figure 12.59 shows the recognition accuracy per class for the training and test image sets. With a few convolution layer
exceptions, the highest recognition rate in both the training and test sets was achieved for engineered after 500 epochs
of training.
objects, and the lowest was for small animals. Frogs were an exception, caused most likely by the fact
that frog size and shape are more consistent than they are, for example, in dogs and birds. As you can
see in Fig. 12.59, if the small animals were removed from the list, recognition performance on the rest of
the images would have been considerably higher.
Figures 12.60 and Fig. 12.61 show the kernels of the first and second layers. Note that each column
in Fig. 12.60 has three 5 × 5 kernels. This is because there are three input channels to the CNN in this
example. If you look carefully at the columns in Fig. 12.60, you can detect a similarity in the arrangement
and values of the coefficients. Although it is not obvious what the kernels are detecting, it is clear that
they are consistent in each column, and that all columns are quite different from each other, indicating
a capability to detect different features in the input images. We show Fig. 12.61 for completeness only,

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 985 6/16/2017 2:18:04 PM DIP4E_GLOBAL_Print_Ready.indb 986 6/16/2017 2:18:05 PM


12.7 Additional Details of Implementation 987 988 Chapter 12 Image Pattern Classification

FIGURE 12.62 Feature Pooled Feature Pooled Neural


Graphical maps feature maps feature net
illustration of maps maps
a forward pass
through the 1
trained CNN.
The purpose
was to recognize RGB components 2
one input image of an input color
from the set in image
Fig. 12.56. As the
3
output shows, the
image was
recognized
correctly as 4
belonging to class R
1, the class of
airplanes. 5
(Original image

Vector
FIGURE 12.61 Weights of the kernels of the second convolution layer after 500 epochs of training. The interpretation courtesy of
of these kernels is the same as in Fig. 12.54. Pearson
Education.) 6

12.7 SOME ADDITIONAL DETAILS OF IMPLEMENTATION


12.7
7

We mentioned in the previous section that neural (including convolutional) nets


have the ability to learn features directly from training data, thus reducing the need
8
for “engineered” features. While this is a significant advantage, it does not imply that B
the design of a neural network is free of human input. On the contrary, designing
complex neural networks requires significant skill and experimentation. 9
In the last two sections, our focus was on the development of fundamental con-
cepts in neural nets, with an emphasis on the derivation of backpropagation for both
fully connected and convolutional nets. Backpropagation is the backbone of neural 10
net design, but there are other important considerations that influence how well
a neural net learns, and then generalizes to patterns it has not seen before. In this
section, we discuss briefly some important aspects in the design of fully connected
and convolutional neural networks.
One of the first questions when designing a neural net architecture is how many is determined by a combination of experience and experimentation. “Starting small”
layers to specify for the network. Theoretically, the universality approximation is a logical approach to this problem. The more layers a network has, the higher the
theorem (Cybenco [1989]) tells us that, under mild conditions, arbitrarily complex probability that backpropagation will run into problems such as so-called vanishing
decision functions can be approximated by a continuous feedforward neural network gradients, where gradient values are so small that gradient descent ceases to be
with a single hidden layer. Although the theorem does not tell us how to compute effective. In convolutional networks, we have the added issue that the size of the
the parameters of that single hidden layer, it does indicate that structurally simple inputs decreases as the images propagate through the network. There are two causes
neural nets can be very powerful. You have seen this in some of the examples in the for this. The first is a natural size reduction caused by convolution itself, with the
last two sections. Experimental evidence suggests that deep neural nets (i.e., net- amount of reduction being proportional to the size of the receptive fields. One solu-
works with two or more hidden layers) are better than a single hidden layer network tion is to use padding prior to performing convolution operations, as we discussed in
at learning abstract representations, which typically is the main point of learning. Section 3.4. The second (and most significant) cause of size reduction is pooling. The
There is no such thing as an algorithm to determine the “optimum” number of lay- minimum pooling neighborhood is of size 2 × 2, which reduces the size of feature
ers to use in a neural network. Therefore, specifying the number of layers generally maps by three-quarters at each layer. A solution that helps is to upsample the input

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 987 6/16/2017 2:18:05 PM DIP4E_GLOBAL_Print_Ready.indb 988 6/16/2017 2:18:05 PM


12.7 Additional Details of Implementation 989 990 Chapter 12 Image Pattern Classification

images, but this must done with care because the relative sizes of features of interest been implemented over the past decade, including commercial and free implemen-
would increase proportionally, thus influencing the size selected for receptive fields. tations. A quick internet search will reveal a multitude of available architectures.
After the number of layers has been specified, the next task is to specify the num-
ber of neurons per layer. We always know how many neurons are needed in the first
and last layers, but the number of neurons for internal layer is also an open question Summary, References, and Further Reading
with no theoretical “best” answer. If the objective is to keep the number of layers as Background material for Sections 12.1 through 12.4 are the books by Theodoridis and Koutroumbas [2006], by
small as possible, the power of the network is increased to some degree by increas- Duda, Hart, and Stork [2001], and by Tou and Gonzalez [1974]. For additional reading on the material on match-
ing the number of neurons per layer. ing shape numbers see Bribiesca and Guzman [1980]. On string matching, see Sze and Yang [1981]. A significant
The main aspects of specifying the architecture of a neural network are com- portion of this chapter was devoted to neural networks. This is a reflection of the fact that neural nets, and in
pleted by specifying the activation function. In this chapter, we worked with sigmoid particular convolutional neural nets, have made significant strides in the past decade in solving image pattern
functions for consistency between examples, but there are applications in which classifications problems. As in the rest of the book, our presentation of this topic focused on fundamentals, but
hyperbolic tangent and ReLU activation functions are superior in terms of improv- the topics covered were thoroughly developed. What you have learned in this chapter is a solid foundation for
ing training performance. much of the work being conducted in this area. As we mentioned earlier, the literature on neural nets is vast, and
Once a network architecture has been specified, training is the central aspect of quickly growing. As a starting point, a basic book by Nielsen [2015] provides an excellent introduction to the topic.
making the architecture useful. Although the networks we discussed in this chapter The more advanced book by Goodfellow, Bengio, and Courville [2016] provides more depth into the mathemati-
cal underpinning of neural nets. Two classic papers worth reading are by Rumelhart, Hinton, and Williams [1986],
are relatively simple, networks applied to very large-scale problems can have mil-
and by LeCun, Bengio, and Haffner [1998]. The LeNet architecture we discussed in Section 12.6 was introduced in
lions of nodes and require large blocks of time to train. When available, the param-
the latter reference, and it is still a foundation for image pattern classification. A recent survey article by LeCun,
eters of a pretrained network are an ideal starting point for further training, or for
Bengio, and Hinton [2015] gives an interesting perspective on the scope of applicability of neural nets in general.
validating recognition performance. Another central theme in training neural nets is The paper by Krizhevsky, Sutskever, and Hinton [2012] was one of the most important catalysts leading to the
the use of GPUs to accelerate matrix operations. significant increase in the present interest on convolutional networks, and on their applicability to image pattern
An issue often encountered in training is over-fitting, in which recognition of the classification. This paper is also a good overview of the details and techniques involved in implementing a large-
training set is acceptable, but the recognition rate on samples not used for training is scale convolutional neural network. For details on the software aspects of many of the examples in this chapter, see
much lower. That is, the net is not able to generalize what it learned and apply it to Gonzalez, Woods, and Eddins [2009].
inputs it has not encountered before. When additional training data is not available,
the most common approach is to artificially enlarge the training set using transfor-
mations such as geometric distortions and intensity variations. The transformations
are carried out while preserving the class membership of the transformed patterns. Problems
Another major approach is to use dropout, a technique that randomly drops nodes
Solutions to the problems marked with an asterisk (*) are in the DIP4E Student Support Package (consult the book
with their connections from a neural network during training. The idea is to change website: www.ImageProcessingPlace.com).
the architecture slightly to prevent the net from adapting too much to a fixed set of
parameters (see Srivastava et al. [2014]). 12.1 Do the following: each bank (for summing currents), and a maxi-
In addition to computational speed, another important aspect of training is effi- (a) * Compute the decision functions of a mini- mum selector capable of selecting the maximum
ciency. Simple things, such as shuffling the input patterns at the beginning of each mum distance classifier for the patterns in value of Nc decision functions in order to deter-
training epoch can reduce or eliminate the possibility of “cycling,” in which param- Fig. 12.10. You may obtain the required mean mine the class membership of a given input.
eter values repeat at regular intervals. Stochastic gradient descent is another impor- vectors by (careful) inspection. 12.5 * Show that the correlation coefficient of Eq. (12-10)
tant training refinement in which, instead of using the entire training set, samples has values in the range [ −1, 1]. (Hint: Express g in
(b) Sketch the decision boundary implemented
are selected randomly and input into the network. You can think of this as dividing vector form.)
by the decision functions in (a).
the training set into mini-batches, and then choosing a single sample from each mini-
12.2 * Show that Eqs. (12-3) and (12-4) perform the 12.6 Show that the distance measure D(a, b) in Eq.
batch. This approach often results in speedier convergence during training. (12-12) satisfies the properties in Eq. (12-13).
In addition to the above topics, a paper by LeCun et al. [2012] is an excellent over- same function in terms of pattern classification.
view of the types of considerations introduced in the preceding discussion. In fact, 12.3 Show that the boundary given by Eq. (12-8) is 12.7 * Show that b = max ( a , b ) − a in Eq. (12-14) is 0
the perpendicular bisector of the line joining the if and only if a and b are identical strings.
the breath spanned by these topics is extensive enough to be the subject of an entire
book (see Montavon et al. [2012]). The neural net architectures we discussed were n-dimensional points m i and m j . 12.8 Carry out the manual computations that resulted
by necessity limited in scope. You can get a good idea of the practical requirements 12.4 * Show how the minimum distance classifier dis- in the mean vector and covariance matrices in
of implementing practical networks by reading a paper by Krizhevsky, Sutskever, cussed in connection with Fig. 12.11 could be Example 12.5.
and Hinton [2012], which summarizes the design and implementation of a large- implemented by using Nc resistor banks (Nc is 12.9 * The following pattern classes have Gaussian prob-
scale, deep convolutional neural network. There are a multitude of designs that have the number of classes), a summing junction at ability density functions:

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 989 6/16/2017 2:18:06 PM DIP4E_GLOBAL_Print_Ready.indb 990 6/16/2017 2:18:06 PM


Problems 991 992 Chapter 12 Image Pattern Classification

lem statement can be obtained from this general weights and bias as that network if trained (b) Show how the middle term in the third line
c1 : {(0, 0)T ,( 2, 0)T ,(2, 2)T ,(0, 2)T }
gradient descent procedure by using the criterion with a sufficiently large number of samples? of Eq. (12-70) follows from the middle term
c2 : {(4, 4)T ,(6, 4)T ,(6, 6)T ,(4, 6)T } function Explain. in the second.
(a) Assume that P(c1 ) = P(c2 ) = 1 2 and obtain
the equation of the Bayes decision boundary J (w, y ) =
1
2
(wT y − wT y ) 12.21 Two pattern classes in two dimensions are distrib-
uted in such a way that the patterns of class c1 lie
12.26 Show the validity of Eq. (12-72). (Hint: Use the
chain rule.)
between these two classes. randomly along a circle of radius r1 . Similarly, the
(Hint: The partial derivative of wT y with respect 12.27 * Show that the dimensions of matrix D(&) in Eq.
(b) Sketch the boundary. patterns of class c2 lie randomly along a circle of (12-79) are n& × n p . (Hint: Some of the parameters
to w is y.)
radius r2 , where r2 = 2r1 . Specify the structure of in that equation are computed in forward propaga-
12.10 Repeat Problem 12.9, but use the following pat- 12.15 * Prove that the perceptron training algorithm giv- a neural network with the minimum number of tion, so you already know their dimensions.)
tern classes: en in Eqs. (12-44) through (12-46) converges in layers and nodes needed to classify properly the
T T T
c1 : {(−1, 0) ,(0, − 1) ,(1, 0) ,(0, 1) } T
a finite number of steps if the training pattern patterns of these two classes. 12.28 With reference to the discussion following Eq.
sets are linearly separable. [Hint: Multiply the (12-82), explain why the error for one pattern is
c2 : {(−2, 0)T ,(0, − 2)T ,(2, 0)T ,(0, 2)T } 12.22 * If two classes are linearly separable, we can train
patterns of class c2 by −1 and consider a non- obtained by squaring the elements of one column
Note that the classes are not linearly separable. a perceptron starting with weights and a bias that of matrix ( A(L) − R ) , adding them, and dividing
negative threshold, T0 so that the perceptron
are all zero, and we would still get a solution. Can the result by 2.
12.11 With reference to the results in Table 12.1, com- training algorithm (with a = 1) is expressed in
you do the same when training a neural network
pute the overall correct recognition rate for the the form w(k + 1) = w(k ), if wT (k )y(k ) > T0 , 12.29 * The matrix formulation in Table 12.3 contains all
by backpropagation? Explain.
patterns of the training set. Repeat for the pat- and w(k + 1) = w(k ) + ay(k ) otherwise. You patterns as columns of a single matrix X. This is
terns of the test set. may need to use the Cauchy-Schwartz inequality: 12.23 Label the outputs, weights, and biases for every ideal in terms of speed and economy of imple-
2 2
a b ≥ (aT b)2 .] node in the following neural network using the mentation. It is also well suited when training
12.12 * We derived the Bayes decision functions general notation introduced in Fig. 12.31. is done using mini-batches. However, there are
( ) ( )
d j ( x ) = p x c j P c j , j = 1, 2, …, Nc
12.16 Derive equations of the derivatives of the follow-
ing activation functions: applications in which the large number of train-
ing vectors is too large to hold in memory, and
using a 0-1 loss function. Prove that these deci- (a) The sigmoid activation function in Fig. 12.30(a). it becomes more practical to loop through each
sion functions minimize the probability of error.
(b) The hyperbolic tangent activation function pattern using the vector formulation. Compose
(Hint: The probability of error p(e) is 1 − p(c), 1 1
in Fig. 12.30(b). a table similar to Table 12.3, but using individual
where p(c) is the probability of being correct.
patterns, x, instead of matrix X.
For a pattern vector x belonging to class ci , (c) * The ReLU activation function in Fig. 12.30(c).
p ( c x ) = p ( ci x ) . Find p(c) and show that p(c) is 12.17 * Specify the structure, weights, and bias(es) of the 1 1
12.30 Consider a CNN whose inputs are RGB color
maximum [ p(e) is minimum] when p(x ci )P(ci ) images of size 512 × 512 pixels. The network has
smallest neural network capable of performing
is maximum.) two convolutional layers. Using this information,
exactly the same function as a minimum distance
12.24 Answer the following: answer the following:
12.13 Finish the computations started in Example 12.7. classifier for two pattern classes in n-dimensional
space. You may assume that the classes are tightly (a) The last element of the input vector in Fig. (a) * You are told that the spatial dimensions
12.14 * The perceptron algorithm given in Eqs. (12-44) 12.32 is 1. Is this vector augmented? Explain.
grouped and are linearly separable. of the feature maps in the first layer are
through (12-46) can be expressed in a more con-
(b) Repeat the calculations in Fig. 12.32, but 504 × 504, and that there are 12 feature
cise form by multiplying the patterns of class 12.18 What is the decision boundary implemented by
using weight matrices that are 100 times the maps in the first layer. Assuming that no
c2 by −1, in which case the correction steps a neural network with n inputs, a single output
values of those used in the figure. padding is used, and that the kernels used
in the algorithm become w(k + 1) = w(k ), if neuron, and no hidden layers? Explain.
are square, and of an odd size, what are the
wT (k )y(k ) > 0, and w(k + 1) = w(k ) + ay ( k ) 12.19 Specify the structure, weights, and bias of a neu- (c) * What can you conclude in general from your
spatial dimensions of these kernels?
otherwise, where we use y instead of x to make it ral network capable of performing exactly the results in (b)?
clear that the patterns of class c2 were multiplied (b) If subsampling is done using neighborhoods
same function as a Bayes classifier for two pat- 12.25 Answer the following:
by −1. This is one of several perceptron algo- of size 2 × 2, what are the spatial dimensions
tern classes in n-dimensional space. The classes (a) * The chain rule in Eq. (12-70) shows three
rithm formulations that can be derived starting of the pooled feature maps in the first layer?
are Gaussian with different means but equal terms. However, you are probably more famil-
from the general gradient descent equation covariance matrices. (c) What is the depth (number) of the pooled
iar with chain rule expressions that have two feature maps in the first layer?
 ∂J ( w, y )  12.20 Answer the following: terms. Show that if you start with the expres-
w ( k + 1) = w ( k ) − a   sion (d) The spatial dimensions of the convolution
 ∂w  w = w ( k ) (a) * Under what conditions are the neural net- kernels in the second layer are 3 × 3. Assum-
works in Problems 12.17 and 12.19 identical? ∂E ∂E ∂zi (& + 1)
d j ( &) = =∑ ing no padding, what are the sizes of the fea-
where a > 0, J(w, y ) is a criterion function, and ∂z j ( & ) i ∂zi (& + 1) ∂zj (&) ture maps in the second layer?
(b) Suppose you specify a neural net architecture
the partial derivative is evaluated at w = w(k ).
identical to the one in Problem 12.17. Would (e) You are told that the number of feature maps
Show that the perceptron algorithm in the prob- you can arrive at the result in Eq. (12-70).
training by backpropagation yield the same

www.EBooksWorld.ir www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 991 6/16/2017 2:18:13 PM DIP4E_GLOBAL_Print_Ready.indb 992 6/16/2017 2:18:14 PM


Problems 993

in the second layer is 6, and that the size of Develop an image processing system capable of
the pooling neighborhoods is again 2 × 2. rejecting incomplete or overlapping ellipses, then
What are the dimensions of the vectors that classifying the remaining single ellipses into one
result from vectorizing the last layer of the of the three given size classes. Show your solu-
CNN? Assume that vectorization is done tion in block diagram form, giving specific details
using linear indexing. regarding the operation of each block. Solve the
12.31 Suppose the input images to a CNN are padded classification problem using a minimum distance
to compensate for the size reduction caused by classifier, indicating clearly how you would go
about obtaining training samples, and how you
convolution and subsampling (pooling). Let P
would use these samples to train the classifier.
denote the thickness of the padding border, let V
denote the width of the (square) input images, let 12.34 A factory mass-produces small American flags
S denote the stride, and let F denote the width of for sporting events. The quality assurance team
has observed that, during periods of peak pro-
the (square) receptive field.
duction, some printing machines have a tendency
(a) Show that the number, N, of neurons in to drop (randomly) between one and three stars
each row in the resulting feature map is and one or two entire stripes. Aside from these
errors, the flags are perfect in every other way.
V + 2P − F
N= + 1 Although the flags containing errors represent a
S small percentage of total production, the plant
(b) * How would you interpret a result using this manager decides to solve the problem. After
equation that is not an integer? much investigation, she concludes that automatic
12.32 * Show the validity of Eq. (12-106). inspection using image processing techniques is
the most economical approach. The basic specifi-
12.33 An experiment produces binary images of blobs
cations are as follows: The flags are approximate-
that are nearly elliptical in shape, as the following
ly 7.5 cm by 12.5 cm in size. They move length-
example image shows. The blobs are of three siz-
wise down the production line (individually, but
es, with the average values of the principal axes
with a ±15% variation in orientation) at approxi-
of the ellipses being (1.3, 0.7), (1.0, 0.5), and (0.75,
mately 50 cm/s, with a separation between flags of
0.25). The dimensions of these axes vary ±10%
approximately 5 cm. In all cases, “approximately”
about their average values.
means ± 5%. The plant manager employs you to
design an image processing system for each pro-
duction line. You are told that cost and simplicity
are important parameters in determining the via-
bility of your approach. Design a complete sys-
tem based on the model of Fig. 1.23. Document
your solution (including assumptions and speci-
fications) in a brief (but clear) written report
addressed to the plant manager. You can use any
of the methods discussed in the book.

www.EBooksWorld.ir

DIP4E_GLOBAL_Print_Ready.indb 993 6/16/2017 2:18:15 PM

You might also like