Print This
Print This
10.1 FUNDAMENTALS
              10
                                                                                                                                                                              10.1
                                                                                                                                                                              Let R represent the entire spatial region occupied by an image. We may view image
                                                                                                                                                                              segmentation as a process that partitions R into n subregions, R1 , R2 , …, Rn , such
                                                                                                                                                                              that
                                                                                                                                                                                           n
                                                                                                                                                                                           ∪ Ri = R.
                                          Image Segmentation                                                                                                                         (a)
                                                                                                                                                                                           i =1
                                                                                                                                                                                     (b) Ri is a connected set, for i = 0, 1, 2, … , n.
                                                                                                                                                                                     (c) Ri ! Rj = ∅ for all i and j, i ≠ j.
                                                                                                                                                                                     (d) Q ( Ri ) = TRUE for i = 0, 1, 2, … , n.
                                                                   The whole is equal to the sum of its parts.
                                                                                                                 Euclid
                                                                                                                                                                                               (       )
                                                                                                                                                                                     (e) Q Ri " Rj = FALSE for any adjacent regions Ri and Rj .
                                                                   The whole is greater than the sum of its parts.                                                            where Q ( Rk ) is a logical predicate defined over the points in set Rk , and ∅ is the
                                                                                                     Max Wertheimer
                                                                                                                                                                              null set. The symbols ´ and ¨ represent set union and intersection, respectively, as
                                                                                                                                                                              defined in Section 2.6. Two regions Ri and Rj are said to be adjacent if their union
                                                                                                                                                                              forms a connected set, as defined in Section 2.5. If the set formed by the union of two
                                                                                                                                                                              regions is not connected, the regions are said to disjoint.
                                                                                                                                                                                 Condition (a) indicates that the segmentation must be complete, in the sense that
                                                                                                                                                                              every pixel must be in a region. Condition (b) requires that points in a region be con-
            Preview                                                                                                                                                           nected in some predefined sense (e.g., the points must be 8-connected). Condition
            The material in the previous chapter began a transition from image processing methods whose inputs                                                                (c) says that the regions must be disjoint. Condition (d) deals with the properties that
            and outputs are images, to methods in which the inputs are images but the outputs are attributes extract-                                                         must be satisfied by the pixels in a segmented region—for example, Q ( Ri ) = TRUE
            ed from those images. Most of the segmentation algorithms in this chapter are based on one of two basic                                                           if all pixels in Ri have the same intensity. Finally, condition (e) indicates that two
            properties of image intensity values: discontinuity and similarity. In the first category, the approach is                                                        adjacent regions Ri and Rj must be different in the sense of predicate Q.†
            to partition an image into regions based on abrupt changes in intensity, such as edges. Approaches in                                                                Thus, we see that the fundamental problem in segmentation is to partition an
            the second category are based on partitioning an image into regions that are similar according to a set                                                           image into regions that satisfy the preceding conditions. Segmentation algorithms
            of predefined criteria. Thresholding, region growing, and region splitting and merging are examples of                                                            for monochrome images generally are based on one of two basic categories dealing
            methods in this category. We show that improvements in segmentation performance can be achieved                                                                   with properties of intensity values: discontinuity and similarity. In the first category,
            by combining methods from distinct categories, such as techniques in which edge detection is combined                                                             we assume that boundaries of regions are sufficiently different from each other, and
            with thresholding. We discuss also image segmentation using clustering and superpixels, and give an                                                               from the background, to allow boundary detection based on local discontinuities in
            introduction to graph cuts, an approach ideally suited for extracting the principal regions of an image.                                                          intensity. Edge-based segmentation is the principal approach used in this category.
            This is followed by a discussion of image segmentation based on morphology, an approach that com-                                                                 Region-based segmentation approaches in the second category are based on parti-
            bines several of the attributes of segmentation based on the techniques presented in the first part of the                                                        tioning an image into regions that are similar according to a set of predefined criteria.
            chapter. We conclude the chapter with a brief discussion on the use of motion cues for segmentation.                                                                 Figure 10.1 illustrates the preceding concepts. Figure 10.1(a) shows an image of a
                                                                                                                                                                              region of constant intensity superimposed on a darker background, also of constant
            Upon completion of this chapter, readers should:                                                                                                                  intensity. These two regions comprise the overall image. Figure 10.1(b) shows the
                                                                                                                                                                              result of computing the boundary of the inner region based on intensity discontinui-
                Understand the characteristics of various types         Know how to combine thresholding and spa-                                                             ties. Points on the inside and outside of the boundary are black (zero) because there
                of edges found in practice.                             tial filtering to improve segmentation.                                                               are no discontinuities in intensity in those regions. To segment the image, we assign
                Understand how to use spatial filtering for             Be familiar with region-based segmentation,                                                           one level (say, white) to the pixels on or inside the boundary, and another level (e.g.,
                edge detection.                                         including clustering and superpixels.                                                                 black) to all points exterior to the boundary. Figure 10.1(c) shows the result of such
                Be familiar with other types of edge detection          Understand how graph cuts and morphologi-                                                             a procedure. We see that conditions (a) through (c) stated at the beginning of this
                methods that go beyond spatial filtering.               cal watersheds are used for segmentation.
                                                                                                                                                                              †
                                                                                                                                                                                In general, Q can be a compound expression such as, “Q ( Ri ) = TRUE if the average intensity of the pixels in
                Understand image thresholding using several             Be familiar with basic techniques for utilizing                                                       region Ri is less than mi AND if the standard deviation of their intensity is greater than si ,” where mi and si
                different approaches.                                   motion in image segmentation.                                                                         are specified constants.
699
www.EBooksWorld.ir www.EBooksWorld.ir
             a b c                                                                                                                                                              we are interested are isolated points, lines, and edges. Edge pixels are pixels at which
             d e f                                                                                                                                                              the intensity of an image changes abruptly, and edges (or edge segments) are sets of
                                                                                                                                                   When we refer to lines,
            FIGURE 10.1                                                                                                                            we are referring to thin     connected edge pixels (see Section 2.5 regarding connectivity). Edge detectors are
            (a) Image of a                                                                                                                         structures, typically just   local image processing tools designed to detect edge pixels. A line may be viewed as
            constant intensity                                                                                                                     a few pixels thick. Such
                                                                                                                                                   lines may correspond, for    a (typically) thin edge segment in which the intensity of the background on either
            region.
            (b) Boundary
                                                                                                                                                   example, to elements of
                                                                                                                                                   a digitized architectural
                                                                                                                                                                                side of the line is either much higher or much lower than the intensity of the line
            based on intensity                                                                                                                     drawing, or roads in a       pixels. In fact, as we will discuss later, lines give rise to so-called “roof edges.” Finally,
            discontinuities.                                                                                                                       satellite image.             an isolated point may be viewed as a foreground (background) pixel surrounded by
            (c) Result of                                                                                                                                                       background (foreground) pixels.
            segmentation.
            (d) Image of a
            texture region.                                                                                                                                                     BACKGROUND
            (e) Result of                                                                                                                                                       As we saw in Section 3.5, local averaging smoothes an image. Given that averaging
            intensity discon-                                                                                                                                                   is analogous to integration, it is intuitive that abrupt, local changes in intensity can
            tinuity computa-
            tions (note the                                                                                                                                                     be detected using derivatives. For reasons that will become evident shortly, first- and
            large number of                                                                                                                                                     second-order derivatives are particularly well suited for this purpose.
            small edges).                                                                                                                                                          Derivatives of a digital function are defined in terms of finite differences. There
            (f) Result of                                                                                                                                                       are various ways to compute these differences but, as explained in Section 3.6, we
            segmentation                                                                                                                                                        require that any approximation used for first derivatives (1) must be zero in areas
            based on region
            properties.                                                                                                                                                         of constant intensity; (2) must be nonzero at the onset of an intensity step or ramp;
                                                                                                                                                                                and (3) must be nonzero at points along an intensity ramp. Similarly, we require that
                                    section are satisfied by this result. The predicate of condition (d) is: If a pixel is on,                                                  an approximation used for second derivatives (1) must be zero in areas of constant
                                    or inside the boundary, label it white; otherwise, label it black. We see that this predi-                                                  intensity; (2) must be nonzero at the onset and end of an intensity step or ramp; and
                                    cate is TRUE for the points labeled black or white in Fig. 10.1(c). Similarly, the two                                                      (3) must be zero along intensity ramps. Because we are dealing with digital quanti-
                                    segmented regions (object and background) satisfy condition (e).                                                                            ties whose values are finite, the maximum possible intensity change is also finite, and
                                       The next three images illustrate region-based segmentation. Figure 10.1(d) is                                                            the shortest distance over which a change can occur is between adjacent pixels.
                                    similar to Fig. 10.1(a), but the intensities of the inner region form a textured pattern.                                                      We obtain an approximation to the first-order derivative at an arbitrary point x of
                                    Figure 10.1(e) shows the result of computing intensity discontinuities in this image.                                                       a one-dimensional function f ( x) by expanding the function f ( x + !x) into a Taylor
                                    The numerous spurious changes in intensity make it difficult to identify a unique                                                           series about x
                                    boundary for the original image because many of the nonzero intensity changes are
                                    connected to the boundary, so edge-based segmentation is not a suitable approach.
                                                                                                                                                                                   f ( x + !x) = f ( x) + !x
                                                                                                                                                                                                                     ∂f ( x )
                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                ( !x ) ∂ 2 f ( x) + ( !x ) ∂3 f ( x) + ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
                                                                                                                                                                                                                                        2                   3
                                    However, we note that the outer region is constant, so all we need to solve this seg-                          Remember, the notation                                             ∂x           2!    ∂x 2          3!   ∂x 3
                                    mentation problem is a predicate that differentiates between textured and constant                             n! means “n factorial”:                                                                                                           (10-1)
                                                                                                                                                   n! = 1#2#· · ·# n.                             "
                                                                                                                                                                                                        ( !x )n ∂n f ( x)
                                    regions. The standard deviation of pixel values is a measure that accomplishes this
                                    because it is nonzero in areas of the texture region, and zero otherwise. Figure 10.1(f)
                                                                                                                                                                                              =   ∑
                                                                                                                                                                                                  n=0     n!         ∂x   n
                                    shows the result of dividing the original image into subregions of size 8 × 8. Each
                                    subregion was then labeled white if the standard deviation of its pixels was posi-                                                          where !x is the separation between samples of f. For our purposes, this separation
                                    tive (i.e., if the predicate was TRUE), and zero otherwise. The result has a “blocky”                                                       is measured in pixel units. Thus, following the convention in the book, !x = 1 for
                                    appearance around the edge of the region because groups of 8 × 8 squares were                                                               the sample preceding x and !x = −1 for the sample following x. When !x = 1, Eq.
                                    labeled with the same intensity (smaller squares would have given a smoother                                                                (10-1) becomes
                                    region boundary). Finally, note that these results also satisfy the five segmentation
                                    conditions stated at the beginning of this section.                                                            Although this is an
                                                                                                                                                                                                                              ∂f ( x )   1 ∂ 2 f ( x)   1 ∂ 3 f ( x)
                                                                                                                                                   expression of only one
                                                                                                                                                                                             f ( x + 1) = f ( x) +                     +              +              + ⋅⋅⋅⋅⋅⋅
                                                                                                                                                   variable, we used partial
                                                                                                                                                                                                                               ∂x        2 ! ∂x 2       3 ! ∂x 3
                                    10.2 POINT, LINE, AND EDGE DETECTION                                                                           derivatives notation for                                                                                                          (10-2)
                                                                                                                                                                                                                "
                                                                                                                                                                                                                     1 ∂ n f ( x)
                                    10.2
                                                                                                                                                   consistency when we
                                    The focus of this section is on segmentation methods that are based on detecting                               discuss functions of two
                                                                                                                                                   variables later in this
                                                                                                                                                                                                          =    ∑
                                                                                                                                                                                                               n = 0 n!  ∂x n
                                    sharp, local changes in intensity. The three types of image characteristics in which                           section.
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                    ∂f ( x )                                                                                                             and
                                                                             = f $( x) = f ( x) − f ( x − 1)                  (10-5)
                                                                     ∂x
                                                                                                                                                                                                               ∂ 2 f ( x, y )
                                                                                                                                                                                                                                = f ( x, y + 1) − 2 f ( x, y ) + f ( x, y − 1)           (10-11)
                                    and the central difference is obtained by subtracting Eq. (10-3) from Eq. (10-2):                                                                                              ∂y 2
                                                               ∂f ( x )             f ( x + 1) − f ( x − 1)                                                                                 It is easily verified that the first and second-order derivatives in Eqs. (10-4)
                                                                        = f $( x) =                                           (10-6)                                                     through (10-7) satisfy the conditions stated at the beginning of this section regarding
                                                                ∂x                             2
                                                                                                                                                                                         derivatives of the first and second order. To illustrate this, consider Fig. 10.2. Part (a)
                                    The higher terms of the series that we did not use represent the error between an                                                                    shows an image of various objects, a line, and an isolated point. Figure 10.2(b) shows
                                    exact and an approximate derivative expansion. In general, the more terms we use                                                                     a horizontal intensity profile (scan line) through the center of the image, including
                                    from the Taylor series to represent a derivative, the more accurate the approxima-                                                                   the isolated point. Transitions in intensity between the solid objects and the back-
                                    tion will be. To include more terms implies that more points are used in the approxi-                                                                ground along the scan line show two types of edges: ramp edges (on the left) and
                                    mation, yielding a lower error. However, it turns out that central differences have                                                                  step edges (on the right). As we will discuss later, intensity transitions involving thin
                                    a lower error for the same number of points (see Problem 10.1). For this reason,                                                                     objects such as lines often are referred to as roof edges.
                                    derivatives are usually expressed as central differences.                                                                                               Figure 10.2(c) shows a simplified profile, with just enough points to make it possi-
                                       The second order derivative based on a central difference, ∂ 2 f ( x) ∂x 2 , is obtained                                                          ble for us to analyze manually how the first- and second-order derivatives behave as
                                    by adding Eqs. (10-2) and (10-3):                                                                                                                    they encounter a point, a line, and the edges of objects. In this diagram the transition
                                                         ∂ 2 f ( x)
                                                                    = f $$( x) = f ( x + 1) − 2 f ( x) + f ( x − 1)           (10-7)                         TABLE 10.1
                                                           ∂x 2                                                                                                                                               f ( x + 2)            f ( x + 1)          f ( x)          f ( x − 1)   f ( x − 2)
                                                                                                                                                            First four central
                                                                                                                                                            digital derivatives                2 f $( x)                                1                0                 −1
                                    To obtain the third order, central derivative we need one more point on either side                                     (finite differenc-
                                    of x. That is, we need the Taylor expansions for f ( x + 2) and f ( x − 2), which we                                    es) for samples                     f $$( x)                                1                −2                1
                                    obtain from Eqs. (10-2) and (10-3) with !x = 2 and !x = −2, respectively. The strat-                                    taken uniformly,
                                                                                                                                                            !x = 1 units apart.                2 f $$$( x)        1                    −2                0                 2            −1
                                    egy is to combine the two Taylor expansions to eliminate all derivatives lower than
                                    the third. The result after ignoring all higher-order terms [see Problem 10.2(a)] is                                                                       f $$$$( x)         1                    −4                6                 −4            1
www.EBooksWorld.ir www.EBooksWorld.ir
             a b                                                                                                                                        FIGURE 10.3
              c                                                                                                                                         A general 3 × 3
                                                                                                                                                                                                                        w1       w2         w3
                                                                                                                                                        spatial filter
            FIGURE 10.2                                                                                                                                 kernel. The w’s
            (a) Image.                                                                                                                                  are the kernel
            (b) Horizontal                                                                                                                              coefficients                                                    w4       w5         w6
            intensity profile                                                                                                                           (weights).
            that includes the
            isolated point
            indicated by the                                                                                                                                                                                            w7       w8         w9
            arrow.
            (c) Subsampled
            profile; the dashes
            were added
            for clarity. The                                                                                                                                                      second derivative has opposite signs (negative to positive or positive to negative)
            numbers in the
            boxes are the
                                                                                                                                                                                  as it transitions into and out of an edge. This “double-edge” effect is an important
            intensity values                                                                                                                                                      characteristic that can be used to locate edges, as we will show later in this section.
            of the dots shown                                                                                                                                                     As we move into the edge, the sign of the second derivative is used also to determine
            in the profile. The                                                                                                                                                   whether an edge is a transition from light to dark (negative second derivative), or
            derivatives were                                7                         Isolated point                                                                              from dark to light (positive second derivative)
            obtained using                                  6
                                                                                                                                                                                     In summary, we arrive at the following conclusions: (1) First-order derivatives gen-
                                                Intensity
www.EBooksWorld.ir www.EBooksWorld.ir
                                    where the partial derivatives are computed using the second-order finite differences                                             a
                                    in Eqs. (10-10) and (10-11). The Laplacian is then                                                                             b c d                                                         1        1        1
                                                                                                                                                                  FIGURE 10.4
                                       ∇ f ( x, y ) = f ( x + 1, y) + f ( x − 1, y) + f ( x, y + 1) + f ( x, y − 1) − 4 f ( x, y)
                                         2
                                                                                                                                    (10-14)                       (a) Laplacian ker-
                                                                                                                                                                  nel used for point                                             1       %8        1
                                                                                                                                                                  detection.
                                    As explained in Section 3.6, this expression can be implemented using the Lapla-                                              (b) X-ray image
                                    cian kernel in Fig. 10.4(a) in Example 10.1. We then we say that a point has been                                             of a turbine blade
                                    detected at a location ( x, y) on which the kernel is centered if the absolute value of                                       with a porosity                                                1        1        1
                                                                                                                                                                  manifested by a
                                    the response of the filter at that point exceeds a specified threshold. Such points are                                       single black pixel.
                                    labeled 1 and all others are labeled 0 in the output image, thus producing a binary                                           (c) Result of con-
                                    image. In other words, we use the expression:                                                                                 volving the kernel
                                                                                                                                                                  with the image.
                                                                                 1     if Z( x, y) > T                                                           (d) Result of
                                                                      g( x, y) =                                                   (10-15)                       using Eq. (10-15)
                                                                                 0     otherwise                                                                 was a single point
                                                                                                                                                                  (shown enlarged
                                    where g( x, y) is the output image, T is a nonnegative threshold, and Z is given by                                           at the tip of the
                                    Eq. (10-12). This formulation simply measures the weighted differences between a                                              arrow). (Original
                                    pixel and its 8-neighbors. Intuitively, the idea is that the intensity of an isolated point                                   image courtesy of
                                                                                                                                                                  X-TEK Systems,
                                    will be quite different from its surroundings, and thus will be easily detectable by                                          Ltd.)
                                    this type of kernel. Differences in intensity that are considered of interest are those
                                    large enough (as determined by T ) to be considered isolated points. Note that, as
                                    usual for a derivative kernel, the coefficients sum to zero, indicating that the filter
                                    response will be zero in areas of constant intensity.
                                                                                                                                                                    EXAMPLE 10.2 : Using the Laplacian for line detection.
                                                                                                                                                                  Figure 10.5(a) shows a 486 × 486 (binary) portion of a wire-bond mask for an electronic circuit, and
             EXAMPLE 10.1 : Detection of isolated points in an image.
                                                                                                                                                                  Fig. 10.5(b) shows its Laplacian image. Because the Laplacian image contains negative values (see the
            Figure 10.4(b) is an X-ray image of a turbine blade from a jet engine. The blade has a porosity mani-                                                 discussion after Example 3.18), scaling is necessary for display. As the magnified section shows, mid gray
            fested by a single black pixel in the upper-right quadrant of the image. Figure 10.4(c) is the result of fil-                                         represents zero, darker shades of gray represent negative values, and lighter shades are positive. The
            tering the image with the Laplacian kernel, and Fig. 10.4(d) shows the result of Eq. (10-15) with T equal                                             double-line effect is clearly visible in the magnified region.
            to 90% of the highest absolute pixel value of the image in Fig. 10.4(c). The single pixel is clearly visible                                             At first, it might appear that the negative values can be handled simply by taking the absolute value
            in this image at the tip of the arrow (the pixel was enlarged to enhance its visibility). This type of detec-                                         of the Laplacian image. However, as Fig. 10.5(c) shows, this approach doubles the thickness of the lines.
            tion process is specialized because it is based on abrupt intensity changes at single-pixel locations that                                            A more suitable approach is to use only the positive values of the Laplacian (in noisy situations we use
            are surrounded by a homogeneous background in the area of the detector kernel. When this condition                                                    the values that exceed a positive threshold to eliminate random variations about zero caused by the
            is not satisfied, other methods discussed in this chapter are more suitable for detecting intensity changes.                                          noise). As Fig. 10.5(d) shows, this approach results in thinner lines that generally are more useful. Note
                                                                                                                                                                  in Figs. 10.5(b) through (d) that when the lines are wide with respect to the size of the Laplacian kernel,
                                                                                                                                                                  the lines are separated by a zero “valley.” This is not unexpected. For example, when the 3 × 3 kernel is
                                    LINE DETECTION                                                                                                                centered on a line of constant intensity 5 pixels wide, the response will be zero, thus producing the effect
                                    The next level of complexity is line detection. Based on the discussion earlier in this                                       just mentioned. When we talk about line detection, the assumption is that lines are thin with respect to
                                    section, we know that for line detection we can expect second derivatives to result                                           the size of the detector. Lines that do not satisfy this assumption are best treated as regions and handled
                                    in a stronger filter response, and to produce thinner lines than first derivatives. Thus,                                     by the edge detection methods discussed in the following section.
                                    we can use the Laplacian kernel in Fig. 10.4(a) for line detection also, keeping in
                                    mind that the double-line effect of the second derivative must be handled properly.                                                                           The Laplacian detector kernel in Fig. 10.4(a) is isotropic, so its response is inde-
                                    The following example illustrates the procedure.                                                                                                           pendent of direction (with respect to the four directions of the 3 × 3 kernel: verti-
                                                                                                                                                                                               cal, horizontal, and two diagonals). Often, interest lies in detecting lines in specified
www.EBooksWorld.ir www.EBooksWorld.ir
             a b
             c d                                                                                                                                              %1        %1         %1        2     %1      %1       %1         2    %1      %1     %1       2
            FIGURE 10.5
            (a) Original
            image.                                                                                                                                             2         2          2       %1      2      %1       %1         2    %1      %1      2      %1
            (b) Laplacian
            image; the
            magnified                                                                                                                                         %1        %1         %1       %1     %1       2       %1         2    %1      2      %1      %1
            section shows the
            positive/negative
                                                                                                                                                                    Horizontal                     &45'                  Vertical                 %45'
            double-line effect
            characteristic of                                                                                                                       a b c d
            the Laplacian.                                                                                                                         FIGURE 10.6 Line detection kernels. Detection angles are with respect to the axis system in Fig. 2.19, with positive
            (c) Absolute value                                                                                                                     angles measured counterclockwise with respect to the (vertical) x-axis.
            of the Laplacian.
            (d) Positive values
            of the Laplacian.                                                                                                                                                   point is said to be more likely associated with a horizontal line. If we are interested
                                                                                                                                                                                in detecting all the lines in an image in the direction defined by a given kernel, we
                                                                                                                                                                                simply run the kernel through the image and threshold the absolute value of the
                                                                                                                                                                                result, as in Eq. (10-15). The nonzero points remaining after thresholding are the
                                                                                                                                                                                strongest responses which, for lines one pixel thick, correspond closest to the direc-
                                                                                                                                                                                tion defined by the kernel. The following example illustrates this procedure.
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                           a b c
                                                                                                                                                          FIGURE 10.8
                                                                                                                                                          From left to right,
                                                                                                                                                          models (ideal
                                                                                                                                                          representations) of
                                                                                                                                                          a step, a ramp, and
                                                                                                                                                          a roof edge, and
                                                                                                                                                          their corresponding
                                                                                                                                                          intensity profiles.
                                                                                                                                                                                       closely modeled as having an intensity ramp profile, such as the edge in Fig. 10.8(b).
                                                                                                                                                                                       The slope of the ramp is inversely proportional to the degree to which the edge is
                                                                                                                                                                                       blurred. In this model, we no longer have a single “edge point” along the profile.
                                                                                                                                                                                       Instead, an edge point now is any point contained in the ramp, and an edge segment
                                                                                                                                                                                       would then be a set of such points that are connected.
                                                                                                                                                                                          A third type of edge is the so-called roof edge, having the characteristics illus-
                                                                                                                                                                                       trated in Fig. 10.8(c). Roof edges are models of lines through a region, with the
                                                                                                                                                                                       base (width) of the edge being determined by the thickness and sharpness of the
                                                                                                                                                                                       line. In the limit, when its base is one pixel wide, a roof edge is nothing more than
                                                                                                                                                                                       a one-pixel-thick line running through a region in an image. Roof edges arise, for
                                                                                                                                                                                       example, in range imaging, when thin objects (such as pipes) are closer to the sensor
                                                                                                                                                                                       than the background (such as walls). The pipes appear brighter and thus create an
                                                                                                                                                                                       image similar to the model in Fig. 10.8(c). Other areas in which roof edges appear
                                                                                                                                                                                       routinely are in the digitization of line drawings and also in satellite images, where
                                                                                                                                                                                       thin features, such as roads, can be modeled by this type of edge.
             a b c                                                                                                                                                                        It is not unusual to find images that contain all three types of edges. Although
             d e f                                                                                                                                                                     blurring and noise result in deviations from the ideal shapes, edges in images that
            FIGURE 10.7 (a) Image of a wire-bond template. (b) Result of processing with the + 45° line detector kernel in Fig.                                                        are reasonably sharp and have a moderate amount of noise do resemble the charac-
            10.6. (c) Zoomed view of the top left region of (b). (d) Zoomed view of the bottom right region of (b). (e) The image                                                      teristics of the edge models in Fig. 10.8, as the profiles in Fig. 10.9 illustrate. What the
            in (b) with all negative values set to zero. (f) All points (in white) whose values satisfied the condition g > T , where
                                                                                                                                                                                       models in Fig. 10.8 allow us to do is write mathematical expressions for edges in the
            g is the image in (e) and T = 254 (the maximum pixel value in the image minus 1). (The points in (f) were enlarged
            to make them easier to see.)                                                                                                                                               development of image processing algorithms. The performance of these algorithms
                                                                                                                                                                                       will depend on the differences between actual edges and the models used in devel-
                                                                                                                                                                                       oping the algorithms.
                                       Edge models are classified according to their intensity profiles. A step edge is                                                                   Figure 10.10(a) shows the image from which the segment in Fig. 10.8(b) was extract-
                                    characterized by a transition between two intensity levels occurring ideally over the                                                              ed. Figure 10.10(b) shows a horizontal intensity profile. This figure shows also the first
                                    distance of one pixel. Figure 10.8(a) shows a section of a vertical step edge and                                                                  and second derivatives of the intensity profile. Moving from left to right along the
                                    a horizontal intensity profile through the edge. Step edges occur, for example, in                                                                 intensity profile, we note that the first derivative is positive at the onset of the ramp
                                    images generated by a computer for use in areas such as solid modeling and ani-                                                                    and at points on the ramp, and it is zero in areas of constant intensity. The second
                                    mation. These clean, ideal edges can occur over the distance of one pixel, provided                                                                derivative is positive at the beginning of the ramp, negative at the end of the ramp,
                                    that no additional processing (such as smoothing) is used to make them look “real.”                                                                zero at points on the ramp, and zero at points of constant intensity. The signs of the
                                    Digital step edges are used frequently as edge models in algorithm development.                                                                    derivatives just discussed would be reversed for an edge that transitions from light to
                                    For example, the Canny edge detection algorithm discussed later in this section was                                                                dark. The intersection between the zero intensity axis and a line extending between
                                    derived originally using a step-edge model.                                                                                                        the extrema of the second derivative marks a point called the zero crossing of the
                                       In practice, digital images have edges that are blurred and noisy, with the degree                                                              second derivative.
                                    of blurring determined principally by limitations in the focusing mechanism (e.g.,                                                                    We conclude from these observations that the magnitude of the first derivative
                                    lenses in the case of optical images), and the noise level determined principally by                                                               can be used to detect the presence of an edge at a point in an image. Similarly, the
                                    the electronic components of the imaging system. In such situations, edges are more                                                                sign of the second derivative can be used to determine whether an edge pixel lies on
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                                                         the ramp (see Problem 10.9). However, the conclusions reached using those models
                                                                                                                                                                                         are the same as with an ideal ramp, and working with the latter simplifies theoretical
                                                                                                                                                                                         formulations. Finally, although attention thus far has been limited to a 1-D horizon-
                                                                                                                                                                                         tal profile, a similar argument applies to an edge of any orientation in an image. We
                                                                                                                                                                                         simply define a profile perpendicular to the edge direction at any desired point, and
                                                                                                                                                                                         interpret the results in the same manner as for the vertical edge just discussed.
                                                                                                                                                              EXAMPLE 10.4 : Behavior of the first and second derivatives in the region of a noisy edge.
                                                                                                                                                            The edge models in Fig. 10.8 are free of noise. The image segments in the first column in Fig. 10.11 show
                                                                                                                                                            close-ups of four ramp edges that transition from a black region on the left to a white region on the right
                                                                                                                                                            (keep in mind that the entire transition from black to white is a single edge). The image segment at the
                                                                                                                                                            top left is free of noise. The other three images in the first column are corrupted by additive Gaussian
                                                                                                                                                            noise with zero mean and standard deviation of 0.1, 1.0, and 10.0 intensity levels, respectively. The graph
                                                                                                                                                            below each image is a horizontal intensity profile passing through the center of the image. All images
                                                                                                                                                            have 8 bits of intensity resolution, with 0 and 255 representing black and white, respectively.
                                                                                                                                                               Consider the image at the top of the center column. As discussed in connection with Fig. 10.10(b), the
                                    FIGURE 10.9 A 1508 × 1970 image showing (zoomed) actual ramp (bottom, left), step (top,
                                    right), and roof edge profiles. The profiles are from dark to light, in the areas enclosed by the                       derivative of the scan line on the left is zero in the constant areas. These are the two black bands shown
                                    small circles. The ramp and step profiles span 9 pixels and 2 pixels, respectively. The base of the                     in the derivative image. The derivatives at points on the ramp are constant and equal to the slope of the
                                    roof edge is 3 pixels. (Original image courtesy of Dr. David R. Pickens, Vanderbilt University.)                        ramp. These constant values in the derivative image are shown in gray. As we move down the center col-
                                                                                                                                                            umn, the derivatives become increasingly different from the noiseless case. In fact, it would be difficult
                                                                                                                                                            to associate the last profile in the center column with the first derivative of a ramp edge. What makes
                                    the dark or light side of an edge. Two additional properties of the second derivative                                   these results interesting is that the noise is almost visually undetectable in the images on the left column.
                                    around an edge are: (1) it produces two values for every edge in an image; and (2)                                      These examples are good illustrations of the sensitivity of derivatives to noise.
                                    its zero crossings can be used for locating the centers of thick edges, as we will show                                    As expected, the second derivative is even more sensitive to noise. The second derivative of the noise-
                                    later in this section. Some edge models utilize a smooth transition into and out of                                     less image is shown at the top of the right column. The thin white and black vertical lines are the positive
                                                                                                                                                            and negative components of the second derivative, as explained in Fig. 10.10. The gray in these images
                                                                                                                                                            represents zero (as discussed earlier, scaling causes zero to show as gray). The only noisy second deriva-
             a b
                                                                                                                                                            tive image that barely resembles the noiseless case corresponds to noise with a standard deviation of 0.1.
            FIGURE 10.10                                                                                                                                    The remaining second-derivative images and profiles clearly illustrate that it would be difficult indeed to
            (a) Two regions of
            constant                                                                      Horizontal intensity
                                                                                                                                                            detect their positive and negative components, which are the truly useful features of the second deriva-
            intensity                                                                     profile                                                           tive in terms of edge detection.
            separated by an                                                                                                                                    The fact that such little visual noise can have such a significant impact on the two key derivatives
            ideal ramp edge.                                                                                                                                used for detecting edges is an important issue to keep in mind. In particular, image smoothing should be
            (b) Detail near                                                                                                                                 a serious consideration prior to the use of derivatives in applications where noise with levels similar to
            the edge, showing
            a horizontal                                                                                                                                    those we have just discussed is likely to be present.
                                                                                          First
            intensity profile,                                                            derivative
            and its first and                                                                                                                                                              In summary, the three steps performed typically for edge detection are:
            second
            derivatives.                                                                                                                                                                  1. Image smoothing for noise reduction. The need for this step is illustrated by the
                                                                                                                                                                                             results in the second and third columns of Fig. 10.11.
                                                                                          Second                                                                                          2. Detection of edge points. As mentioned earlier, this is a local operation that
                                                                                          derivative                                                                                         extracts from an image all points that are potential edge-point candidates.
                                                                                                                                                                                          3. Edge localization. The objective of this step is to select from the candidate
                                                                                          Zero crossing                                                                                      points only the points that are members of the set of points comprising an edge.
                                                                                                                                                                                         The remainder of this section deals with techniques for achieving these objectives.
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                          For convenience, we
                                                                                                                                                                                                                                                              ∂f ( x, y) 
                                                                                                                                                          repeat here some of the                                                           g x ( x, y)  ∂x 
                                                                                                                                                          gradient concepts and                           ∇f ( x, y) ≡ grad [ f ( x, y)] ≡              =                 (10-16)
                                                                                                                                                          equations introduced in                                                           g y ( x, y)  ∂f ( x, y) 
                                                                                                                                                          Chapter 3.                                                                                         ∂y 
                                                                                                                                                                                       This vector has the well-known property that it points in the direction of maximum
                                                                                                                                                                                       rate of change of f at ( x, y) (see Problem 10.10). Equation (10-16) is valid at an
                                                                                                                                                                                       arbitrary (but single) point ( x, y). When evaluated for all applicable values of x
                                                                                                                                                                                       and y, (f ( x, y) becomes a vector image, each element of which is a vector given by
                                                                                                                                                                                       Eq. (10-16). The magnitude, M( x, y), of this gradient vector at a point ( x, y) is given
                                                                                                                                                                                       by its Euclidean vector norm:
M( x, y) = ∇f ( x, y) = g x2 ( x, y) + g y2 ( x, y) (10-17)
                                                                                                                                                                                       This is the value of the rate of change in the direction of the gradient vector at point
                                                                                                                                                                                       ( x, y). Note that M( x, y), (f ( x, y) , g x ( x, y), and g y ( x, y) are arrays of the same
                                                                                                                                                                                       size as f, created when x and y are allowed to vary over all pixel locations in f. It is
                                                                                                                                                                                       common practice to refer to M( x, y) and (f ( x, y) as the gradient image, or simply
                                                                                                                                                                                       as the gradient when the meaning is clear. The summation, square, and square root
                                                                                                                                                                                       operations are elementwise operations, as defined in Section 2.6.
                                                                                                                                                                                          The direction of the gradient vector at a point ( x, y) is given by
                                                                                                                                                                                                                                             g y ( x, y) 
                                                                                                                                                                                                                         a( x, y ) = tan −1                                 (10-18)
                                                                                                                                                                                                                                             g x ( x, y) 
                                                                                                                                                                                    Angles are measured in the counterclockwise direction with respect to the x-axis
                                                                                                                                                                                    (see Fig. 2.19). This is also an image of the same size as f, created by the elementwise
                                                                                                                                                                                    division of g x and g y over all applicable values of x and y. The following example
                                                                                                                                                                                    illustrates, the direction of an edge at a point ( x, y) is orthogonal to the direction,
                                                                                                                                                                                    a( x, y), of the gradient vector at the point.
www.EBooksWorld.ir www.EBooksWorld.ir
www.EBooksWorld.ir www.EBooksWorld.ir
              a                                                                                                                                            Recall the important          coefficients of all the kernels in Fig. 10.14 sum to zero, thus giving a response of zero
                                                                          z1         z2          z3                                                        result in Problem 3.32
             b c                                                                                                                                           that using a kernel           in areas of constant intensity, as expected of derivative operators.
             d e                                                                                                                                           whose coefficients sum           Any of the pairs of kernels from Fig. 10.14 are convolved with an image to obtain
             f g                                                                                                                                           to zero produces a
                                                                          z4         z5          z6
                                                                                                                                                           filtered image whose          the gradient components g x and g y at every pixel location. These two partial deriva-
            FIGURE 10.14                                                                                                                                   pixels also sum to zero.      tive arrays are then used to estimate edge strength and direction. Obtaining the
            A 3 × 3 region                                                                                                                                 This implies in general
            of an image (the                                              z7         z8          z9                                                        that some pixels will be      magnitude of the gradient requires the computations in Eq. (10-17). This imple-
            z’s are intensity                                                                                                                              negative. Similarly, if the   mentation is not always desirable because of the computational burden required
                                                                                                                                                           kernel coefficients sum
            values), and                                                                                                                                   to 1, the sum of pixels in    by squares and square roots, and an approach used frequently is to approximate the
            various kernels                                          %1        0             0        %1                                                   the original and filtered     magnitude of the gradient by absolute values:
            used to compute                                                                                                                                images will be the same
                                                                                                                                                           (see Problem 3.31).
            the gradient at the
            point labeled z5 .
                                                                      0        1             1        0                                                                                                                    M( x, y) ≈ g x + g y                          (10-26)
                                                                                Roberts
                                                                                                                                                                                         This equation is more attractive computationally, and it still preserves relative
                                                             %1      %1        %1          %1         0    1
                                                                                                                                                                                         changes in intensity levels. The price paid for this advantage is that the resulting
                                                                                                                                                                                         filters will not be isotropic (invariant to rotation) in general. However, this is not an
                                                                                                                                                                                         issue when kernels such as the Prewitt and Sobel kernels are used to compute g x
                                                                0     0        0           %1         0    1
                                                                                                                                                                                         and g y because these kernels give isotropic results only for vertical and horizontal
                                                                                                                                                                                         edges. This means that results would be isotropic only for edges in those two direc-
                                                                1     1        1           %1         0    1                                                                             tions anyway, regardless of which of the two equations is used. That is, Eqs. (10-17)
                                                                                                                                                                                         and (10-26) give identical results for vertical and horizontal edges when either the
                                                                                   Prewitt
                                                                                                                                                                                         Sobel or Prewitt kernels are used (see Problem 10.11).
                                                             %1      %2        %1          %1         0    1                                                                                 The 3 × 3 kernels in Fig. 10.14 exhibit their strongest response predominantly for
                                                                                                                                                                                         vertical and horizontal edges. The Kirsch compass kernels (Kirsch [1971]) in Fig. 10.15,
                                                                0     0        0           %2         0    2
                                                                                                                                                                                         are designed to detect edge magnitude and direction (angle) in all eight compass
                                                                                                                                                                                         directions. Instead of computing the magnitude using Eq. (10-17) and angle using
                                                                                                                                                                                         Eq. (10-18), Kirsch’s approach was to determine the edge magnitude by convolv-
                                                                1     2        1           %1         0    1
                                                                                                                                                                                         ing an image with all eight kernels and assign the edge magnitude at a point as the
                                                                                   Sobel                                                                                                 response of the kernel that gave strongest convolution value at that point. The edge
                                                                                                                                                                                         angle at that point is then the direction associated with that kernel. For example, if
                                                                                                                                                                                         the strongest value at a point in the image resulted from using the north (N) kernel,
                                                                                                                                                                                         the edge magnitude at that point would be assigned the response of that kernel, and
                                                                ∂f                                                                                                                       the direction would be 0° (because compass kernel pairs differ by a rotation of 180°;
                                                         gx =      = (z7 + 2z8 + z9 ) − (z1 + 2z2 + z3 )                    (10-24)
                                                                ∂x                                                                                                                       choosing the maximum response will always result in a positive number). Although
                                    and                                                                                                                                                  when working with, say, the Sobel kernels, we think of a north or south edge as
                                                                                                                                                                                         being vertical, the N and S compass kernels differentiate between the two, the differ-
                                                                ∂f                                                                                                                       ence being the direction of the intensity transitions defining the edge. For example,
                                                         gy =      = (z3 + 2z6 + z9 ) − (z1 + 2z4 + z7 )                    (10-25)
                                                                ∂y                                                                                                                       assuming that intensity values are in the range [0, 1], the binary edge in Fig. 10.8(a)
                                                                                                                                                                                         is defined by black (0) on the left and white (1) on the right. When all Kirsch kernels
                                    It can be demonstrated (see Problem 10.12) that using a 2 in the center location pro-
                                                                                                                                                                                         are applied to this edge, the N kernel will yield the highest value, thus indicating an
                                    vides image smoothing. Figures 10.14(f) and (g) show the kernels used to implement
                                                                                                                                                                                         edge oriented in the north direction (at the point of the computation).
                                    Eqs. (10-24) and (10-25). These kernels are called the Sobel operators (Sobel [1970]).
                                        The Prewitt kernels are simpler to implement than the Sobel kernels, but the
                                    slight computational difference between them typically is not an issue. The fact                                         EXAMPLE 10.6 : Illustration of the 2-D gradient magnitude and angle.
                                    that the Sobel kernels have better noise-suppression (smoothing) characteristics                                       Figure 10.16 illustrates the Sobel absolute value response of the two components of the gradient, g x
                                    makes them preferable because, as mentioned earlier in the discussion of Fig. 10.11,                                   and g y , as well as the gradient image formed from the sum of these two components. The directionality
                                    noise suppression is an important issue when dealing with derivatives. Note that the                                   of the horizontal and vertical components of the gradient is evident in Figs. 10.16(b) and (c). Note, for
www.EBooksWorld.ir www.EBooksWorld.ir
             a b c d                                                                                                                           FIGURE 10.17
             e f g h                 %3     %3     5      %3     5        5    5      5     5       5     5     %3                             Gradient angle
                                                                                                                                               image computed
            FIGURE 10.15                                                                                                                       using Eq. (10-18).
            Kirsch compass           %3     0      5      %3     0        5   %3      0    %3       5     0     %3
                                                                                                                                               Areas of constant
            kernels. The edge                                                                                                                  intensity in this
            direction of                                                                                                                       image indicate
                                     %3     %3     5      %3    %3     %3     %3     %3    %3      %3    %3     %3
            strongest response                                                                                                                 that the direction
            of each kernel is               N                   NW                   W                   SW                                    of the gradient
            labeled below it.                                                                                                                  vector is the same
                                      5     %3    %3      %3    %3    %3      %3     %3   %3       %3    %3     %3                             at all the pixel
                                                                                                                                               locations in those
                                                                                                                                               regions.
                                      5     0     %3       5     0    %3      %3      0    %3      %3     0     5
                                      5     %3    %3       5     5     %3      5      5     5      %3     5     5
                                                                                                                                                  Figure 10.17 shows the gradient angle image computed using Eq. (10-18). In general, angle images are
                                                                                                                                               not as useful as gradient magnitude images for edge detection, but they do complement the information
                                            S                   SE                   E                   NE                                    extracted from an image using the magnitude of the gradient. For instance, the constant intensity areas
                                                                                                                                               in Fig. 10.16(a), such as the front edge of the sloping roof and top horizontal bands of the front wall,
            example, how strong the roof tile, horizontal brick joints, and horizontal segments of the windows are in                          are constant in Fig. 10.17, indicating that the gradient vector direction at all the pixel locations in those
            Fig. 10.16(b) compared to other edges. In contrast, Fig. 10.16(c) favors features such as the vertical com-                        regions is the same. As we will show later in this section, angle information plays a key supporting role
            ponents of the façade and windows. It is common terminology to use the term edge map when referring                                in the implementation of the Canny edge detection algorithm, a widely used edge detection scheme.
            to an image whose principal features are edges, such as gradient magnitude images. The intensities of the
            image in Fig. 10.16(a) were scaled to the range [0, 1]. We use values in this range to simplify parameter                                                          The original image in Fig. 10.16(a) is of reasonably high resolution, and at the
            selection in the various methods for edge detection discussed in this section.                                                                                  distance the image was acquired, the contribution made to image detail by the wall
                                                                                                                                                                            bricks is significant. This level of fine detail often is undesirable in edge detection
                                                                                                                                                                            because it tends to act as noise, which is enhanced by derivative computations and
             a b
                                                                                                                                                                            thus complicates detection of the principal edges. One way to reduce fine detail is
             c d
                                                                                                                                                                            to smooth the image prior to computing the edges. Figure 10.18 shows the same
            FIGURE 10.16
                                                                                                                                                                            sequence of images as in Fig. 10.16, but with the original image smoothed first using
             (a) Image of size
            834 × 1114 pixels,                                                                                                                                              a 5 × 5 averaging filter (see Section 3.5 regarding smoothing filters). The response
            with intensity                                                                                                                                                  of each kernel now shows almost no contribution due to the bricks, with the results
            values scaled to                                                                                                                                                being dominated mostly by the principal edges in the image.
            the range [0, 1].                                                                                                                                                  Figures 10.16 and 10.18 show that the horizontal and vertical Sobel kernels do
            (b) g x , the
                                                                                                                                                                            not differentiate between edges in the ± 45° directions. If it is important to empha-
            component of
            the gradient in                                          %1             %1      1                                                                               size edges oriented in particular diagonal directions, then one of the Kirsch kernels
            the x-direction,                                                                                                                                                in Fig. 10.15 should be used. Figures 10.19(a) and (b) show the responses of the 45°
            obtained using the                                        1                                                                                                     (NW) and −45° (SW) Kirsch kernels, respectively. The stronger diagonal selectivity
            Sobel kernel in                                                                                                                                                 of these kernels is evident in these figures. Both kernels have similar responses to
            Fig. 10.14(f) to
                                                                                                                                                                            horizontal and vertical edges, but the response in these directions is weaker.
            filter the image.
            (c) g y , obtained
            using the kernel                                                                                                                   The threshold used to
                                                                                                                                               generate Fig. 10.20(a)
                                                                                                                                                                            Combining the Gradient with Thresholding
            in Fig. 10.14(g).                                                                                                                  was selected so that most    The results in Fig. 10.18 show that edge detection can be made more selective by
            (d) The gradient                                                                                                                   of the small edges caused
            image, g x + g y .                                                                                                                 by the bricks were           smoothing the image prior to computing the gradient. Another approach aimed
                                                                                                                                               eliminated. This was the
                                                                                                                                               same objective as when
                                                                                                                                                                            at achieving the same objective is to threshold the gradient image. For example,
                                                                                                                                               the image in Fig. 10.16(a)   Fig. 10.20(a) shows the gradient image from Fig. 10.16(d), thresholded so that pix-
                                                                                                                                               was smoothed prior to
                                                                                                                                               computing the gradient.
                                                                                                                                                                            els with values greater than or equal to 33% of the maximum value of the gradi-
                                                                                                                                                                            ent image are shown in white, while pixels below the threshold value are shown in
www.EBooksWorld.ir www.EBooksWorld.ir
             a b                                                                                                                                  a b
             c d                                                                                                                                 FIGURE 10.20
            FIGURE 10.18                                                                                                                         (a) Result of
            Same sequence as                                                                                                                     thresholding
            in Fig. 10.16, but                                                                                                                   Fig. 10.16(d), the
            with the original                                                                                                                    gradient of the
            image smoothed                                                                                                                       original image.
            using a 5 × 5 aver-                                                                                                                  (b) Result of
            aging kernel prior                                                                                                                   thresholding
            to edge detection.                                                                                                                   Fig. 10.18(d), the
                                                                                                                                                 gradient of the
                                                                                                                                                 smoothed image.
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                   ∂ 2G( x, y) ∂ 2G( x, y)                                                                       a b                                                ( 2G
                                                   ∇ 2G( x, y) =              +                                                                                  c d
                                                                      ∂x 2        ∂y 2
                                                                                                                                                                FIGURE 10.21
                                                                              x2 + y2                  x2 + y2                                                  (a) 3-D plot of
                                                                 ∂ −x −        2s2
                                                                                             ∂ −y −     2s2
                                                              =   a    e                b +   a    e             b                (10-28)                       the negative of the
                                                                ∂x s 2                      ∂y s 2                                                              LoG.
                                                                                     x2 + y2                            x2 + y2                                 (b) Negative of
                                                                    x2   1         −      2          y2   1           −      2                                  the LoG
                                                              =a       − 2b       e 2s         + a      − 2b         e 2s
                                                                    s4  s                            s4  s                                                      displayed as an
                                                                                                                                                                image.
                                    Collecting terms, we obtain                                                                                                 (c) Cross section
                                                                                                                                                                of (a) showing                                                              y
                                                                                                                                                                                                      x
                                                                                                       x2 + y2                                                  zero crossings.
                                                                                x 2 + y 2 − 2s 2 −                                                              (d) 5 × 5 kernel
                                                             ∇ 2G( x, y) = a                    be
                                                                                                        2s2                       (10-29)                                                                           (2G
                                                                                      s4                                                                        approximation to
                                                                                                                                                                                                                                                              0     0    %1   0     0
                                                                                                                                                                the shape in (a).
                                                                                                                                                                The negative
                                    This expression is called the Laplacian of a Gaussian (LoG).                                                                of this kernel                                                                                0     %1   %2   %1    0
                                        Figures 10.21(a) through (c) show a 3-D plot, image, and cross-section of the                                           would be used in
                                    negative of the LoG function (note that the zero crossings of the LoG occur at                                              practice.                                                                                    %1     %2   16   %2   %1
                                    x + y 2 = 2s 2 , which defines a circle of radius 2s centered on the peak of the
                                      2
                                    Gaussian function). Because of the shape illustrated in Fig. 10.21(a), the LoG func-                                                                                                                                      0     %1   %2   %1    0
                                    tion sometimes is called the Mexican hat operator. Figure 10.21(d) shows a 5 × 5                                                                               Zero crossing                 Zero crossing
                                    kernel that approximates the shape in Fig. 10.21(a) (normally, we would use the neg-                                                                                                                                      0     0    %1   0     0
                                    ative of this kernel). This approximation is not unique. Its purpose is to capture the                                                                                          2 2s
                                    essential shape of the LoG function; in terms of Fig. 10.21(a), this means a positive,
                                    central term surrounded by an adjacent, negative region whose values decrease as a
                                                                                                                                                                                             direction, thus avoiding having to use multiple kernels to calculate the strongest
                                    function of distance from the origin, and a zero outer region. The coefficients must
                                    sum to zero so that the response of the kernel is zero in areas of constant intensity.                                                                   response at any point in the image.
                                                                                                                                                                                                The Marr-Hildreth algorithm consists of convolving the LoG kernel with an input
                                        Filter kernels of arbitrary size (but fixed s) can be generated by sampling Eq. (10-29),
                                    and scaling the coefficients so that they sum to zero. A more effective approach for                                                                     image,
                                    generating a LoG kernel is sampling Eq. (10-27) to the desired size, then convolving
                                    the resulting array with a Laplacian kernel, such as the kernel in Fig. 10.4(a). Because                                                                                               g( x, y) = ( 2G( x, y) ! f ( x, y)                  (10-30)
                                    convolving an image with a kernel whose coefficients sum to zero yields an image
                                    whose elements also sum to zero (see Problems 3.32 and 10.16), this approach auto-                                                                       and then finding the zero crossings of g( x, y) to determine the locations of edges in
                                    matically satisfies the requirement that the sum of the LoG kernel coefficients be                                          This expression is
                                                                                                                                                                                             f ( x, y). Because the Laplacian and convolution are linear processes, we can write
                                                                                                                                                                implemented in the           Eq. (10-30) as
                                    zero. We will discuss size selection for LoG filter later in this section.                                                  spatial domain using
                                        There are two fundamental ideas behind the selection of the operator ∇ 2G. First,                                       Eq. (3-35). It can be
                                                                                                                                                                implemented also in the                                    g( x, y) = ∇ 2 [G( x, y) ! f ( x, y)]                   (10-31)
                                    the Gaussian part of the operator blurs the image, thus reducing the intensity of                                           frequency domain using
                                    structures (including noise) at scales much smaller than s. Unlike the averaging                                            Eq. (4-104).
                                                                                                                                                                                             indicating that we can smooth the image first with a Gaussian filter and then com-
                                    filter used in Fig. 10.18, the Gaussian function is smooth in both the spatial and
                                    frequency domains (see Section 4.8), and is thus less likely to introduce artifacts                                                                      pute the Laplacian of the result. These two equations give identical results.
                                    (e.g., ringing) not present in the original image. The other idea concerns the second-                                                                      The Marr-Hildreth edge-detection algorithm may be summarized as follows:
                                    derivative properties of the Laplacian operator, ∇ 2 . Although first derivatives can                                                                     1. Filter the input image with an n × n Gaussian lowpass kernel obtained by sam-
                                    be used for detecting abrupt changes in intensity, they are directional operators. The                                                                       pling Eq. (10-27).
                                    Laplacian, on the other hand, has the important advantage of being isotropic (invari-                                                                     2. Compute the Laplacian of the image resulting from Step 1 using, for example,
                                    ant to rotation), which not only corresponds to characteristics of the human visual                                                                          the 3 × 3 kernel in Fig. 10.4(a). [Steps 1 and 2 implement Eq. (10-31).]
                                    system (Marr [1982]) but also responds equally to changes in intensity in any kernel
                                                                                                                                                                                              3. Find the zero crossings of the image from Step 2.
www.EBooksWorld.ir www.EBooksWorld.ir
                                          To specify the size of the Gaussian kernel, recall from our discussion of Fig. 3.35 that                              a b
                                          the values of a Gaussian function at a distance larger than 3s from the mean are                                      c d
                                          small enough so that they can be ignored. As discussed in Section 3.5, this implies                                  FIGURE 10.22
            As explained in Section
            3.5, <⋅= and :⋅; denote the
                                          using a Gaussian kernel of size L 6sM × L 6sM , where L 6sM denotes the ceiling of 6s; that                          (a) Image of size
            ceiling and floor func-       is, smallest integer not less than 6s. Because we work with kernels of odd dimen-                                    834 × 1114 pixels,
            tions. That is, the ceiling   sions, we would use the smallest odd integer satisfying this condition. Using a kernel                               with intensity
            and floor functions map                                                                                                                            values scaled to the
            a real number to the          smaller than this will “truncate” the LoG function, with the degree of truncation
                                                                                                                                                               range [0, 1].
            smallest following, or the
            largest previous, integer,
                                          being inversely proportional to the size of the kernel. Using a larger kernel would                                  (b) Result of
            respectively.                 make little difference in the result.                                                                                Steps 1 and 2 of
                                              One approach for finding the zero crossings at any pixel, p, of the filtered image,                              the Marr-Hildreth
                                          g( x, y), is to use a 3 × 3 neighborhood centered at p. A zero crossing at p implies                                 algorithm using
                                                                                                                                                               s = 4 and n = 25.
                                          that the signs of at least two of its opposing neighboring pixels must differ. There are
            Attempts to find zero                                                                                                                              (c) Zero cross-
            crossings by finding the      four cases to test: left/right, up/down, and the two diagonals. If the values of g( x, y)                            ings of (b) using
            coordinates (x, y) where
            g(x, y) = 0 are impractical
                                          are being compared against a threshold (a common approach), then not only must                                       a threshold of 0
            because of noise and          the signs of opposing neighbors be different, but the absolute value of their numeri-                                (note the closed-
            other computational           cal difference must also exceed the threshold before we can call p a zero-crossing                                   loop edges).
            inaccuracies.                                                                                                                                      (d) Zero cross-
                                          pixel. We illustrate this approach in Example 10.7.
                                                                                                                                                               ings found using a
                                              Computing zero crossings is the key feature of the Marr-Hildreth edge-detection                                  threshold equal to
                                          method. The approach discussed in the previous paragraph is attractive because of                                    4% of the maxi-
                                          its simplicity of implementation and because it generally gives good results. If the                                 mum value of the
                                          accuracy of the zero-crossing locations found using this method is inadequate in a                                   image in (b). Note
                                                                                                                                                               the thin edges.
                                          particular application, then a technique proposed by Huertas and Medioni [1986]
                                          for finding zero crossings with subpixel accuracy can be employed.
                                                                                                                                                                                            with s1 > s 2 . Experimental results suggest that certain “channels” in the human
                                                                                                                                                                                            vision system are selective with respect to orientation and frequency, and can be
             EXAMPLE 10.7 : Illustration of the Marr-Hildreth edge-detection method.                                                                                                        modeled using Eq. (10-32) with a ratio of standard deviations of 1.75:1. Using the
            Figure 10.22(a) shows the building image used earlier and Fig. 10.22(b) is the result of Steps 1 and 2 of                                                                       ratio 1.6:1 preserves the basic characteristics of these observations and also pro-
            the Marr-Hildreth algorithm, using s = 4 (approximately 0.5% of the short dimension of the image)                                                                               vides a closer “engineering” approximation to the LoG function (Marr and Hil-
            and n = 25 to satisfy the size condition stated above. As in Fig. 10.5, the gray tones in this image are due                                                                    dreth [1980]). In order for the LoG and DoG to have the same zero crossings, the
            to scaling. Figure 10.22(c) shows the zero crossings obtained using the 3 × 3 neighborhood approach just                                                                        value of s for the LoG must be selected based on the following equation (see
            discussed, with a threshold of zero. Note that all the edges form closed loops. This so-called “spaghetti                                                                       Problem 10.19):
            effect” is a serious drawback of this method when a threshold value of zero is used (see Problem 10.17).
            We avoid closed-loop edges by using a positive threshold.                                                                                                                                                               s12 s22     s2 
                                                                                                                                                                                                                           s2 =             ln  1                       (10-33)
               Figure 10.22(d) shows the result of using a threshold approximately equal to 4% of the maximum                                                                                                                     s12 − s22  s22 
            value of the LoG image. The majority of the principal edges were readily detected, and “irrelevant” fea-
            tures, such as the edges due to the bricks and the tile roof, were filtered out. This type of performance                                                                        Although the zero crossings of the LoG and DoG will be the same when this value
            is virtually impossible to obtain using the gradient-based edge-detection techniques discussed earlier.                                                                         of s is used, their amplitude scales will be different. We can make them compatible
            Another important consequence of using zero crossings for edge detection is that the resulting edges are                                                                        by scaling both functions so that they have the same value at the origin.
            1 pixel thick. This property simplifies subsequent stages of processing, such as edge linking.                                                                                     The profiles in Figs. 10.23(a) and (b) were generated with standard devia-
                                                                                                                                                                                            tion ratios of 1:1.75 and 1:1.6, respectively (by convention, the curves shown are
                                                                                                                                                                                            inverted, as in Fig. 10.21). The LoG profiles are the solid lines, and the DoG profiles
                                            It is possible to approximate the LoG function in Eq. (10-29) by a difference of                                                                are dotted. The curves shown are intensity profiles through the center of the LoG
                                          Gaussians (DoG):                                                                                                                                  and DoG arrays, generated by sampling Eqs. (10-29) and (10-32), respectively. The
                                                                                          2    2             2     2
                                                                                                                                                                                            amplitude of all curves at the origin were normalized to 1. As Fig. 10.23(b) shows,
                                                                                  1      x +y     1       x +y
                                                                 DG ( x, y) =
                                                                                       −
                                                                                      e 2 s12 −        e − 2 s22                (10-32)                                                     the ratio 1:1.6 yielded a slightly closer approximation of the LoG and DoG func-
                                                                                2ps12           2ps 22                                                                                      tions (for example, compare the bottom lobes of the two figures).
www.EBooksWorld.ir www.EBooksWorld.ir
             a b                                                                                                                                                                                   where the approximation was only about 20% worse that using the optimized
            FIGURE 10.23                                                                                                                                                                           numerical solution (a difference of this magnitude generally is visually impercep-
            (a) Negatives of                                                                                                                                                                       tible in most applications).
            the LoG (solid)                                                                                                                                                                           Generalizing the preceding result to 2-D involves recognizing that the 1-D
            and DoG                                                                                                                                                                                approach still applies in the direction of the edge normal (see Fig. 10.12). Because
            (dotted) profiles
            using a s ratio of                                                                                                                                                                     the direction of the normal is unknown beforehand, this would require applying the
            1.75:1. (b) Profiles                                                                                                                                                                   1-D edge detector in all possible directions. This task can be approximated by first
            obtained using a                                                                                                                                                                       smoothing the image with a circular 2-D Gaussian function, computing the gradient
            ratio of 1.6:1.                                                                                                                                                                        of the result, and then using the gradient magnitude and direction to estimate edge
                                                                                                                                                                                                   strength and direction at every point.
                                        Gaussian kernels are separable (see Section 3.4). Therefore, both the LoG and                                                                                 Let f ( x, y) denote the input image and G( x, y) denote the Gaussian function:
                                    the DoG filtering operations can be implemented with 1-D convolutions instead of
                                    using 2-D convolutions directly (see Problem 10.19). For an image of size M × N                                                                                                                                         −
                                                                                                                                                                                                                                                                x2 + y2
                                                                                                                                                                                                                                             G( x, y) = e        2s 2                               (10-35)
                                    and a kernel of size n × n, doing so reduces the number of multiplications and addi-
                                    tions for each convolution from being proportional to n 2 MN for 2-D convolutions
                                    to being proportional to nMN for 1-D convolutions. This implementation difference                                                                              We form a smoothed image, fs ( x, y), by convolving f and G:
                                    is significant. For example, if n = 25, a 1-D implementation will require on the order
                                    of 12 times fewer multiplication and addition operations than using 2-D convolution.                                                                                                                fs ( x, y) = G( x, y) ! f ( x, y)                           (10-36)
                                    The Canny Edge Detector                                                                                                                                        This operation is followed by computing the gradient magnitude and direction
                                                                                                                                                                                                   (angle), as discussed earlier:
                                    Although the algorithm is more complex, the performance of the Canny edge detec-
                                    tor (Canny [1986]) discussed in this section is superior in general to the edge detec-
                                                                                                                                                                                                                            Ms ( x, y) = (fs ( x, y) =      g x2 ( x, y) + g y2 ( x, y)             (10-37)
                                    tors discussed thus far. Canny’s approach is based on three basic objectives:
                                                                                                                                                                                                   and
                                        1. Low error rate. All edges should be found, and there should be no spurious
                                                                                                                                                                                                                                                           g y ( x, y) 
                                           responses.                                                                                                                                                                                   a( x, y) = tan −1                                         (10-38)
                                        2. Edge points should be well localized. The edges located must be as close as pos-                                                                                                                                g x ( x, y) 
                                           sible to the true edges. That is, the distance between a point marked as an edge
                                                                                                                                                                                               with g x ( x, y) = ∂fs ( x, y) ∂x and g y ( x, y) = ∂fs ( x, y) ∂y. Any of the derivative fil-
                                           by the detector and the center of the true edge should be minimum.
                                                                                                                                                                                               ter kernel pairs in Fig. 10.14 can be used to obtain g x ( x, y) and g y ( x, y). Equation
                                        3. Single edge point response. The detector should return only one point for each                                                                      (10-36) is implemented using an n × n Gaussian kernel whose size is discussed below.
                                           true edge point. That is, the number of local maxima around the true edge should                                                                    Keep in mind that (fs ( x, y) and a( x, y) are arrays of the same size as the image
                                           be minimum. This means that the detector should not identify multiple edge pix-
                                                                                                                                                                                               from which they are computed.
                                           els where only a single edge point exists.
                                                                                                                                                                                                  Gradient image (fs ( x, y) typically contains wide ridges around local maxima.
                                    The essence of Canny’s work was in expressing the preceding three criteria math-                                                                           The next step is to thin those ridges. One approach is to use nonmaxima suppres-
                                    ematically, and then attempting to find optimal solutions to these formulations. In                                                                        sion. The essence of this approach is to specify a number of discrete orientations of
                                    general, it is difficult (or impossible) to find a closed-form solution that satisfies                                                                     the edge normal (gradient vector). For example, in a 3 × 3 region we can define four
                                    all the preceding objectives. However, using numerical optimization with 1-D step                                                                          orientations† for an edge passing through the center point of the region: horizontal,
                                    edges corrupted by additive white Gaussian noise† led to the conclusion that a good                                                                        vertical, + 45°, and − 45°. Figure 10.24(a) shows the situation for the two possible
                                    approximation to the optimal step edge detector is the first derivative of a Gaussian,                                                                     orientations of a horizontal edge. Because we have to quantize all possible edge
                                                                                                                                                                                               directions into four ranges, we have to define a range of directions over which we
                                                                                      2
                                                                                      −x − x
                                                                                                      2
                                                                             d − x2                                                                                                            consider an edge to be horizontal. We determine edge direction from the direction
                                                                                e 2s = 2 e 2s2                                         (10-34)
                                                                             dx       s                                                                                                        of the edge normal, which we obtain directly from the image data using Eq. (10-38).
                                    †
                                                                                                                                                                                               As Fig. 10.24(b) shows, if the edge normal is in the range of directions from −22.5° to
                                     Recall that white noise is noise having a frequency spectrum that is continuous and uniform over a specified
                                    frequency band. White Gaussian noise is white noise in which the distribution of amplitude values is Gaussian.
                                                                                                                                                                                                   †
                                    Gaussian white noise is a good approximation of many real-world situations and generates mathematically                                                          Every edge has two possible orientations. For example, an edge whose normal is oriented at 0° and an edge
                                    tractable models. It has the useful property that its values are statistically independent.                                                                    whose normal is oriented at 180° are the same horizontal edge.
www.EBooksWorld.ir www.EBooksWorld.ir
             a b                                                                           %157.5'              &157.5'
                                                                                                                                                                                               improve on this situation by using hysteresis thresholding which, as we will discuss
              c                                             Edge normal                                                                                                                        in Section 10.3, uses two thresholds: a low threshold, TL and a high threshold, TH .
            FIGURE 10.24                                                                                                                                                                       Experimental evidence (Canny [1986]) suggests that the ratio of the high to low
            (a) Two possible                p1    p2   p3   p1    p2   p3                                                                                                                      threshold should be in the range of 2:1 to 3:1.
            orientations of a                    p5                                                                       y
                                                                                                                                                                                                  We can visualize the thresholding operation as creating two additional images:
            horizontal edge                 p4         p6   p4 p       p6
                                                                5
            (shaded) in a 3 × 3                                                    Edge                         Edge normal
                                            p7    p8   p9   p7    p8   p9                                                                                                                                                      gNH ( x, y) = gN ( x, y) ≥ TH                       (10-39)
            neighborhood.                                                                                       (gradient vector)
            (b) Range of values                                                                         a                                                                                      and
            (shaded) of a, the              Edge normal                                     %22.5'              &22.5'                                                                                                         gNL ( x, y) = gN ( x, y) ≥ TL                       (10-40)
            direction angle of
            the edge normal                                                                             x
            for a horizontal                                     %157.5'         &157.5'                                                                                                       Initially, gNH ( x, y) and gNL ( x, y) are set to 0. After thresholding, gNH ( x, y) will usu-
                                                                                             &45'edge
            edge. (c) The angle                                                                                                                                                                ally have fewer nonzero pixels than gNL ( x, y), but all the nonzero pixels in gNH ( x, y)
            ranges of the edge                                                                                                                                                                 will be contained in gNL ( x, y) because the latter image is formed with a lower thresh-
            normals for the                            %112.5'                              &112.5'
            four types of edge                                                                                                                                                                 old. We eliminate from gNL ( x, y) all the nonzero pixels from gNH ( x, y) by letting
            directions in a 3 × 3                                                                                                                                                                                          gNL ( x, y) = gNL ( x, y) − gNH ( x, y)                 (10-41)
                                                                                                Vertical edge
            neighborhood.
            Each edge direc-                                                                                                                                                                   The nonzero pixels in gNH ( x, y) and gNL ( x, y) may be viewed as being “strong”
            tion has two ranges,                       %67.5'                               &67.5'
                                                                                                                                                                                               and “weak” edge pixels, respectively. After the thresholding operations, all strong
            shown in corre-
            sponding shades.                                                                                                                                                                   pixels in gNH ( x, y) are assumed to be valid edge pixels, and are so marked imme-
                                                                                             %45'edge                                                                                          diately. Depending on the value of TH , the edges in gNH ( x, y) typically have gaps.
                                                                  %22.5'         &22.5'
                                                                            0'                                                                                                                 Longer edges are formed using the following procedure:
                                                                                   Horizontal edge
                                                                                                                                                                                                (a) Locate the next unvisited edge pixel, p, in gNH ( x, y).
                                    22.5° or from −157.5° to 157.5°, we call the edge a horizontal edge. Figure 10.24(c)                                                                        (b) Mark as valid edge pixels all the weak pixels in gNL ( x, y) that are connected to
                                    shows the angle ranges corresponding to the four directions under consideration.                                                                                p using, say, 8-connectivity.
                                       Let d1 , d2 , d3 ,and d4 denote the four basic edge directions just discussed for                                                                        (c) If all nonzero pixels in gNH ( x, y) have been visited go to Step (d). Else, return
                                    a 3 × 3 region: horizontal, −45°, vertical, and +45°, respectively. We can formulate                                                                            to Step ( a).
                                    the following nonmaxima suppression scheme for a 3 × 3 region centered at an                                                                                (d) Set to zero all pixels in gNL ( x, y) that were not marked as valid edge pixels.
                                    arbitrary point ( x, y) in a:
                                                                                                                                                                                           At the end of this procedure, the final image output by the Canny algorithm is
                                     1. Find the direction dk that is closest to a( x, y).
                                                                                                                                                                                           formed by appending to gNH ( x, y) all the nonzero pixels from gNL ( x, y).
                                     2. Let K denote the value of (fs at ( x, y). If K is less than the value of (fs at one                                                                   We used two additional images, gNH ( x, y) and gNL ( x, y) to simplify the discussion.
                                        or both of the neighbors of point ( x, y) along dk , let gN ( x, y) = 0 (suppression);                                                             In practice, hysteresis thresholding can be implemented directly during nonmaxima
                                        otherwise, let gN ( x, y) = K .                                                                                                                    suppression, and thresholding can be implemented directly on gN ( x, y) by forming a
                                    When repeated for all values of x and y, this procedure yields a nonmaxima sup-                                                                        list of strong pixels and the weak pixels connected to them.
                                    pressed image gN ( x, y) that is of the same size as fs ( x, y). For example, with reference                                                               Summarizing, the Canny edge detection algorithm consists of the following steps:
                                    to Fig. 10.24(a), letting ( x, y) be at p5 , and assuming a horizontal edge through p5 ,
                                                                                                                                                                                                1.   Smooth the input image with a Gaussian filter.
                                    the pixels of interest in Step 2 would be p2 and p8 . Image gN ( x, y) contains only the
                                    thinned edges; it is equal to image (fs ( x, y) with the nonmaxima edge points sup-                                                                         2.   Compute the gradient magnitude and angle images.
                                    pressed.                                                                                                                                                    3.   Apply nonmaxima suppression to the gradient magnitude image.
                                       The final operation is to threshold gN ( x, y) to reduce false edge points. In the                                                                       4.   Use double thresholding and connectivity analysis to detect and link edges.
                                    Marr-Hildreth algorithm we did this using a single threshold, in which all values
                                    below the threshold were set to 0. If we set the threshold too low, there will still                                                                   Although the edges after nonmaxima suppression are thinner than raw gradient edg-
                                    be some false edges (called false positives). If the threshold is set too high, then                                                                   es, the former can still be thicker than one pixel. To obtain edges one pixel thick, it is
                                    valid edge points will be eliminated (false negatives). Canny’s algorithm attempts to                                                                  typical to follow Step 4 with one pass of an edge-thinning algorithm (see Section 9.5).
www.EBooksWorld.ir www.EBooksWorld.ir
                                         As mentioned earlier, smoothing is accomplished by convolving the input image                            to achieve the objectives stated in the previous paragraph for the gradient and Marr-Hildreth images.
                                       with a Gaussian kernel whose size, n × n, must be chosen. Once a value of s has                            Comparing the Canny image with the other two images, we see in the Canny result significant improve-
            Usually, selecting a
                                       been specified, we can use the approach discussed in connection with the Marr-Hil-                         ments in detail of the principal edges and, at the same time, more rejection of irrelevant features. For
            suitable value of s        dreth algorithm to determine an odd value of n that provides the “full” smoothing                          example, note that both edges of the concrete band lining the bricks in the upper section of the image
            for the first time in an   capability of the Gaussian filter for the specified value of s.                                            were detected by the Canny algorithm, whereas the thresholded gradient lost both of these edges, and
            application requires
            experimentation.              Some final comments on implementation: As noted earlier in the discussion of                            the Marr-Hildreth method detected only the upper one. In terms of filtering out irrelevant detail, the
                                       the Marr-Hildreth edge detector, the 2-D Gaussian function in Eq. (10-35) is sepa-                         Canny image does not contain a single edge due to the roof tiles; this is not true in the other two images.
                                       rable into a product of two 1-D Gaussians. Thus, Step 1 of the Canny algorithm can                         The quality of the lines with regard to continuity, thinness, and straightness is also superior in the Canny
                                       be formulated as 1-D convolutions that operate on the rows (columns) of the image                          image. Results such as these have made the Canny algorithm a tool of choice for edge detection.
                                       one at a time, and then work on the columns (rows) of the result. Furthermore, if
                                       we use the approximations in Eqs. (10-19) and (10-20), we can also implement the
                                       gradient computations required for Step 2 as 1-D convolutions (see Problem 10.22).                           EXAMPLE 10.9 : Another illustration of the three principal edge-detection methods discussed in this section.
                                                                                                                                                  As another comparison of the three principal edge-detection methods discussed in this section, consider
             EXAMPLE 10.8 : Illustration and comparison of the Canny edge-detection method.                                                       Fig. 10.26(a), which shows a 512 × 512 head CT image. Our objective is to extract the edges of the outer
            Figure 10.25(a) shows the familiar building image. For comparison, Figs. 10.25(b) and (c) show, respec-                               contour of the brain (the gray region in the image), the contour of the spinal region (shown directly
            tively, the result in Fig. 10.20(b) obtained using the thresholded gradient, and Fig. 10.22(d) using the                              behind the nose, toward the front of the brain), and the outer contour of the head. We wish to generate
            Marr-Hildreth detector. Recall that the parameters used in generating those two images were selected                                  the thinnest, continuous contours possible, while eliminating edge details related to the gray content in
            to detect the principal edges, while attempting to reduce “irrelevant” features, such as the edges of the                             the eyes and brain areas.
            bricks and the roof tiles.                                                                                                               Figure 10.26(b) shows a thresholded gradient image that was first smoothed using a 5 × 5 averaging
               Figure 10.25(d) shows the result obtained with the Canny algorithm using the parameters TL = 0.04,                                 kernel. The threshold required to achieve the result shown was 15% of the maximum value of the gradi-
            TH = 0.10 (2.5 times the value of the low threshold), s = 4, and a kernel of size 25 × 25, which cor-                                 ent image. Figure 10.26(c) shows the result obtained with the Marr-Hildreth edge-detection algorithm
            responds to the smallest odd integer not less than 6s. These parameters were chosen experimentally                                    with a threshold of 0.002, s = 3, and a kernel of size 19 × 19 . Figure 10.26(d) was obtained using the
                                                                                                                                                  Canny algorithm with TL = 0.05,TH = 0.15 (3 times the value of the low threshold), s = 2, and a kernel
                                                                                                                                                  of size 13 × 13.
             a b                                                                                                                                   a b
             c d                                                                                                                                   c d
            FIGURE 10.25                                                                                                                          FIGURE 10.26
            (a) Original image                                                                                                                    (a) Head CT image
            of size 834 × 1114                                                                                                                    of size 512 × 512
            pixels, with                                                                                                                          pixels, with
            intensity values                                                                                                                      intensity values
            scaled to the range                                                                                                                   scaled to the range
            [0, 1].                                                                                                                               [0, 1].
            (b) Thresholded                                                                                                                       (b) Thresholded
            gradient of the                                                                                                                       gradient of the
            smoothed image.                                                                                                                       smoothed image.
            (c) Image obtained                                                                                                                    (c) Image obtained
            using the                                                                                                                             using the Marr-Hil-
            Marr-Hildreth                                                                                                                         dreth algorithm.
            algorithm.                                                                                                                            (d) Image obtained
            (d) Image obtained                                                                                                                    using the Canny
            using the Canny                                                                                                                       algorithm.
            algorithm. Note the                                                                                                                   (Original image
            significant                                                                                                                           courtesy of Dr.
            improvement of                                                                                                                        David R. Pickens,
            the Canny image                                                                                                                       Vanderbilt
            compared to the                                                                                                                       University.)
            other two.
www.EBooksWorld.ir www.EBooksWorld.ir
               In terms of edge quality and the ability to eliminate irrelevant detail, the results in Fig. 10.26 correspond                                                      The direction angle of the gradient vector is given by Eq. (10-18). An edge pixel
            closely to the results and conclusions in the previous example. Note also that the Canny algorithm was                                                             with coordinates ( s, t ) in Sxy has an angle similar to the pixel at ( x, y) if
            the only procedure capable of yielding a totally unbroken edge for the posterior boundary of the brain,
            and the closest boundary of the spinal cord. It was also the only procedure capable of finding the cleanest                                                                                           a( s, t ) − a( x, y) ≤ A                          (10-43)
            contours, while eliminating all the edges associated with the gray brain matter in the original image.
                                                                                                                                                                               where A is a positive angle threshold. As noted earlier, the direction of the edge at
                                                                                                                                                                               ( x, y) is perpendicular to the direction of the gradient vector at that point.
                                       The price paid for the improved performance of the Canny algorithm is a sig-
                                                                                                                                                                                   A pixel with coordinates ( s, t ) in Sxy is considered to be linked to the pixel at ( x, y)
                                    nificantly more complex implementation than the two approaches discussed earlier.
                                                                                                                                                                               if both magnitude and direction criteria are satisfied. This process is repeated for
                                    In some applications, such as real-time industrial image processing, cost and speed
                                                                                                                                                                               every edge pixel. As the center of the neighborhood is moved from pixel to pixel, a
                                    requirements usually dictate the use of simpler techniques, principally the thresh-
                                                                                                                                                                               record of linked points is kept. A simple bookkeeping procedure is to assign a dif-
                                    olded gradient approach. When edge quality is the driving force, the Marr-Hildreth
                                                                                                                                                                               ferent intensity value to each set of linked edge pixels.
                                    and Canny algorithms, especially the latter, offer superior alternatives.
                                                                                                                                                                                   The preceding formulation is computationally expensive because all neighbors of
                                                                                                                                                                               every point have to be examined. A simplification particularly well suited for real
                                    LINKING EDGE POINTS                                                                                                                        time applications consists of the following steps:
                                    Ideally, edge detection should yield sets of pixels lying only on edges. In practice,
                                                                                                                                                                                1. Compute the gradient magnitude and angle arrays, M( x, y) and a( x, y), of the
                                    these pixels seldom characterize edges completely because of noise, breaks in the
                                                                                                                                                                                   input image, f ( x, y).
                                    edges caused by nonuniform illumination, and other effects that introduce disconti-
                                    nuities in intensity values. Therefore, edge detection typically is followed by linking                                                     2. Form a binary image, g( x, y), whose value at any point ( x, y) is given by:
                                    algorithms designed to assemble edge pixels into meaningful edges and/or region
                                    boundaries. In this section, we discuss two fundamental approaches to edge linking                                                                                    1    if M( x, y) > TM AND a( x, y) = A ± TA
                                                                                                                                                                                               g( x, y) = 
                                    that are representative of techniques used in practice. The first requires knowledge                                                                                  0    otherwise
                                    about edge points in a local region (e.g., a 3 × 3 neighborhood), and the second
                                    is a global approach that works with an entire edge map. As it turns out, linking                                                               where TM is a threshold, A is a specified angle direction, and ±TA defines a
                                    points along the boundary of a region is also an important aspect of some of the                                                               “band” of acceptable directions about A.
                                    segmentation methods discussed in the next chapter, and in extracting features from                                                         3. Scan the rows of g and fill (set to 1) all gaps (sets of 0’s) in each row that do not
                                    a segmented image, as we will do in Chapter 11. Thus, you will encounter additional                                                            exceed a specified length, L. Note that, by definition, a gap is bounded at both
                                    edge-point linking methods in the next two chapters.                                                                                           ends by one or more 1’s. The rows are processed individually, with no “memory”
                                                                                                                                                                                   kept between them.
                                    Local Processing                                                                                                                            4. To detect gaps in any other direction, u, rotate g by this angle and apply the
                                    A simple approach for linking edge points is to analyze the characteristics of pixels                                                          horizontal scanning procedure in Step 3. Rotate the result back by −u.
                                    in a small neighborhood about every point ( x, y) that has been declared an edge                                                           When interest lies in horizontal and vertical edge linking, Step 4 becomes a simple
                                    point by one of the techniques discussed in the preceding sections. All points that                                                        procedure in which g is rotated ninety degrees, the rows are scanned, and the result
                                    are similar according to predefined criteria are linked, forming an edge of pixels that                                                    is rotated back. This is the application found most frequently in practice and, as the
                                    share common properties according to the specified criteria.                                                                               following example shows, this approach can yield good results. In general, image
                                       The two principal properties used for establishing similarity of edge pixels in this                                                    rotation is an expensive computational process so, when linking in numerous angle
                                    kind of local analysis are (1) the strength (magnitude) and (2) the direction of the                                                       directions is required, it is more practical to combine Steps 3 and 4 into a single,
                                    gradient vector. The first property is based on Eq. (10-17). Let Sxy denote the set of                                                     radial scanning procedure.
                                    coordinates of a neighborhood centered at point ( x, y) in an image. An edge pixel
                                    with coordinates ( s, t ) in Sxy is similar in magnitude to the pixel at ( x, y) if
                                                                                                                                                    EXAMPLE 10.10 : Edge linking using local processing.
                                                                   M( s, t ) − M( x, y) ≤ E                        (10-42)                        Figure 10.27(a) shows a 534 × 566 image of the rear of a vehicle. The objective of this example is to
                                                                                                                                                  illustrate the use of the preceding algorithm for finding rectangles whose sizes makes them suitable
                                    where E is a positive threshold.                                                                              candidates for license plates. The formation of these rectangles can be accomplished by detecting
www.EBooksWorld.ir www.EBooksWorld.ir
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                            u$                         umin   0   umax                                   a
                                                            y                                              u
                                                                                                               r min
                                                                                                                                         u                               b
                                                                                   xj cosu & yj sinu ) r
                                                                                                                                                                        FIGURE 10.30
                         u                                                                                                                                               (a) Image of size
                             r
                                                                                                                                                                        101 × 101 pixels,
                                                                                                                  0                                                     containing five
                                                                                                                                                                        white points (four
                                                 (xj, yj)
                                                                                                                                                                        in the corners and
                                                                                                                                                                        one in the center).
                                 (xi, yi)                       r$                                             r max                                                    (b) Corresponding
                                                                         xi cosu & yi sinu ) r                                                                          parameter space.
                     x                                               r                                                 r
             a b c
            FIGURE 10.29 (a) (r, u) parameterization of a line in the xy-plane. (b) Sinusoidal curves in the ru-plane;the point of
            intersection (r$, u$) corresponds to the line passing through points ( xi , yi ) and ( x j , y j ) in the xy-plane. (c) Division                                                                      Q
            of the ru-plane into accumulator cells.                                                                                                                                                      %100
                                                                                                                                                                                                                                  2
                                            −90° ≤ u ≤ 90° and − D ≤ r ≤ D, where D is the maximum distance between opposite                                                                              %50
                                                                                                                                                                                                                  R
                                            corners in an image. The cell at coordinates (i, j ) with accumulator value A(i, j ) cor-
                                            responds to the square associated with parameter-space coordinates (ri , u j ). Ini-
                                            tially, these cells are set to zero. Then, for every non-background point ( xk , yk ) in                                                                              S                   A                                      1              S
                                                                                                                                                                                                             0
                                            the xy-plane, we let u equal each of the allowed subdivision values on the u-axis
                                                                                                                                                                                                        r
                                            and solve for the corresponding r using the equation r = xk cos u + yk sin u. The
                                            resulting r values are then rounded off to the nearest allowed cell value along the                                                                                                                         3
                                                                                                                                                                                                                                                                                           R
                                            r axis. If a choice of uq results in the solution rp , then we let A( p, q) = A( p, q) + 1.                                                                      50                   4
                                                                                                                                                                                                                                                                             B
                                            At the end of the procedure, a value of K in a cell A(i, j ) means that K points in the
                                            xy-plane lie on the line x cos u j + y sin u j = ri . The number of subdivisions in the
                                                                                                                                                                                                                                                                                           Q
                                            ru-plane determines the accuracy of the colinearity of these points. It can be shown                                                                            100
                                            (see Problem 10.27) that the number of computations in the method just discussed is
                                            linear with respect to n, the number of non-background points in the xy-plane.                                                                                                                                                   5
www.EBooksWorld.ir www.EBooksWorld.ir
                                    cube-like cells, and accumulators of the form A(i, j, k ). The procedure is to incre-
                                    ment c1 and c2 , solve for the value of c3 that satisfies Eq. (10-45), and update the
                                    accumulator cell associated with the triplet (c1 , c2 , c3 ). Clearly, the complexity of the
                                    Hough transform depends on the number of coordinates and coefficients in a given
                                    functional representation. As noted earlier, generalizations of the Hough transform
                                    to detect curves with no simple analytic representations are possible, as is the appli-
                                    cation of the transform to grayscale images.
                                       Returning to the edge-linking problem, an approach based on the Hough trans-
                                    form is as follows:
                                     1.   Obtain a binary edge map using any of the methods discussed earlier in this section.
                                     2.   Specify subdivisions in the ru-plane.
                                     3.   Examine the counts of the accumulator cells for high pixel concentrations.
                                     4.   Examine the relationship (principally for continuity) between pixels in a chosen
                                          cell.
                                    Continuity in this case usually is based on computing the distance between discon-
                                    nected pixels corresponding to a given accumulator cell. A gap in a line associated
                                    with a given cell is bridged if the length of the gap is less than a specified threshold.
                                    Being able to group lines based on direction is a global concept applicable over the
                                    entire image, requiring only that we examine pixels associated with specific accumu-
                                    lator cells. The following example illustrates these concepts.                                                     a b
                                                                                                                                                      c d e
                                                                                                                                                     FIGURE 10.31 (a) A 502 × 564 aerial image of an airport. (b) Edge map obtained using Canny’s algorithm. (c) Hough
             EXAMPLE 10.12 : Using the Hough transform for edge linking.                                                                             parameter space (the boxes highlight the points associated with long vertical lines). (d) Lines in the image plane
            Figure 10.31(a) shows an aerial image of an airport. The objective of this example is to use the Hough                                   corresponding to the points highlighted by the boxes. (e) Lines superimposed on the original image.
            transform to extract the two edges defining the principal runway. A solution to such a problem might be
            of interest, for instance, in applications involving autonomous air navigation.
               The first step is to obtain an edge map. Figure 10.31(b) shows the edge map obtained using Canny’s                                    orientations of runways throughout the world are available in flight charts, and the direction of travel
            algorithm with the same parameters and procedure used in Example 10.9. For the purpose of computing                                      is easily obtainable using GPS (Global Positioning System) information. This information also could be
            the Hough transform, similar results can be obtained using any of the other edge-detection techniques                                    used to compute the distance between the vehicle and the runway, thus allowing estimates of param-
            discussed earlier. Figure 10.31(c) shows the Hough parameter space obtained using 1° increments for u,                                   eters such as expected length of lines relative to image size, as we did in this example.
            and one-pixel increments for r.
               The runway of interest is oriented approximately 1° off the north direction, so we select the cells cor-
            responding to ± 90° and containing the highest count because the runways are the longest lines oriented                                                           10.3 THRESHOLDING
                                                                                                                                                                                  10.3
            in these directions. The small boxes on the edges of Fig. 10.31(c) highlight these cells. As mentioned ear-                                                           Because of its intuitive properties, simplicity of implementation, and computational
            lier in connection with Fig. 10.30(b), the Hough transform exhibits adjacency at the edges. Another way                                                               speed, image thresholding enjoys a central position in applications of image segmen-
            of interpreting this property is that a line oriented at +90° and a line oriented at −90° are equivalent (i.e.,                                                       tation. Thresholding was introduced in Section 3.1, and we have used it in various
            they are both vertical). Figure 10.31(d) shows the lines corresponding to the two accumulator cells just                                                              discussions since then. In this section, we discuss thresholding in a more formal way,
            discussed, and Fig. 10.31(e) shows the lines superimposed on the original image. The lines were obtained                                                              and develop techniques that are considerably more general than what has been pre-
            by joining all gaps not exceeding 20% (approximately 100 pixels) of the image height. These lines clearly                                                             sented thus far.
            correspond to the edges of the runway of interest.
               Note that the only information needed to solve this problem was the orientation of the runway and                                                                  FOUNDATION
            the observer’s position relative to it. In other words, a vehicle navigating autonomously would know
                                                                                                                                                                                  In the previous section, regions were identified by first finding edge segments,
            that if the runway of interest faces north, and the vehicle’s direction of travel also is north, the runway
                                                                                                                                                                                  then attempting to link the segments into boundaries. In this section, we discuss
            should appear vertically in the image. Other relative orientations are handled in a similar manner. The
www.EBooksWorld.ir www.EBooksWorld.ir
                                         techniques for partitioning images directly into regions based on intensity values                                                                where a, b, and c are any three distinct intensity values. We will discuss dual threshold-
                                         and/or properties of these values.                                                                                                                ing later in this section. Segmentation problems requiring more than two thresholds
                                                                                                                                                                                           are difficult (or often impossible) to solve, and better results usually are obtained using
                                         The Basics of Intensity Thresholding                                                                                                              other methods, such as variable thresholding, as will be discussed later in this section,
                                         Suppose that the intensity histogram in Fig. 10.32(a) corresponds to an image, f ( x, y),                                                         or region growing, as we will discuss in Section 10.4.
                                         composed of light objects on a dark background, in such a way that object and back-                                                                  Based on the preceding discussion, we may infer intuitively that the success of
                                         ground pixels have intensity values grouped into two dominant modes. One obvious                                                                  intensity thresholding is related directly to the width and depth of the valley(s) sepa-
            Remember, f(x, y)
                                         way to extract the objects from the background is to select a threshold, T, that sepa-                                                            rating the histogram modes. In turn, the key factors affecting the properties of the
            denotes the intensity of f
            at coordinates (x, y).       rates these modes. Then, any point ( x, y) in the image at which f ( x, y) > T is called                                                          valley(s) are: (1) the separation between peaks (the further apart the peaks are, the
                                         an object point. Otherwise, the point is called a background point. In other words,                                                               better the chances of separating the modes); (2) the noise content in the image (the
                                         the segmented image, denoted by g( x, y), is given by                                                                                             modes broaden as noise increases); (3) the relative sizes of objects and background;
            Although we follow
            convention in using 0                                                                                                                                                          (4) the uniformity of the illumination source; and (5) the uniformity of the reflectance
            intensity for the back-                                                                                                                                                        properties of the image.
            ground and 1 for object                                               1      if f ( x, y) > T
            pixels, any two distinct                                   g( x, y) =                                             (10-46)
            values can be used in                                                 0      if f ( x, y) ≤ T                                                                                 The Role of Noise in Image Thresholding
            Eq. (10-46).
                                                                                                                                                                                           The simple synthetic image in Fig. 10.33(a) is free of noise, so its histogram con-
                                         When T is a constant applicable over an entire image, the process given in this equa-
                                                                                                                                                                                           sists of two “spike” modes, as Fig. 10.33(d) shows. Segmenting this image into two
                                         tion is referred to as global thresholding. When the value of T changes over an image,
                                                                                                                                                                                           regions is a trivial task: we just select a threshold anywhere between the two modes.
                                         we use the term variable thresholding. The terms local or regional thresholding are
                                                                                                                                                                                           Figure 10.33(b) shows the original image corrupted by Gaussian noise of zero
                                         used sometimes to denote variable thresholding in which the value of T at any point
                                                                                                                                                                                           mean and a standard deviation of 10 intensity levels. The modes are broader now
                                         ( x, y) in an image depends on properties of a neighborhood of ( x, y) (for example,
                                         the average intensity of the pixels in the neighborhood). If T depends on the spa-
                                         tial coordinates ( x, y) themselves, then variable thresholding is often referred to as
                                         dynamic or adaptive thresholding. Use of these terms is not universal.
                                             Figure 10.32(b) shows a more difficult thresholding problem involving a histo-
                                         gram with three dominant modes corresponding, for example, to two types of light
                                         objects on a dark background. Here, multiple thresholding classifies a point ( x, y) as
                                         belonging to the background if f ( x, y) ≤ T1 , to one object class if T1 < f ( x, y) ≤ T2 ,
                                         and to the other object class if f ( x, y) > T2 . That is, the segmented image is given by
                                                                                a     if f ( x, y) > T2
                                                                                
                                                                   g ( x, y ) = b     if T1 < f ( x, y) ≤ T2                  (10-47)
                                                                                c     if f ( x, y) ≤ T1
                                                                                
             a b
            FIGURE 10.32
            Intensity
            histograms that                                                                                                                                        0        63        127         191    255   0    63         127     191    255   0   63      127     191    255
            can be partitioned
            (a) by a single                                                                                                                                    a b c
            threshold, and                                                                                                                                     d e f
            (b) by dual                                                                                                                                       FIGURE 10.33 (a) Noiseless 8-bit image. (b) Image with additive Gaussian noise of mean 0 and standard deviation of
            thresholds.                                                                                                                                       10 intensity levels. (c) Image with additive Gaussian noise of mean 0 and standard deviation of 50 intensity levels.
                                                                      T                                    T1     T2                                          (d) through (f) Corresponding histograms.
www.EBooksWorld.ir www.EBooksWorld.ir
                                          [see Fig. 10.33(e)], but their separation is enough so that the depth of the valley                                                        perfectly uniform, but the reflectance of the image was not, as a results, for example,
                                          between them is sufficient to make the modes easy to separate. A threshold placed                                                          of natural reflectivity variations in the surface of objects and/or background.
                                          midway between the two peaks would do the job. Figure 10.33(c) shows the result                                                               The important point is that illumination and reflectance play a central role in the
                                          of corrupting the image with Gaussian noise of zero mean and a standard deviation                                                          success of image segmentation using thresholding or other segmentation techniques.
                                          of 50 intensity levels. As the histogram in Fig. 10.33(f) shows, the situation is much                                                     Therefore, controlling these factors when possible should be the first step consid-
                                          more serious now, as there is no way to differentiate between the two modes. With-                                                         ered in the solution of a segmentation problem. There are three basic approaches
                                          out additional processing (such as the methods discussed later in this section) we                                                         to the problem when control over these factors is not possible. The first is to correct
                                          have little hope of finding a suitable threshold for segmenting this image.                                                                the shading pattern directly. For example, nonuniform (but fixed) illumination can
                                                                                                                                                                                     be corrected by multiplying the image by the inverse of the pattern, which can be
                                         The Role of Illumination and Reflectance in Image Thresholding                                                                              obtained by imaging a flat surface of constant intensity. The second is to attempt
                                          Figure 10.34 illustrates the effect that illumination can have on the histogram of                                                         to correct the global shading pattern via processing using, for example, the top-hat
                                          an image. Figure 10.34(a) is the noisy image from Fig. 10.33(b), and Fig. 10.34(d)                                                         transformation introduced in Section 9.8. The third approach is to “work around”
                                          shows its histogram. As before, this image is easily segmentable with a single thresh-                                                     nonuniformities using variable thresholding, as discussed later in this section.
                                          old. With reference to the image formation model discussed in Section 2.3, suppose
                                          that we multiply the image in Fig. 10.34(a) by a nonuniform intensity function, such                                                       BASIC GLOBAL THRESHOLDING
            In theory, the histogram      as the intensity ramp in Fig. 10.37(b), whose histogram is shown in Fig. 10.34(e).
            of a ramp image is
                                          Figure 10.34(c) shows the product of these two images, and Fig. 10.34(f) is the result-                                                    When the intensity distributions of objects and background pixels are sufficiently
            uniform. In practice, the
            degree of uniformity          ing histogram. The deep valley between peaks was corrupted to the point where sep-                                                         distinct, it is possible to use a single (global) threshold applicable over the entire
            depends on the size of
                                          aration of the modes without additional processing (to be discussed later in this sec-                                                     image. In most applications, there is usually enough variability between images that,
            the image and number of
            intensity levels.             tion) is no longer possible. Similar results would be obtained if the illumination was                                                     even if global thresholding is a suitable approach, an algorithm capable of estimat-
                                                                                                                                                                                     ing the threshold value for each image is required. The following iterative algorithm
                                                                                                                                                                                     can be used for this purpose:
                                                                                                                                                                                      1. Select an initial estimate for the global threshold, T.
                                                                                                                                                                                      2. Segment the image using T in Eq. (10-46). This will produce two groups of
                                                                                                                                                                                         pixels: G1 , consisting of pixels with intensity values > T; and G2 , consisting of
                                                                                                                                                                                         pixels with values ≤ T.
                                                                                                                                                                                      3. Compute the average (mean) intensity values m1 and m2 for the pixels in G1
                                                                                                                                                                                         and G2 , respectively.
                                                                                                                                                                                      4. Compute a new threshold value midway between m1 and m2 :
                                                                                                                                                                                                                            1
                                                                                                                                                                                                                       T=     ( m1 + m2 )
                                                                                                                                                                                                                            2
                                                                                                                                                                                      5. Repeat Steps 2 through 4 until the difference between values of T in successive
                                                                                                                                                                                         iterations is smaller than a predefined value, !T.
                                                                                                                                                                                         The algorithm is stated here in terms of successively thresholding the input image
                                                                                                                                                                                     and calculating the means at each step, because it is more intuitive to introduce
                                                                                                                                                                                     it in this manner. However, it is possible to develop an equivalent (and more effi-
                                                                                                                                                                                     cient) procedure by expressing all computations in the terms of the image histogram,
                 0          63          127     191    255   0   0.2   0.4    0.6     0.8   1   0   63      127     191      255
                                                                                                                                                                                     which has to be computed only once (see Problem 10.29).
                                                                                                                                                                                         The preceding algorithm works well in situations where there is a reasonably
             a b c                                                                                                                                                                   clear valley between the modes of the histogram related to objects and background.
             d e f
                                                                                                                                                                                     Parameter !T is used to stop iterating when the changes in threshold values is small.
            FIGURE 10.34 (a) Noisy image. (b) Intensity ramp in the range [0.2, 0.6]. (c) Product of (a) and (b). (d) through (f)
                                                                                                                                                                                     The initial threshold must be chosen greater than the minimum and less than the
            Corresponding histograms.
                                                                                                                                                                                     maximum intensity level in the image (the average intensity of the image is a good
www.EBooksWorld.ir www.EBooksWorld.ir
             a b c                                                                                                                                                                Now, suppose that we select a threshold T (k ) = k, 0 < k < L − 1, and use it to thresh-
            FIGURE 10.35 (a) Noisy fingerprint. (b) Histogram. (c) Segmented result using a global threshold (thin image border                                                   old the input image into two classes, c1 and c2 , where c1 consists of all the pixels in
            added for clarity). (Original image courtesy of the National Institute of Standards and Technology.).                                                                 the image with intensity values in the range [0, k ] and c2 consists of the pixels with
                                                                                                                                                                                  values in the range [k + 1, L − 1]. Using this threshold, the probability, P1 (k ), that a
                                                                                                                                                                                  pixel is assigned to (i.e., thresholded into) class c1 is given by the cumulative sum
                                    initial choice for T ). If this condition is met, the algorithm converges in a finite num-
                                                                                                                                                                                                                                               k
                                    ber of steps, whether or not the modes are separable (see Problem 10.30).                                                                                                                P1 (k ) = ∑ pi                          (10-49)
                                                                                                                                                                                                                                              i=0
             EXAMPLE 10.13 : Global thresholding.                                                                                                                                 Viewed another way, this is the probability of class c1 occurring. For example, if we
            Figure 10.35 shows an example of segmentation using the preceding iterative algorithm. Figure 10.35(a)                                                                set k = 0, the probability of class c1 having any pixels assigned to it is zero. Similarly,
            is the original image and Fig. 10.35(b) is the image histogram, showing a distinct valley. Application                                                                the probability of class c2 occurring is
            of the basic global algorithm resulted in the threshold T = 125.4 after three iterations, starting with T                                                                                                          L −1
            equal to the average intensity of the image, and using !T = 0. Figure 10.35(c) shows the result obtained                                                                                             P2 (k ) =     ∑
                                                                                                                                                                                                                              i = k +1
                                                                                                                                                                                                                                         pi = 1 − P1 (k )            (10-50)
            using T = 125 to segment the original image. As expected from the clear separation of modes in the
            histogram, the segmentation between object and background was perfect.                                                                                                From Eq. (3-25), the mean intensity value of the pixels in c1 is
                                                                                                                                                                                                                  k                       k
                                                                                                                                                                                                       m1 (k ) = ∑ iP ( i c1 ) = ∑ iP ( c1 i ) P ( i ) P ( c1 )
                                    OPTIMUM GLOBAL THRESHOLDING USING OTSU’S METHOD                                                                                                                              i=0                     i=0
                                                                                                                                                                                                                                                                     (10-51)
                                    Thresholding may be viewed as a statistical-decision theory problem whose objec-                                                                                               1        k
                                    tive is to minimize the average error incurred in assigning pixels to two or more                                                                                          =          ∑ i pi
                                                                                                                                                                                                                 P1 ( k ) i = 0
                                    groups (also called classes). This problem is known to have an elegant closed-form
                                    solution known as the Bayes decision function (see Section 12.4). The solution is                                                             where P1 (k ) is given by Eq. (10-49). The term P ( i c1 ) in Eq. (10-51) is the probability
                                    based on only two parameters: the probability density function (PDF) of the inten-
                                                                                                                                                                                  of intensity value i, given that i comes from class c1 . The rightmost term in the first
                                    sity levels of each class, and the probability that each class occurs in a given applica-                                                     line of the equation follows from Bayes’ formula:
                                    tion. Unfortunately, estimating PDFs is not a trivial matter, so the problem usually
                                    is simplified by making workable assumptions about the form of the PDFs, such as                                                                                           P ( A B ) = P ( B A) P ( A) P ( B )
                                    assuming that they are Gaussian functions. Even with simplifications, the process
                                    of implementing solutions using these assumptions can be complex and not always                                                               The second line follows from the fact that P ( c1 i ) , the probability of c1 given i, is 1
                                    well-suited for real-time applications.                                                                                                       because we are dealing only with values of i from class c1 . Also, P(i) is the probabil-
                                        The approach in the following discussion, called Otsu’s method (Otsu [1979]), is                                                          ity of the ith value, which is the ith component of the histogram, pi . Finally, P(c1 ) is
                                    an attractive alternative. The method is optimum in the sense that it maximizes the                                                           the probability of class c1 which, from Eq. (10-49), is equal to P1 (k ).
www.EBooksWorld.ir www.EBooksWorld.ir
                                            Similarly, the mean intensity value of the pixels assigned to class c2 is                                                                           The first line of this equation follows from Eqs. (10-55), (10-56), and (10-59). The
                                                                                                                                                                                                second line follows from Eqs. (10-50) through (10-54). This form is slightly more
                                                                                         L −1                                                                                                   efficient computationally because the global mean, mG , is computed only once, so
                                                                         m2 (k ) =       ∑       iP ( i c2 )                                                                                    only two parameters, m1 and P1 , need to be computed for any value of k.
                                                                                      i = k +1
                                                                                                                                    (10-52)                                                         The first line in Eq. (10-60) indicates that the farther the two means m1 and m2 are
                                                                                        1 L −1                                                                                                  from each other, the larger sB2 will be, implying that the between-class variance is a
                                                                                  =             ∑ i pi
                                                                                      P2 (k ) i = k +1                                                                                          measure of separability between classes. Because sG2 is a constant, it follows that h
                                                                                                                                                                                                also is a measure of separability, and maximizing this metric is equivalent to maximiz-
                                         The cumulative mean (average intensity) up to level k is given by                                                                                      ing s B2 . The objective, then, is to determine the threshold value, k, that maximizes the
                                                                                                k                                                                                               between-class variance, as stated earlier. Note that Eq. (10-57) assumes implicitly that
                                                                              m(k ) = ∑ i pi                                        (10-53)                                                     sG2 > 0. This variance can be zero only when all the intensity levels in the image are
                                                                                               i=0                                                                                              the same, which implies the existence of only one class of pixels. This in turn means
                                         and the average intensity of the entire image (i.e., the global mean) is given by                                                                      that h = 0 for a constant image because the separability of a single class from itself
                                                                                           L −1                                                                                                 is zero.
                                                                                 mG =      ∑
                                                                                           i=0
                                                                                               i pi                                 (10-54)                                                         Reintroducing k, we have the final results:
                                                                                                                                                                                                                                                  s B2 (k )
                                         The validity of the following two equations can be verified by direct substitution of                                                                                                        h (k ) =                                   (10-61)
                                         the preceding results:                                                                                                                                                                                    sG2
                                                                                                                                                                                                and
                                                                           P1 m1 + P2 m2 = mG                                       (10-55)                                                                                  sB2 ( k ) =
                                                                                                                                                                                                                                           [ mG P1 (k) − m(k)]2                  (10-62)
                                                                                                                                                                                                                                             P1 (k )[1 − P1 (k )]
                                         and
                                                                                                                                                                                                Then, the optimum threshold is the value, k* , that maximizes s B2 (k ) :
                                                                                 P1 + P2 = 1                                        (10-56)
www.EBooksWorld.ir www.EBooksWorld.ir
             EXAMPLE 10.14 : Optimum global thresholding using Otsu’s method.                                                                                                  USING IMAGE SMOOTHING TO IMPROVE GLOBAL THRESHOLDING
            Figure 10.36(a) shows an optical microscope image of polymersome cells. These are cells artificially engi-                                                     As illustrated in Fig. 10.33, noise can turn a simple thresholding problem into an
            neered using polymers. They are invisible to the human immune system and can be used, for example,                                                             unsolvable one. When noise cannot be reduced at the source, and thresholding is the
            to deliver medication to targeted regions of the body. Figure 10.36(b) shows the image histogram. The                                                          preferred segmentation method, a technique that often enhances performance is to
            objective of this example is to segment the molecules from the background. Figure 10.36(c) is the result                                                       smooth the image prior to thresholding. We illustrate this approach with an example.
            of using the basic global thresholding algorithm discussed earlier. Because the histogram has no distinct                                                         Figure 10.37(a) is the image from Fig. 10.33(c), Fig. 10.37(b) shows its histogram,
            valleys and the intensity difference between the background and objects is small, the algorithm failed to                                                      and Fig. 10.37(c) is the image thresholded using Otsu’s method. Every black point
            achieve the desired segmentation. Figure 10.36(d) shows the result obtained using Otsu’s method. This                                                          in the white region and every white point in the black region is a thresholding error,
            result obviously is superior to Fig. 10.36(c). The threshold value computed by the basic algorithm was                                                         so the segmentation was highly unsuccessful. Figure 10.37(d) shows the result of
            169, while the threshold computed by Otsu’s method was 182, which is closer to the lighter areas in the                                                        smoothing the noisy image with an averaging kernel of size 5 × 5 (the image is of size
            image defining the cells. The separability measure h* was 0.467.                                                                                               651 × 814 pixels), and Fig. 10.37(e) is its histogram. The improvement in the shape
               As a point of interest, applying Otsu’s method to the fingerprint image in Example 10.13 yielded a                                                          of the histogram as a result of smoothing is evident, and we would expect threshold-
            threshold of 125 and a separability measure of 0.944. The threshold is identical to the value (rounded to                                                      ing of the smoothed image to be nearly perfect. Figure 10.37(f) shows this to be the
            the nearest integer) obtained with the basic algorithm. This is not unexpected, given the nature of the                                                        case. The slight distortion of the boundary between object and background in the
                                                                                                                                                                           segmented, smoothed image was caused by the blurring of the boundary. In fact, the
            histogram. In fact, the separability measure is high because of the relatively large separation between
                                                                                                                                                                           more aggressively we smooth an image, the more boundary errors we should antici-
            modes and the deep valley between them.
                                                                                                                                                                           pate in the segmented result.
www.EBooksWorld.ir www.EBooksWorld.ir
             a b c                                                                                                                                  a b c
             d e f                                                                                                                                  d e f
            FIGURE 10.37 (a) Noisy image from Fig. 10.33(c) and (b) its histogram. (c) Result obtained using Otsu’s method.                        FIGURE 10.38 (a) Noisy image and (b) its histogram. (c) Result obtained using Otsu’s method. (d) Noisy image
            (d) Noisy image smoothed using a 5 × 5 averaging kernel and (e) its histogram. (f) Result of thresholding using                        smoothed using a 5 × 5 averaging kernel and (e) its histogram. (f) Result of thresholding using Otsu’s method.
            Otsu’s method.                                                                                                                         Thresholding failed in both cases to extract the object of interest. (See Fig. 10.39 for a better solution.)
                                        Next, we investigate the effect of severely reducing the size of the foreground                                                         objects and the background. An immediate and obvious improvement is that his-
                                    region with respect to the background. Figure 10.38(a) shows the result. The noise in                                                       tograms should be less dependent on the relative sizes of objects and background.
                                    this image is additive Gaussian noise with zero mean and a standard deviation of 10                                                         For instance, the histogram of an image composed of a small object on a large back-
                                    intensity levels (as opposed to 50 in the previous example). As Fig. 10.38(b) shows,                                                        ground area (or vice versa) would be dominated by a large peak because of the high
                                    the histogram has no clear valley, so we would expect segmentation to fail, a fact that                                                     concentration of one type of pixels. We saw in Fig. 10.38 that this can lead to failure
                                    is confirmed by the result in Fig. 10.38(c). Figure 10.38(d) shows the image smoothed                                                       in thresholding.
                                    with an averaging kernel of size 5 × 5, and Fig. 10.38(e) is the corresponding histo-                                                          If only the pixels on or near the edges between objects and background were
                                    gram. As expected, the net effect was to reduce the spread of the histogram, but the                                                        used, the resulting histogram would have peaks of approximately the same height. In
                                    distribution still is unimodal. As Fig. 10.38(f) shows, segmentation failed again. The                                                      addition, the probability that any of those pixels lies on an object would be approxi-
                                    reason for the failure can be traced to the fact that the region is so small that its con-                                                  mately equal to the probability that it lies on the background, thus improving the
                                    tribution to the histogram is insignificant compared to the intensity spread caused                                                         symmetry of the histogram modes. Finally, as indicated in the following paragraph,
                                    by noise. In situations such as this, the approach discussed in the following section is
                                                                                                                                                                                using pixels that satisfy some simple measures based on gradient and Laplacian
                                    more likely to succeed.
                                                                                                                                                                                operators has a tendency to deepen the valley between histogram peaks.
                                                                                                                                                                                   The approach just discussed assumes that the edges between objects and back-
                                    USING EDGES TO IMPROVE GLOBAL THRESHOLDING                                                                                                  ground are known. This information clearly is not available during segmentation,
                                    Based on the discussion thus far, we conclude that the chances of finding a “good”                                                          as finding a division between objects and background is precisely what segmenta-
                                    threshold are enhanced considerably if the histogram peaks are tall, narrow, sym-                                                           tion aims to do. However, an indication of whether a pixel is on an edge may be
                                    metric, and separated by deep valleys. One approach for improving the shape of                                                              obtained by computing its gradient or Laplacian. For example, the average value
                                    histograms is to consider only those pixels that lie on or near the edges between                                                           of the Laplacian is 0 at the transition of an edge (see Fig. 10.10), so the valleys of
www.EBooksWorld.ir www.EBooksWorld.ir
                                        histograms formed from the pixels selected by a Laplacian criterion can be expected
                                        to be sparsely populated. This property tends to produce the desirable deep valleys
                                        discussed above. In practice, comparable results typically are obtained using either
                                        the gradient or Laplacian images, with the latter being favored because it is compu-
                                        tationally more attractive and is also created using an isotropic edge detector.
                                            The preceding discussion is summarized in the following algorithm, where f ( x, y)
            It is possible to modify    is the input image:
            this algorithm so that
            both the magnitude of
            the gradient and the
                                         1. Compute an edge image as either the magnitude of the gradient, or absolute
                                                                                                                                                                                                 0      63         127     191    255
            absolute value of the           value of the Laplacian, of f ( x, y) using any of the methods in Section 10.2.
            Laplacian images are
            used. In this case, we       2. Specify a threshold value, T.
            would specify a threshold
            for each image and form      3. Threshold the image from Step 1 using T from Step 2 to produce a binary image,
            the logical OR of the
            two results to obtain
                                            gT ( x, y). This image is used as a mask image in the following step to select pixels
            the marker image. This          from f ( x, y) corresponding to “strong” edge pixels in the mask.
            approach is useful when
            more control is desired      4. Compute a histogram using only the pixels in f ( x, y) that correspond to the
            over the points deemed
            to be valid edge points.
                                            locations of the 1-valued pixels in gT ( x, y).
                                         5. Use the histogram from Step 4 to segment f ( x, y) globally using, for example,
                                                                                                                                                                                                 0      63         127     191    255
                                            Otsu’s method.
                                                                                                                                                           a b c
            The nth percentile is
                                        If T is set to any value less than the minimum value of the edge image then, accord-                               d e f
            the smallest number         ing to Eq. (10-46), gT ( x, y) will consist of all 1’s, implying that all pixels of f ( x, y)                     FIGURE 10.39 (a) Noisy image from Fig. 10.38(a) and (b) its histogram. (c) Mask image formed as the gradient mag-
            that is greater than n%
            of the numbers in a         will be used to compute the image histogram. In this case, the preceding algorithm                                nitude image thresholded at the 99.7 percentile. (d) Image formed as the product of (a) and (c). (e) Histogram of
            given set. For example,     becomes global thresholding using the histogram of the original image. It is custom-                              the nonzero pixels in the image in (d). (f) Result of segmenting image (a) with the Otsu threshold based on the
            if you received a 95 in a
                                        ary to specify the value of T to correspond to a percentile, which typically is set                               histogram in (e). The threshold was 134, which is approximately midway between the peaks in this histogram.
            test and this score was
            greater than 85% of all     high (e.g., in the high 90’s) so that few pixels in the gradient/Laplacian image will
            the students taking the
            test, then you would be     be used in the computation. The following examples illustrate the concepts just dis-
            in the 85th percentile                                                                                                                          EXAMPLE 10.16 : Using edge information based on the Laplacian to improve global thresholding.
            with respect to the test
                                        cussed. The first example uses the gradient, and the second uses the Laplacian. Simi-
            scores.                     lar results can be obtained in both examples using either approach. The important                                 In this example, we consider a more complex thresholding problem. Figure 10.40(a) shows an 8-bit
                                        issue is to generate a suitable derivative image.                                                                 image of yeast cells for which we want to use global thresholding to obtain the regions corresponding
                                                                                                                                                          to the bright spots. As a starting point, Fig. 10.40(b) shows the image histogram, and Fig. 10.40(c) is
                                                                                                                                                          the result obtained using Otsu’s method directly on the image, based on the histogram shown. We see
             EXAMPLE 10.15 : Using edge information based on the gradient to improve global thresholding.                                                 that Otsu’s method failed to achieve the original objective of detecting the bright spots. Although the
                                                                                                                                                          method was able to isolate some of the cell regions themselves, several of the segmented regions on the
            Figures 10.39(a) and (b) show the image and histogram from Fig. 10.38. You saw that this image could                                          right were actually joined. The threshold computed by the Otsu method was 42, and the separability
            not be segmented by smoothing followed by thresholding. The objective of this example is to solve the                                         measure was 0.636.
            problem using edge information. Figure 10.39(c) is the mask image, gT ( x, y), formed as gradient mag-                                           Figure 10.40(d) shows the mask image gT ( x, y) obtained by computing the absolute value of the
            nitude image thresholded at the 99.7 percentile. Figure 10.39(d) is the image formed by multiplying the                                       Laplacian image, then thresholding it with T set to 115 on an intensity scale in the range [0, 255]. This
            mask by the input image. Figure 10.39(e) is the histogram of the nonzero elements in Fig. 10.39(d). Note                                      value of T corresponds approximately to the 99.5 percentile of the values in the absolute Laplacian
            that this histogram has the important features discussed earlier; that is, it has reasonably symmetrical                                      image, so thresholding at this level results in a sparse set of pixels, as Fig. 10.40(d) shows. Note in this
            modes separated by a deep valley. Thus, while the histogram of the original noisy image offered no hope                                       image how the points cluster near the edges of the bright spots, as expected from the preceding dis-
            for successful thresholding, the histogram in Fig. 10.39(e) indicates that thresholding of the small object                                   cussion. Figure 10.40(e) is the histogram of the nonzero pixels in the product of (a) and (d). Finally,
            from the background is indeed possible. The result in Fig. 10.39(f) shows that this is the case. This image                                   Fig. 10.40(f) shows the result of globally segmenting the original image using Otsu’s method based on
            was generated using Otsu’s method [to obtain a threshold based on the histogram in Fig. 10.42(e)], and                                        the histogram in Fig. 10.40(e). This result agrees with the locations of the bright spots in the image. The
            then applying the Otsu threshold globally to the noisy image in Fig. 10.39(a). The result is nearly perfect.                                  threshold computed by the Otsu method was 115, and the separability measure was 0.762, both of which
                                                                                                                                                          are higher than the values obtained by using the original histogram.
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                      FIGURE 10.41
                                                                                                                                                      Image in Fig.
                                                                                                                                                      10.40(a) segmented
                                                                                                                                                      using the same
                                                                                                                                                      procedure as
                                                                                                                                                      explained in Figs.
                                                                                                                                                      10.40(d) through
                                                                                                                                                      (f), but using a
                                                                                                                                                      lower value to
                                                                                                                                                      threshold the
                                                                                                                                                      absolute Laplacian
                                                                                                                                                      image.
                                                                                                                                                                                                                                           1
             a b c
             d e f
                                                                                                                                                                                                                                  mk =
                                                                                                                                                                                                                                           Pk
                                                                                                                                                                                                                                                ∑ ipi
                                                                                                                                                                                                                                                i ∈ck
                                                                                                                                                                                                                                                                                            (10-68)
            FIGURE 10.40 (a) Image of yeast cells. (b) Histogram of (a). (c) Segmentation of (a) with Otsu’s method using the
            histogram in (b). (d) Mask image formed by thresholding the absolute Laplacian image. (e) Histogram of the non-                                                        As before, mG is the global mean given in Eq. (10-54). The K classes are separated
            zero pixels in the product of (a) and (d). (f) Original image thresholded using Otsu’s method based on the histogram                                                   by K − 1 thresholds whose values, k1∗ , k2∗ ,…, kK∗ −1 , are the values that maximize Eq.
            in (e). (Original image courtesy of Professor Susan L. Forsburg, University of Southern California.)                                                                   (10-66):
                                                                                                                                                                                                      (
                                                                                                                                                                                                  s B2 k1∗ , k2∗ ,…, kK∗ −1   )   =          max
                                                                                                                                                                                                                                      0 < k1 < k2 <…kK < L −1
                                                                                                                                                                                                                                                                s B2 ( k1 , k2 ,… kK −1 )   (10-69)
               By varying the percentile at which the threshold is set, we can even improve the segmentation of the
            complete cell regions. For example, Fig. 10.41 shows the result obtained using the same procedure as in                                                                Although this result is applicable to an arbitrary number of classes, it begins to lose
            the previous paragraph, but with the threshold set at 55, which is approximately 5% of the maximum                                                                     meaning as the number of classes increases because we are dealing with only one
            value of the absolute Laplacian image. This value is at the 53.9 percentile of the values in that image.                                                               variable (intensity). In fact, the between-class variance usually is cast in terms of
            This result clearly is superior to the result in Fig. 10.40(c) obtained using Otsu’s method with the histo-                                                            multiple variables expressed as vectors (Fukunaga [1972]). In practice, using mul-
            gram of the original image.                                                                                                                                            tiple global thresholding is considered a viable approach when there is reason to
                                                                                                                                                                                   believe that the problem can be solved effectively with two thresholds. Applications
                                    MULTIPLE THRESHOLDS                                                                                                                            that require more than two thresholds generally are solved using more than just
                                                                                                                                                                                   intensity values. Instead, the approach is to use additional descriptors (e.g., color)
                                    Thus far, we have focused attention on image segmentation using a single global                                                                and the application is cast as a pattern recognition problem, as you will learn shortly
                                    threshold. Otsu’s method can be extended to an arbitrary number of thresholds                                                                  in the discussion on multivariable thresholding.
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                     k2
                                                                                        1
                                                                              m2 =
                                                                                        P2
                                                                                                 ∑ ipi
                                                                                              i = k +1
                                                                                                                                                (10-72)                          EXAMPLE 10.17 : Multiple global thresholding.
                                                                                                      1
                                                                                                 L −1                                                                          Figure 10.42(a) shows an image of an iceberg. The objective of this example is to segment the image into
                                                                                        1
                                                                              m3 =            ∑ ipi
                                                                                        P3 i = k2 +1
                                                                                                                                                                               three regions: the dark background, the illuminated area of the iceberg, and the area in shadows. It is
                                                                                                                                                                               evident from the image histogram in Fig. 10.42(b) that two thresholds are required to solve this problem.
                                        As in Eqs. (10-55) and (10-56), the following relationships hold:                                                                      The procedure discussed above resulted in the thresholds k1∗ = 80 and k2∗ = 177, which we note from
                                                                                                                                                                               Fig. 10.45(b) are near the centers of the two histogram valleys. Figure 10.42(c) is the segmentation that
                                                                        P1 m1 + P2 m2 + P3 m3 = mG                                              (10-73)                        resulted using these two thresholds in Eq. (10-76). The separability measure was 0.954. The principal
                                                                                                                                                                               reason this example worked out so well can be traced to the histogram having three distinct modes
                                        and                                                                                                                                    separated by reasonably wide, deep valleys. But we can do even better using superpixels, as you will see
                                                                              P1 + P2 + P3 = 1                                                  (10-74)                        in Section 10.5.
                                        We see from Eqs. (10-71) and (10-72) that P and m, and therefore             are functions    sB2 ,
                                        of k1 and k2 . The two optimum threshold values, k1* and k2* , are the values that maxi-
                                               2
                                        mize s B (k1 , k2 ). That is, as indicated in Eq. (10-69), we find the optimum thresholds
                                        by finding
                                                                        (         )
                                                                    sB2 k1∗ , k2∗ =        max
                                                                                      0 < k1 < k2 < L − 1
                                                                                                               sB2 ( k1 , k2 )                  (10-75)
                                        The procedure starts by selecting the first value of k1 (that value is 1 because look-
                                        ing for a threshold at 0 intensity makes no sense; also, keep in mind that the incre-
                                        ment values are integers because we are dealing with integer intensity values).
                                        Next, k2 is incremented through all its values greater than k1 and less than L − 1
                                        (i.e., k2 = k1 + 1,…, L − 2). Then, k1 is incremented to its next value and k2 is incre-                                                                                             0     63         127      191        255
                                        mented again through all its values greater than k1 . This procedure is repeated
                                        until k1 = L − 3. The result of this procedure is a 2-D array, s B2 ( k1 , k2 ) , and the last                                           a b c
                                        step is to look for the maximum value in this array. The values of k1 and k2 cor-                                                       FIGURE 10.42 (a) Image of an iceberg. (b) Histogram. (c) Image segmented into three regions using dual Otsu thresholds.
                                        responding to that maximum in the array are the optimum thresholds, k1* and k2* .                                                       (Original image courtesy of NOAA.)
www.EBooksWorld.ir www.EBooksWorld.ir
                                          VARIABLE THRESHOLDING                                                                                                                         where Q is a predicate based on parameters computed using the pixels in neighbor-
                                          As discussed earlier in this section, factors such as noise and nonuniform illumina-                                                                                                                       (        )
                                                                                                                                                                                        hood Sxy . For example, consider the following predicate, Q s xy , mxy , based on the
                                          tion play a major role in the performance of a thresholding algorithm. We showed                                                              local mean and standard deviation:
                                          that image smoothing and the use of edge information can help significantly. How-
                                                                                                                                                                                                                TRUE if f ( x, y) > asxy AND f ( x, y) > bmxy
                                          ever, sometimes this type of preprocessing is either impractical or ineffective in                                                                        (        )
                                                                                                                                                                                                  Q sxy , mxy =                                                      (10-82)
                                          improving the situation, to the point where the problem cannot be solved by any                                                                                         FALSE otherwisee
                                          of the thresholding methods discussed thus far. In such situations, the next level of
                                          thresholding complexity involves variable thresholding, as we will illustrate in the                                                          Note that Eq. (10-80) is a special case of Eq. (10-81), obtained by letting Q be TRUE
                                          following discussion.                                                                                                                         if f ( x, y) > Txy and FALSE otherwise. In this case, the predicate is based simply on
                                                                                                                                                                                        the intensity at a point.
                                          Variable Thresholding Based on Local Image Properties
                                          A basic approach to variable thresholding is to compute a threshold at every point,                                EXAMPLE 10.18 : Variable thresholding based on local image properties.
                                          ( x, y), in the image based on one or more specified properties in a neighborhood                                Figure 10.43(a) shows the yeast image from Example 10.16. This image has three predominant inten-
                                          of ( x, y). Although this may seem like a laborious process, modern algorithms and                               sity levels, so it is reasonable to assume that perhaps dual thresholding could be a good segmentation
                                          hardware allow for fast neighborhood processing, especially for common functions                                 approach. Figure 10.43(b) is the result of using the dual thresholding method summarized in Eq. (10-76).
                                          such as logical and arithmetic operations.                                                                       As the figure shows, it was possible to isolate the bright areas from the background, but the mid-gray
                                              We illustrate the approach using the mean and standard deviation of the pixel                                regions on the right side of the image were not segmented (i.e., separated) properly. To illustrate the use
                                          values in a neighborhood of every point in an image. These two quantities are use-
                                          ful for determining local thresholds because, as you know from Chapter 3, they are                                a b
                                          descriptors of average intensity and contrast. Let mxy and sxy denote the mean and                                c d
            We simplified the nota-
                                          standard deviation of the set of pixel values in a neighborhood, Sxy , centered at                               FIGURE 10.43
            tion slightly from the        coordinates ( x, y) in an image (see Section 3.3 regarding computation of the local                              (a) Image from
            form we used in
            Eqs. (3-27) and (3-28) by
                                          mean and standard deviation). The following are common forms of variable thresh-                                 Fig. 10.40.
            letting xy imply a            olds based on the local image properties:                                                                        (b) Image
            neighborhood S, centered                                                                                                                       segmented using
            at coordinates (x, y).                                           Txy = as xy + bmxy                             (10-78)                        the dual
                                                                                                                                                           thresholding
                                          where a and b are nonnegative constants, and                                                                     approach given
                                                                                                                                                           by Eq. (10-76).
                                                                             Txy = as xy + bmG                              (10-79)                        (c) Image of local
            Note that Txy is a                                                                                                                             standard
            threshold array of the                                                                                                                         deviations.
            same size as the image        where mG is the global image mean. The segmented image is computed as                                            (d) Result
            from which it was
            obtained. The threshold                                                                                                                        obtained using
            at a location (x, y) in the
                                                                                 1    if f ( x, y) > Txy                                                 local thresholding.
            array is used to segment
                                                                      g( x, y) =                                           (10-80)
            the value of an image at
            that location.                                                        0   if f ( x, y) ≤ Txy
                                          where f ( x, y) is the input image. This equation is evaluated for all pixel locations
                                          in the image, and a different threshold is computed at each location ( x, y) using the
                                          pixels in the neighborhood Sxy .
                                             Significant power (with a modest increase in computation) can be added to vari-
                                          able thresholding by using predicates based on the parameters computed in the neigh-
                                          borhood of a point ( x, y) :
www.EBooksWorld.ir www.EBooksWorld.ir
            of local thresholding, we computed the local standard deviation sxy for all ( x, y) in the input image using
            a neighborhood of size 3 × 3. Figure 10.43(c) shows the result. Note how the faint outer lines correctly
            delineate the boundaries of the cells. Next, we formed a predicate of the form shown in Eq. (10-82), but
            using the global mean instead of mxy . Choosing the global mean generally gives better results when the
            background is nearly constant and all the object intensities are above or below the background intensity.
            The values a = 30 and b = 1.5 were used to complete the specification of the predicate (these values
            were determined experimentally, as is usually the case in applications such as this). The image was then
            segmented using Eq. (10-82). As Fig. 10.43(d) shows, the segmentation was quite successful. Note in par-
            ticular that all the outer regions were segmented properly, and that most of the inner, brighter regions
            were isolated correctly.
                                                                                                                                                      a b c
                                    Variable Thresholding Based on Moving Averages                                                                   FIGURE 10.44 (a) Text image corrupted by spot shading. (b) Result of global thresholding using Otsu’s method.
                                                                                                                                                     (c) Result of local thresholding using moving averages.
                                    A special case of the variable thresholding method discussed in the previous sec-
                                    tion is based on computing a moving average along scan lines of an image. This
                                    implementation is useful in applications such as document processing, where speed                                  As another illustration of the effectiveness of this segmentation approach, we used the same param-
                                    is a fundamental requirement. The scanning typically is carried out line by line in a                            eters as in the previous paragraph to segment the image in Fig. 10.45(a), which is corrupted by a sinu-
                                    zigzag pattern to reduce illumination bias. Let zk+1 denote the intensity of the point                           soidal intensity variation typical of the variations that may occur when the power supply in a document
                                    encountered in the scanning sequence at step k + 1. The moving average (mean                                     scanner is not properly grounded. As Figs. 10.45(b) and (c) show, the segmentation results are compa-
                                    intensity) at this new point is given by                                                                         rable to those in Fig. 10.44.
                                                                                                                                                        Note that successful segmentation results were obtained in both cases using the same values for n
                                                                 1 k +1                                                                              and c, which shows the relative ruggedness of the approach. In general, thresholding based on moving
                                                    m(k + 1) =         ∑ zi
                                                                 n i = k+2−n
                                                                                                 for k ≥ n − 1
                                                                                                                                                     averages works well when the objects of interest are small (or thin) with respect to the image size, a
                                                                                                                      (10-83)                        condition satisfied by images of typed or handwritten text.
                                                              = m(k ) +
                                                                          1
                                                                          n
                                                                               (
                                                                            zk +1 − zk − n   )   for k ≥ n + 1
                                                                                                                                                                              10.4       SEGMENTATION BY REGION GROWING AND BY REGION
                                    where n is the number of points used in computing the average, and m(1) = z1 . The                                                            10.4
                                                                                                                                                                                         SPLITTING AND MERGING
                                    conditions imposed on k are so that all subscripts on zk are positive. All this means
                                    is that n points must be available for computing the average. When k is less than the                            You should review the
                                                                                                                                                                              As we discussed in Section 10.1, the objective of segmentation is to partition an
                                    limits shown (this happens near the image borders) the averages are formed with                                  terminology introduced   image into regions. In Section 10.2, we approached this problem by attempting to
                                    the available image points. Because a moving average is computed for every point
                                                                                                                                                     in Section 10.1 before
                                                                                                                                                     proceeding.
                                                                                                                                                                              find boundaries between regions based on discontinuities in intensity levels, where-
                                    in the image, segmentation is implemented using Eq. (10-80) with Txy = cmxy , where                                                       as in Section 10.3, segmentation was accomplished via thresholds based on the dis-
                                    c is positive scalar, and mxy is the moving average from Eq. (10-83) at point ( x, y) in                                                  tribution of pixel properties, such as intensity values or color. In this section and in
                                    the input image.                                                                                                                          Sections 10.5 and 10.6, we discuss segmentation techniques that find the regions
                                                                                                                                                                              directly. In Section 10.7, we will discuss a method that finds the regions and their
                                                                                                                                                                              boundaries simultaneously.
             EXAMPLE 10.19 : Document thresholding using moving averages.
            Figure 10.44(a) shows an image of handwritten text shaded by a spot intensity pattern. This form of                                                                   REGION GROWING
            intensity shading is typical of images obtained using spot illumination (such as a photographic flash).                                                           As its name implies, region growing is a procedure that groups pixels or subregions
            Figure 10.44(b) is the result of segmentation using the Otsu global thresholding method. It is not unex-                                                          into larger regions based on predefined criteria for growth. The basic approach is to
            pected that global thresholding could not overcome the intensity variation because the method gener-                                                              start with a set of “seed” points, and from these grow regions by appending to each
            ally performs poorly when the areas of interest are embedded in a nonuniform illumination field. Figure                                                           seed those neighboring pixels that have predefined properties similar to the seed
            10.44(c) shows successful segmentation with local thresholding using moving averages. For images of                                                               (such as ranges of intensity or color).
            written material, a rule of thumb is to let n equal five times the average stroke width. In this case, the                                                           Selecting a set of one or more starting points can often be based on the nature of
            average width was 4 pixels, so we let n = 20 in Eq. (10-83) and used c = 0.5.                                                                                     the problem, as we show later in Example 10.20. When a priori information is not
www.EBooksWorld.ir www.EBooksWorld.ir
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                     candidates. However, Step 3 will reject the outer points because they are not 8-connected to the seeds.
                                                                                                                                                     In fact, as Fig. 10.46(i) shows, this step resulted in the correct segmentation, indicating that the use of
                                                                                                                                                     connectivity was a fundamental requirement in this case. Finally, note that in Step 4 we used the same
                                                                                                                                                     value for all the regions found by the algorithm. In this case, it was visually preferable to do so because
                                                                                                                                                     all those regions have the same physical meaning in this application—they all represent porosities.
www.EBooksWorld.ir www.EBooksWorld.ir
                                     3. Stop when no further merging is possible.                                                                                                          TRUE           if sR > a AND 0 < mR < b
                                                                                                                                                                                    Q(R) = 
                                    Numerous variations of this basic theme are possible. For example, a significant                                                                        FALSE         otherwise
                                    simplification results if in Step 2 we allow merging of any two adjacent regions Rj
                                    and Rk if each one satisfies the predicate individually. This results in a much sim-                        where sR and mR are the standard deviation and mean of the region being processed, and a and b are
                                    pler (and faster) algorithm, because testing of the predicate is limited to individual                      nonnegative constants.
                                    quadregions. As the following example shows, this simplification is still capable of                           Analysis of several regions in the outer area of interest revealed that the mean intensity of pixels
                                    yielding good segmentation results.                                                                         in those regions did not exceed 125, and the standard deviation was always greater than 10. Figures
                                                                                                                                                10.48(b) through (d) show the results obtained using these values for a and b, and varying the minimum
                                                                                                                                                size allowed for the quadregions from 32 to 8. The pixels in a quadregion that satisfied the predicate
             EXAMPLE 10.21 : Segmentation by region splitting and merging.                                                                      were set to white; all others in that region were set to black. The best result in terms of capturing the
            Figure 10.48(a) shows a 566 × 566 X-ray image of the Cygnus Loop supernova. The objective of this                                   shape of the outer region was obtained using quadregions of size 16 × 16. The small black squares in
            example is to segment (extract from the image) the “ring” of less dense matter surrounding the dense                                Fig. 10.48(d) are quadregions of size 8 × 8 whose pixels did not satisfy the predicate. Using smaller
            inner region. The region of interest has some obvious characteristics that should help in its segmenta-                             quadregions would result in increasing numbers of such black regions. Using regions larger than the one
            tion. First, we note that the data in this region has a random nature, indicating that its standard devia-                          illustrated here would result in a more “block-like” segmentation. Note that in all cases the segmented
            tion should be greater than the standard deviation of the background (which is near 0) and of the large                             region (white pixels) was a connected region that completely separates the inner, smoother region from
            central region, which is smooth. Similarly, the mean value (average intensity) of a region containing                               the background. Thus, the segmentation effectively partitioned the image into three distinct areas that
            data from the outer ring should be greater than the mean of the darker background and less than the                                 correspond to the three principal features in the image: background, a dense region, and a sparse region.
            mean of the lighter central region. Thus, we should be able to segment the region of interest using the                             Using any of the white regions in Fig. 10.48 as a mask would make it a relatively simple task to extract
            following predicate:                                                                                                                these regions from the original image (see Problem 10.43). As in Example 10.20, these results could not
                                                                                                                                                have been obtained using edge- or threshold-based segmentation.
             a b                                                                                                                                                                As used in the preceding example, properties based on the mean and standard
             c d                                                                                                                                                             deviation of pixel intensities in a region attempt to quantify the texture of the region
            FIGURE 10.48                                                                                                                                                     (see Section 11.3 for a discussion on texture). The concept of texture segmentation
            (a) Image of the                                                                                                                                                 is based on using measures of texture in the predicates. In other words, we can per-
            Cygnus Loop                                                                                                                                                      form texture segmentation by any of the methods discussed in this section simply by
            supernova, taken
                                                                                                                                                                             specifying predicates based on texture content.
            in the X-ray band
            by NASA’s
            Hubble Telescope.                                                                                                                                               10.5 REGION SEGMENTATION USING CLUSTERING AND
            (b) through (d)                                                                                                                                                  10.5
                                                                                                                                                                                 SUPERPIXELS
            Results of limit-
            ing the smallest                                                                                                                                                 In this section, we discuss two related approaches to region segmentation. The first
            allowed                                                                                                                                                          is a classical approach based on seeking clusters in data, related to such variables as
            quadregion to be                                                                                                                                                 intensity and color. The second approach is significantly more modern, and is based
            of sizes of 32 × 32,
            16 × 16, and 8 × 8                                                                                                                                               on using clustering to extract “superpixels” from an image.
                                                                                                                                                A more general form of
            pixels,                                                                                                                             clustering is
            respectively.                                                                                                                       unsupervised clustering,     REGION SEGMENTATION USING K-MEANS CLUSTERING
            (Original image                                                                                                                     in which a clustering
            courtesy of
                                                                                                                                                algorithm attempts to        The basic idea behind the clustering approach used in this chapter is to partition a
                                                                                                                                                find a meaningful set of
            NASA.)                                                                                                                              clusters in a given set
                                                                                                                                                                             set, Q, of observations into a specified number, k, of clusters. In k-means clustering,
                                                                                                                                                of samples. We do not        each observation is assigned to the cluster with the nearest mean (hence the name
                                                                                                                                                address this topic, as
                                                                                                                                                our focus in this brief
                                                                                                                                                                             of the method), and each mean is called the prototype of its cluster. A k-means algo-
                                                                                                                                                introduction is only to      rithm is an iterative procedure that successively refines the means until convergence
                                                                                                                                                illustrate how supervised
                                                                                                                                                clustering is used for
                                                                                                                                                                             is achieved.
                                                                                                                                                image segmentation.              Let {z1 , z 2 , … , zQ } be set of vector observations (samples). These vectors have
                                                                                                                                                                             the form
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                    z1                                                                               a b
                                                                                                   z                                                                                FIGURE 10.49
                                                                                                 z= 
                                                                                                      2
                                                                                                                                                      (10-84)                         (a) Image of size
                                                                                                   #
                                                                                                                                                                                    688 × 688 pixels.
                                                                                                    zn                                                                              (b) Image
                                                                                                                                                                                      segmented using
                                           In image segmentation, each component of a vector z represents a numerical pixel                                                           the k-means
                                           attribute. For example, if segmentation is based on just grayscale intensity, then z = z                                                   algorithm with
                                           is a scalar representing the intensity of a pixel. If we are segmenting RGB color                                                          k = 3.
                                           images, z typically is a 3-D vector, each component of which is the intensity of a pixel
                                           in one of the three primary color images, as we discussed in Chapter 6. The objec-
                                           tive of k-means clustering is to partition the set Q of observations into k (k ≤ Q)
                                           disjoint cluster sets C = {C1 , C2 , … , Ck }, so that the following criterion of optimality
                                           is satisfied:†                              k
                                                                           arg min a ∑ ∑ z − m i b
                                                                                                       2
                                                                                                                               (10-85)
                                                                                        C       i = 1 z ∈Ci
                                                                                                                                                                                                                   When T = 0, this algorithm is known to converge in a finite number of iterations
                                           where m i is the mean vector (or centroid) of the samples in set Ci and arg is the vec-                                                                                 to a local minimum. It is not guaranteed to yield the global minimum required to
                                           tor norm of the argument. Typically, the Euclidean norm is used, so the term z − m i                                                                                    minimize Eq. (10-85). The result at convergence does depend on the initial values
                                           is the familiar Euclidean distance from a sample in Ci to mean m i . In words, this                                                                                     chosen for m i . An approach used frequently in data analysis is to specify the initial
                                           equation says that we are interested in finding the sets C = {C1 , C2 , … , Ck } such that                                                                              means as k randomly chosen samples from the given sample set, and to run the
                                           the sum of the distances from each point in a set to the mean of that set is minimum.                                                                                   algorithm several times, with a new random set of initial samples each time. This is
                                              Unfortunately, finding this minimum is an NP-hard problem for which no practi-                                                                                       to test the “stability” of the solution. In image segmentation, the important issue is
                                           cal solution is known. As a result, a number of heuristic methods that attempt to find                                                                                  the value selected for k because this determines the number of segmented regions;
                                           approximations to the minimum have been proposed over the years. In this section,                                                                                       thus, multiple passes are rarely used.
                                           we discuss what is generally considered to be the “standard” k-means algorithm,
                                           which is based on the Euclidean distance (see Section 2.6). Given a set {z1 , z 2 , … , zQ }
                                           of vector observation and a specified value of k, the algorithm is as follows:                                                               EXAMPLE 10.22 : Using k-means clustering for segmentation.
            These initial means are          1. Initialize the algorithm: Specify an initial set of means, m i (1), i = 1, 2, … , k.                                                  Figure 10.49(a) shows an image of size 688 × 688 pixels, and Fig. 10.49(b) is the segmentation obtained
            the initial cluster centers.                                                                                                                                              using the k-means algorithm with k = 3. As you can see, the algorithm was able to extract all the mean-
            They are also called seeds.      2. Assign samples to clusters: Assign each sample to the cluster set whose mean
                                                                                                                                                                                      ingful regions of this image with high accuracy. For example, compare the quality of the characters in
                                                is the closest (ties are resolved arbitrarily, but samples are assigned to only one
                                                cluster):                                                                                                                             both images. It is important to realize that the entire segmentation was done by clustering of a single
                                                                                                                                                                                      variable (intensity). Because k-means works with vector observations in general, its power to discrimi-
                                                   zq → Ci if !zq − m i !2 < !zq − m j !2 j = 1, 2, … , k ( j ≠ i); q = 1, 2, … , Q
                                                                                                                                                                                      nate between regions increases as the number of components of vector z in Eq. (10-84) increases.
                                             3. Update the cluster centers (means):
                                                                                         1                                                                                                                         REGION SEGMENTATION USING SUPERPIXELS
                                                                                 mi =
                                                                                         Ci
                                                                                              ∑z
                                                                                              z ∈Ci
                                                                                                              i = 1, 2, … , k
                                                                                                                                                                                                                   The idea behind superpixels is to replace the standard pixel grid by grouping pixels
                                                                                                                                                                                                                   into primitive regions that are more perceptually meaningful than individual pixels.
                                                 where Ci is the number of samples in cluster set Ci .
                                                                                                                                                                                                                   The objectives are to lessen computational load, and to improve the performance of
                                             4. Test for completion: Compute the Euclidean norms of the differences between                                                                                        segmentation algorithms by reducing irrelevant detail. A simple example will help
                                                the mean vectors in the current and previous steps. Compute the residual error,                                                                                    explain the basic approach of superpixel representations.
                                                E, as the sum of the k norms. Stop if E ≤ T , where T a specified, nonnegative                                                                                        Figure 10.50(a) shows an image of size 600 × 800 (480,000) pixels containing
                                                threshold. Else, go back to Step 2.                                                                                                                                various levels of detail that could be described verbally as: “This is an image of two
                                                                                                                                                                                                                   large carved figures in the foreground, and at least three, much smaller, carved fig-
                                           †
                                             Remember, min x
                                                              ( h( x)) is the minimum of h with respected to x, whereas arg xmin ( h( x)) is the value (or values)                                                 ures resting on a fence behind the large figures. The figures are on a beach, with
                                           of x at which h is minimum.
www.EBooksWorld.ir www.EBooksWorld.ir
             a b c
            FIGURE 10.50 (a) Image of size 600 × 480 (480,000) pixels. (b) Image composed of 4,000 superpixels (the boundaries
            between superpixels (in white) are superimposed on the superpixel image for reference—the boundaries are not
            part of the data). (c) Superpixel image. (Original image courtesy of the U.S. National Park Services.).
            Figures 10.50(b) and (c)   the ocean and sky in the background.” Figure 10.50(b) shows the same image rep-
            were obtained using a
            method to be discussed
                                       resented by 4,000 superpixels and their boundaries (the boundaries are shown for
            later in this section.     reference—they are not part of the data), and Fig. 10.50(c) shows the superpixel
                                       image. One could argue that the level of detail in the superpixel image would lead                            FIGURE 10.51 Top row: Results of using 1,000, 500, and 250 superpixels in the representation of Fig. 10.50(a). As before,
                                       to the same description as the original, but the former contains only 4,000 primitive                         the boundaries between superpixels are superimposed on the images for reference. Bottom row: Superpixel images.
                                       units, as opposed to 480,000 in the original. Whether the superpixel representation
                                       is “adequate” depends on the application. If the objective is to describe the image
                                       at the level of detail mentioned above, then the answer is yes. On the other hand, if
                                       the objective is to detect imperfections at pixel-level resolutions, then the answer                                                       SLIC Superpixel Algorithm
                                       obviously is no. And there are application, such as computerized medical diagnosis,                                                        In this section we discuss an algorithm for generating superpixels, called simple lin-
                                       in which approximate representations of any kind are not acceptable. Nevertheless,                                                         ear iterative clustering (SLIC). This algorithm, developed by Achanta et al. [2012],
                                       numerous application areas, such as image-database queries, autonomous naviga-                                                             is conceptually simple, and has computational and other performance advantages
                                       tion, and certain branches of robotics, in which economy of implementation and                                                             over other superpixels techniques. SLIC is a modification of the k-means algorithm
                                       potential improvements in segmentation performance far outweigh any appreciable                                                            discussed in the previous section. SLIC observations typically use (but are not lim-
                                       loss of image detail.                                                                                                                      ited to) 5-dimensional vectors containing three color components and two spatial
                                          One important requirement of any superpixel representation is adherence to bound-                                                       coordinates. For example, if we are using the RGB color system, the 5-dimensional
                                       aries. This means that boundaries between regions of interest must be preserved                                                            vector associated with an image pixel has the form
                                       in a superpixel image. We can see that this indeed is the case with the image in
                                       Fig. 10.50(c). Note, for example, how clear the boundaries between the figures and                                                                                                       r 
                                       the background are. The same is true of the boundaries between the beach and the                              As you will learn in                                                       g
                                                                                                                                                     Chapter 11, vectors                                                         
                                       ocean, and between the ocean and the sky. Other important characteristics are the                             containing image
                                                                                                                                                                                                                            z = b                                   (10-86)
                                                                                                                                                     attributes are called
                                       preservations of topological properties and, of course, computational efficiency. The                                                                                                     
                                                                                                                                                                                                                                 x
                                                                                                                                                     feature vectors.
                                       superpixel algorithm discussed in this section meets these requirements.
                                          As another illustration, we show the results of severely decreasing the number of                                                                                                      y 
                                       superpixels to 1,000, 500, and 250. The results in Fig. 10.51, show a significant loss of
                                       detail compared to Fig. 10.50(a), but the first two images contain most of the detail                                                      where (r, g, b) are the three color components of a pixel, and ( x, y) are its two spatial
                                       relevant to the image description discussed earlier. A notable difference is that two                                                      coordinates. Let nsp denote the desired number of superpixels and let ntp denote the
                                                                                                                                                                                  total number of pixels in the image.The initial superpixel centers, m i = [ ri gi bi xi yi ] ,
                                                                                                                                                                                                                                                                              T
                                       of the three small carvings on the fence in the back were eliminated. The 250-ele-
                                       ment superpixel image even lost the third. However, the boundaries between the                                                             i = 1, 2, … , nsp , are obtained by sampling the image on a regular grid spaced s units
                                       principal regions, as well as the basic topology of the images, were preserved.                                                            apart. To generate superpixels approximately equal in size (i.e., area), the grid spac-
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                                                 and
                                        by sampling the image at regular grid steps, s. Move the cluster centers to the
                                        lowest gradient position in a 3 × 3 neighborhood. For each pixel location, p, in                                                                                                                                   12
                                        the image, set a label L( p) = −1 and a distance d( p) = ".                                                                                                           ds = ( x j − xi )2 + ( y j − yi )2                 (10-88)
                                     2. Assign samples to cluster centers: For each cluster center m i , i = 1, 2, … , nsp ,
                                        compute the distance, Di ( p) between m i and each pixel p in a 2 s × 2 s neighbor-                                                      We then define D as the composite distance
                                        hood about m i . Then, for each p and i = 1, 2, … , nsp , if Di < d( p), let d( p) = Di                                                                                                                           12
                                        and L( p) = i.                                                                                                                                                              d 2     d 2
                                                                                                                                                                                                               D = a c b + a s b                                   (10-89)
                                     3. Update the cluster centers: Let Ci denote the set of pixels in the image with                                                                                               dcm     dsm 
                                        label L( p) = i. Update m i :
                                                                                                                                                                                 where dcm and dsm are the maximum expected values of dc and ds . The maximum spa-
                                                                                                                                                                                 tial distance should correspond to the sampling interval; that is, dsm = s = [ ntp nsp ]1 2 .
                                                                      1
                                                               mi =
                                                                      Ci
                                                                           ∑z       i = 1, 2, … , nsp                                                                            Determining the maximum color distance is not as straightforward, because these
                                                                           z ∈Ci                                                                                                 distances can vary significantly from cluster to cluster, and from image to image. A
                                                                                                                                                                                 solution is to set dcm to a constant c so that Eq. (10-89) becomes
                                        where Ci is the number of pixels in set Ci , and the z’s are given by Eq. (10-86).                                                                                                                            12
                                     4. Test for convergence: Compute the Euclidean norms of the differences between                                                                                                  d 2     d 2
                                                                                                                                                                                                                 D = a c b + a s b                                 (10-90)
                                        the mean vectors in the current and previous steps. Compute the residual error,                                                                                                c       s   
                                        E, as the sum of the nsp norms. If E < T , where T a specified nonnegative thresh-
                                        old, go to Step 5. Else, go back to Step 2.                                                                                              We can write this equation as
                                     5. Post-process the superpixel regions: Replace all the superpixels in each region,                                                                                                                             12
                                                                                                                                                                                                                              d 2 
                                        Ci , by their average value, m i .                                                                                                                                         D = dc2 + a s b c 2                             (10-91)
                                                                                                                                                                                                                               s       
                                    Note in Step 5 that superpixels end up as contiguous regions of constant value. The
                                    average value is not the only way to compute this constant, but it is the most widely                                                        This is the distance measure used for each cluster in the algorithm. Constant c can be
                                    used. For graylevel images, the average is just the average intensity of all the pixels                                                      used to weigh the relative importance between color similarity and spatial proximity.
                                    in the region spanned by the superpixel. This algorithm is similar to the k-means                                                            When c is large, spatial proximity is more important, and the resulting superpixels
                                    algorithm in the previous section, with the exceptions that the distances, Di , are not                                                      are more compact. When c is small, the resulting superpixels adhere more tightly to
                                    specified as Euclidean distances (see below), and that these distances are computed                                                          image boundaries, but have less regular size and shape.
                                    for regions of size 2 s × 2 s, rather than for all the pixels in the image, thus reduc-                                                         For grayscale images, as in Example 10.23 below, we use
                                    ing computation time significantly. In practice, SLIC convergence with respect to                                                                                                                           12
                                    E can be achieved with fairly large values of T. For example, all results reported by                                                                                              dc = (l j − li )2                         (10-92)
                                    Achanta et al. [2012] were obtained using T = 10.
www.EBooksWorld.ir www.EBooksWorld.ir
                                    in Eq. (10-91), where the l’s are intensity levels of the points for which the distance
                                    is being computed.
                                       In 3-D, superpixels become supervoxels, which are handled by defining
                                                                                                                 12
                                                         ds = ( x j − xi )2 + ( y j − yi )2 + (zj − zi )2         (10-93)
                                    where the z’s are the coordinates of the third spatial dimension. We must also add
                                    the third spatial variable, z, to the vector in Eq. (10-86).
                                       Because no provision is made in the algorithm to enforce connectivity, it is pos-
                                    sible for isolated pixels to remain after convergence. These are assigned the label
                                    of the nearest cluster using a connected components algorithm (see Section 9.6).
                                    Although we explained the algorithm in the context of RGB color components, the
                                    method is equally applicable to other colors systems. In fact, other components of
                                    vector z in Eq. (10-86) (with the exception of the spatial variables) could be other
                                    real-valued feature values, provided that a meaningful distance measure can be
                                    defined for them.
www.EBooksWorld.ir www.EBooksWorld.ir
                                           The types of graphs in which we are interested are undirected graphs whose                                              a b
                                        edges are further characterized by a matrix, W, whose element w(i, j ) is a weight                                         c d
                                        associated with the edge that connects nodes i and j. Because the graph is undirected,                                    FIGURE 10.53
                                        w(i, j ) = w( j, i), which means that W is a symmetric matrix. The weights are selected                                   (a) A 3 × 3 image.
                                        to be proportional to one or more similarity measures between all pairs of nodes. A                                       (c) A corresponding
                                                                                                                                                                  graph.
                                        graph whose edges are associated with weights is called a weighted graph.                                                 (d) Graph cut.
                                           The essence of the material in this section is to represent an image to be seg-                                        (c) Segmented
            Superpixels are also well
                                        mented as a weighted, undirected graph, where the nodes of the graph are the pixels                                       image.
            suited for use as graph     in the image, and an edge is formed between every pair of nodes. The weight, w(i, j ),
            nodes. Thus, when we
            refer in this section to
                                        of each edge is a function of the similarity between nodes i and j. We then seek to                                                                                                       Image                           ⇓   Segmentation
                                        partition the nodes of the graph into disjoint subsets V1 , V2 ,…, VK where, by some
           “pixels” in an image, we
            are, by implication,
            also referring to super-
                                        measure, the similarity among the nodes within a subset is high, and the similarity                                                                                             ⇓      Node
            pixels.                     across the nodes of different subsets is low. The nodes of the partitioned subsets                                                                                   Edge
                                        correspond to the regions in the segmented image.
                                                                                                                                                                                                                                          ⇓
                                           Set V is partitioned into subsets by cutting the graph. A cut of a graph is a parti-
                                        tion of V into two subsets A and B such that
                                                                                                                                                                                                            Graph                                   Cut
                                                                          A ´ B = V and A ¨ B = ∅                                  (10-96)
                                                                                                                                                                                               image graphs. Figure 10.54 shows the same graph as the one we just discussed, but
                                        where the cut is implemented by removing the edges connecting subgraphs A and B.                                                                       here you see two additional nodes called the source and sink terminal nodes, respec-
                                        There are two key aspects of using graph cuts for image segmentation: (1) how to                                                                       tively, each connected to all nodes in the graph via unidirectional links called t-links.
                                        associate a graph with an image; and (2) how to cut the graph in a way that makes                                                                      The terminal nodes are not part of the image; their role, for example, is to associate
                                        sense in terms of partitioning the image into background and foreground (object)                                                                       with each pixel a probability that it is a background or foreground (object) pixel.
                                        pixels. We address these two questions next.                                                                                                           The probabilities are the weights of the t-links. In Figs. 10.54(c) and (d), the thickness
                                           Figure 10.53 shows a simplified approach for generating a graph from an image.                                                                      of each t-link is proportional to the value of the probability that the graph node to
                                        The nodes of the graph correspond to the pixels in the image and, to keep the expla-                                                                   which it is connected is a foreground or background pixel (the thicknesses shown
                                        nation simple, we allow edges only between adjacent pixels using 4-connectivity,                                                                       are so that the segmentation result would be the same as in Fig. 10.53). Which of the
                                        which means that there are no diagonal edges linking the pixels. But, keep in mind                                                                     two nodes we call background or foreground is arbitrary.
                                        that, in general, edges are specified between every pair of pixels. The weights for the
                                        edges typically are formed from spatial relationships (for example, distance from the
                                        vertex pixel) and intensity measures (for example, texture and color), consistent with                                                                 MINIMUM GRAPH CUTS
                                        exhibiting similarity between pixels. In this simple example, we define the degree                                                                     Once an image has been expressed as a graph, the next step is to cut the graph into
                                        of similarity between two pixels as the inverse of the difference in their intensities.                                                                two or more subgraphs. The nodes (pixels) in each resulting subgraph correspond
                                        That is, for two nodes (pixels) ni and n j , the weight of the edge between them is                                                                    to a region in the segmented image. Approaches based on Fig. 10.54 rely on inter-
                                        w(i, j ) = 1"A # I (ni ) − I (n j ) # + cB, where I (ni ) and I (n j ), are the intensities of the two                                                 preting the graph as a flow network (of pipes, for example) and obtaining what is
                                        nodes (pixels) and c is a constant included to prevent division by 0. Thus, the closer                                                                 commonly referred to as a minimum graph cut. This formulation is based on the
                                        the values of intensity between adjacent pixels is, the larger the value of w will be.                                                                 so-called Max-Flow, Min-Cut Theorem. This theorem states that, in a flow network,
                                           For illustrative purposes, the thickness of each edge in Fig. 10.53 is shown propor-                                                                the maximum amount of flow passing from the source to the sink is equal to the
                                        tional to the degree of similarity between the pixels that it connects (see Problem                                                                    minimum cut. This minimum cut is defined as the smallest total weight of the edges
                                        10.44). As you can see in the figure, the edges between the dark pixels are stronger                                                                   that, if removed, would disconnect the sink from the source:
                                        than the edges between dark and light pixels, and vice versa. Conceptually, segmen-
                                        tation is achieved by cutting the graph along its weak edges, as illustrated by the
                                        dashed line in Fig. 10.53(d). Figure 10.53(c) shows the segmented image.                                                                                                              cut( A, B) =     ∑
                                                                                                                                                                                                                                             u ∈A,v∈B
                                                                                                                                                                                                                                                        w(u, v)                      (10-97)
                                           Although the basic structure in Fig. 10.53 is the focus of the discussion in this
                                        section, we mention for completeness another common approach for constructing
www.EBooksWorld.ir www.EBooksWorld.ir
⇓ distinct regions.
                                                                      Graph
                                                                                                                                                                                  their proximity, such as the partition shown in Fig. 10.55. The approach presented in
                                                                                                                                                                                  this section, proposed by Shi and Malik [2000] (see also Hochbaum [2010]), is aimed
                                                                                                                                                                                  at avoiding this type of behavior by redefining the concept of a cut.
                                                                                                                                                                                     Instead of looking at the total weight value of the edges that connect two parti-
                                                                           ⇓                                                                                                      tions, the idea is to work with a measure of “disassociation” that computes the cost
                                                                                                                                                                                  as a fraction of the total edge connections to all nodes in the graph. This measure,
                                                                                                                                                                                  called the normalized cut (Ncut), is defined as
                                                                               Cut
                                                                                                                                                                                                                          cut( A, B)      cut( A, B)
                                                                                                                                                                                                        Ncut ( A, B) =                 +                           (10-98)
                                                                                                                                                                                                                         assoc( A, V )   assoc(B, V )
                                                          Sink Terminal                      Sink Terminal                                                                        where cut( A, B) is given by Eq. (10-97) and
                                                          (Foreground)                       (Foreground)
                                                                                                                                                                                                                assoc( A, V ) =      ∑
                                                                                                                                                                                                                                  u ∈A, z ∈V
                                                                                                                                                                                                                                               w(u, z)             (10-99)
                                    where A and B satisfy Eq. (10-96). The optimum partition of a graph is the one that
                                                                                                                                                                                  is the sum of the weights of all the edges from the nodes of subgraph A to the nodes
                                    minimizes this cut value. There is an exponential number of such partitions, which
                                                                                                                                                                                  of the entire graph. Similarly,
                                    would present us with an intractable computational problem. However, efficient
                                    algorithms that run in polynomial time have been developed for solving max-flow
                                    problems. Therefore, based on the Max-Flow, Min-Cut Theorem, we can apply these
                                                                                                                                                                                                                 assoc(B, V ) =      ∑
                                                                                                                                                                                                                                  v ∈B, z ∈V
                                                                                                                                                                                                                                               w(v, z)             (10-100)
                                    algorithms to image segmentation, provided that we cast segmentation as a flow
                                                                                                                                                                                  is the sum of the weights of the edges from all the edges in B to the entire graph. As
                                    problem and select the weights for the edges and t-links such that minimum graph
                                                                                                                                                                                  you can see, assoc( A, V ) is simply the cut of A from the rest of the graph, and simi-
                                    cuts will result in meaningful segmentations.
                                                                                                                                                                                  larly for assoc(B, V ).
                                       Although the min-cut approach offers an elegant solution, it can result in group-
                                                                                                                                                                                      By using Ncut ( A, B) instead of cut( A, B), the cut that partitions isolated points
                                    ings that favor cutting small sets of isolated nodes in a graph, leading to improper
                                                                                                                                                                                  will no longer have small values. You can see this, for example, by noting in Fig. 10.55
                                    segmentations. Figure 10.55 shows an example, in which the two regions of interest
                                                                                                                                                                                  that if A is the single node shown, cut( A, B) and assoc( A, V ) will have the same val-
                                    are characterized by the tightness of the pixel groupings. Meaningful edge weights
                                                                                                                                                                                  ue. Thus, independently of how small cut( A, B) is, Ncut ( A, B) will always be greater
                                    that reflect this property would be inversely proportional to the distance between
                                                                                                                                                                                  than or equal to 1, thus providing normalization for “pathological” cases such as this.
                                    pairs of points. But this would lead to weights that would be smaller for isolated
                                                                                                                                                                                      Based on similar concepts, we can define a measure for total normalized associa-
                                    points, resulting in min cuts such as the example in Fig. 10.55. In fact, any cut that
                                                                                                                                                                                  tion within graph partitions as
                                    partitions out individual points on the left of the figure will have a smaller cut value
                                    in Eq. (10-4) than a cut that properly partitions the points into two groups based on
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                       assoc( A, A)    assoc(B, B)                                                                                            Eq. (10-105) gives K eigenvalues and K eigenvectors, each corresponding to one
                                                          Nassoc( A, B) =                            +                                            (10-101)                                                    eigenvalue. The solution to our problem is the eigenvector corresponding the second
                                                                                       assoc( A, V )   assoc(B, V )
                                                                                                                                                                                                              smallest eigenvalue.
                                     where assoc( A, A) and assoc(B, B) are the total weights connecting the nodes within                                                                                        We can convert the preceding generalized eigenvalue formulation into a standard
                                     A and within B, respectively. It is not difficult to show (see Problem 10.46) that                                                                                       eigenvalue problem by writing Eq. (10-105) as (see Problem 10.45):
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                                              Thus far, we have discussed segmentation based on three principal concepts: edge
                                                                                                                                                                              detection, thresholding, and region extraction. Each of these approaches was found
             EXAMPLE 10.24 : Specifying weights for graph cut segmentation.                                                                                                   to have advantages (for example, speed in the case of global thresholding) and dis-
            In Fig. 10.53, we illustrated how to generate graph weights using intensity values, and in Fig. 10.55 we                                                          advantages (for example, the need for post-processing, such as edge linking, in edge-
            discussed briefly how to generate weights based on the distance between pixels. In this example, we give                                                          based segmentation). In this section, we discuss an approach based on the concept of
            a more practical approach for generating weights that include both intensity and distance from a pixel,                                                           so-called morphological watersheds. Segmentation by watersheds embodies many of
            thus introducing the concept of a neighborhood in graph segmentation.                                                                                             the concepts of the other three approaches and, as such, often produces more stable
                Let ni and n j denote two nodes (image pixels). As mentioned earlier in this section, weights are sup-                                                        segmentation results, including connected segmentation boundaries. This approach
            posed to reflect the similarity between nodes in a graph. When considering segmentation, one of the                                                               also provides a simple framework for incorporating knowledge-based constraints
            principal ways to establish how likely two pixels in an image are to be a part of the same region or object                                                       (see Fig. 1.23) in the segmentation process, as we discuss at the end of this section.
            is to determine the difference in their intensity values, and how close the pixels are to each other. The
            weight value of the edge between two pixels should be large when the pixels are very close in intensity                                                           BACKGROUND
            and proximity (i.e., when the pixels are “similar), and should decrease as their intensity difference and                                                         The concept of a watershed is based on visualizing an image in three dimensions,
            distance from each other increases. That is, the weight value should be a function of how similar the                                                             two spatial coordinates versus intensity, as in Fig. 2.18(a). In such a “topographic”
            pixels are in intensity and distance. These two concepts can be embedded into a single weight function                                                            interpretation, we consider three types of points: (1) points belonging to a regional
            using the following expression:                                                                                                                                   minimum; (2) points at which a drop of water, if placed at the location of any of those
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                         a c
                                                                                                                                                         b d
                                                                                                                                                        FIGURE 10.57
                                                                                                                                                        (a) Original
                                                                                                                                                        image.
                                                                                                                                                        (b) Topographic
                                                                                                                                                        view. Only the
                                                                                                                                                        background is
                                                                                                                                                        black. The basin
                                                                                                                                                        on the left is
                                                                                                                                                        slightly lighter
                                                                                                                                                        than black.
                                                                                                                                                        (c) and (d) Two
                                                                                                                                                        stages of flooding.
             a b c                                                                                                                                      All constant dark
                                                                                                                                                        values of gray are
            FIGURE 10.56 (a) Image of size 600 × 600 pixels. (b) Image smoothed with a 25 × 25 box kernel. (c) Graph cut segmen-                        intensities in the
            tation obtained by specifying two regions.                                                                                                  original image.                    Water                                    Water
                                                                                                                                                        Only constant
                                                                                                                                                        light gray repre-
                                                                                                                                                        sents “water.”
                                                                                                                                                        (Courtesy of Dr.
                                         points, would fall with certainty to a single minimum; and (3) points at which water                           S. Beucher, CMM/
                                         would be equally likely to fall to more than one such minimum. For a particular                                Ecole des Mines                                                                     Water
                                         regional minimum, the set of points satisfying condition (2) is called the catchment                           de Paris.)
                                                                                                                                                        (Continued on
                                         basin or watershed of that minimum. The points satisfying condition (3) form crest                             next page.)
                                         lines on the topographic surface, and are referred to as divide lines or watershed lines.
                                            The principal objective of segmentation algorithms based on these concepts is to
                                         find the watershed lines. The method for doing this can be explained with the aid of
                                         Fig. 10.57. Figure 10.57(a) shows a gray-scale image and Fig. 10.57(b) is a topograph-
                                         ic view, in which the height of the “mountains” is proportional to intensity values in
                                         the input image. For ease of interpretation, the backsides of structures are shaded.
                                         This is not to be confused with intensity values; only the general topography of the
                                         three-dimensional representation is of interest. In order to prevent the rising water                                                       effect is more pronounced as water continues to rise, as shown in Fig. 10.57(g). This
                                         from spilling out through the edges of the image, we imagine the perimeter of the                                                           figure shows a longer dam between the two catchment basins and another dam in
                                         entire topography (image) being enclosed by dams that are higher than the highest                                                           the top part of the right basin. The latter dam was built to prevent merging of water
                                         possible mountain, whose value is determined by the highest possible intensity value                                                        from that basin with water from areas corresponding to the background. This pro-
                                         in the input image.                                                                                                                         cess is continued until the maximum level of flooding (corresponding to the highest
                                            Suppose that a hole is punched in each regional minimum [shown as dark areas in                                                          intensity value in the image) is reached. The final dams correspond to the watershed
                                         Fig. 10.57(b)] and that the entire topography is flooded from below by letting water                                                        lines, which are the desired segmentation boundaries. The result for this example is
            Because of neighboring       rise through the holes at a uniform rate. Figure 10.57(c) shows the first stage of flood-                                                   shown in Fig. 10.57(h) as dark, one-pixel-thick paths superimposed on the original
            contrast, the leftmost       ing, where the “water,” shown in light gray, has covered only areas that correspond                                                         image. Note the important property that the watershed lines form connected paths,
            basin in Fig. 10.57(c)
            appears black, but it is a   to the black background in the image. In Figs. 10.57(d) and (e) we see that the water                                                       thus giving continuous boundaries between regions.
            few shades lighter than      now has risen into the first and second catchment basins, respectively. As the water                                                           One of the principal applications of watershed segmentation is in the extraction
            the black background.
            The mid-gray in the          continues to rise, it will eventually overflow from one catchment basin into another.                                                       of nearly uniform (blob-like) objects from the background. Regions characterized
            second basin is a natural    The first indication of this is shown in 10.57(f). Here, water from the lower part of                                                       by small variations in intensity have small gradient values. Thus, in practice, we often
            gray from the image
            in (a).                      the left basin overflowed into the basin on the right, and a short “dam” (consisting of                                                     see watershed segmentation applied to the gradient of an image, rather than to the
                                         single pixels) was built to prevent water from merging at that level of flooding (the                                                       image itself. In this formulation, the regional minima of catchment basins correlate
                                         mathematical details of dam building are discussed in the following section). The                                                           nicely with the small value of the gradient corresponding to the objects of interest.
www.EBooksWorld.ir www.EBooksWorld.ir
             e f
             g h
            FIGURE 10.57
            (Continued)
            (e) Result of
            further flooding.
            (f) Beginning of
            merging of water
            from two
            catchment basins
            (a short dam was
            built between
            them).
            (g) Longer dams.
            (h) Final water-
            shed (segmenta-
            tion) lines super-
            imposed on the
            original image.
            (Courtesy of Dr.
            S. Beucher, CMM/
            Ecole des Mines
            de Paris.)
                                                                                                                                                                                                                                           Origin
                                                                                                                                                                                                                                       1   1   1
                                                                                                                                                                                                                                       1   1   1
                                                                                                                                                                                                                                       1   1   1
                                       DAM CONSTRUCTION
                                       Dam construction is based on binary images, which are members of 2-D integer
                                       space Z 2 (see Sections 2.4 and 2.6). The simplest way to construct dams separating
                                       sets of binary points is to use morphological dilation (see Section 9.2).
                                          Figure 10.58 illustrates the basics of dam construction using dilation. Part (a)
                                       shows portions of two catchment basins at flooding step n − 1, and Fig. 10.58(b)
                                       shows the result at the next flooding step, n. The water has spilled from one basin
                                       to the another and, therefore, a dam must be built to keep this from happening. In                                                                                                                  First dilation
                                       order to be consistent with notation to be introduced shortly, let M1 and M2 denote                                                                                                                 Second dilation
                                       the sets of coordinates of points in two regional minima. Then let the set of coordi-                        a                                                                                      Dam points
                                       nates of points in the catchment basin associated with these two minima at stage n − 1                       b c
                                       of flooding be denoted by Cn−1 (M1 ) and Cn−1 (M2 ), respectively. These are the two                         d
                                       gray regions in Fig. 10.58(a).                                                                              FIGURE 10.58 (a) Two partially flooded catchment basins at stage n − 1 of flooding. (b) Flooding at stage n, showing
            See Sections 2.5 and 9.5
                                                                                                                                                   that water has spilled between basins. (c) Structuring element used for dilation. (d) Result of dilation and dam
            regarding connected           Let C [ n − 1] denote the union of these two sets. There are two connected com-
            components.                                                                                                                            construction.
                                       ponents in Fig. 10.58(a), and only one component in Fig. 10.58(b). This connected
www.EBooksWorld.ir www.EBooksWorld.ir
                                    component encompasses the earlier two components, which are shown dashed.                                                                 Geometrically, T [ n] is the set of coordinates of points in g( x, y) lying below the
                                    Two connected components having become a single component indicates that                                                                  plane g( x, y) = n.
                                    water between the two catchment basins has merged at flooding step n. Let this                                                               The topography will be flooded in integer flood increments, from n = min + 1 to
                                    connected component be denoted by q. Note that the two components from step                                                               n = max + 1. At any step n of the flooding process, the algorithm needs to know
                                    n − 1 can be extracted from q by performing a logical AND operation, q !C [ n − 1].                                                       the number of points below the flood depth. Conceptually, suppose that the coordi-
                                    Observe also that all points belonging to an individual catchment basin form a                                                            nates in T [ n] that are below the plane g( x, y) = n are “marked” black, and all other
                                    single connected component.                                                                                                               coordinates are marked white. Then when we look “down” on the xy-plane at any
                                        Suppose that each of the connected components in Fig. 10.58(a) is dilated by                                                          increment n of flooding, we will see a binary image in which black points correspond
                                    the structuring element in Fig. 10.58(c), subject to two conditions: (1) The dilation                                                     to points in the function that are below the plane g( x, y) = n. This interpretation is
                                    has to be constrained to q (this means that the center of the structuring element                                                         quite useful, and will make it easier to understand the following discussion.
                                    can be located only at points in q during dilation); and (2) the dilation cannot be                                                          Let Cn ( Mi ) denote the set of coordinates of points in the catchment basin associ-
                                    performed on points that would cause the sets being dilated to merge (i.e., become                                                        ated with minimum Mi that are flooded at stage n. With reference to the discussion
                                    a single connected component). Figure 10.58(d) shows that a first dilation pass (in                                                       in the previous paragraph, we may view Cn ( Mi ) as a binary image given by
                                    light gray) expanded the boundary of each original connected component. Note that
                                    condition (1) was satisfied by every point during dilation, and that condition (2) did                                                                                    Cn ( Mi ) = C ( Mi ) ! T [ n ]                    (10-111)
                                    not apply to any point during the dilation process; thus, the boundary of each region
                                    was expanded uniformly.                                                                                                                   In other words, Cn ( Mi ) = 1 at location ( x, y) if ( x, y) ∈ C ( Mi ) AND ( x, y) ∈T [ n ];
                                        In the second dilation, shown in black in 10.58(d), several points failed condition                                                   otherwise Cn ( Mi ) = 0. The geometrical interpretation of this result is straightfor-
                                    (1) while meeting condition (2), resulting in the broken perimeter shown in the figure.                                                   ward. We are simply using the AND operator to isolate at stage n of flooding the
                                    It is evident that the only points in q that satisfy the two conditions under consid-                                                     portion of the binary image in T [ n] that is associated with regional minimum Mi .
                                    eration describe the one-pixel-thick connected path shown crossed-hatched in Fig.                                                            Next, let B denote the number of number of flooded catchment basins at stage n,
                                    10.58(d). This path is the desired separating dam at stage n of flooding. Construction                                                    and let C[ n] denote the union of these basins at stage n :
                                    of the dam at this level of flooding is completed by setting all the points in the path                                                                                                 B
                                    just determined to a value greater than the maximum possible intensity value of the                                                                                           C[ n] = ∪ Cn ( Mi )                           (10-112)
                                    image (e.g., greater than 255 for an 8-bit image). This will prevent water from cross-                                                                                                 i =1
                                    ing over the part of the completed dam as the level of flooding is increased. As noted                                                    Then C[max + 1] is the union of all catchment basins:
                                    earlier, dams built by this procedure, which are the desired segmentation boundaries,
                                                                                                                                                                                                                                  B
                                    are connected components. In other words, this method eliminates the problems of
                                                                                                                                                                                                               C [ max + 1] = ∪ C ( Mi )                        (10-113)
                                    broken segmentation lines.                                                                                                                                                                    i =1
                                       Although the procedure just described is based on a simple example, the method
                                    used for more complex situations is exactly the same, including the use of the 3 × 3                                                      It can be shown (see Problem 10.47) that the elements in both Cn ( Mi ) and T [ n] are
                                    symmetric structuring element in Fig. 10.58(c).                                                                                           never replaced during execution of the algorithm, and that the number of elements
                                                                                                                                                                              in these two sets either increases or remains the same as n increases. Thus, it fol-
                                    WATERSHED SEGMENTATION ALGORITHM                                                                                                          lows that C[ n − 1] is a subset of C[ n]. According to Eqs. (10-112) and (10-113), C[ n]
                                                                                                                                                                              is a subset of T [ n], so it follows that C[ n − 1] is also a subset of T [ n]. From this we
                                    Let M1 , M2 ,…, MR be sets denoting the coordinates of the points in the regional
                                                                                                                                                                              have the important result that each connected component of C[ n − 1] is contained
                                    minima of an image, g( x, y). As mentioned earlier, this typically will be a gradient
                                                                                                                                                                              in exactly one connected component of T [ n].
                                    image. Let C ( Mi ) be a set denoting the coordinates of the points in the catchment
                                                                                                                                                                                  The algorithm for finding the watershed lines is initialized by letting C[min + 1] =
                                    basin associated with regional minimum Mi (recall that the points in any catchment
                                                                                                                                                                              T[min + 1]. The procedure then proceeds recursively, successively computing C[ n]
                                    basin form a connected component). The notation min and max will be used to
                                                                                                                                                                              from C[ n − 1], using the following approach. Let Q denote the set of connected com-
                                    denote the minimum and maximum values of g( x, y). Finally, let T [ n] represent the
                                                                                                                                                                              ponents in T [ n]. Then, for each connected component q ∈ Q[ n], there are three pos-
                                    set of coordinates ( s, t ) for which g( s, t ) < n. That is,
                                                                                                                                                                              sibilities:
                                                                  T [ n] =   {( s, t ) g ( s, t ) < n}           (10-110)                                                      1. q ! C[ n − 1] is empty.
www.EBooksWorld.ir www.EBooksWorld.ir
www.EBooksWorld.ir www.EBooksWorld.ir
            FIGURE 10.61                                                                                                                                                      Motion is a powerful cue used by humans and many animals to extract objects or
            (a) Image showing                                                                                                                                                 regions of interest from a background of irrelevant detail. In imaging applications,
            internal markers
            (light gray regions)
                                                                                                                                                                              motion arises from a relative displacement between the sensing system and the
            and external                                                                                                                                                      scene being viewed, such as in robotic applications, autonomous navigation, and
            markers (watershed                                                                                                                                                dynamic scene analysis. In the following discussion we consider the use of motion in
            lines).                                                                                                                                                           segmentation both spatially and in the frequency domain.
            (b) Result of
            segmentation. Note
            the improvement
                                                                                                                                                                              SPATIAL TECHNIQUES
            over Fig. 10.60(b).                                                                                                                                               In what follows, we will consider two approaches for detecting motion, working direct-
            (Courtesy of Dr.                                                                                                                                                  ly in the spatial domain. The key objective is to give you an idea how to measure
            S. Beucher, CMM/
            Ecole des Mines de
                                                                                                                                                                              changes in digital images using some straightforward techniques.
            Paris.)
                                                                                                                                                                              A Basic Approach
                                                                                                                                                                              One of the simplest approaches for detecting changes between two image frames
                                                                                                                                                                              f ( x, y, ti ) and f ( x, y, t j ) taken at times ti and t j , respectively, is to compare the two
                                        Suppose that we define an internal marker as (1) a region that is surrounded by                                                       images pixel by pixel. One procedure for doing this is to form a difference image.
                                    points of higher “altitude”; (2) such that the points in the region form a connected                                                      Suppose that we have a reference image containing only stationary components.
                                    component; and (3) in which all the points in the connected component have the                                                            Comparing this image against a subsequent image of the same scene, but including
                                    same intensity value. After the image was smoothed, the internal markers resulting                                                        one or more moving objects, results in the difference of the two images canceling the
                                    from this definition are shown as light gray, blob-like regions in Fig. 10.61(a). Next,                                                   stationary elements, leaving only nonzero entries that correspond to the nonstation-
                                    the watershed algorithm was applied to the smoothed image, under the restriction                                                          ary image components.
                                    that these internal markers be the only allowed regional minima. Figure 10.61(a)                                                              A difference image of two images (of the same size) taken at times ti and t j may
                                    shows the resulting watershed lines. These watershed lines are defined as the exter-                                                      be defined as
                                    nal markers. Note that the points along the watershed line pass along the highest
                                    points between neighboring markers.
                                       The external markers in Fig. 10.61(a) effectively partition the image into regions,                                                                                        1    if f ( x, y, ti ) − f ( x, y, t j ) > T
                                                                                                                                                                                                    dij ( x, y) =                                                  (10-114)
                                    with each region containing a single internal marker and part of the background.                                                                                               0   otherwise
                                    The problem is thus reduced to partitioning each of these regions into two: a single
                                    object, and its background. We can bring to bear on this simplified problem many of                                                       where T is a nonnegative threshold. Note that dij ( x, y) has a value of 1 at spatial coor-
                                    the segmentation techniques discussed earlier in this chapter. Another approach is                                                        dinates ( x, y) only if the intensity difference between the two images is appreciably
                                    simply to apply the watershed segmentation algorithm to each individual region. In                                                        different at those coordinates, as determined by T. Note also that coordinates ( x, y)
                                    other words, we simply take the gradient of the smoothed image [as in Fig. 10.59(b)]                                                      in Eq. (10-114) span the dimensions of the two images, so the difference image is of
                                    and restrict the algorithm to operate on a single watershed that contains the marker                                                      the same size as the images in the sequence.
                                    in that particular region. Figure 10.61(b) shows the result obtained using this                                                              In the discussion that follows, all pixels in dij ( x, y) that have value 1 are consid-
                                    approach. The improvement over the image in 10.60(b) is evident.                                                                          ered the result of object motion. This approach is applicable only if the two imag-
                                        Marker selection can range from simple procedures based on intensity values                                                           es are registered spatially, and if the illumination is relatively constant within the
                                    and connectivity, as we just illustrated, to more complex descriptions involving size,
                                                                                                                                                                              bounds established by T. In practice, 1-valued entries in dij ( x, y) may arise as a result
                                    shape, location, relative distances, texture content, and so on (see Chapter 11 regard-
                                                                                                                                                                              of noise also. Typically, these entries are isolated points in the difference image, and
                                    ing feature descriptors). The point is that using markers brings a priori knowledge
                                                                                                                                                                              a simple approach to their removal is to form 4- or 8-connected regions of 1’s in
                                    to bear on the segmentation problem. Keep in mind that humans often aid segmen-
                                                                                                                                                                              image dij ( x, y), then ignore any region that has less than a predetermined number of
                                    tation and higher-level tasks in everyday vision by using a priori knowledge, one
                                                                                                                                                                              elements. Although it may result in ignoring small and/or slow-moving objects, this
                                    of the most familiar being the use of context. Thus, the fact that segmentation by
                                                                                                                                                                              approach improves the chances that the remaining entries in the difference image
                                    watersheds offers a framework that can make effective use of this type of knowledge
                                                                                                                                                                              actually are the result of motion, and not noise.
                                    is a significant advantage of this method.
www.EBooksWorld.ir www.EBooksWorld.ir
                                      Although the method just described is simple, it is used frequently as the basis of
                                    imaging systems designed to detect changes in controlled environments, such as in
                                    surveillance of parking facilities, buildings, and similar fixed locales.
                                    Accumulative Differences
                                    Consider a sequence of image frames denoted by f ( x, y, t1 ), f ( x, y, t 2 ),…, f ( x, y, t n ),
                                    and let f ( x, y, t1 ) be the reference image. An accumulative difference image (ADI)
                                    is formed by comparing this reference image with every subsequent image in the
                                    sequence. A counter for each pixel location in the accumulative image is increment-
                                    ed every time a difference occurs at that pixel location between the reference and an
                                    image in the sequence. Thus, when the kth frame is being compared with the refer-
                                    ence, the entry in a given pixel of the accumulative image gives the number of times                                      a b c
                                    the intensity at that position was different [as determined by T in Eq. (10-114)] from                                   FIGURE 10.62 ADIs of a rectangular object moving in a southeasterly direction. (a) Absolute ADI. (b) Positive ADI.
                                    the corresponding pixel value in the reference image.                                                                    (c) Negative ADI.
                                       Assuming that the intensity values of the moving objects are greater than the
                                    background, we consider three types of ADIs. Let R( x, y) denote the reference
                                    image and, to simplify the notation, let k denote tk so that f ( x, y, k ) = f ( x, y, tk ). We
                                    assume that R( x, y) = f ( x, y, 1). Then, for any k > 1, and keeping in mind that the
                                    values of the ADIs are counts, we define the following accumulative differences for                                      are of size 256 × 256 pixels. We note the following: (1) The nonzero area of the positive ADI is equal
                                    all relevant values of ( x, y) :                                                                                         to the size of the moving object; (2) the location of the positive ADI corresponds to the location of the
                                                                                                                                                             moving object in the reference frame; (3) the number of counts in the positive ADI stops increasing
                                                                  Ak −1 ( x, y) + 1      if R( x, y) − f ( x, y, k ) > T                                   when the moving object is displaced completely with respect to the same object in the reference frame;
                                                    Ak ( x, y) =                                                             (10-115)                       (4) the absolute ADI contains the regions of the positive and negative ADI; and (5) the direction and
                                                                   Ak −1 ( x, y)         otherwise
                                                                                                                                                             speed of the moving object can be determined from the entries in the absolute and negative ADIs.
www.EBooksWorld.ir www.EBooksWorld.ir
             EXAMPLE 10.28 : Building a reference image.                                                                                                                             Suppose that in frame two (t = 1), the object has moved to coordinates ( x$ + 1, y$);
                                                                                                                                                                                  that is, it has moved 1 pixel parallel to the x-axis. Then, repeating the projection pro-
            Figures 10.63(a) and (b) show two image frames of a traffic intersection. The first image is considered
            the reference, and the second depicts the same scene some time later. The objective is to remove the                                                                  cedure discussed in the previous paragraph yields the sum exp  j 2pa1 ( x′ + 1) ∆t  . If
            principal moving objects in the reference image in order to create a static image. Although there are                                                                 the object continues to move 1 pixel location per frame then, at any integer instant
            other smaller moving objects, the principal moving feature is the automobile at the intersection mov-                                                                 of time, t, the result will be exp  j 2pa1 ( x′ + t ) ∆t  , which, using Euler’s formula, may
            ing from left to right. For illustrative purposes we focus on this object. By monitoring the changes in                                                               be expressed as
            the positive ADI, it is possible to determine the initial position of a moving object, as explained above.
                                                                                                                                                                                                e j 2 pa1 ( x ′ + t )∆t = cos  2pa1 ( x′ + t ) ∆t  + j sin  2pa1 ( x′ + t ) ∆t    (10-118)
            Once the area occupied by this object is identified, the object can be removed from the image by sub-
            traction. By looking at the frame in the sequence at which the positive ADI stopped changing, we can
                                                                                                                                                                                  for t = 0, 1, 2, … , K − 1. In other words, this procedure yields a complex sinusoid
            copy from this image the area previously occupied by the moving object in the initial frame. This area
            then is pasted onto the image from which the object was cut out, thus restoring the background of that                                                                with frequency a1 . If the object were moving V1 pixels (in the x-direction) between
            area. If this is done for all moving objects, the result is a reference image with only static components                                                             frames, the sinusoid would have frequency V1a1 . Because t varies between 0 and
            against which we can compare subsequent frames for motion detection. The reference image resulting                                                                    K − 1 in integer increments, restricting a1 to have integer values causes the discrete
            from removing the east-bound moving vehicle and restoring the background is shown in Fig. 10.63(c).                                                                   Fourier transform of the complex sinusoid to have two peaks—one located at fre-
                                                                                                                                                                                  quency V1a1 and the other at K − V1a1 . This latter peak is the result of symmetry in
                                                                                                                                                                                  the discrete Fourier transform, as discussed in Section 4.6, and may be ignored. Thus
                                    FREQUENCY DOMAIN TECHNIQUES                                                                                                                   a peak search in the Fourier spectrum would yield one peak with value V1a1 . Divid-
                                    In this section, we consider the problem of determining motion via a Fourier trans-                                                           ing this quantity by a1 yields V1 , which is the velocity component in the x-direction,
                                    form formulation. Consider a sequence f ( x, y, t ), t = 0, 1, 2, … , K − 1, of K digital                                                     as the frame rate is assumed to be known. A similar analysis would yield V2 , the
                                    image frames of size M × N pixels, generated by a stationary camera. We begin the                                                             component of velocity in the y-direction.
                                    development by assuming that all frames have a homogeneous background of zero                                                                    A sequence of frames in which no motion takes place produces identical exponen-
                                    intensity. The exception is a single, 1-pixel object of unit intensity that is moving                                                         tial terms, whose Fourier transform would consist of a single peak at a frequency of 0
                                    with constant velocity. Suppose that for frame one (t = 0), the object is at location                                                         (a single dc term). Therefore, because the operations discussed so far are linear, the
                                    ( x$, y$) and the image plane is projected onto the x-axis; that is, the pixel intensities                                                    general case involving one or more moving objects in an arbitrary static background
                                    are summed (for each row) across the columns in the image. This operation yields                                                              would have a Fourier transform with a peak at dc corresponding to static image
                                    a 1-D array with M entries that are zero, except at x$, which is the x-coordinate of                                                          components, and peaks at locations proportional to the velocities of the objects.
                                                                                                                                                                                     These concepts may be summarized as follows. For a sequence of k digital images
                                    the single-point object. If we now multiply all the components of the 1-D array by
                                                                                                                                                                                  of size M × N pixels, the sum of the weighted projections onto the x-axis at any inte-
                                    the quantity exp [ j 2pa1 x∆t ] for x = 0, 1, 2, … , M − 1 and add the results, we obtain
                                                                                                                                                                                  ger instant of time is
                                    the single term exp [ j 2pa1 x′∆t ] because there is only one nonzero point in the array.
                                    In this notation, a1 is a positive integer, and !t is the time interval between frames.                                                                                             M −1 N −1
                                                                                                                                                                                                      g x (t , a1 ) =   ∑ ∑ f ( x, y, t ) e j 2pa x∆t
                                                                                                                                                                                                                        x=0 y=0
                                                                                                                                                                                                                                                         1
                                                                                                                                                                                                                                                               t = 0, 1,…, K − 1          (10-119)
             a b c
                                                                                                                                                                                                     Gx (u1 , a1 ) =      ∑
                                                                                                                                                                                                                          t =0
                                                                                                                                                                                                                               g x (t , a1 ) e − j 2 pu t K
                                                                                                                                                                                                                                                     1
                                                                                                                                                                                                                                                              u1 = 0, 1,…, K − 1          (10-121)
            FIGURE 10.63 Building a static reference image. (a) and (b) Two frames in a sequence. (c) Eastbound automobile sub-
            tracted from (a), and the background restored from the corresponding area in (b). (Jain and Jain.)                                                                    and
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                      K −1                                                                                    FIGURE 10.64
                                                    Gy (u2 , a2 ) =   ∑
                                                                      t =0
                                                                           g y (t , a2 ) e − j 2 pu t K
                                                                                                  2
                                                                                                          u2 = 0, 1,…, K − 1   (10-122)                       LANDSAT
                                                                                                                                                              frame. (Cowart,
                                                                                                                                                              Snyder, and
                                    These transforms are computed using an FFT algorithm, as discussed in Section 4.11.                                       Ruedger.)
                                      The frequency-velocity relationship is
u1 = a1V1 (10-123)
and
u2 = a2V2 (10-124)
                                    In the preceding formulation, the unit of velocity is in pixels per total frame time.
                                    For example, V1 = 10 indicates motion of 10 pixels in K frames. For frames that
                                    are taken uniformly, the actual physical speed depends on the frame rate and the
                                    distance between pixels. Thus, if V1 = 10, and K = 30, the frame rate is two images
                                    per second, and the distance between pixels is 0.5 m, then the actual physical speed
                                    in the x-direction is
                                                                                                                                                              the results of computing Eqs. (10-121) and (10-122) with a1 = 6 and a2 = 4, respectively. The peak at
                                                    V1 = (10 pixels )( 0.5 m pixel )( 2 frames s )( 30 frames )                                               u1 = 3 in Fig. 10.66(a) yields V1 = 0.5 from Eq. (10-123). Similarly, the peak at u2 = 4 in Fig. 10.66(b)
                                                                                                                                                              yields V2 = 1.0 from Eq. (10-124).
                                                                                 d 2 Im  g x ( t , a1 )
                                                                       S2 x =                                                  (10-126)
                                                                                           dt 2
                                                                                                              t=n
                                                                                                                                                              FIGURE 10.65
                                    Because g x is sinusoidal, it can be shown (see Problem 10.53) that S1 x and S2 x will                                    Intensity plot of
                                                                                                                                                              the image in
                                    have the same sign at an arbitrary point in time, n, if the velocity component V1                                         Fig. 10.64, with
                                    is positive. Conversely, opposite signs in S1 x and S2 x indicate a negative velocity                                     the target circled.
                                    component. If either S1 x or S2 x is zero, we consider the next closest point in time,                                    (Rajala, Riddle,
                                    t = n ± !t. Similar comments apply to computing the sign of V2 .                                                          and Snyder.)
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                         Magnitude ( # 10 2)
                                                                                                                                                                                           tion 10.7, one of the key issues with watersheds is the problem of over-segmentation. The papers by Bleau and Leon
             Magnitude ( # 10)
                                 320                                                                            80                                                                         [2000] and by Gaetano et al. [2015] are illustrative of approaches for dealing with this problem.
                                                                                                                                                                                              The material in Section 10.8 dealing with accumulative differences is from Jain, R. [1981]. See also Jain, Kasturi,
                                 240                                                                            60                                                                         and Schunck [1995]. The material dealing with motion via Fourier techniques is from Rajala, Riddle, and Snyder
                                                                                                                40
                                                                                                                                                                                           [1983]. The books by Snyder and Qi [2004], and by Chakrabarti et al. [2015], provide additional reading on
                                 160
                                                                                                                                                                                           motion estimation. For details on the software aspects of many of the examples in this chapter, see Gonzalez, Woods,
                                  80                                                                            20                                                                         and Eddins [2009].
                                   0                                                                             0
                                       0   4   8   12   16 20 24     28   32   36   40                               0   4   8   12   16 20 24     28   32   36   40
             a b
                                                         Frequency                                                                     Frequency                                           Problems
            FIGURE 10.66 (a) Spectrum of Eq. (10-121) showing a peak at u1 = 3. (b) Spectrum of Eq. (10-122) showing a peak at                                                             Solutions to the problems marked with an asterisk (*) are in the DIP4E Student Support Package (consult the book
            u2 = 4. (Rajala, Riddle, and Snyder.)                                                                                                                                          website: www.ImageProcessingPlace.com)..
                                                                                                                                                                                           10.1 * In a Taylor series approximation, the remainder        10.5 * With reference to Fig. 10.6, what are the angles
                                                                                                                                                                                                  (also called the truncation error) consists of all            (measured with respect to the x-axis of the book
                                                   where umax is the aliasing frequency limitation established by K, and Vmax is the                                                              the terms not used in the approximation. The                  axis convention in Fig. 2.19) of the horizontal and
                                                   maximum expected object velocity.                                                                                                              first term in the remainder of a finite difference            vertical lines to which the kernels in Figs. 10.6(a)
                                                                                                                                                                                                  approximation is indicative of the error in the               and (c) are most responsive?
                                                                                                                                                                                                  approximation. The higher the derivative order         10.6   Refer to Fig. 10.7 in answering the following ques-
                                                                                                                                                                                                  of that term is, the lower the error will be in the           tions.
            Summary, References, and Further Reading                                                                                                                                              approximation. All three approximations to the                (a) * Some of the lines joining the pads and center
            Because of its central role in autonomous image processing, segmentation is a topic covered in most books dealing                                                                     first derivative given in Eqs. (10-4)-(10-6) are                    element in Fig. 10.7(e) are single lines, while
            with image processing, image analysis, and computer vision. The following books provide complementary and/or                                                                          computed using the same number of sample                            others are double lines. Explain why.
            supplementary reading for our coverage of this topic: Umbaugh [2010]; Prince [2012]; Nixon and Aguado, A [2012];                                                                      points. However, the error of the central differ-             (b) Propose a method for eliminating the com-
            Pratt [2014]; and Petrou and Petrou [2010].                                                                                                                                           ence approximation is less than the other two.                    ponents in Fig. 10.7(f) that are not part of
               Work dealing with the use of kernels to detect intensity discontinuities (see Section 10.2) has a long history.                                                                    Show that this is true.                                           the line oriented at −45°.
            Numerous kernels have been proposed over the years: Roberts [1965]; Prewitt [1970]; and Kirsh [1971]. The Sobel
            operators are from [Sobel]; see also Danielsson and Seger [1990]. Our presentation of the zero-crossing properties of                                                          10.2    Do the following:                                            (c)
            the Laplacian is based on Marr [1982]. The Canny edge detector discussed in Section 10.2 is due to Canny [1986]. The                                                                   (a) * Show how Eq. (10-8) was obtained.               10.7   With reference to the edge models in Fig. 10.8,
            basic reference for the Hough transform is Hough [1962]. See Ballard [1981], for a generalization to arbitrary shapes.                                                                 (b) Show how Eq. (10-9) was obtained.                        answer the following without generating the gra-
                Other approaches used to deal with the effects of illumination and reflectance on thresholding are illustrated by                                                                                                                               dient and angle images. Simply provide sketches
                                                                                                                                                                                           10.3    A binary image contains straight lines oriented
            the work of Perez and Gonzalez [1987], Drew et al. [1999], and Toro and Funt [2007]. The optimum thresholding                                                                          horizontally, vertically, at 45°, and at −45°. Give          of the profiles that show what you would expect
            approach due to Otsu [1979] has gained considerable acceptance because it combines excellent performance with                                                                          a set of 3 × 3 kernels that can be used to detect            the profiles of the magnitude and angle images
            simplicity of implementation, requiring only estimation of image histograms. The basic idea of using preprocessing                                                                     one-pixel breaks in these lines. Assume that the             to look like.
            to improve thresholding dates back to an early paper by White and Rohrer [1983]), which combined thresholding,                                                                         intensities of the lines and background are 1 and            (a) * Suppose that we compute the gradient mag-
            the gradient, and the Laplacian in the solution of a difficult segmentation problem.                                                                                                   0, respectively.                                                   nitude of each of these models using the
                See Fu and Mui [1981] for an early survey on the topic of region-oriented segmentation. The work of Haddon                                                                                                                                            Prewitt kernels in Fig. 10.14. Sketch what a
                                                                                                                                                                                           10.4    Propose a technique for detecting gaps of length
            and Boyce [1990] and of Pavlidis and Liow [1990] are among the earliest efforts to integrate region and boundary                                                                       ranging between 1 and K pixels in line segments                    horizontal profile through the center of each
            information for the purpose of segmentation. Region growing is still an active area of research in image processing,                                                                   of a binary image. Assume that the lines are one                   gradient image would look like.
            as exemplified by Liangjia et al. [2013]. The basic reference on the k-means algorithm presented in Section 10.5                                                                       pixel thick. Base your technique on 8-neighbor               (b) Sketch a horizontal profile for each corre-
            goes way back several decades to an obscure 1957 Bell Labs report by Lloyd, who subsequenty published in Lloyd                                                                         connectivity analysis, rather than attempting to                 sponding angle image.
            [1982]. This algorithm was already being in used in areas such as pattern recognition in the 1960s and ’70s (Tou and                                                                   construct kernels for detecting the gaps.             10.8   Consider a horizontal intensity profile through
www.EBooksWorld.ir www.EBooksWorld.ir
                    the middle of a binary image that contains a ver-                    in Fig. 10.14, and in (a) above, and give iso-                                       sketch the histogram of edge directions. Be                                                   2
                                                                                                                                                                                                                                                                                2s 2
                                                                                                                                                                                                                                                            G(r ) = e − r
                    tical step edge through the center of the image.                     tropic results only for horizontal and verti-                                        precise in labeling the height of each compo-
                    Draw what the profile would look like after the                      cal edges, and for edges oriented at ± 45°,                                          nent of the histogram.
                                                                                                                                                                                                                                     where r 2 = x 2 + y 2 . The LoG is then derived by
                    image has been blurred by an averaging kernel                        respectively.                                                                  (c) What would the Laplacian of this image look              taking the second partial derivative with respect
                    of size n × n with coefficients equal to 1 n2 . For                                                                                                     like based on using Eq. (10-14)? Show all                to r: ( 2G(r ) = ∂ 2G(r ) ∂r 2 . Finally, x 2 + y 2 is sub-
                                                                              10.12 The results obtained by a single pass through an
                    simplicity, assume that the image was scaled so                                                                                                         relevant different pixel values in the Lapla-            stituted for r 2 to get the final (incorrect) result:
                                                                                    image of some 2-D kernels can be achieved also
                    that its intensity levels are 0 on the left of the                                                                                                      cian image.
                                                                                    by two passes using 1-D kernels. For example,
                    edge and 1 on its right. Also, assume that the size
                    of the kernel is much smaller than the image, so
                                                                                    the same result of using a 3 × 3 smoothing kernel                           10.15 Suppose that an image f ( x, y) is convolved with                                 (
                                                                                                                                                                                                                                        ( 2G ( x, y ) =  x 2 + y 2 − s 2
                                                                                                                                                                                                                                                                                   )   s4 
                                                                                                                                                                                                                                                                                           
                                                                                    with coefficients 1 9 can be obtained by a pass
                    that image border effects are not a concern near
                    the center of the image.
                                                                                    of the kernel [1 1 1] through an image, followed
                                                                                                                                                                      a kernel of size n × n (with coefficients 1 n2 ) to
                                                                                                                                                                      produce a smoothed image f ( x, y).                                                                      (
                                                                                                                                                                                                                                                                 exp  − x 2 + y 2             )   2s 2 
                                                                                                                                                                                                                                                                                                        
                                                                                    by a pass of the result with the kernel [1 1 1]T .
            10.9 * Suppose that we had used the edge models in the                  The final result is then scaled by 1 9. Show that                                   (a) * Derive an expression for edge strength                 Derive this result and explain the reason for the
                   following image, instead of the ramp in Fig. 10.10.              the response of Sobel kernels (Fig. 10.14) can                                            (edge magnitude) as a function of n. Assume            difference between this expression and Eq. (10-29).
                   Sketch the gradient and Laplacian of each profile.               be implemented similarly by one pass of the                                               that n is odd and that the partial derivatives
                                                                                                                                                                                                                               10.19 Do the following:
                                                                                    differencing kernel [ −1 0 1] (or its vertical coun-                                      are computed using Eqs. (10-19) and (10-20).
                                                                                                                                                                                                                                     (a) * Derive Eq. (10-33).
                                                                                    terpart) followed by the smoothing kernel [1 2 1]                                   (b) Show that the ratio of the maximum edge
                                                                                    (or its vertical counterpart).                                                          strength of the smoothed image to the maxi-              (b) Let k = s1 s 2 denote the standard deviation
                                           Image                                                                                                                            mum edge strength of the original image is                   ratio discussed in connection with the DoG
                                                                              10.13 A popular variation of the compass kernels                                                                                                           function, and express Eq. (10-33) in terms of
                                                                                                                                                                            1 n. In other words, edge strength is inversely
                                                                                    shown in Fig. 10.15 is based on using coefficients                                                                                                   k and s 2 .
                                                                                                                                                                            proportional to the size of the smoothing
                                                                                    with values 0, 1, and −1.
                                                                                                                                                                            kernel, as one would expect.                       10.20 In the following, assume that G and f are discrete
                                                                                    (a) * Give the form of the eight compass kernels                                                                                                 arrays of size n × n and M × N , respectively.
                                                                                                                                                                10.16 With reference to Eq. (10-29),
                                                                                          using these coefficients. As in Fig. 10.15, let N,
                                        Profile of a
                                                                                          NW, . . . denote the direction of the edge that                               (a) * Show that the average value of the LoG                 (a) Show that the 2-D convolution of the Gauss-
                                        horizontal line                                                                                                                                                                                  ian function G( x, y) in Eq. (10-27) with an
                                                                                          gives the strongest response.                                                       operator, ( 2G( x, y), is zero.
                                                                                                                                                                                                                                         image f ( x, y) can be expressed as a 1-D con-
                                                                                    (b) Specify the gradient vector direction of the                                    (b) Show that the average value of any image                     volution along the rows (columns) of f ( x, y),
                                                                                        edges detected by each kernel in (a).                                               convolved with this operator also is zero.
            10.10 Do the following:                                                                                                                                                                                                      followed by a 1-D convolution along the col-
                                                                                                                                                                            (Hint: Consider solving this problem in the
                    (a) * Show that the direction of steepest (maxi-          10.14 The rectangle in the following binary image is of                                                                                                    umns (rows) of the result. (Hints: See Sec-
                                                                                                                                                                            frequency domain, using the convolution
                          mum) ascent of a function f at point ( x, y)              size m × n pixels.                                                                                                                                   tion 3.4 regarding discrete convolution and
                                                                                                                                                                            theorem and the fact that the average value
                          is given by the vector (f ( x, y) in Eq. (10-16),                                                                                                                                                              separability).
                                                                                                                                                                            of a function is proportional to its Fourier
                          and that the rate of that descent is (f ( x, y) ,                                                                                                 transform evaluated at the origin.)                      (b) * Derive an expression for the computa-
                          defined in Eq. (10-17).                                                                                                                                                                                          tional advantage using the 1-D convolution
                                                                                                                                                                        (c) Suppose that we: (1) used the kernel in Fig.
                                                                                                                                                                                                                                           approach in (a) as opposed to implementing
                    (b) Show that the direction of steepest descent is                                                                                                      10.4(a) to approximate the Laplacian of a
                                                                                                                                                                                                                                           the 2-D convolution directly. Assume that
                        given by the vector −(f ( x, y), and that the                                                                                                       Gaussian, and (2) convolved this result with
                                                                                                                                                                                                                                           G( x, y) is sampled to produce an array of size
                        rate of the steepest descent is (f ( x, y) .                                                                                                        any image. What would be true in general of
                                                                                                                                                                                                                                           n × n and that f ( x, y) is of size M × N . The
                    (c) Give the description of an image whose gra-                                                                                                         the values of the resulting image? Explain.
                                                                                                                                                                                                                                           computational advantage is the ratio of the
                        dient magnitude image would be the same,                                                                                                            (Hint: Take a look at Problem 3.32.)
                                                                                                                                                                                                                                           number of multiplications required for 2-D
                        whether we computed it using Eq. (10-17) or                                                                                             10.17 Refer to Fig. 10.22(c).                                              convolution to the number required for 1-D
                        (10-26). A constant image is not acceptable                                                                                                     (a) Explain why the edges form closed contours.                    convolution. (Hint: Review the subsection
                        answer.                                                                                                                                                                                                            on separable kernels in Section 3.4.)
                                                                                    (a) * What would the magnitude of the gradient                                      (b) * Does the zero-crossing method for finding
            10.11 Do the following.                                                                                                                                           edge location always result in closed con-       10.21 Do the following.
                                                                                          of this image look like based on using the
                    (a) How would you modify the Sobel and                                approximation in Eq. (10-26)? Assume that                                           tours? Explain.                                        (a) Show that Steps 1 and 2 of the Marr-Hildreth
                        Prewitt kernels in Fig. 10.14 so that they give                   g x and g y are obtained using the Sobel ker-                         10.18 One often finds in the literature a derivation of                  algorithm can be implemented using four
                        their strongest gradient response for edges                       nels. Show all relevant different pixel values                              the Laplacian of a Gaussian (LoG) that starts                      1-D convolutions. (Hints: Refer to Problem
                        oriented at ± 45° ?                                               in the gradient image.                                                      with the expression                                                10.20(a) and express the Laplacian operator
                    (b) * Show that the Sobel and Prewitt kernels                   (b) With reference to Eq. (10-18) and Fig. 10.12,                                                                                                    as the sum of two partial derivatives, given
www.EBooksWorld.ir www.EBooksWorld.ir
                         by Eqs. (10-10) and (10-11), and implement               (e) Sketch the horizontal profiles of the angle                                    assume that the images have been preprocessed               greater than m2 , and that the initial T is between
                         each derivative using a 1-D kernel, as in                    images resulting from using the Canny edge                                     so that they are binary and that all tracks are 1           the max and min image intensities. Give conditions
                         Problem 10.12.)                                              detector.                                                                      thick, except at the point of collision from which          (in terms of the parameters of these curves) for the
                                                                                                                                                                     they emanate. Your procedure should be able to              following to be true when the algorithm converges:
                    (b) Derive an expression for the computational         10.24 In Example 10.9, we used a smoothing kernel of
                                                                                                                                                                     differentiate between tracks that have the same
                        advantage of using the 1-D convolution                   size 19 × 19 to generate Fig. 10.26(c) and a kernel                                                                                             (a) * The threshold is equal to (m1 + m2 ) 2.
                                                                                                                                                                     direction but different origins. (Hint: Base your
                        approach in (a) as opposed to implementing               of size 13 × 13 to generate Fig. 10.26(d). What was
                                                                                                                                                                     solution on the Hough transform.)                           (b) * The threshold is to the left of m2 .
                        the 2-D convolution directly. Assume that                the rationale that led to choosing these values?
                        G( x, y) is sampled to produce an array of               (Hint: Observe that both are Gaussian kernels,                              10.29 * Restate the basic global thresholding algorithm             (c) The threshold is in the interval given by the
                        size n × n and that f ( x, y) is of size M × N .         and refer to the discussion of lowpass Gaussian                                     in Section 10.3 so that it uses the histogram of an             equation (m1 + m2 2) < T < m1 .
                        The computational advantage is the ratio of              kernels in Section 3.5.)                                                            image instead of the image itself.
                                                                                                                                                                                                                           10.35 Do the following:
                        the number of multiplications required for         10.25 Refer to the Hough transform in Section 10.2.                               10.30 * Prove that the basic global thresholding algo-
                        2-D convolution to the number required for                                                                                                                                                               (a) * Show how the first line in Eq. (10-60) fol-
                                                                                                                                                                     rithm in Section 10.3 converges in a finite number
                        1-D convolution (see Problem 10.20).                      (a) Propose a general procedure for obtaining                                                                                                        lows from Eqs. (10-55), (10-56), and (10-59).
                                                                                                                                                                     of steps. (Hint: Use the histogram formulation
                                                                                      the normal representation of a line from its                                                                                               (b) Show how the second line in Eq. (10-60)
            10.22 Do the following.                                                                                                                                  from Problem 10.29.)
                                                                                      slope-intercept form, y = ax + b.                                                                                                              follows from the first.
                    (a) * Formulate Step 1 and the gradient mag-                                                                                             10.31 Give an explanation why the initial threshold in
                                                                                  (b) * Find the normal representation of the line                                                                                         10.36 Show that a maximum value for Eq. (10-63)
                          nitude image computation in Step 2 of the                                                                                                the basic global thresholding algorithm in Sec-
                                                                                        y = −2 x + 1.                                                                                                                            always exists for k in the range 0 ≤ k ≤ L − 1.
                          Canny algorithm using 1-D instead of 2-D                                                                                                 tion 10.3 must be between the minimum and
                                                                           10.26 Refer to the Hough transform in Section 10.2.                                     maximum values in the image. (Hint: Construct           10.37 * With reference to Eq. (10-65), advance an
                          convolutions.
                                                                                  (a) * Explain why the Hough mapping of the point                                 an example that shows the algorithm failing for a               argument that establishes that 0 ≤ h(k ) ≤ 1 for k
                    (b) What is the computational advantage of                          labeled 1 in Fig. 10.30(a) is a straight line in                           threshold value selected outside this range.)                   in the range 0 ≤ k ≤ L − 1, where the minimum
                        using the 1-D convolution approach as                           Fig. 10.30(b).                                                       10.32 *Assume that the initial threshold in the basic                 is achievable only by images with constant inten-
                        opposed to implementing a 2-D convolu-
                                                                                  (b) * Is this the only point that would produce that                              global thresholding algorithm in Section 10.3 is               sity, and the maximum occurs only for 2-valued
                        tion. Assume that the 2-D Gaussian filter in
                                                                                        result? Explain.                                                            selected as a value between the minimum and                    images with values 0 and (L − 1).
                        Step 1 is sampled into an array of size n × n
                                                                                                                                                                    maximum intensity values in an image. Do you           10.38 Do the following:
                        and that the input image is of size M × N .               (c) Explain the reflective adjacency relationship
                                                                                                                                                                    think the final value of the threshold at conver-
                        Express the computational advantage as                        illustrated by, for example, the curve labeled                                                                                             (a) * Suppose that the intensities of a digital
                                                                                                                                                                    gence depends on the specific initial value used?
                        the ratio of the number of multiplications                    Q in Fig. 10.30(b).                                                                                                                              image f ( x, y) are in the range [0, 1] and that
                                                                                                                                                                    Explain. (You can use a simple image example to
                        required by each method.                                                                                                                                                                                       a threshold, T, successfully segmented the
                                                                           10.27 Show that the number of operations required to                                     support your conclusion.)
            10.23 With reference to the three vertical edge models               implement the accumulator-cell approach dis-                                                                                                          image into objects and background. Show
                                                                                 cussed in Section 10.2 is linear in n, the number                           10.33 You may assume in both of the following cases                       that the threshold T ′ = 1 − T will success-
                  and corresponding profiles in Fig. 10.8 provide
                                                                                 of non-background points in the image plane (i.e.,                                that the initial threshold is in the open interval                  fully segment the negative of f ( x, y) into the
                  sketches of the profiles that would result from
                                                                                 the xy-plane).                                                                    (0, L − 1).                                                         same regions. The term negative is used here
                  each of the following methods. You may sketch
                  the profiles manually.                                   10.28 An important application of image segmentation                                      (a) * Show that if the histogram of an image is                   in the sense defined in Section 3.2.
                                                                                 is in processing images resulting from so-called                                          uniform over all possible intensity levels,           (b) The intensity transformation function in
                    (a) * Suppose that we compute the gradient
                                                                                 bubble chamber events. These images arise from                                            the basic global thresholding algorithm con-              (a) that maps an image into its negative is
                          magnitude of each of the three edge model
                                                                                 experiments in high-energy physics in which a                                             verges to the average intensity of the image.             a linear function with negative slope. State
                          images using the Sobel kernels. Sketch the
                          horizontal intensity profiles of the three             beam of particles of known properties is directed                                   (b) Show that if the histogram of an image is                   the conditions that an arbitrary intensity
                          resulting gradient images.                             onto a target of known nuclei. A typical event con-                                     bimodal, with identical modes that are sym-                 transformation function must satisfy for the
                                                                                 sists of incoming tracks, any one of which, upon                                        metric about their means, then the basic                    segmentability of the original image with
                    (b) Sketch the horizontal intensity profiles that
                                                                                 a collision, branches out into secondary tracks of                                      global thresholding algorithm will converge                 respect to a threshold, T, to be preserved.
                        would result from using the 3 × 3 Laplacian
                                                                                 particles emanating from the point of collision.                                        to the point halfway between the means of                   What would be the value of the threshold
                        kernel in Fig. 10.10.4(a).
                                                                                 Propose a segmentation approach for detecting                                           the modes.                                                  after the intensity transformation?
                    (c) * Repeat (b) using only the first two steps of           all tracks angled at any of the following six direc-
                                                                                 tions off the horizontal: ± 25°, ± 50°, and ± 75°.                          10.34 Refer to the basic global thresholding algorithm in     10.39 The objects and background in the image below
                          the Marr-Hildreth edge detector.
                                                                                 The estimation error allowed in any of these six                                  Section 10.3. Assume that in a given problem, the             have a mean intensity of 170 and 60, respectively,
                    (d) Repeat (b) using the first two steps of the                                                                                                histogram is bimodal with modes that are Gauss-               on a [0, 255] scale. The image is corrupted by
                                                                                 directions is ±5°. For a track to be valid it must
                        Canny edge detector. You may ignore the                                                                                                    ian curves of the form A1 exp[ −(z − m1 )2 2s12 ]             Gaussian noise with 0 mean and a standard devia-
                                                                                 be at least 100 pixels long and have no more than
                        angle images.                                                                                                                              and A2 exp[ −(z − m2 )2 2s 22 ]. Assume that m1 is            tion of 10 intensity levels. Propose a thresholding
                                                                                 three gaps, each not exceeding 10 pixels. You may
www.EBooksWorld.ir www.EBooksWorld.ir
                    method capable of a correct segmentation rate of     10.43 Consider the region of 1’s resulting from the                                                       10.50 What would the negative ADI image shown                                         the manufacturer. All that is known is that, dur-
                    90% or higher. (Recall that 99.7% of the area of           segmentation of the sparse regions in the image                                                           in Fig. 10.62(c) look like if we tested against T                               ing the life of the lamps, A(t ) is always greater
                    a Gaussian curve lies in a ±3s interval about the          of the Cygnus Loop in Example 10.21. Propose                                                              (instead of testing against −T) in Eq. (10-117)?                                than the negative component in the preceding
                    mean, where s is the standard deviation.)                  a technique for using this region as a mask to                                                      10.51 Are the following statements true or false? Ex-                                 equation because illumination cannot be nega-
                                                                               isolate the three main components of the image:                                                           plain the reason for your answer in each.                                       tive. It has been observed that Otsu’s algorithm
                                                                               (1) background; (2) dense inner region; and (3)                                                                                                                                           works well when the lamps are new, and their
                                                                               sparse outer region.                                                                                        (a) * The nonzero entries in the absolute ADI                                 pattern of illumination is nearly constant over the
                                                                                                                                                                                                 continue to grow in dimension, provided                                 entire image. However, segmentation perfor-
                                                                         10.44 Let the pixels in the first row of a 3 × 3 image, like                                                            that the object is moving.                                              mance deteriorates with time. Being experimental,
                                                                               the one in Fig. 10.53(a), be labeled as 1, 2, 3, and
                                                                                                                                                                                           (b) The nonzero entries in the positive ADI                                   the lamps are exceptionally expensive, so you are
                                                                               the pixels in the second and third rows be labeled
                                                                                                                                                                                               always occupy the same area, regardless of                                employed as a consultant to help solve the prob-
                                                                               as 4, 5, 6 and 7, 8, 9, respectively. Let the inten-
                                                                                                                                                                                               the motion undergone by the object.                                       lem using digital image processing techniques to
                                                                               sity of these pixels be [90, 80, 30; 70, 5, 20; 80 20
                                                                                                                                                                                                                                                                         compensate for the changes in illumination, and
                                                                               30] where, for example, the intensity of pixel 2 is                                                         (c) The nonzero entries in the negative ADI
                                                                                                                                                                                                                                                                         thus extend the useful life of the lamps. You are
                                                                               80 and of pixel 4 it is 70. Compute the weights                                                                 continue to grow in dimension, provided
                                                                                                                                                                                                                                                                         given flexibility to install any special markers or
                                                                               for the edges for the graph in Fig. 10.53(c), using                                                             that the object is moving.
                                                                                                                                                                                                                                                                         other visual cues in the viewing area of the imag-
                                                                               the formula w(i, j ) = 30[1"A # I (ni ) − I (n j ) # + c B ]                                        10.52 Suppose that in Example 10.29 motion along the                                  ing cameras. Propose a solution in sufficient detail
                                                                               explained in the text in connection with that                                                             x-axis is set to zero. The object now moves only
            10.40 Refer to the intensity ramp image in Fig. 10.34(b)                                                                                                                                                                                                     that the engineering plant manager can under-
                                                                               figure (we scaled the formula by 30 to make the                                                           along the y-axis at 1 pixel per frame for 32 frames
                  and the moving-average algorithm discussed in                                                                                                                                                                                                          stand your approach. (Hint: Review the image
                                                                               numerical results easier to interpret). Let c = 0                                                         and then (instantaneously) reverses direction
                  Section 10.3. Assume that the image is of size                                                                                                                                                                                                         model discussed in Section 2.3 and consider using
                                                                               in this case.                                                                                             and moves in exactly the opposite direction for
                  500 × 700 pixels and that its minimum and maxi-                                                                                                                                                                                                        one or more targets of known reflectivity.)
                  mum values are 0 and 1, where 0’s are contained        10.45 * Show how Eqs. (10-106) through (10-108) follow                                                          another 32 frames. What would Figs. 10.66(a)
                                                                                                                                                                                                                                                                  10.55 The speed of a bullet in flight is to be estimated by
                  only in the first column.                                      from Eq. (10-105).                                                                                      and (b) look like under these conditions?
                                                                                                                                                                                                                                                                        using high-speed imaging techniques. The method
                    (a) * What would be the result of segmenting this                                                                                                              10.53 *Advance an argument that demonstrates that                                    of choice involves the use of a CCD camera and
                                                                         10.46 Demonstrate the validity of Eq. (10-102).
                          image with the moving-average algorithm                                                                                                                         when the signs of S1 x and S2 x in Eqs. (10-125)                              flash that exposes the scene for K seconds. The bul-
                          using b = 0 and an arbitrary value for n.      10.47 Refer to the discussion in Section 10.7.                                                                   and (10-126) are the same, velocity component                                 let is 2.5 cm long, 1 cm wide, and its range of speed
                                                                                                                                                                                          V1 is positive.
                          Explain what the segmented image would                (a) * Show that the elements of Cn ( Mi ) and T [ n ]                                                                                                                                   is 750 ± 250 m s. The camera optics produce an
                          look like.                                                  are never replaced during execution of the                                                   10.54 An automated pharmaceutical plant uses image                                   image in which the bullet occupies 10% of the
                    (b) Now reverse the direction of the ramp so                      watershed segmentation algorithm.                                                                  processing to measure the shapes of medication                                 horizontal resolution of a 256 × 256 digital image.
                        that its leftmost value is 1 and the rightmost                                                                                                                   tablets for the purpose of quality control. The                                 (a) * Determine the maximum value of K that
                                                                                (b) Show that the number of elements in sets
                        value is 0 and repeat (a).                                                                                                                                       segmentation stage of the system is based on                                          will guarantee that the blur from motion
                                                                                    Cn (Mi ) and T [ n] either increases or remains
                                                                                                                                                                                         Otsu’s method. The speed of the inspection lines                                      does not exceed 1 pixel.
                    (c) Repeat (a) but with b = 1 and n = 2.                        the same as n increases.
                                                                                                                                                                                         is so high that a very high rate flash illumina-
                    (d) Repeat (a) but with b = 1 and n = 100.           10.48 You saw in Section 10.7 that the boundaries                                                                                                                                               (b) Determine the minimum number of frames
                                                                                                                                                                                         tion is required to “stop” motion. When new, the
                                                                               obtained using the watershed segmentation algo-                                                                                                                                               per second that would have to be acquired
                                                                                                                                                                                         illumination lamps project a uniform pattern of
            10.41 Propose a region-growing algorithm to segment                rithm form closed loops (for example, see Figs.                                                                                                                                               in order to guarantee that at least two com-
                                                                                                                                                                                         light. However, as the lamps age, the illumination
                  the image in Problem 10.39.                                  10.59 and 10.61). Advance an argument that estab-                                                                                                                                             plete images of the bullet are obtained dur-
                                                                                                                                                                                         pattern deteriorates as a function of time and
            10.42 * Segment the image shown by using the split and             lishes whether or not closed boundaries always                                                                                                                                                ing its path through the field of view of the
                                                                                                                                                                                         spatial coordinates according to the equation
                    merge procedure discussed in Section 10.4. Let             result from application of this algorithm.                                                                                                                                                    camera.
                                                                                                                                                                                                                                        2 )2 + ( y − N 2 )2 ]
                    Q ( Ri ) = TRUE if all pixels in Ri have the same    10.49 * Give a step-by-step implementation of the dam-
                                                                                                                                                                                                  i( x, y) = A(t ) − t 2 e − [( x − M                                    (c) * Propose a segmentation procedure for
                    intensity. Show the quadtree corresponding to                building procedure for the one-dimensional inten-                                                                                                                                             automatically extracting the bullet from a
                    your segmentation.                                           sity cross section shown below. Show a drawing                                                            where ( M 2, N 2 ) is the center of the viewing                                     sequence of frames.
                                                                                 of the cross section at each step, showing “water”                                                        area and t is time measured in increments of
                                          N                                                                                                                                                                                                                              (d) Propose a method for automatically deter-
                                                                                 levels and dams constructed.                                                                              months. The lamps are still experimental and
                                                                                                                                                                                                                                                                             mining the speed of the bullet.
                                                                                                                                                                                           the behavior of A(t ) is not fully understood by
                                                                                 7
                                                                                 6
                                                                                 5
                                                    N                            4
                                                                                 3
                                                                                 2
                                                                                 1
                                                                                 0                                                                         x
                                                                                       1   2   3   4   5   6   7   8   9   10   11    12   13   14   15
www.EBooksWorld.ir www.EBooksWorld.ir
11.1 BACKGROUND
              11
                                                                                                                                                                              11.1
811
www.EBooksWorld.ir www.EBooksWorld.ir
                                    Therefore, area is an invariant feature descriptor with respect to the given family of                                                       recognition for automated inspection, searching for patterns (e.g., individual faces
                                    transformations. However, if we add the affine transformation scaling to the fam-                                                            and/or fingerprints) in image databases, and autonomous applications, such as robot
                                    ily, descriptor area ceases to be invariant with respect to the extended family. The                                                         and vehicle navigation. For these applications, numerical features usually are “pack-
                                    descriptor is now covariant with respect to the family, because scaling the area of the                                                      aged” in the form of a feature vector, (i.e., a 1 × n or n × 1 matrix) whose elements are
                                    region by any factor scales the value of the descriptor by the same factor. Similarly,                                                       the descriptors. An RGB image is one of the simplest examples. As you know from
                                    the descriptor direction (of the principal axis of the region) is covariant because                                                          Chapter 6, each pixel of an RGB image can be expressed as 3-D vector,
                                    rotating the region by any angle has the same effect on the value of the descriptor.
                                    Most of the feature descriptors we use in this chapter are covariant in general, in                                                                                                       x1 
                                    the sense that they may be invariant to some transformations of interest, but not to                                                                                                x =  x2 
                                    others that may be equally as important. As you will see shortly, it is good practice to                                                                                                  x3 
                                    normalize as many relevant invariances as possible out of covariances. For instance,
                                    we can compensate for changes in direction of a region by computing its actual                                                               in which x1 is the intensity value of the red image at a point, and the other com-
                                    direction and rotating the region so that its principal axis points in a predefined                                                          ponents are the intensity values of the green and blue images at the same point. If
                                    direction. If we do this for every region detected in an image, rotation will cease to                                                       color is used as a feature, then a region in an RGB image would be represented as
                                    be covariant.                                                                                                                                a set of feature vectors (points) in 3-D space. When n descriptors are used, feature
                                       Another major classification of features is local vs. global. You are likely to see                                                       vectors become n-dimensional, and the space containing them is referred to as an
                                    many different attempts to classify features as belonging to one of these two catego-                                                        n-dimensional feature space. You may “visualize” a set of n-dimensional feature vec-
                                    ries. What makes this difficult is that a feature may belong to both, depending on the                                                       tors as a “hypercloud” of points in n-dimensional Euclidean space.
                                    application. For example, consider the descriptor area again, and suppose that we                                                               In this chapter, we group features into three principal categories: boundary,
                                    are applying it to the task of inspecting the degree to which bottles moving past an                                                         region, and whole image features. This subsidivision is not based on the applicabil-
                                    imaging sensor on a production line are full of liquid. The sensor and its accompany-                                                        ity of the methods we are about to discuss; rather, it is based on the fact that some
                                    ing software are capable of generating images of ten bottles at once, in which liquid                                                        categories make more sense than others when considered in the context of what is
                                    in each bottle appears as a bright region, and the rest of the image appears as dark                                                         being described. For example, it is implied that when we refer to the “length of a
                                    background. The area of a region in this fixed geometry is directly proportional to                                                          boundary” we are referring to the “length of the boundary of a region,” but it makes
                                    the amount of liquid in a bottle and, if detected and measured reliably, area is the                                                         no sense to refer to the “length” of an image. It will become clear that many of the
                                    only feature we need to solve the inspection problem. Each image has ten regions, so                                                         features we will be discussing are applicable to boundaries and regions, and some
                                    we consider area to be a local feature, in the sense that it is applicable to individual                                                     apply to whole images as well.
                                    elements (regions) of an image. If the problem were to detect the total amount (area)
                                    of liquid in an image, we would now consider area to be a global descriptor. But the                                                         11.2 BOUNDARY PREPROCESSING
                                                                                                                                                                                 11.2
                                    story does not end there. Suppose that the liquid inspection task is redefined so that                                                       The segmentation techniques discussed in the previous two chapters yield raw data
                                    it calculates the entire amount of liquid per day passing by the imaging station. We                                                         in the form of pixels along a boundary or pixels contained in a region. It is standard
                                    no longer care about the area of individual regions per se. Our units now are images.                                                        practice to use schemes that compact the segmented data into representations that
                                    If we know the total area in an image, and we know the number of images, calculat-                                                           facilitate the computation of descriptors. In this section, we discuss various bound-
                                    ing the total amount of liquid in a day is trivial. Now the area of an entire image is a                                                     ary preprocessing approaches suitable for this purpose.
                                    local feature, and the area of the total at the end of the day is global. Obviously, we
                                    could redefine the task so that the area at the end of a day becomes a local feature                                                         BOUNDARY FOLLOWING (TRACING)
                                    descriptor, and the area for all assembly lines becomes a global measure. And so on,
                                    endlessly. In this chapter, we call a feature local if it is applies to a member of a set,
                                                                                                                                                   You will find it helpful to   Several of the algorithms discussed in this chapter require that the points in the
                                                                                                                                                   review the discussion in
                                    and global if it applies to the entire set, where “member” and “set” are determined                            Sections 2.5 on neighbor-     boundary of a region be ordered in a clockwise or counterclockwise direction. Con-
                                    by the application.
                                                                                                                                                   hoods, adjacency and          sequently, we begin our discussion by introducing a boundary-following algorithm
                                                                                                                                                   connectivity, and the
                                        Features by themselves are seldom generated for human consumption, except in                               discussion in Section 9.6     whose output is an ordered sequence of points. We assume (1) that we are work-
                                    applications such as interactive image processing, topics that are not in the main-                            dealing with connected        ing with binary images in which object and background points are labeled 1 and 0,
                                                                                                                                                   components.
                                    stream of this book. In fact, as you will see later, some feature extraction meth-                                                           respectively; and (2) that images are padded with a border of 0’s to eliminate the
                                    ods generate tens, hundreds, or even thousands of descriptor values that would                                                               possibility of an object merging with the image border. For clarity, we limit the dis-
                                    appear meaningless if examined visually. Instead, feature description typically is                                                           cussion to single regions. The approach is extended to multiple, disjoint regions by
                                    used as a preprocessing step for higher-level tasks, such as image registration, object                                                      processing the regions individually.
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                  c                      c                        c                                                                                                                                                   1   1
                       1 1 1 1            c0 b0 1 1 1             b 1 1                  b 1                      b                                                                                                                 1           1       1         1     1 1
                     1     1              1       1           1     1              1     1             1     1                                                                                                                  1       1       1 1   1 1     1         1 1
                       1   1                 1    1             1   1                1   1               1   1          ...                                                                                                     1               1   1   1           1     1
                     1     1              1       1           1     1              1     1             1     1                                                                                                               1   1              1 1   1 1         1   1   1
                     1 1 1 1              1 1 1 1             1 1 1 1              1 1 1 1             1 1 1 1                                                                                                               1 1 1              1       1         1 1 1
             a b c d e f                                                                                                                                                                              a b c
            FIGURE 11.1 Illustration of the first few steps in the boundary-following algorithm. The point to be processed next is                                                                    FIGURE 11.2 Examples of boundaries that can be processed by the boundary-following algo-
            labeled in bold, black; the points yet to be processed are gray; and the points found by the algorithm are shaded.                                                                        rithm. (a) Closed boundary with a branch. (b) Self-intersecting boundary. (c) Multiple bound-
            Squares without labels are considered background (0) values.                                                                                                                              aries (processed one at a time).
                                        The following algorithm traces the boundary of a 1-valued region, R, in a binary                                                                              a straightforward approach is to extract the holes (see Section 9.6) and treat them
                                      image.                                                                                                                                                          as 1-valued regions on a background of 0’s. Applying the boundary-following algo-
                                                                                                                                                                                                      rithm to these regions will yield the inner boundaries of the original region.
                                          1. Let the starting point, b0 , be the uppermost-leftmost point† in the image that is                                                                          We could have stated the algorithm just as easily based on following a boundary
                                             labeled 1. Denote by c0 the west neighbor of b0 [see Fig. 11.1(b)]. Clearly, c0 is                                                                       in the counterclockwise direction but you will find it easier to have just one algo-
                                             always a background point. Examine the 8-neighbors of b0 , starting at c0 and                                                                            rithm and then reverse the order of the result to obtain a sequence in the opposite
            See Section 2.5 for the
            definition of 4-neigh-           proceeding in a clockwise direction. Let b1 denote the first neighbor encountered                                                                        direction. We use both directions interchangeably (but consistently) in the following
            bors, 8-neighbors, and           whose value is 1, and let c1 be the (background) point immediately preceding b1                                                                          sections to help you become familiar with both approaches.
            m-neighbors of a point,
                                             in the sequence. Store the locations of b0 for use in Step 5.
                                          2. Let b = b0 and c = c0 .                                                                                                                                  CHAIN CODES
                                          3. Let the 8-neighbors of b, starting at c and proceeding in a clockwise direction,                                                                         Chain codes are used to represent a boundary by a connected sequence of straight-
                                             be denoted by n1 , n2 , … , n8 . Find the first neighbor labeled 1 and denote it by nk .                                                                 line segments of specified length and direction. We assume in this section that all
                                          4. Let b = nk and c = nk –1 .                                                                                                                               curves are closed, simple curves (i.e., curves that are closed and not self intersecting).
                                          5. Repeat Steps 3 and 4 until b = b0 . The sequence of b points found when the
                                             algorithm stops is the set of ordered boundary points.                                                                                                   Freeman Chain Codes
                                                                                                                                                                                                      Typically, a chain code representation is based on 4- or 8-connectivity of the seg-
                                      Note that c in Step 4 is always a background point because nk is the first 1-valued                                                                             ments. The direction of each segment is coded by using a numbering scheme, as in Fig.
                                      point found in the clockwise scan. This algorithm is referred to as the Moore bound-                                                                            11.3. A boundary code formed as a sequence of such directional numbers is referred
                                      ary tracing algorithm after Edward F. Moore, a pioneer in cellular automata theory.                                                                             to as a Freeman chain code.
                                         Figure 11.1 illustrates the first few steps of the algorithm. It is easily verified (see                                                                        Digital images usually are acquired and processed in a grid format with equal
                                      Problem 11.1) that continuing with this procedure will yield the correct boundary,                                                                              spacing in the x- and y-directions, so a chain code could be generated by following a
                                      shown in Fig. 11.1(f), whose points are ordered in a clockwise sequence. The algo-                                                                              boundary in, say, a clockwise direction and assigning a direction to the segments con-
                                      rithm works equally well with more complex boundaries, such as the boundary with                                                                                necting every pair of pixels. This level of detail generally is not used for two principal
                                      an attached branch in Fig. 11.2(a) or the self-intersecting boundary in Fig. 11.2(b).                                                                           reasons: (1) The resulting chain would be quite long and (2) any small disturbances
                                      Multiple boundaries [Fig. 11.2(c)] are handled by processing one boundary at a time.                                                                            along the boundary due to noise or imperfect segmentation would cause changes
                                         If we start with a binary region instead of a boundary, the algorithm extracts the                                                                           in the code that may not be related to the principal shape features of the boundary.
                                      outer boundary of the region. Typically, the resulting boundary will be one pixel                                                                                  An approach used to address these problems is to resample the boundary by
                                      thick, but not always [see Problem 11.1(b)]. If the objective is to find the boundaries                                                                         selecting a larger grid spacing, as in Fig. 11.4(a). Then, as the boundary is traversed, a
                                      of holes in a region (these are called the inner or interior boundaries of the region),                                                                         boundary point is assigned to a node of the coarser grid, depending on the proximity
                                      †
                                                                                                                                                                                                      of the original boundary point to that node, as in Fig. 11.4(b). The resampled bound-
                                       As you will see later in this chapter and in Problem 11.8, the uppermost-leftmost point in a 1-valued boundary
                                      has the important property that a polygonal approximation to the boundary has a convex vertex at that location.
                                                                                                                                                                                                      ary obtained in this way can be represented by a 4- or 8-code. Figure 11.4(c) shows
                                      Also, the left and north neighbors of the point are guaranteed to be background points. These properties make                                                   the coarser boundary points represented by an 8-directional chain code. It is a simple
                                      it a good “standard” point at which to start boundary-following algorithms.                                                                                     matter to convert from an 8-code to a 4-code and vice versa (see Problems 2.15, 9.27,
www.EBooksWorld.ir www.EBooksWorld.ir
             a b                                                   1                          2                                                                                             For instance, the first difference of the 4-directional chain code 10103322 is 3133030.
                                                                                       3                       1                                                                            Size normalization can be achieved by altering the spacing of the resampling grid.
            FIGURE 11.3
            Direction                                                                                                                                                                          The normalizations just discussed are exact only if the boundaries themselves
            numbers for                                                                                                                                                                     are invariant to rotation (again, in angles that are integer multiples of the directions
                                                        2                    0     4                               0
            (a) 4-directional                                                                                                                                                               in Fig. 11.3) and scale change, which seldom is the case in practice. For instance,
            chain code, and                                                                                                                                                                 the same object digitized in two different orientations will have different bound-
            (b) 8-directional                                                          5                       7                                                                            ary shapes in general, with the degree of dissimilarity being proportional to image
            chain code.
                                                                   3                          6                                                                                             resolution. This effect can be reduced by selecting chain elements that are long in
                                                                                                                                                                                            proportion to the distance between pixels in the digitized image, and/or by orienting
                                                                                                                                                                                            the resampling grid along the principal axes of the object to be coded, as discussed
                                    and 9.29). For the same reason mentioned when discussing boundary tracing earlier                                                                       in Section 11.3, or along its eigen axes, as discussed in Section 11.5.
                                    in this section, we chose the starting point in Fig. 11.4(c) as the uppermost-leftmost
                                    point of the boundary, which gives the chain code 0766…1212. As you might suspect,                                           EXAMPLE 11.1 : Freeman chain code and some of its variations.
                                    the spacing of the resampling grid is determined by the application in which the                                           Figure 11.5(a) shows a 570 × 570-pixel, 8-bit gray-scale image of a circular stroke embedded in small,
                                    chain code is used.                                                                                                        randomly distributed specular fragments. The objective of this example is to obtain a Freeman chain
                                       If the sampling grid used to obtain a connected digital curve is a uniform quad-                                        code, the corresponding integer of minimum magnitude, and the first difference of the outer boundary
                                    rilateral (see Fig. 2.19) all points of a Freeman code based on Fig. 11.3 are guaran-                                      of the stroke. Because the object of interest is embedded in small fragments, extracting its boundary
                                    teed to coincide with the points of the curve. The same is true if a digital curve is                                      would result in a noisy curve that would not be descriptive of the general shape of the object. As you
                                    subsampled using the same type of sampling grid, as in Fig. 11.4(b). This is because                                       know, smoothing is a routine process when working with noisy boundaries. Figure 11.5(b) shows the
                                    the samples of curves produced using such grids have the same arrangement as in                                            original image smoothed using a box kernel of size 9 × 9 pixels (see Section 3.5 for a discussion of spa-
                                    Fig. 11.3, so all points are reachable as we traverse a curve from one point to the next                                   tial smoothing), and Fig. 11.5(c) is the result of thresholding this image with a global threshold obtained
                                    to generate the code.                                                                                                      using Otsu’s method. Note that the number of regions has been reduced to two (one of which is a dot),
                                       The numerical value of a chain code depends on the starting point. However, the                                         significantly simplifying the problem.
                                    code can be normalized with respect to the starting point by a straightforward pro-                                           Figure 11.5(d) is the outer boundary of the region in Fig. 11.5(c). Obtaining the chain code of this
                                    cedure: We simply treat the chain code as a circular sequence of direction numbers                                         boundary directly would result in a long sequence with small variations that are not representative
                                    and redefine the starting point so that the resulting sequence of numbers forms an                                         of the global shape of the boundary, so we resample it before obtaining its chain code. This reduces
                                    integer of minimum magnitude. We can normalize also for rotation (in angles that                                           insignificant variability. Figure 11.5(e) is the result of using a resampling grid with nodes 50 pixels apart
                                    are integer multiples of the directions in Fig. 11.3) by using the first difference of the                                 (approximately 10% of the image width) and Fig. 11.5(f) is the result of joining the sample points by
                                    chain code instead of the code itself. This difference is obtained by counting the num-                                    straight lines. This simpler approximation retained the principal features of the original boundary.
                                    ber of direction changes (in a counterclockwise direction in Fig. 11.3) that separate                                         The 8-directional Freeman chain code of the simplified boundary is
                                    two adjacent elements of the code. If we treat the code as a circular sequence to nor-
                                    malize it with respect to the starting point, then the first element of the difference is                                                               00006066666666444444242222202202
                                    computed by using the transition between the last and first components of the chain.
                                                                                                                                                               The starting point of the boundary is at coordinates (2, 5) in the subsampled grid (remember from
                                                                                                                                                               Fig. 2.19 that the origin of an image is at its top, left). This is the uppermost-leftmost point in Fig. 11.5(f).
                                                                                                                                                               The integer of minimum magnitude of the code happens in this case to be the same as the chain code:
                                                                                                                   0
             a b c                                                                                                                                                                          00006066666666444444242222202202
                                                                                                               2         7
            FIGURE 11.4
                                                                                                           1                 6                                 The first difference of the code is
            (a) Digital
            boundary with                                                                              2                     6
            resampling grid                                                                                                                                                                 00062600000006000006260000620626
                                                                                                   1                         6
            superimposed.
            (b) Result of                                                                      2                             6                                    Using this code to represent the boundary results in a significant reduction in the amount of data
            resampling.                                                                                                                                        needed to store the boundary. In addition, working with code numbers offers a unified way to analyze
            (c) 8-directional                                                                                                6
                                                                                                   3                                                           the shape of a boundary, as we discuss in Section 11.3. Finally, keep in mind that the subsampled bound-
            chain-coded                                                                                                5 4
            boundary.
                                                                                                           3                                                   ary can be recovered from any of the preceding codes.
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                                                Figure 11.6 illustrates how an SCC is generated. The first step is to select the
                                                                                                                                                                             length of the line segment to use in generating the code [see Fig. 11.6(b)]. Next, a
                                                                                                                                                                             starting point (the origin) is specified (for an open curve, the logical starting point is
                                                                                                                                                                             one of its end points). As Fig. 11.6(c) shows, once the origin has been selected, one
                                                                                                                                                                             end of a line segment is placed at the origin and the other end of the segment is set
                                                                                                                                                                             to coincide with the curve. This point becomes the starting point of the next line seg-
                                                                                                                                                                             ment, and we repeat this procedure until the starting point (or end point in the case
                                                                                                                                                                             of an open curve) is reached. As the figure illustrates, you can think of this process as
                                                                                                                                                                             a sequence of identical circles (with radius equal to the length of the line segment)
                                                                                                                                                                             traversing the curve. The intersections of the circles and the curve determine the
                                                                                                                                                                             nodes of the straight-line approximation to the curve.
                                                                                                                                                                                Once the intersections of the circles are known, we determine the slope changes
                                                                                                                                                                             between contiguous line segments. Positive and zero slope changes are normalized
                                                                                                                                                                             to the open half interval [0, 1), while negative slope changes are normalized to the
                                                                                                                                                                             open interval (−1, 0). Not allowing slope changes of ±1 eliminates the implementa-
                                                                                                                                                                             tion issues that result from having to deal with the fact that such changes result in
                                                                                                                                                                             the same line segment with opposite directions.
                                                                                                                                                                                The sequence of slope changes is the chain that defines the SCC approximation
                                                                                                                                                                             to the original curve. For example, the code for the curve in Fig. 11.6(e) is 0.12, 0.20,
                                                                                                                                                                             0.21, 0.11, −0.11, −0.12, −0.21, −0.22, −0.24, −0.28, −0.28, −0.31, −0.30. The accu-
                                                                                                                                                                             racy of the slope changes defined in Fig. 11.6(d) is 10 −2 , resulting in an “alphabet”
                                                                                                                                                                             of 199 possible symbols (slope changes). The accuracy can be changed, of course. For
                                                                                                                                                                             instance, and accuracy of 10 −1 produces an alphabet of 19 symbols (see Problem 11.6).
             a b c                                                                                                                                                           Unlike a Freeman code, there is no guarantee that the last point of the coded curve
             d e f                                                                                                                                                           will coincide with the last point of the curve itself. However, shortening the line
            FIGURE 11.5 (a) Noisy image of size 570 × 570 pixels. (b) Image smoothed with a 9 × 9 box kernel. (c) Smoothed
            image, thresholded using Otsu’s method. (d) Longest outer boundary of (c). (e) Subsampled boundary (the points
            are shown enlarged for clarity). (f) Connected points from (e).
www.EBooksWorld.ir www.EBooksWorld.ir
                                      length and/or increasing angle resolution often resolves the problem, because the
                                      results of computations are rounded to the nearest integer (remember we work with
                                      integer coordinates).
                                         The inverse of an SCC is another chain of the same length, obtained by reversing
                                      the order of the symbols and their signs. The mirror image of a chain is obtained by
                                      starting at the origin and reversing the signs of the symbols. Finally, we point out
                                      that the preceding discussion is directly applicable to closed curves. Curve following
                                      would start at an arbitrary point (for example, the uppermost-leftmost point of the
                                      curve) and proceed in a clockwise or counterclockwise direction, stopping when the
                                      starting point is reached. We will illustrate an use of SSCs in Example 11.6.
www.EBooksWorld.ir www.EBooksWorld.ir
www.EBooksWorld.ir www.EBooksWorld.ir
                                     (a) Vk is on the positive side of the line through the pair of points (VL , WC ), in which                            The next vertex is V5 = (7, 1). Using the values from the previous step we obtain sgn(VL , WC , V5 ) = 9,
                                         case sgn (VL , WC , Vk ) > 0.                                                                                  so condition (a) is satisfied. Therefore, we let VL = WC = (4, 1) (this is V4 ) and reinitialize:
                                     (b) Vk is on the negative side of the line though pair (VL , WC ) or is collinear with                             BC = WC = VL = (4, 1). Note that once we knew that sgn(VL , WC , V5 ) > 0 we did not bother to compute
                                         it; that is sgn (VL , WC , Vk ) ≤ 0. Simultaneously, Vk lies to the positive side of the                       the other sgn expression. Also, reinitialization means that we start fresh again by examining the next
                                         line through (VL , BC ) or is collinear with it; that is, sgn (VL , BC , Vk ) ≥ 0.                             vertex following the newly found MPP vertex. In this case, that next vertex is V5 , so we visit it again.
                                                                                                                                                           With V5 = ( 7, 1) , and using the new values of VL , WC , and BC , it follows that sgn (VL , WC , V5 ) = 0 and
                                     (c) Vk is on the negative side of the line though pair (VL , BC ) , in which case
                                                                                                                                                        sgn (VL , BC , V5 ) = 0, so condition (b) holds. Therefore, we let WC = V5 = ( 7, 1) because V5 is a W vertex.
                                         sgn (VL , BC , Vk ) < 0.
                                                                                                                                                           The next vertex is V6 = ( 8, 2 ) and sgn (VL , WC , V6 ) = 3, so condition (a) holds. Thus, we let
                                    If condition (a) holds, the next MPP vertex is WC , and we let VL = WC ; then we                                    VL = WC = ( 7, 1) and reinitialize the algorithm by setting WC = BC = VL .
                                    reinitialize the algorithm by setting WC = BC = VL , and start with the next vertex                                    Because the algorithm was reinitialized at V5 , the next vertex is V6 = (8, 2) again. Using the results
                                    after the newly changed VL .                                                                                        from the previous step gives us sgn(VL , WC , V6 ) = 0 and sgn(VL , BC , V6 ) = 0, so condition (b) holds this
                                       If condition (b) holds, Vk becomes a candidate MPP vertex. In this case, we set                                  time. Because V6 is B we let BC = V6 = (8, 2).
                                    WC = Vk if Vk is convex (i.e., it is a W vertex); otherwise we set BC = Vk . We then                                   Summarizing, we have found three vertices of the MPP up to this point: V1 = (1, 4), V4 = (4, 1), and
                                    continue with the next vertex in the list.                                                                          V5 = (7, 1). Continuing as above with the remaining vertices results in the MPP vertices in Fig. 11.8(c)
                                                                                                                                                        (see Problem 11.9). The mirrored B vertices at (2, 3), (3, 2), and on the lower-right side at (13, 10), are on
                                       If condition (c) holds, the next MPP vertex is BC and we let VL = BC ; then we
                                                                                                                                                        the boundary of the MPP. However, they are collinear and thus are not considered vertices of the MPP.
                                    reinitialize the algorithm by setting WC = BC = VL and start with the next vertex
                                                                                                                                                        Appropriately, the algorithm did not detect them as such.
                                    after the newly changed VL .
                                       The algorithm stops when it reaches the first vertex again, and thus has processed
                                    all the vertices in the polygon. The VL vertices found by the algorithm are the ver-                                  EXAMPLE 11.3 : Applying the MPP algorithm.
                                    tices of the MPP. Klette and Rosenfeld [2004] have proved that this algorithm finds
                                                                                                                                                        Figure 11.9(a) is a 566 × 566 binary image of a maple leaf, and Fig. 11.9(b) is its 8-connected boundary.
                                    all the MPP vertices of a polygon enclosed by a simply connected cellular complex.                                  The sequence in Figs. 11.9(c) through (h) shows MMP representations of this boundary using square
                                                                                                                                                        cellular complex cells of sizes 2, 4, 6, 8, 16, and 32, respectively (the vertices in each figure were con-
             EXAMPLE 11.2 : A numerical example showing the details of how the MPP algorithm works.                                                     nected with straight lines to form a closed boundary). The leaf has two major features: a stem and three
            A simple example in which we can follow the algorithm step-by-step will help clarify the preceding con-                                     main lobes. The stem begins to be lost for cell sizes greater than 4 × 4, as Fig. 11.9(e) shows. The three
            cepts. Consider the vertices in Fig. 11.8(c). In our image coordinate system, the top-left point of the grid                                main lobes are preserved reasonably well, even for a cell size of 16 × 16, as Fig. 11.9(g) shows. However,
            is at coordinates (0, 0). Assuming unit grid spacing, the first few (counterclockwise) vertices are:                                        we see in Fig. 11.8(h) that by the time the cell size is increased to 32 × 32, this distinctive feature has
                                                                                                                                                        been nearly lost.
               V0 (1, 4) W  V1 (2, 3) B  V2 (3, 3) W  V3 (3, 2) B  V4 (4, 1) W  V5 (7, 1) W  V6 (8, 2) B  V7 (9, 2) B                               The number of points in the original boundary [Fig. 11.9(b)] is 1900. The numbers of vertices in
                                                                                                                                                        Figs. 11.9(c) through (h) are 206, 127, 92, 66, 32, and 13, respectively. Figure 11.9(e), which has 127 ver-
            where the triplets are separated by vertical lines, and the B vertices are mirrored, as required by the                                     tices, retained all the major features of the original boundary while achieving a data reduction of over
            algorithm.                                                                                                                                  90%. So here we see a significant advantage of MMPs for representing a boundary. Another important
               The uppermost-leftmost vertex is always the first vertex of the MPP, so we start by letting VL and V0                                    advantage is that MPPs perform boundary smoothing. As explained in the previous section, this is a
            be equal, VL = V0 = (1, 4), and initializing the other variables: WC = BC = VL = (1, 4 ) .                                                  usual requirement when representing a boundary by a chain code.
               The next vertex is V1 = ( 2, 3) . In this case we have sgn (VL , WC , V1 ) = 0 and sgn (VL , BC , V1 ) = 0, so
            condition (b) holds. Because V1 is a B (concave) vertex, we update the blue crawler: BC = V1 = ( 2, 3) . At
            this stage, we have VL = (1, 4), WC = (1, 4), and BC = (2, 3).                                                                                                           SIGNATURES
               Next, we look at V2 = ( 3, 3) . In this case, sgn (VL , WC , V2 ) = 0, and sgn (VL , BC , V2 ) = 1, so condition (b)                                              A signature is a 1-D functional representation of a 2-D boundary and may be gener-
            holds. Because V2 is W, we update the white crawler: WC = (3, 3).                                                                                                    ated in various ways. One of the simplest is to plot the distance from the centroid
               The next vertex is V3 = ( 3, 2 ) . At this junction we have VL = (1, 4), WC = (3, 3), and BC = (2, 3). Then,                                                      to the boundary as a function of angle, as illustrated in Fig. 11.10. The basic idea of
            sgn (VL , WC , V3 ) = −2 and sgn (VL , BC , V3 ) = 0, so condition (b) holds again. Because V3 is B, we let                                                          using signatures is to reduce the boundary representation to a 1-D function that
            BC = V3 = (4, 3) and look at the next vertex.                                                                                                                        presumably is easier to describe than the original 2-D boundary.
               The next vertex is V4 = ( 4, 1) . We are working with VL = (1, 4), WC = (3, 3), and BC = (3, 2). The values                                                          Based on the assumptions of uniformity in scaling with respect to both axes, and
            of sgn are sgn(VL , WC , V4 ) = −3 and sgn(VL , BC , V4 ) = 0. So, condition (b) holds yet again, and we let                                                         that sampling is taken at equal intervals of u, changes in the size of a shape result
            WC = V4 = (4, 1) because V4 is a W vertex.                                                                                                                           in changes in the amplitude values of the corresponding signature. One way to
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                            a b
                                                                                                                                                           FIGURE 11.10
                                                                                                                                                                                                                     r                                                r
                                                                                                                                                           Distance-versus-                                              u                                                u
                                                                                                                                                           angle signatures.
                                                                                                                                                           In (a), r(u) is
                                                                                                                                                           constant. In (b),
                                                                                                                                                           the signature
                                                                                                                                                           consists of                                                   A                                                A
                                                                                                                                                           repetitions of
                                                                                                                                                           the pattern                      r(u)                                             r(u)
                                                                                                                                                           r ( u ) = A sec u for
                                                                                                                                                           0 ≤ u ≤ p 4 , and                                                                2A
                                                                                                                                                           r ( u ) = A csc u for
                                                                                                                                                                                            A                                               A
                                                                                                                                                           p 4 < u ≤ p 2.
                                                                                                                                                                                                0   p   p   3p       5p      3p   7p                 p   p   3p       5p      3p   7p
                                                                                                                                                                                                                 p                     2p        0                p                     2p
                                                                                                                                                                                                    4   2    4        4       2    4                 4   2    4        4       2    4
                                                                                                                                                                                                                 u                                                u
www.EBooksWorld.ir www.EBooksWorld.ir
             a b c                                                                                                                                 a b c
             d e f                                                                                                                                FIGURE 11.12
            FIGURE 11.11                                                                                                                          Medial axes
            (a) and (d) Two                                                                                                                       (dashed) of three
            binary regions,                                                                                                                       simple regions.
            (b) and (e) their
            external
            boundaries, and
            (c) and (f) their
            corresponding r(u)
            signatures. The
            horizontal axes
            in (c) and (f) cor-
            respond to angles                                                                                                                                                  pixels to their nearest background (zero) pixels, which constitute the region bound-
            from 0° to 360°, in                                                                                                                                                ary. Thus, we compute the distance transform of the complement of the image, as
            increments of 1°.
                                                                                                                                                                               Figs. 11.13(c) and (d) illustrate. By comparing Figs. 11.13(d) and 11.12(a), we see
                                                                                                                                                                               in the former that the MAT (skeleton) is equivalent to the ridge of the distance
                                                                                                                                                                               transform [i.e., the ridge in the image in Fig. 11.13(d)]. This ridge is the set of local
                                                                                                                                                                               maxima [shown bold in Fig. 11.13(d)]. Figures 11.13(e) and (f) show the same effect
                                                                                                                                                                               on a larger (414 × 708) binary image.
                                                                                                                                                                                  Finding approaches for computing the distance transform efficiently has been a
                                                                                                                                                                               topic of research for many years. Numerous approaches exist that can compute the
                                                                                                                                                                               distance transform with linear time complexity, O(K ), for a binary image with K
                                                                                                                                                                               pixels. For example, the algorithm by Maurer et al. [2003] not only can compute the
                                    to belong to the medial axis of R. The concept of “closest” (and thus the resulting                                                        distance transform in O(K ), it can compute it in O(K P ) using P processors.
                                    MAT) depends on the definition of a distance metric (see Section 2.5). Figure 11.12
                                    shows some examples using the Euclidean distance. If the Euclidean distance is used,
                                    the resulting skeleton is the same as what would be obtained by using the maximum                              a b                                      0       0   0   0     0                          1.41 1 1     1 1.41
                                    disks from Section 9.5. The skeleton of a region is defined as its medial axis.                                c d                                      0       1   1   1     0                            1 0 0       0 1
                                                                                                                                                   e f                                      0       1   1   1     0                            1 0 0       0 1
                                        The MAT of a region has an intuitive interpretation based on the “prairie fire”
                                                                                                                                                  FIGURE 11.13                              0       0   0   0     0                          1.41 1 1     1 1.41
                                    concept discussed in Section 11.3 (see Fig. 11.15). Consider an image region as a
                                                                                                                                                  (a) A small
                                    prairie of uniform, dry grass, and suppose that a fire is lit simultaneously along all                        image and (b) its                 0   0       0   0   0   0     0   0   0          0   0    0   0   0   0   0    0   0
                                    the points on its border. All fire fronts will advance into the region at the same speed.                     distance                          0   1       1   1   1   1     1   1   0          0   1    1   1   1   1   1    1   0
                                    The MAT of the region is the set of points reached by more than one fire front at                             transform. Note                   0   1       1   1   1   1     1   1   0          0   1    2   2   2   2   2    1   0
                                    the same time.                                                                                                that all 1-valued                 0   1       1   1   1   1     1   1   0          0   1    2   3   3   3   2    1   0
                                        In general, the MAT comes considerably closer than thinning to producing skel-                            pixels in (a) have                0   1       1   1   1   1     1   1   0          0   1    2   2   2   2   2    1   0
                                                                                                                                                  corresponding                     0   1       1   1   1   1     1   1   0          0   1    1   1   1   1   1    1   0
                                    etons that “make sense.” However, computing the MAT of a region requires cal-                                 0’s in (b). (c) A
                                    culating the distance from every interior point to every point on the border of the                                                             0   0       0   0   0   0     0   0   0          0   0    0   0   0   0   0    0   0
                                                                                                                                                  small image, and
                                    region—an impractical endeavor in most applications. Instead, the approach is to                              (d) the distance
                                    obtain the skeleton equivalently from the distance transform, for which numerous                              transform of its
                                    efficient algorithms exist.                                                                                   complement. (e) A
                                                                                                                                                  larger image, and
                                        The distance transform of a region of foreground pixels in a background of zeros                          (f) the distance
                                    is the distance from every pixel to the nearest nonzero valued pixel. Figure 11.13(a)                         transform of its
                                    shows a small binary image, and Fig. 11.13(b) is its distance transform. Observe that                         complement. The
                                    every 1-valued pixel has a distance transform value of 0 because its closest nonzero                          Euclidian distance
                                    valued pixel is itself. For the purpose of finding skeletons equivalent to the MAT,                           was used through-
                                                                                                                                                  out.
                                    we are interested in the distance from the pixels of a region of foreground (white)
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                                              and
                                                                                                                                                                                                                               y − y1 
                                                                                                                                                                                                              anglem = tan −1  2       
                                                                                                                                                                                                                               x2 − x1 
                                                                                                                                                                           The minor axis (also called the longest perpendicular chord) of a boundary is defined
                                                                                                                                                                           as the line perpendicular to the major axis, and of such length that a box passing
             EXAMPLE 11.5 : Skeletons obtained using thinning and pruning vs. the distance transform.                                                                      through the outer four points of intersection of the boundary with the two axes com-
            Figure 11.14(a) shows a segmented image of blood vessels, and Fig. 11.14(b) shows the skeleton obtained                                                        pletely encloses the boundary. The box just described is called the basic rectangle or
            using morphological thinning. As we discussed in Chapter 9, thinning is characteristically accompanied                                                         bounding box, and the ratio of the major to the minor axis is called the eccentricity
            by spurs, which certainly is the case here. Figure 11.14(c) shows the result of forty passes of spur removal.                                                  of the boundary. We give some examples of this descriptor in Section 11.4.
            With the exception of the few small spurs visible on the bottom left of the image, pruning did a reason-                                                          The curvature of a boundary is defined as the rate of change of slope. In general,
            able job of cleaning up the skeleton. One drawback of thinning is the loss of potentially important                                                            obtaining reliable measures of curvature at a point of a raw digital boundary is dif-
            features. This was not the case here, except the pruned skeleton does not cover the full expanse of the                                                        ficult because these boundaries tend to be locally “ragged.” Smoothing can help, but
            image. Figure 11.14(c) shows the skeleton obtained using distance transform computations based on fast                                                         a more rugged measure of curvature is to use the difference between the slopes of
            marching (see Lee et al. [2005] and Shi and Karl [2008]). The way the algorithm we used implements                                                             adjacent boundary segments that have been represented as straight lines. Polygonal
            branch generation handles ambiguities such as spurs automatically.                                                                                             approximations are well-suited for this approach [see Fig. 11.8(c)], in which case we
               The result in Fig. 11.14(d) is slightly superior to the result in Fig. 11.14(c), but both skeletons certainly                                               are concerned only with curvature at the vertices. As we traverse the polygon in the
            capture the important features of the image in this case. A key advantage of the thinning approach                                                             clockwise direction, a vertex point p is said to be convex if the change in slope at p
            is simplicity of implementation, which can be important in dedicated applications. Overall, distance-                                                          is nonnegative; otherwise, p is said to be concave. The description can be refined
            transform formulations tend to produce skeletons less prone to discontinuities, but overcoming the                                                             further by using ranges for the changes of slope. For instance, p could be labeled as
            computational burden of the distance transform results in implementations that are considerably more                                                           part of a nearly straight line segment if the absolute change of slope at that point is
            complex than thinning.
                                                                                                                                                 We will discuss corners   less than 10°, or it could be labeled as “corner-like” point if the absolute change is
                                                                                                                                                 in detail later in this
                                                                                                                                                 chapter.                  in the range 90°, ± 30°.
                                                                                                                                                                              Descriptors based on changes of slope can be formulated easily by expressing a
                                    11.3 BOUNDARY FEATURE DESCRIPTORS
                                    11.3                                                                                                                                   boundary in the form of a slope chain code (SSC), as discussed earlier (see Fig. 11.6).
                                    We begin our discussion of feature descriptors by considering several fundamental                                                      A particularly useful boundary descriptor that is easily implemented using SSCs is
                                    approaches for describing region boundaries.                                                                                           tortuosity, a measure of the twists and turns of a curve. The tortuosity, t, of a curve
www.EBooksWorld.ir www.EBooksWorld.ir
                                    represented by an SCC is defined as the sum of the absolute values of the chain ele-                                                          SHAPE NUMBERS
                                    ments:                                                                                                                                        The shape number of a Freeman chain-coded boundary, based on the 4-directional
                                                                               n
                                                                         t = ∑ ai                                    (11-6)                        As explained                   code of Fig. 11.3(a), is defined as the first difference of smallest magnitude. The order,
                                                                                                                                                   Section 11.2, the first dif-
                                                                              i =1
                                                                                                                                                   ference of smallest mag-
                                                                                                                                                                                  n, of a shape number is defined as the number of digits in its representation. More-
                                                                                                                                                   nitude makes a Freeman         over, n is even for a closed boundary, and its value limits the number of possible
                                    where n is the number of elements in the SCC, and ai are the values (slope changes)                            chain code independent         different shapes. Figure 11.16 shows all the shapes of order 4, 6, and 8, along with
                                    of the elements in the code. The next example illustrates one use of this descriptor                           of the starting point, and
                                                                                                                                                   is insensitive to rotation     their chain-code representations, first differences, and corresponding shape numbers.
                                                                                                                                                   in increments of 90° if a      Although the first difference of a 4-directional chain code is independent of rotation
                                                                                                                                                   4-directional code is used.
             EXAMPLE 11.6 : Using slope chain codes to describe tortuosity.                                                                                                       (in increments of 90°), the coded boundary in general depends on the orientation of
            An important measures of blood vessel morphology is its tortuosity. This metric can assist in the computer-                                                           the grid. One way to normalize the grid orientation is by aligning the chain-code grid
            aided diagnosis of Retinopathy of Prematurity (ROP), an eye disease that affects babies born prema-                                                                   with the sides of the basic rectangle defined in the previous section.
            turely (Bribiesca [2013]). ROP causes abnormal blood vessels to grow in the retina (see Section 2.1). This                                                               In practice, for a desired shape order, we find the rectangle of order n whose
            growth can cause the retina to detach from the back of the eye, potentially leading to blindness.                                                                     eccentricity (defined in Section 11.4) best approximates that of the basic rectangle,
                Figure 11.15(a) shows an image of the retina (called a fundus image) from a newborn baby. Ophthal-                                                                and use this new rectangle to establish the grid size. For example, if n = 12, all the
            mologists diagnose and make decisions about the initial treatment of ROP based on the appearance of                                                                   rectangles of order 12 (that is, those whose perimeter length is 12) are of sizes
            retinal blood vessels. Dilatation and increased tortuosity of the retinal vessels are signs of highly prob-                                                           2 × 4, 3 × 3, and 1 × 5. If the eccentricity of the 2 × 4 rectangle best matches the
            able ROP. Blood vessels denoted A, B, and C in Fig. 11.15 were selected to demonstrate the discrimi-                                                                  eccentricity of the basic rectangle for a given boundary, we establish a 2 × 4 grid
            native potential of SCCs for quantifying tortuosity (each vessel shown is a long, thin region, not a line                                                             centered on the basic rectangle and use the procedure outlined in Section 11.2 to
            segment).                                                                                                                                                             obtain the Freeman chain code. The shape number follows from the first differ-
                The border of each vessel was extracted and its length (number of pixels), P, was calculated. To make                                                             ence of this code. Although the order of the resulting shape number usually equals
            SCC comparisons meaningful, the three boundaries were normalized so that each would have the same                                                                     n because of the way the grid spacing was selected, boundaries with depressions
                                                                                                                                                                                  comparable to this spacing sometimes yield shape numbers of order greater than n.
            number, m, of straight-line segments. The length, L, of the line segment was then computed as L = m P.
                                                                                                                                                                                  In this case, we specify a rectangle of order lower than n, and repeat the procedure
            It follows that the number of elements of each SCC is m − 1. The tortuosity, t, of a curve represented by
                                                                                                                                                                                  until the resulting shape number is of order n. The order of a shape number starts
            an SCC is defined as the sum of the absolute values of the chain elements, as noted in Eq. (11-6).
                                                                                                                                                                                  at 4 and is always even because we are working with 4-connectivity and require that
                The table in Fig. 11.15(b) shows values of t for vessels A, B, and C based on 51 straight-line segments
                                                                                                                                                                                  boundaries be closed.
            (as noted above, n = m − 1). The values of tortuosity are in agreement with our visual analysis of the
            three vessels, showing B as being slightly “busier” than A, and C as having the fewest twists and turns.                               FIGURE 11.16                                                      Order 4                  Order 6
                                                                                                                                                   All shapes of
                                                                                                                                                   order 4, 6, and 8.
                                                                                                                                                   The directions are
                                                                                                                                                   from Fig. 11.3(a),
                                                                                                                                                   and the dot                                            Chain code: 0 3 2 1               0 0 3 2 2 1
             a b
                                                                                                                                                   indicates the
            FIGURE 11.15                                                                                                                                                                                  Difference: 3 3 3 3               3 0 3 3 0 3
                                                                                                                                                   starting point.
            (a) Fundus image
            from a prematurely                                                                                                                                                                             Shape no.: 3 3 3 3               0 3 3 0 3 3
            born baby with ROP.
            (b) Tortuosity of                                                                                                                                                                                                     Order 8
                                                                                       Curve      n        T
            vessels A, B, and C.
            (Courtesy of                                                                 A        50     2.3770
            Professor Ernesto
            Bribiesca, IIMAS-                                                            B        50     2.5132
            UNAM, Mexico.)
                                                                                         C        50     1.6285
Chain code: 0 0 3 3 2 2 1 1 0 3 0 3 2 2 1 1 0 0 0 3 2 2 2 1
Difference: 3 0 3 0 3 0 3 0 3 3 1 3 3 0 3 0 3 0 0 3 3 0 0 3
Shape no.: 0 3 0 3 0 3 0 3 0 3 0 3 3 1 3 3 0 0 3 3 0 0 3 3
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                                               FIGURE 11.18                                                             jy
             a b
             c d                                                                                                                                                               A digital
                                                                                                                                                                               boundary and its
            FIGURE 11.17                                                                                                                                                       representation
            Steps in the                                                                                                                                                       as sequence of
                                                                                                                                                                                                                                      Imaginary axis
            generation of a                                                                                                                                                    complex numbers.
            shape number.                                                                                                                                                      The points ( x0 , y0 )
                                                                                                                                                                               and ( x1 , y1 ) are
                                                                                                                                                                               (arbitrarily) the                                                       y0
                                                                                                                                                                               first two points in                                                     y1
                                                                                                                                                                               the sequence.
                                                                                                                                                                                                                                                                                              x
                                                                                                                                                                                                                                                                x 0 x1
                                                                                                                                                                                                                                                                         Real axis
                                                                                                                                                                                                            of the sequence was restated, the nature of the boundary itself was not changed.
                                                                                                                                                                                                            Of course, this representation has one great advantage: It reduces a 2-D to a 1-D
                                                                                             0
                                                                                       1
                                                                                                                                                                                                            description problem.
                                                                                                                                                                                                               We know from Eq. (4-44) that the discrete Fourier transform (DFT) of s ( k ) is
                                                                                               3
                                                                                         2
                                                                 Chain code: 0 0 0 0 3 0 0 3 2 2 3 2 2 2 1 2 1 1                                                                                                                                                      K –1
                                                                                                                                                                                                                                                            a (u) =   ∑ s ( k )e – j 2puk K        (11-8)
                                                                 Difference: 3 0 0 0 3 1 0 3 3 0 1 3 0 0 3 1 3 0                                                                                                                                                      k=0
                                                                 Shape no.: 0 0 0 3 1 0 3 3 0 1 3 0 0 3 1 3 0 3
                                                                                                                                                                                                            for u = 0, 1, 2, …, K − 1. The complex coefficients a ( u ) are called the Fourier descrip-
                                                                                                                                                                                                            tors of the boundary. The inverse Fourier transform of these coefficients restores
                                                                                                                                                                                                            s ( k ) . That is, from Eq. (4-45),
             EXAMPLE 11.7 : Computing shape numbers.
                                                                                                                                                                                                                                                                         K –1
            Suppose that n = 18 is specified for the boundary in Fig. 11.17(a). To obtain a shape number of this order                                                                                                                                               1
                                                                                                                                                                                                                                                        s (k ) =         ∑ a (u )e j 2puk K        (11-9)
            we follow the steps just discussed. First, we find the basic rectangle, as shown in Fig. 11.17(b). Next we find                                                                                                                                          K   u=0
            the closest rectangle of order 18. It is a 3 × 6 rectangle, requiring the subdivision of the basic rectangle
            shown in Fig. 11.17(c). The chain-code directions are aligned with the resulting grid. The final step is to                                                                                     for k = 0, 1, 2,…, K − 1. We know from Chapter 4 that the inverse is identical to the
            obtain the chain code and use its first difference to compute the shape number, as shown in Fig. 11.17(d).                                                                                      original input, provided that all the Fourier coefficients are used in Eq. (11-9). How-
                                                                                                                                                                                                            ever, suppose that, instead of all the Fourier coefficients, only the first P coefficients
                                                                                                                                                                                                            are used. This is equivalent to setting a ( u ) = 0 for u > P – 1 in Eq. (11-9). The result
                                          FOURIER DESCRIPTORS                                                                                                                                               is the following approximation to s ( k ) :
            We use the “conven-           Figure 11.18 shows a digital boundary in the xy-plane, consisting of K points. Starting                                                                                                                                        P –1
            tional” axis system here
                                          at an arbitrary point ( x0 , y0 ) , coordinate pairs ( x0 , y0 ) , ( x1 , y1 ) , ( x2 , y2 ) ,…, ( xK −1 , yK −1 )                                                                                                         1
            for consistency with the                                                                                                                                                                                                                    sˆ ( k ) =       ∑ a (u )e j 2puk K       (11-10)
            literature. However, the      are encountered in traversing the boundary, say, in the counterclockwise direction.                                                                                                                                        K   u=0
                                          These coordinates can be expressed in the form x ( k ) = xk and y ( k ) = yk . Using
            same result is obtained
            if we use the book                                                                                                                                                                              for k = 0, 1, 2, …, K − 1. Although only P terms are used to obtain each component
            image coordinate system       this notation, the boundary itself can be represented as the sequence of coordinates                                                                              of sˆ ( k ) , parameter k still ranges from 0 to K – 1. That is, the same number of points
                                          s ( k ) =  x ( k ) , y ( k ) for k = 0, 1, 2, …, K − 1. Moreover, each coordinate pair can be
            whose origin is at the
            top left because both are                                                                                                                                                                       exists in the approximate boundary, but not as many terms are used in the recon-
            right-handed coordinate       treated as a complex number so that                                                                                                                               struction of each point.
            systems (see Fig. 2.19). In
            the latter, the rows and
            columns represent the
                                                                                     s ( k ) = x ( k ) + jy ( k )                                 (11-7)                                                       Deleting the high-frequency coefficients is the same as filtering the transform
            real and imaginary parts                                                                                                                                                                        with an ideal lowpass filter. You learned in Chapter 4 that the periodicity of the
            of the complex number.        for k = 0, 1, 2,…, K − 1. That is, the x-axis is treated as the real axis and the y-axis as                                                                       DFT requires that we center the transform prior to filtering it by multiplying it by
                                          the imaginary axis of a sequence of complex numbers. Although the interpretation                                                                                  (−1)x . Thus, we use this procedure when implementing Eq. (11-8), and use it again
www.EBooksWorld.ir www.EBooksWorld.ir
                                    to reverse the centering when computing the inverse in Eq. (11-10). Because of
                                    symmetry considerations in the DFT, the number of points in the boundary and its
                                    inverse must be even. This implies that the number of coefficients removed (set to 0)
                                    before the inverse is computed must be even. Because the transform is centered, we
                                    set to 0 half the number of coefficients on each end of the transform to preserve
                                    symmetry. Of course, the DFT and its inverse are computed using an FFT algorithm.
                                        Recall from discussions of the Fourier transform in Chapter 4 that high-frequency
                                    components account for fine detail, and low-frequency components determine over-
                                    all shape. Thus, the smaller we make P in Eq. (11-10), the more detail that will be lost
                                    on the boundary, as the following example illustrates.
www.EBooksWorld.ir www.EBooksWorld.ir
            TABLE 11.1                                                                                                                                                                                            An alternative approach is to normalize the area of g(r ) in Fig. 11.20 to unity and
                                                Transformation       Boundary                           Fourier Descriptor
            Some basic
                                                                                                                                                                                                              treat it as a histogram. In other words, g(ri ) is now treated as the probability of value
            properties of                       Identity             s (k )                             a (u )
            Fourier                                                                                                                                                                                           ri occurring. In this case, r is treated as the random variable and the moments are
            descriptors.                        Rotation             sr ( k ) = s ( k ) e ju            ar ( u ) = a ( u ) e ju                                                                                                                           K –1
                                                                                                                                                                                                                                             mn ( r ) =   ∑   (ri – m)n g (ri )                (11-16)
                                                Translation          st ( k ) = s ( k ) + ∆ xy          at ( u ) = a ( u ) + ∆ xy d ( u )                                                                                                                 i=0
                                                                                                                                                                                                              where
                                                Scaling              ss ( k ) = a s ( k )               as ( u ) = a a ( u )                                                                                                                                K –1
                                                                                                                                                                                                                                                   m=       ∑   ri g ( ri )                    (11-17)
                                                Starting point        sp ( k ) = s ( k − k0 )           a p ( u ) = a ( u ) e – j 2 p k0 u   K
                                                                                                                                                                                                                                                            i=0
                                      ability of intensity value zi occurring. It then follows from Eq. (3-24) that the nth                                                                                   As we did with boundaries, we begin the discussion of regional features with some
                                      moment of z about its mean is                                                                                                                                           basic region descriptors.
                                                                                   A–1
                                                                     mn ( z ) =    ∑   ( zi – m)n p ( zi )                                       (11-14)                                                      SOME BASIC DESCRIPTORS
                                                                                   i=0
                                                                                                                                                                                                              The major and minor axes of a region, as well as the idea of a bounding box, are
                                      where
                                                                                      A–1
                                                                                                                                                                                                              as defined earlier for boundaries. The area of a region is defined as the number of
                                                                              m=      ∑   zi p ( zi )                                            (11-15)                                                      pixels in the region. The perimeter of a region is the length of its boundary. When
                                                                                      i=0                                                                                                                     area and perimeter are used as descriptors, they generally make sense only when
                                      As you know, m is the mean (average) value of z, and m2 is its variance. Gener-                                                                                         they are normalized (Example 11.9 shows such a use). A more frequent use of these
                                      ally, only the first few moments are required to differentiate between signatures of                                                                                    two descriptors is in measuring compactness of a region, defined as the perimeter
                                      clearly distinct shapes.                                                                                                                                                squared over the area:
                                                                                                                                                                                                                                                                          p2
                                                                                                                                                                                                                                                 compactness =                                 (11-18)
                                                                                                                                                                                                                                                                          A
            FIGURE 11.20                                      g(r)
            Sampled                                                                                                                                                                                           This is a dimensionless measure that is 4p for a circle (its minimum value) and 16
            signature from                                                                                                                                                                                    for a square.
            Fig. 11.10(b) treat-                                                                                                                                               Sometimes compactness
                                                                                                                                                                                                                A similar dimensionless measure is circularity (also called roundness), defined as
            ed as an ordinary,                                                                                                                                                 is defined as the inverse of
            discrete function                                                                                                                                                  the circularity. Obviously,                                                            4pA
            of one variable.                                                                                                                                                   these two measures are                                            circularity =                                 (11-19)
                                                                                                                        r                                                      closely related.                                                                        p2
www.EBooksWorld.ir www.EBooksWorld.ir
                                         The value of this descriptor is 1 for a circle (its maximum value) and p 4 for a square.                                                                  where z k is a 2-D vector whose elements are the two spatial coordinates of a point in
                                         Note that these two measures are independent of size, orientation, and translation.                                                                       the region, K is the total number of points, and z is the mean vector:
                                         Another measure based on a circle is the effective diameter:                                                                                                                                         1 K
                                                                                                                                                                                                                                         z = ∑ zk                                  (11-22)
                                                                                                A                                                                                                                                             K k =1
                                                                                     de = 2                                            (11-20)                                                     The main diagonal elements of C are the variances of the coordinate values of the
                                                                                                p
                                                                                                                                                                                                   points in the region, and the off-diagonal elements are their covariances.
                                         This is the diameter of a circle having the same area, A, as the region being pro-                                                                           An ellipse oriented in the same direction as the principal axes of the region can be
                                         cessed. This measure is neither dimensionless nor independent of region size, but it                                                                      interpreted as the intersection of a 2-D Gaussian function with the xy-plane. The ori-
                                         is independent of orientation and translation. It can be normalized for size and made                                                                     entation of the axes of the ellipse are also in the direction of the eigenvectors of the
                                         dimensionless by dividing it by the largest diameter expected in a given application.                                                                     covariance matrix, and the distances from the center of the ellipse to its intersection
                                             In a manner analogous to the way we defined compactness and circularity relative                                                                      with its major and minor axes is equal to the largest and smallest eigenvalues of the
                                         to a circle, we define the eccentricity of a region relative to an ellipse as the eccentric-                                                              covariance matrix, respectively, as Fig. 11.21(b) shows. With reference to Fig. 11.21,
                                         ity of an ellipse that has the same second central moments as the region. For 1-D, the                                                                    and the equation of its eccentricity given above, we see by analogy that the eccen-
                                         second central moment is the variance. For 2-D discrete data, we have to consider                                                                         tricity of an ellipse with the same second moments as the region is given by
                                         the variance of each variable as well as the covariance between them. These are
                                         the components of the covariance matrix, which is estimated from samples using                                                                                                                    l22 − l12
                                                                                                                                                                                                                          eccentricity =
                                         Eq. (11-21) below, with the samples in this case being 2-D vectors representing the                                                                                                                  l2                                    (11-23)
                                         coordinates of the data.
                                             Figure 11.21(a) shows an ellipse in standard form (i.e., an ellipse whose major and                                                                                                       = 1 − (l1 l2 )2      l2 ≥ l1
                                         minor axes are aligned with the coordinate axes). The eccentricity of such an ellipse                                                                     For circular regions, l1 = l2 and the eccentricity is 0. For a line, l1 = 0 and the eccen-
                                         is defined as the ratio of the distance between foci (2c in Fig. 11.21), and the length                                                                   tricity is 1. Thus, values of this descriptor are in the range [0, 1].
                                         of its major axis (2a), which gives the ratio 2c 2a = c a. That is,
                                                                                                                                                                        EXAMPLE 11.9 : Comparison of feature descriptors.
                                                                               c     a 2 − b2
                                                              eccentricity =     =            = 1 − (b a)2            a ≥ b                                           Figure 11.22 shows values of the preceding descriptors for several region shapes. None of the descriptors
                                                                               a         a
                                                                                                                                                                      for the circle was exactly equal to its theoretical value because digitizing a circle introduces error into
                                         However, we are interested in the eccentricity of an ellipse that has the same second                                        the computation, and because we approximated the length of a boundary as its number of elements. The
                                         central moments as a given 2-D region, which means that our ellipses can have arbi-                                          eccentricity of the square did have an exact value of 0, because a square with no rotation aligns perfectly
                                         trary orientations. Intuitively, what we are trying to do is approximate our 2-D data                                        with the sampling grid. The other two descriptors for the square were close to their theoretical values also.
                                         by an elliptical region whose axes are aligned with the principal axes of the data, as                                          The values listed in the first two rows of Fig. 11.22 carry the same information. For example, we can
                                         Fig. 11.21(b) illustrates. As you will learn in Section 11.5 (see Example 11.17), the                                        tell that the star is less compact and less circular than the other shapes. Similarly, it is easy to tell from the
            Often, you will the
            constant in Eq. (11-21)      principal axes are the eigenvectors of the covariance matrix, C, of the data, which is                                       numbers listed that the teardrop region has by far the largest eccentricity, but it is harder to differentiate
            written as 1/K instead of
                                         given by:                                                                                                                    from the other shapes using compactness or circularity.
            1/K−1. The latter is used
            to obtain a statistically-                                           K                                                                                       As we discussed in Section 11.1, feature descriptors typically are arranged in the form of feature
                                                                           1
            unbiased estimate of C.
            For our purposes, either
                                                                     C=         ∑
                                                                          K − 1 k =1
                                                                                     (z k − z )(z k − z )T             (11-21)                                        vectors for subsequent processing. Figure 11.23 shows the feature space for the descriptors in Fig. 11.22.
            formulation is acceptable.
             a b                                                                                Major axis                                                             a b c d
            FIGURE 11.21                                                        Minor axis                                                                            FIGURE 11.22                         Descriptor
                                                   c 2 = a 2 − b2                                      e1
            (a) An ellipse in                                                                                                                                         Compactness,
            standard form.                                                                       l1                                                                   circularity, and
                                                          b               Binary       e 2 l2                  e1 l1 and e 2 l2 are the
            (b) An ellipse                    Focus              Focus                                                                                                eccentricity of
                                                                          region                               eigenvectors and                                                                            Compactness       10.1701        42.2442        15.9836       13.2308
            approximating a                           c                                                                                                               some simple
                                                                                                               corresponding eigenvalues
            region in arbitrary                                                                                                                                       binary regions.                      Circularity        1.2356         0.2975         0.7862        0.9478
                                                                                                Centroid       of the covariance matrix of
            orientation.                                         a                              of region      the coordinates of the region
                                                                                                                                                                                                           Eccentricity       0.0411         0.0636              0        0.8117
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                           x3 = eccentricity
                                                 x2 = circularity
            Each point in feature space “encapsulates” the three descriptor values for each object. Although we can
            tell from looking at the values of the descriptors in the figure that the circle and square are much more
            similar than the other two objects, note how much clearer this fact is in feature space. You can imagine
            that if we had multiple samples of those objects corrupted by noise, it could become difficult to differ-
            entiate between vectors (points) corresponding to squares or circles. In contrast, the star and teardrop
                                                                                                                                                                                                                                  Region no.   Ratio of lights per
            objects are far from each other, and from the circle and square, so they are less likely to be misclassified                                                                                                          (from top)   region to total lights
            in the presence of noise. Feature space will play an important role in Chapter 12, when we discuss image                                                                                                                  1                0.204
            pattern classification.                                                                                                                                                                                                   2                0.640
                                                                                                                                                                                                                                      3                0.049
                                                                                                                                                                                                                                      4                0.107
             EXAMPLE 11.10 : Using area features.
            Even a simple descriptor such as normalized area can be quite useful for extracting information from
            images. For instance, Fig. 11.24 shows a night-time satellite infrared image of the Americas. As we dis-
            cussed in Section 1.3, such images provide a global inventory of human settlements. The imaging sensors
            used to collect these images have the capability to detect visible and near infrared emissions, such as
            lights, fires, and flares. The table alongside the images shows (by region from top to bottom) the ratio
            of the area occupied by white (the lights) to the total light area in all four regions. A simple measure-
            ment like this can give, for example, a relative estimate by region of electrical energy consumption. The
            data can be refined by normalizing it with respect to land mass per region, with respect to population
            numbers, and so on.
                                    TOPOLOGICAL DESCRIPTORS                                                                                                                                                               affects distance, topological properties do not depend on the notion of distance or
                                    Topology is the study of properties of a figure that are unaffected by any defor-                                                                                                     any properties implicitly based on the concept of a distance measure.
                                    mation, provided that there is no tearing or joining of the figure (sometimes these                                                                                                      Another topological property useful for region description is the number of con-
                                    are called rubber-sheet distortions). For example, Fig. 11.25(a) shows a region with                                                                     See Sections 2.5 and 9.5     nected components of an image or region. Figure 11.25(b) shows a region with three
                                                                                                                                                                                             regarding connected
                                    two holes. Obviously, a topological descriptor defined as the number of holes in                                                                         components.                  connected components. The number of holes H and connected components C in a
                                    the region will not be affected by a stretching or rotation transformation. However,                                                                                                  figure can be used to define the Euler number, E :
                                    the number of holes can change if the region is torn or folded. Because stretching
                                                                                                                                                                                                                                                                 E=C − H                            (11-24)
www.EBooksWorld.ir www.EBooksWorld.ir
             a b                                                                                                                                FIGURE 11.27
                                                                                                                                                A region                                                                        Vertex
            FIGURE 11.25
                                                                                                                                                containing a
            (a) A region with
                                                                                                                                                polygonal
            two holes.                                                                                                                                                                                                                    Face
                                                                                                                                                network.
            (b) A region with
            three connected
            components.
                                                                                                                                                                                                                                         Hole
                                                                                                                                                                                                              Edge
                                    The Euler number is also a topological property. The regions shown in Fig. 11.26, for
                                    example, have Euler numbers equal to 0 and −1, respectively, because the “A” has
                                    one connected component and one hole, and the “B” has one connected component
                                    but two holes.
                                      Regions represented by straight-line segments (referred to as polygonal networks)
                                    have a particularly simple interpretation in terms of the Euler number. Figure 11.27                        11.28(b). The threshold was selected manually to illustrate the point that it would be impossible in this
                                    shows a polygonal network. Classifying interior regions of such a network into faces                        case to segment the river by itself without other regions of the image also appearing in the thresholded
                                    and holes is often important. Denoting the number of vertices by V, the number of                           result.
                                    edges by Q, and the number of faces by F gives the following relationship, called the                           The image in Fig. 11.28(b) has 1591 connected components (obtained using 8-connectivity) and its
                                    Euler formula:                                                                                              Euler number is 1552, from which we deduce that the number of holes is 39. Figure 11.28(c) shows the
                                                                                                                                                connected component with the largest number of pixels (8479). This is the desired result, which we
                                                                   V − Q + F =C − H                              (11-25)                        already know cannot be segmented by itself from the image using a threshold. Note how clean this result
                                    which, in view of Eq. (11-24), can be expressed as                                                          is. The number of holes in the region defined by the connected component just found would give us the
                                                                                                                                                number of land masses within the river. If we wanted to perform measurements, like the length of each
                                                                      V − Q + F =E                               (11-26)                        branch of the river, we could use the skeleton of the connected component [Fig. 11.28(d)] to do so.
                                    The network in Fig. 11.27 has seven vertices, eleven edges, two faces, one connected
                                    region, and three holes; thus the Euler number is −2 (i.e., 7 − 11 + 2 = 1 − 3 = −2).                                                    TEXTURE
                                                                                                                                                                             An important approach to region description is to quantify its texture content.
             EXAMPLE 11.11 : Extracting and characterizing the largest feature in a segmented image.                                                                         While no formal definition of texture exists, intuitively this descriptor provides mea-
            Figure 11.28(a) shows a 512 × 512, 8-bit image of Washington, D.C. taken by a NASA LANDSAT satel-                                                                sures of properties such as smoothness, coarseness, and regularity (Fig. 11.29 shows
            lite. This image is in the near infrared band (see Fig. 1.10 for details). Suppose that we want to segment                                                       some examples). In this section, we discuss statistical and spectral approaches for
            the river using only this image (as opposed to using several multispectral images, which would simplify                                                          describing the texture of a region. Statistical approaches yield characterizations of
            the task, as you will see later in this chapter). Because the river is a dark, uniform region relative to                                                        textures as smooth, coarse, grainy, and so on. Spectral techniques are based on prop-
            the rest of the image, thresholding is an obvious approach to try. The result of thresholding the image                                                          erties of the Fourier spectrum and are used primarily to detect global periodicity in
            with the highest possible threshold value before the river became a disconnected region is shown in Fig.                                                         an image by identifying high-energy, narrow peaks in its spectrum.
                                                                                                                                                                             Statistical Approaches
             a b                                                                                                                                                             One of the simplest approaches for describing texture is to use statistical moments
            FIGURE 11.26                                                                                                                                                     of the intensity histogram of an image or region. Let z be a random variable denot-
            Regions with                                                                                                                                                     ing intensity, and let p ( zi ) , i = 0, 1, 2,…, L − 1, be the corresponding normalized his-
            Euler numbers                                                                                                                                                    togram, where L is the number of distinct intensity levels. From Eq. (3-24), the nth
            equal to 0 and −1,                                                                                                                                               moment of z about the mean is
            respectively.
                                                                                                                                                                                                                         L –1
                                                                                                                                                                                                            mn ( z ) =   ∑   ( zi − m)n p ( zi )               (11-27)
                                                                                                                                                                                                                         i=0
www.EBooksWorld.ir www.EBooksWorld.ir
             a b                                                                                                                                   where m is the mean value of z (i.e., the average intensity of the image or region):
             c d
                                                                                                                                                                                                  L −1
            FIGURE 11.28                                                                                                                                                                 m=       ∑   zi p ( zi )                    (11-28)
            (a) Infrared image                                                                                                                                                                    i=0
            of the Washington,
            D.C. area.                                                                                                                             Note from Eq. (11-27) that m0 = 1 and m1 = 0. The second moment [the variance
            (b) Thresholded                                                                                                                        s 2 ( z) = m2 ( z)] is particularly important in texture description. It is a measure of
            image.                                                                                                                                 intensity contrast that can be used to establish descriptors of relative intensity
            (c) The largest
            connected compo-                                                                                                                       smoothness. For example, the measure
            nent of (b).
                                                                                                                                                                                                             1
            (d) Skeleton of (c).                                                                                                                                                    R ( z) = 1 −                                     (11-29)
            (Original image                                                                                                                                                                              1 + s 2 ( z)
            courtesy of NASA.)
                                                                                                                                                   is 0 for areas of constant intensity (the variance is zero there) and approaches 1 for
                                                                                                                                                   large values of s 2 ( z) . Because variance values tend to be large for grayscale images
                                                                                                                                                   with values, for example, in the range 0 to 255, it is a good idea to normalize the vari-
                                                                                                                                                   ance to the interval [0, 1] for use in Eq. (11-29). This is done simply by dividing s 2 ( z)
                                                                                                                                                   by ( L − 1) in Eq. (11-29). The standard deviation, s(z), also is used frequently as a
                                                                                                                                                              2
                                                                                                                                                   and a measure of average entropy that, as you may recall from information theory,
                                                                                                                                                   is defined as
                                                                                                                                                                                           L −1
             a b c                                                                                                                                                               e ( z) = – ∑ p ( zi ) log 2 p ( zi )                (11-31)
            FIGURE 11.29                                                                                                                                                                   i=0
            The white squares
            mark, from left                                                                                                                        Because values of p are in the range [0, 1] and their sum equals 1, the value of
            to right, smooth,                                                                                                                      descriptor U is maximum for an image in which all intensity levels are equal (maxi-
            coarse, and regular                                                                                                                    mally uniform), and decreases from there. Entropy is a measure of variability, and is
            textures. These are                                                                                                                    0 for a constant image.
            optical microscope
            images of a
            superconductor,                                                                                            EXAMPLE 11.12 : Texture descriptors based on histograms.
            human cholesterol,
            and a microproces-                                                                                       Table 11.2 lists the values of the preceding descriptors for the three types of textures highlighted in
            sor. (Courtesy of                                                                                        Fig. 11.29. The mean describes only the average intensity of each region and is useful only as a rough
            Dr. Michael W.                                                                                           idea of intensity, not texture. The standard deviation is more informative; the numbers clearly show
            Davidson, Florida                                                                                        that the first texture has significantly less variability in intensity (it is smoother) than the other two tex-
            State University.)
                                                                                                                     tures. The coarse texture shows up clearly in this measure. As expected, the same comments hold for R,
                                                                                                                     because it measures essentially the same thing as the standard deviation. The third moment is useful for
www.EBooksWorld.ir www.EBooksWorld.ir
                                            G would all be 1’s, because intensity value 1 occurs in f with neighbors valued 3, 5,                                                                                                  mc = ∑ j ∑ pij
                                                                                                                                                                                                                                          j =1    i =1
                                            and 7 in the position specified by Q—one occurrence of each. As an exercise, you
                                            should compute all the elements of G using this definition of Q.                                                                              and
www.EBooksWorld.ir www.EBooksWorld.ir
www.EBooksWorld.ir www.EBooksWorld.ir
             a                                                                                                                                   TABLE 11.4
             b                                                                                                                                  Descriptors evaluated using the co-occurrence matrices displayed as images in Fig. 11.32.
             c
                                                                                                                                                   Normalized
            FIGURE 11.31                                                                                                                                                Maximum
                                                                                                                                                  Co-occurrence                        Correlation     Contrast      Uniformity     Homogeneity   Entropy
            Images whose                                                                                                                                                Probability
                                                                                                                                                     Matrix
            pixels have
            (a) random,                                                                                                                                G1 n1                 0.00006    −0.0005          10838            0.00002     0.0366       15.75
            (b) periodic, and
            (c) mixed texture                                                                                                                          G 2 n2                0.01500     0.9650          00570            0.01230     0.0824       06.43
            patterns. Each                                                                                                                                                   0.06860     0.8798          01356            0.00480     0.2048       13.58
            image is of size                                                                                                                           G 3 n3
            263 × 800 pixels.
                                                                                                                                                   The contrast descriptor is highest for G1 and lowest for G 2 . Thus, we see that the less random an
                                                                                                                                                image is, the lower its contrast tends to be. We can see the reason by studying the matrix displayed in
                                                                                                                                                Fig. 11.32. The (i − j )2 terms are differences of integers for 1 ≤ i, j ≤ L, so they are the same for any G.
                                                                                                                                                Therefore, the probabilities of the elements of the normalized co-occurrence matrices are the factors
                                                                                                                                                that determine the value of contrast. Although G1 has the lowest maximum probability, the other two
                                                                                                                                                matrices have many more zero or near-zero probabilities (the dark areas in Fig. 11.32). Because the sum
                                                                                                                                                of the values of G n is 1, it is easy to see why the contrast descriptor tends to increase as a function of
            variability in intensities. The high transitions in intensity occur at object boundaries, but these counts                          randomness.
            are low with respect to the moderate intensity transitions over large areas, so they are obscured by the                               The remaining three descriptors are explained in a similar manner. Uniformity increases as a func-
            ability of an image display to show high and low values simultaneously, as we discussed in Chapter 3.                               tion of the values of the probabilities squared. Thus, the less randomness there is in an image, the higher
               The preceding observations are qualitative. To quantify the “content” of co-occurrence matrices, we                              the uniformity descriptor will be, as the fifth column in Table 11.4 shows. Homogeneity measures the
            need descriptors such as those in Table 11.3. Table 11.4 shows values of these descriptors computed                                 concentration of values of G with respect to the main diagonal. The values of the denominator term
            for the three co-occurrence matrices in Fig. 11.32. To use these descriptors, the co-occurrence matrices                            (1 + i − j ) are the same for all three co-occurrence matrices, and they decrease as i and j become closer
            must be normalized by dividing them by the sum of their elements, as discussed earlier. The entries in                              in value (i.e., closer to the main diagonal). Thus, the matrix with the highest values of probabilities
            Table 11.4 agree with what one would expect from the images in Fig. 11.31 and their corresponding co-                               (numerator terms) near the main diagonal will have the highest value of homogeneity. As we discussed
            occurrence matrices in Fig. 11.32. For example, consider the Maximum Probability column in Table 11.4.                              earlier, such a matrix will correspond to images with a “rich” gray-level content and areas of slowly vary-
            The highest probability corresponds to the third co-occurrence matrix, which tells us that this matrix                              ing intensity values. The entries in the sixth column of Table 11.4 are consistent with this interpretation.
            has the highest number of counts (largest number of pixel pairs occurring in the image relative to the                                 The entries in the last column of the table are measures of randomness in co-occurrence matrices,
            positions in Q) than the other two matrices. This agrees with our analysis of G 3 . The second column indi-                         which in turn translate into measures of randomness in the corresponding images. As expected, G1 had
            cates that the highest correlation corresponds to G 2 , which in turn tells us that the intensities in the sec-                     the highest value because the image from which it was derived was totally random. The other two
            ond image are highly correlated. The repetitiveness of the sinusoidal pattern in Fig. 11.31(b) indicates                            entries are self-explanatory. Note that the entropy measure for G1 is near the theoretical maximum of
            why this is so. Note that the correlation for G1 is essentially zero, indicating that there is virtually no                         16 (2 log 2 256 = 16). The image in Fig. 11.31(a) is composed of uniform noise, so each intensity level has
            correlation between adjacent pixels, a characteristic of random images such as the image in Fig. 11.31(a).                          approximately an equal probability of occurrence, which is the condition stated in Table 11.3 for maxi-
                                                                                                                                                mum entropy.
                                                                                                                                                   Thus far, we have dealt with single images and their co-occurrence matrices. Suppose that we want
             a b c                                                                                                                              to “discover” (without looking at the images) if there are any sections in these images that contain
            FIGURE 11.32                                                                                                                        repetitive components (i.e., periodic textures). One way to accomplish this goal is to examine the cor-
            256 × 256                                                                                                                           relation descriptor for sequences of co-occurrence matrices, derived from these images by increasing
            co-occurrence
            matrices G1 , G 2 ,                                                                                                                 the distance between neighbors. As mentioned earlier, it is customary when working with sequences of
            and G 3 ,                                                                                                                           co-occurrence matrices to quantize the number of intensities in order to reduce matrix size and corre-
            corresponding                                                                                                                       sponding computational load. The following results were obtained using L = 8.
            from left to right                                                                                                                     Figure 11.33 shows plots of the correlation descriptors as a function of horizontal “offset” (i.e., hori-
            to the images in                                                                                                                    zontal distance between neighbors) from 1 (for adjacent pixels) to 50. Figure 11.33(a) shows that all
            Fig. 11.31.
                                                                                                                                                correlation values are near 0, indicating that no such patterns were found in the random image. The
www.EBooksWorld.ir www.EBooksWorld.ir
                                1                                                                                                                                                                              purpose of analysis, every periodic pattern is associated with only one peak in the
                                                                                                                                                                                                               spectrum, rather than two.
                               0.5                                                                                                                                                                                 Detection and interpretation of the spectrum features just mentioned often
                Correlation
www.EBooksWorld.ir www.EBooksWorld.ir
             a b                                                                                                                                                                        MOMENT INVARIANTS
             c d
                                                                                                                                                                                        The 2-D moment of order ( p + q ) of an M × N digital image, f ( x, y), is defined as
            FIGURE 11.35
            (a) and (b) Images                                                                                                                                                                                                           M –1 N –1
            of random and
            ordered objects.
                                                                                                                                                                                                                                m pq =   ∑ ∑ x p yq f ( x, y )
                                                                                                                                                                                                                                         x=0 y=0
                                                                                                                                                                                                                                                                                            (11-34)
            (c) and (d) Cor-
            responding                                                                                                                                                                  where p = 0, 1, 2,… and q = 0, 1, 2,… are integers. The corresponding central moment
            Fourier spectra. All                                                                                                                                                        of order ( p + q ) is defined as
            images are of size
            600 × 600 pixels.
                                                                                                                                                                                                                                M –1 N –1
                                                                                                                                                                                                                                                         p       q
                                                                                                                                                                                                                       m pq =   ∑ ∑ ( x – x ) ( y – y ) f ( x, y )
                                                                                                                                                                                                                                x=0 y=0
                                                                                                                                                                                                                                                                                            (11-35)
                                                                                                                                                                                                                                       m10        m
                                                                                                                                                                                                                                 x=        and y = 01                                       (11-36)
                                                                                                                                                                                                                                       m00        m00
                                                                                                                                                                                        The normalized central moment of order ( p + q ) , denoted h pq , is defined as
                                                                                                                                                                                                                                                         m pq
                                                                                                                                                                                                                                            h pq =                                          (11-37)
                                                                                                                                                                                                                                                         m g00
                                                                                                                                                                                        where
                                                                                                                                                                                                                                              p + q
                                                                                                                                                                                                                                      g=            + 1                                     (11-38)
                                                                                                                                                                                                                                                2
             a b                    9.0                                          2.7                                                                                                    for p + q = 2, 3, … .A set of seven, 2-D moment invariants can be derived from the
             c d                    8.0                                          2.6                                                                                                    second and third normalized central moments:†
                                    7.0                                          2.5
            FIGURE 11.36
             (a) and (b) Plots      6.0                                          2.4                                                                                                                                                   f1 = h20 + h02                                       (11-39)
                                    5.0                                          2.3
            of S(r ) and S(u)
            for Fig. 11.35(a).      4.0                                          2.2
                                                                                                                                                                                                                                 f2 = ( h20 – h02 ) + 4h11
                                                                                                                                                                                                                                                        2    2
            (c) and (d) Plots       3.0                                          2.1                                                                                                                                                                                                        (11-40)
            of S(r ) and S(u)       2.0                                          2.0
            for Fig. 11.35(b).      1.0                                          1.9
            All vertical axes                                                                                                                                                                                             f3 = ( h30 – 3h12 ) + ( 3h21 – h03 )
                                     0                                           1.8                                                                                                                                                                 2                   2
                                          0   50   100   150   200   250   300         0   20 40 60 80 100 120 140 160 180                                                                                                                                                                  (11-41)
            are ×10 5.
                                    6.0                                          3.6
                                                                                 3.4
                                    5.0
                                                                                                                                                                                                                           f4 = ( h30 + h12 ) + ( h21 + h03 )
                                                                                 3.2                                                                                                                                                                 2               2
                                                                                 3.0
                                                                                                                                                                                                                                                                                            (11-42)
                                    4.0
                                                                                 2.8
                                    3.0                                          2.6
                                                                                 2.4
                                    2.0                                          2.2
                                    1.0                                          2.0                                                                                                    †
                                                                                                                                                                                         Derivation of these results requires concepts that are beyond the scope of this discussion. The book by Bell
                                                                                 1.8                                                                                                    [1965] and the paper by Hu [1962] contain detailed discussions of these concepts. For generating moment invari-
                                     0                                           1.6                                                                                                    ants of an order higher than seven, see Flusser [2000]. Moment invariants can be generalized to n dimensions
                                          0   50   100   150   200   250   300         0   20 40 60 80 100 120 140 160 180                                                              (see Mamistvalov [1998]).
www.EBooksWorld.ir www.EBooksWorld.ir
www.EBooksWorld.ir www.EBooksWorld.ir
                                         images, then the corresponding pixels at the same spatial location in all images can                                                                                       It is not difficult to show (see Problem 11.25) that the mean of the y vectors result-
                                         be arranged as an n-dimensional vector:                                                                                                                                 ing from this transformation is zero; that is,
                                                                                                   x1                                                                                                                                              my = E {y} = 0                              (11-50)
                                                                                                  x 
                                                                                                x= 
                                                                                                     2
                                                                                                                                                      (11-46)                                                    It follows from basic matrix theory that the covariance matrix of the y’s is given in
                                                                                                  # 
                                                                                                                                                                                                               terms of A and Cx by the expression
                                                                                                   xn 
                                                                                                                                                                                                                                                         C y = ACx AT                            (11-51)
                                         Throughout this section, the assumption is that all vectors are column vectors (i.e.,
                                         matrices of order n × 1). We can write them on a line of text simply by expressing                                                                                      Furthermore, because of the way A was formed, Cy is a diagonal matrix whose ele-
                                         them as x = ( x1 , x2 ,…, xn ) , where T indicates the transpose.
                                                                       T
                                                                                                                                                                                                                 ments along the main diagonal are the eigenvalues of Cx ; that is,
            You may find it helpful         We can treat the vectors as random quantities, just like we did when constructing
            to review the tutorials on   an intensity histogram. The only difference is that, instead of talking about quanti-                                                                                                                        *1              0
            probability and matrix
                                         ties like the mean and variance of the random variables, we now talk about mean                                                                                                                                      *2         
                                                                                                                                                                                                                                                Cy =                     
            theory available on the
            book website.                vectors and covariance matrices. The mean vector of the population is defined as                                                                                                                                                                        (11-52)
                                                                                                                                                                                                                                                                    $    
                                                                                                                                                                                                                                                                         
                                                                                             m x = E {x }                                             (11-47)                                                                                        0                *n 
                                         where E {x} is the expected value of x, and the subscript denotes that m is associated                                                                                  The off-diagonal elements of this covariance matrix are 0, so the elements of the
                                         with the population of x vectors. Recall that the expected value of a vector or matrix                                                                                  y vectors are uncorrelated. Keep in mind that the li are the eigenvalues of Cx and
                                         is obtained by taking the expected value of each element.                                                                                                               that the elements along the main diagonal of a diagonal matrix are its eigenvalues
                                            The covariance matrix of the vector population is defined as                                                                                                         (Noble and Daniel [1988]). Thus, Cx and Cy have the same eigenvalues.
                                                                                                                                                                                                                    Another important property of the Hotelling transform deals with the reconstruc-
                                                                                            {
                                                                                  Cx = E ( x – mx )( x – mx )
                                                                                                                     T
                                                                                                                         }                            (11-48)                                                    tion of x from y. Because the rows of A are orthonormal vectors, it follows that
                                                                                                                                                                                                                 A –1 = AT , and any vector x can be recovered from its corresponding y by using the
                                         Because x is n dimensional, Cx is an n × n matrix. Element cii of Cx is the variance                                                                                    expression
                                         of xi , the ith component of the x vectors in the population, and element cij of Cx
                                                                                                                                                                                                                                                         x = AT y + mx                           (11-53)
                                         is the covariance between elements xi and x j of these vectors. Matrix Cx is real
                                         and symmetric. If elements xi and x j are uncorrelated, their covariance is zero and,                                                                                   But, suppose that, instead of using all the eigenvectors of Cx , we form a matrix A k
                                         therefore, cij = 0, resulting in a diagonal covariance matrix.                                                                                                          from the k eigenvectors corresponding to the k largest eigenvalues, yielding a trans-
                                            Because Cx is real and symmetric, finding a set of n orthonormal eigenvectors                                                                                        formation matrix of order k × n. The y vectors would then be k dimensional, and
                                         is always possible (Noble and Daniel [1988]). Let ei and li , i =1, 2, … , n, be the                                                                                    the reconstruction given in Eq. (11-53) would no longer be exact (this is somewhat
                                         eigenvectors and corresponding eigenvalues of CX , † arranged (for convenience) in                                                                                      analogous to the procedure we used in Section 11.3 to describe a boundary with a
                                         descending order so that * j ≥ * j +1 for j = 1, 2,…, n − 1. Let A be a matrix whose                                                                                    few Fourier coefficients).
                                         rows are formed from the eigenvectors of CX , arranged in descending value of their                                                                                        The vector reconstructed by using A k is
                                         eigenvalues, so that the first row of A is the eigenvector corresponding to the largest
                                         eigenvalue.                                                                                                                                                                                                     x̂ = ATk y + mx                         (11-54)
                                            Suppose that we use A as a transformation matrix to map the x’s into vectors
                                         denoted by y’s, as follows:                                                                                                                                             It can be shown that the mean squared error between x and x̂ is given by the expres-
                                                                                           y = A ( x – mx )                                           (11-49)                                                    sion
                                                                                                                                                                                                                                                     n           k          n
            The Hotelling transform
            is the same as the           This expression is called the Hotelling transform, which, as you will learn shortly, has                                                                                                             ems = ∑ * j −     ∑    *j = ∑ *j                   (11-55)
            discrete Karhunen-Loève                                                                                                                                                                                                                 j =1        j =1      j = k +1
            transform, so the
                                         some very interesting and useful properties.
            two names are used
            interchangeably in the                                                                                                                                                                               Equation (11-55) indicates that the error is zero if k = n (that is, if all the eigen-
            literature.
                                         †
                                             By definition, the eigenvector and eigenvalues of an n × n matrix C satisfy the equation Cei = li ei .
                                                                                                                                                                                                                 vectors are used in the transformation). Because the * j ’s decrease monotonically,
www.EBooksWorld.ir www.EBooksWorld.ir
                                    Eq. (11-55) also shows that the error can be minimized by selecting the k eigenvec-                                      FIGURE 11.39
                                    tors associated with the largest eigenvalues. Thus, the Hotelling transform is optimal                                   Forming of a
                                    in the sense that it minimizes the mean squared error between the vectors x and                                          feature vector from
                                                                                                                                                             corresponding
                                    their approximations x̂. Due to this idea of using the eigenvectors corresponding
                                                                                                                                                             pixels in six images.
                                    to the largest eigenvalues, the Hotelling transform also is known as the principal
                                    components transform.
                                                                                                                                                                                                                                                         Spectral band 6
             EXAMPLE 11.16 : Using principal components for image description.                                                                                                                          x1
                                                                                                                                                                                                                                                     Spectral band 5
            Figure 11.38 shows six multispectral satellite images corresponding to six spectral bands: visible blue                                                                                     x2
                                                                                                                                                                                                        x3                                     Spectral band 4
            (450–520 nm), visible green (520–600 nm), visible red (630–690 nm), near infrared (760–900 nm), middle
                                                                                                                                                                                                  x )
            infrared (1550–1,750 nm), and thermal infrared (10,400–12,500 nm). The objective of this example is to                                                                                      x4                                 Spectral band 3
            illustrate how to use principal components as image features.                                                                                                                               x5                             Spectral band 2
                Organizing the images as in Fig. 11.39 leads to the formation of a six-element vector x from each set                                                                                   x6
                                                                                                                                                                                                                                   Spectral band 1
            of corresponding pixels in the images, as discussed earlier in this section. The images in this example
            are of size 564 × 564 pixels, so the population consisted of ( 564 ) = 318, 096 vectors from which the
                                                                                 2
            mean vector, covariance matrix, and corresponding eigenvalues and eigenvectors were computed. The                                                eigenvectors were then used as the rows of matrix A, and a set of y vectors were obtained using Eq.
                                                                                                                                                             (11-49). Similarly, we used Eq. (11-51) to obtain C y . Table 11.6 shows the eigenvalues of this matrix.
                                                                                                                                                             Note the dominance of the first two eigenvalues.
                                                                                                                                                                A set of principal component images was generated using the y vectors mentioned in the previous
                                                                                                                                                             paragraph (images are constructed from vectors by applying Fig. 11.39 in reverse). Figure 11.40 shows
                                                                                                                                                             the results. Figure 11.40(a) was formed from the first component of the 318,096 y vectors, Fig. 11.40(b)
                                                                                                                                                             from the second component of these vectors, and so on, so these images are of the same size as the origi-
                                                                                                                                                             nal images in Fig. 11.38. The most obvious feature in the principal component images is that a significant
                                                                                                                                                             portion of the contrast detail is contained in the first two images, and it decreases rapidly from there. The
                                                                                                                                                             reason can be explained by looking at the eigenvalues. As Table 11.6 shows, the first two eigenvalues are
                                                                                                                                                             much larger than the others. Because the eigenvalues are the variances of the elements of the y vectors,
                                                                                                                                                             and variance is a measure of intensity contrast, it is not unexpected that the images formed from the
                                                                                                                                                             vector components corresponding to the largest eigenvalues would exhibit the highest contrast. In fact,
                                                                                                                                                             the first two images in Fig. 11.40 account for about 89% of the total variance. The other four images have
                                                                                                                                                             low contrast detail because they account for only the remaining 11%.
                                                                                                                                                                According to Eqs. (11-54) and (11-55), if we used all the eigenvectors in matrix A we could recon-
                                                                                                                                                             struct the original images from the principal component images with zero error between the original
                                                                                                                                                             and reconstructed images (i.e., the images would be identical). If the objective is to store and/or transmit
                                                                                                                                                             the principal component images and the transformation matrix for later reconstruction of the original
                                                                                                                                                             images, it would make no sense to store and/or transmit all the principal component images because
                                                                                                                                                             nothing would be gained. Suppose, however, that we keep and/or transmit only the two principal com-
                                                                                                                                                             ponent images. Then there would be significant savings in storage and/or transmission (matrix A would
                                                                                                                                                             be of size 2 × 6, so its impact would be negligible).
                                                                                                                                                                Figure 11.41 shows the results of reconstructing the six multispectral images from the two principal
                                                                                                                                                             component images corresponding to the largest eigenvalues. The first five images are quite close in
             a b c
             d e f                                                                                                                                            TABLE 11.6
                                                                                                                                                                                           L1                L2            L3             L4                 L5            L6
            FIGURE 11.38 Multispectral images in the (a) visible blue, (b) visible green, (c) visible red, (d) near infrared, (e) middle                     Eigenvalues of Cx
            infrared, and (f) thermal infrared bands. (Images courtesy of NASA.)                                                                             obtained from the            10344          2966            1401            203                 94            31
                                                                                                                                                             images in Fig. 11.38.
www.EBooksWorld.ir www.EBooksWorld.ir
             a b c                                                                                                                                   a b c
             d e f                                                                                                                                   d e f
            FIGURE 11.40 The six principal component images obtained from vectors computed using Eq. (11-49). Vectors are                           FIGURE 11.41 Multispectral images reconstructed using only the two principal component images corresponding to the
            converted to images by applying Fig. 11.39 in reverse.                                                                                  two principal component vectors with the largest eigenvalues. Compare these images with the originals in Fig. 11.38.
            appearance to the originals in Fig. 11.38, but this is not true for the sixth image. The reason is that the                             of maximum variance (data spread) of the population, while the second eigenvector is perpendicular to
            original sixth image is actually blurry, but the two principal component images used in the reconstruc-                                 the first, as Fig. 11.43(b) shows. In terms of the present discussion, the principal components transform in
            tion are sharp, therefore, the blurry “detail” is lost. Figure 11.42 shows the differences between the                                  Eq. (11-49) accomplishes two things: (1) it establishes the center of the transformed coordinates system
            original and reconstructed images. The images in Fig. 11.42 were enhanced to highlight the differences                                  as the centroid (mean) of the population because mx is subtracted from each x; and (2) the y coordinates
            between them. If they were shown without enhancement, the first five images would appear almost all                                     (vectors) it generates are rotated versions of the x’s, so that the data align with the eigenvectors. If we
            black, with the sixth (difference) image showing the most variability.                                                                  define a ( y1 , y2 ) axis system so that y1 is along the first eigenvector and y2 is along the second, then the
                                                                                                                                                    geometry that results is as illustrated in Fig. 11.43(c). That is, the dominant data directions are aligned
                                                                                                                                                    with the new axis system. The same result will be obtained regardless of the size, translation, or rotation
             EXAMPLE 11.17 : Using principal components for normalizing for variations in size, translation, and rotation.                          of the object, provided that all points in the region or boundary undergo the same transformation. If we
                                                                                                                                                    wished to size-normalize the transformed data, we would divide the coordinates by the corresponding
            As we mentioned earlier in this chapter, feature descriptors should be as independent as possible of
                                                                                                                                                    eigenvalues.
            variations in size, translation, and rotation. Principal components provide a convenient way to normal-
                                                                                                                                                       Observe in Fig. 11.43(c) that the points in the y-axes system can have both positive and negative val-
            ize boundaries and/or regions for variations in these three variables. Consider the object in Fig. 11.43,
                                                                                                                                                    ues. To convert all coordinates to positive values, we simply subtract the vector ( y1 min , y2 min )T from all
            and assume that its size, location, and orientation (rotation) are arbitrary. The points in the region (or its
                                                                                                                                                    the y vectors. To displace the resulting points so that they are all greater than 0, as in Fig. 11.43(d), we
            boundary) may be treated as 2-D vectors, x = ( x1 , x2 ) , where x1 and x2 are the coordinates of any object
                                                                    T
                                                                                                                                                    add to them a vector ( a, b) where a and b are greater than 0.
                                                                                                                                                                                   T
            point. All the points in the region or boundary constitute a 2-D vector population that can be used to
                                                                                                                                                       Although the preceding discussion is straightforward in principle, the mechanics are a frequent source
            compute the covariance matrix Cx and mean vector mx . One eigenvector of Cx points in the direction
                                                                                                                                                    of confusion. Thus, we conclude this example with a simple manual illustration. Figure 11.44(a) shows
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                       a b                                  x2                                   x2
                                                                                                                                                       c d                                                                        Direction perpendicular
                                                                                                                                                      FIGURE 11.43                                                                to the direction of max
                                                                                                                                                      (a) An object.                                                              variance
                                                                                                                                                      (b) Object show-                                                                    e2
                                                                                                                                                      ing eigenvectors                                                                                      e1
                                                                                                                                                      of its covariance                                                                                      Direction of
                                                                                                                                                      matrix.                                                                                                max variance
                                                                                                                                                      (c) Transformed
                                                                                                                                                      object, obtained
                                                                                                                                                      using Eq. (11-49).
                                                                                                                                                      (d) Object
                                                                                                                                                      translated so that
                                                                                                                                                      all its coordinate
                                                                                                                                                      values are greater                                                    x1                                     x1
                                                                                                                                                      than 0.                                           y2                       y2
Centroid
y1
             a b c
             d e f
                                                                                                                                                                                                                                                                    y1
            FIGURE 11.42 Differences between the original and reconstructed images. All images were enhanced by scaling them
            to the full [0, 255] range to facilitate visual analysis.
            four points with coordinates (1, 1), (2, 4), (4, 2), and (5, 5). The mean vector, covariance matrix, and nor-                                When transforming image pixels, keep in mind that image coordinates are the same as matrix coor-
            malized (unit length) eigenvectors of this population are:                                                                                dinates; that is, ( x, y) represents (r, c), and the origin is the top left. Axes of the principal components
                                                                                                                                                      just illustrated are as shown in Figs. 11.43(a) and (d). You need to keep this in mind in interpreting the
                                                              3        3.333 2.00                                                                 results of applying a principal components transformation to objects in an image.
                                                        mx =   , Cx =             
                                                              3        2.00 3.333
            and                                                                                                                                                                11.6 WHOLE-IMAGE FEATURES
                                                                                                                                                                                   11.6
                                                                0.707           − 0.707                                                                                        The descriptors introduced in Sections 11.2 through 11.4 are well suited for appli-
                                                          e1 =         , e 2 =  0.707 
                                                                0.707                                                                                                          cations (e.g., industrial inspection), in which individual regions can be segmented
                                                                                                                                                                                   reliably using methods such as the ones discussed in Chapters 10 and 11. With the
               The corresponding eigenvalues are *1 = 5.333 and * 2 = 1.333. Figure 11.44(b) shows the eigenvec-
                                                                                                                                                                                   exception of the application in Example 11.17, the principal components feature
            tors superimposed on the data. From Eq. (11-49), the transformed points (the y’s) are (−2.828, 0)T ,
                                                                                                                                                                                   vectors in Section 11.5 are different from the earlier material, in the sense that they
            (0, − 1.414)T, (0, 1.414)T , and (2.828, 0)T. These points are plotted in Fig. 11.44(c). Note that they are
                                                                                                                                                                                   are based on multiple images. But even these descriptors are localized to sets of
            aligned with the y-axes and that they have fractional values. When working with images, coordinate
                                                                                                                                                                                   corresponding pixels. In some applications, such as searching image databases for
            values are integers, making it necessary to round all values to their nearest integer value. Figure 11.44(d)
            shows the points rounded to the nearest integer and their location shifted so that all coordinate values                                                               matches (e.g., as in human face recognition), the variability between images is so
            are integers greater than 0, as in the original figure.                                                                                                                extensive that the methods in Chapters 10 and 11 are not applicable.
www.EBooksWorld.ir www.EBooksWorld.ir
             a b                                         x2                                             x2                                                                    FIGURE 11.45
             c d                                                                                                                                                              Illustration of how
                                                     7                                              7                                                                         the Harris-Stephens                              Region 2
            FIGURE 11.44                                                                                                                                                      corner detector                                               A
            A manual                                 6                                              6
                                                                                                                                                                              operates in the
            example.                                 5                                              5                                                                         three types of sub-
            (a) Original points.                     4                                              4                                                                         regions indicated by                                                                        B
                                                                                                                                                                                                                                   Boundary                                                        C
            (b) Eigenvectors of                                                                                  e2       e1                                                  A (flat), B (edge),
            the covariance                           3                                              3
                                                                                                                                                                              and C (corner). The                                                              Region 1
            matrix of the points                     2                                              2
                                                                                                                                                                              wiggly arrows
            in (a).                                  1                                              1                                                                         indicate graphically
            (c) Transformed                          0                                         x1   0                                      x1                                 a directional
            points obtained                              0    1   2   3       4    5   6   7            0    1   2    3    4   5   6   7                                      response in the
            using Eq. (11-49).                                                                                                                                                detector as it moves
            (d) Points from (c),                                              y2                        y2                                                                    in the three areas
            rounded and trans-                                                                                                                                                shown.
            lated so that all
                                                                          3                         7
            coordinate values
            are integers greater                                          2                         6
            than 0. The dashed                                            1                         5
            lines are included                                                                 y1
                                                                                                                                                                                                           happens when the window is located in a constant (or nearly constant) region, as
                                                                                                    4
            to facilitate viewing.                           %3 %2 %1              1   2   3                                                                                                               in location A in Fig. 11.45; (2) areas of changes in one direction but no (or small)
            They are not part of                                    %1                              3
                                                                                                                                                                                                           changes in the orthogonal direction, which this happens when the window spans a
            the data.                                                 %2                            2
                                                                                                                                                                                                           boundary between two regions, as in location B; and (3) areas of significant changes
                                                                      %3                            1
                                                                                                                                                                                                           in all directions, a condition that happens when the window contains a corner (or
                                                                                                    0                                      y1
                                                                                                        0    1   2    3    4   5   6   7                                                                   isolated points), as in location C. The HS corner detector is a mathematical formula-
                                                                                                                                                                                                           tion that attempts to differentiate between these three conditions.
                                                                                                                                                                              A patch is the image area       Let f denote an image, and let f ( s, t ) denote a patch of the image defined by the
                                            The state of the art in image processing is such that as the complexity of the task                                               spanned by the detector
                                                                                                                                                                              window at any given
                                                                                                                                                                                                           values of ( s, t ). A patch of the same size, but shifted by ( x, y), is given by f ( s + x, t + y).
                                         increases, the number of techniques suitable for addressing those tasks decreases.                                                   time.                        Then, the weighted sum of squared differences between the two patches is given by
                                         This is particularly true when dealing with feature descriptors applicable to entire
            The discussion in
                                         images that are members of a large family of images. In this section, we discuss                                                                                                        C( x, y) = ∑ ∑ w( s, t )[ f ( s + x, t + y) − f ( s, t )]
                                                                                                                                                                                                                                                                                                   2
                                                                                                                                                                                                                                                                                                       (11-56)
            Sections 12.5 through        two of the principal feature detection methods currently being used for this pur-                                                                                                                     s       t
            12.7 dealing with neural
            networks is also impor-
                                         pose. One is based on detecting corners, and the other works with entire regions
            tant in terms of process-    in an image. Then, in Section 11.7 we present a feature detection and description                                                                                 where w( s, t ) is a weighting function to be discussed shortly. The shifted patch can be
            ing large numbers of
                                         approach designed specifically to work with these types of features.                                                                                              approximated by the linear terms of a Taylor expansion
            entire images for the
            purpose of characterizing
            their content.
                                         THE HARRIS-STEPHENS CORNER DETECTOR                                                                                                                                                       f ( s + x, t + y) ≈ f ( s, t ) + xfx ( s, t ) + yfy ( s, t )        (11-57)
            Our use the term “corner”    Intuitively, we think of a corner as a rapid change of direction in a curve. Corners
            is broader than just
            90° corners; it refers to
                                         are highly effective features because they are distinctive and reasonably invariant to                                                                            where fx ( s, t ) = ∂f ∂x and fy ( s, t ) = ∂f ∂y , both evaluated at ( s, t ). We can then write
            features that are “corner-   viewpoint. Because of these characteristics, corners are used routinely for matching                                                                              Eq. (11-56) as
            like.”
                                         image features in applications such as tracking for autonomous navigation, stereo
                                                                                                                                                                                                                                                                                               2
                                         machine vision algorithms, and image database queries.                                                                                                                                    C( x, y) = ∑ ∑ w( s, t )  xfx ( s, t ) + yfy ( s, t )           (11-58)
                                            In this section, we discuss an algorithm for corner detection formulated by Har-                                                                                                                       s       t
                                         ris and Stephens [1988]. The idea behind the Harris-Stephens (HS) corner detec-
                                         tor is illustrated in Fig. 11.45. The basic approach is this: Corners are detected by                                                                             This equation can written in matrix form as
                                         running a small window over an image, as we did in Chapter 3 for spatial filtering.
                                                                                                                                                                                                                                                                         x
                                         The detector window is designed to compute intensity changes. We are interested in                                                                                                                        C( x, y) = [ x y ] M                              (11-59)
                                         three scenarios: (1) Areas of zero (or small) intensity changes in all directions, which                                                                                                                                        y
www.EBooksWorld.ir www.EBooksWorld.ir
where
                                                                           M = ∑ ∑ w( s, t ) A                                             (11-60)
                                                                                    s    t
                                        and
                                                                                fx2                  fx fy 
                                                                             A=                                                          (11-61)
                                                                                fx fy                fy2 
                                        Matrix M sometimes is called the Harris matrix. It is understood that its terms are
                                        evaluated at ( s, t ). If w( s, t ) is isotropic, then M is symmetric because A is. The
                                        weighting function w( s, t ) used in the HS detector generally has one of two forms:
                                        (1) it is 1 inside the patch and 0 elsewhere (i.e., it has the shape of a box lowpass filter
                                        kernel), or (2) it is an exponential function of the form                                                                           Flat                        %1                        Straight                  %1                   Corner                 %1
                                                                                                                                                                                                                                  Edge
                                                                                                 2
                                                                                                     + t 2 ) 2s 2
                                                                           w( s, t ) = e − ( s                                             (11-62)
                                         The box is used when computational speed is paramount and the noise level is low.
                                         The exponential form is used when data smoothing is important.
                                                                                                                                                                                                                        1                                               1                                        1
                                            As illustrated in Fig. 11.45, a corner is characterized by large values in region C,                                           %1
                                                                                                                                                                                                                            fy
                                                                                                                                                                                                                                 %1
                                                                                                                                                                                                                                                                            fy
                                                                                                                                                                                                                                                                                 %1
                                                                                                                                                                                                                                                                                                                     fy
                                         in both spatial directions. However, when the patch spans a boundary there will also
                                         be a response in one direction. The question is: How can we tell the difference? As
                                         we discussed in Section 11.5 (see Example 11.17), the eigenvectors of a real, sym-
                                                                                                                                                                             lx : small                                               lx : large                                      lx : large
                                         metric matrix (such as M above) point in the direction of maximum data spread,                                                      ly : small                 1                             ly : small            1                         ly : large        1
                                                                                                                                                                                                 fx                                                    fx                                          fx
                                         and the corresponding eigenvalues are proportional to the amount of data spread in
                                         the direction of the eigenvectors. In fact, the eigenvectors are the major axes of an                                             a b c
                                         ellipse fitting the data, and the magnitude of the eigenvalues are the distances from                                             d e f
                                         the center of the ellipse to the points where it intersects the major axes. Figure 11.46                                         FIGURE 11.46 (a)–(c) Noisy images and image patches (small squares) encompassing image regions similar in content
                                         illustrates how we can use these properties to differentiate between the three cases                                             to those in Fig. 11.45. (d)–(f) Plots of value pairs ( fx , fy ) showing the characteristics of the eigenvalues of M that are
                                         in which we are interested.                                                                                                      useful for detecting the presence of a corner in an image patch.
                                             The small image patches in Figs. 11.46(a) through (c) are representative of regions
                                        A, B, and C in Fig. 11.45. In Fig. 11.46(d), we show values of ( fx , fy ) computed using
                                         the derivative kernels wy = [ −1 0 1] and wx = wTy (remember, we use the coordinate
                                                                                                                                                                          The eigenvalues of the             imply the presence of a vertical or horizontal boundary; and (3) two large eigenval-
            As noted in Chapter 3, we                                                                                                                                     2 × 2 matrix M can be
            do not use bold notation     system defined in Fig. 2.19). Because we compute the derivatives at each point in the                                            expressed in a closed              ues imply the presence of a corner or (unfortunately) isolated bright points.
            for vectors and matrices     patch, variations caused by noise result in scattered values, with the spread of the
                                                                                                                                                                          form (see Problem 11.31).             Thus, we see that the eigenvalues of the matrix formed from derivatives in the
            representing spatial                                                                                                                                          However, their computa-
            kernels.                     scatter being directly related to the noise level and its properties. As expected, the                                           tion requires squares and          image patch can be used to differentiate between the three scenarios of interest.
                                         derivatives from the flat region form a nearly circular cluster, whose eigenvalues are
                                                                                                                                                                          square roots, which are            However, instead of using the eigenvalues (which are expensive to compute), the HS
                                                                                                                                                                          expensive to process.
                                         almost identical, yielding a nearly circular fit to the points (we label these eigenvalues                                                                          detector utilizes a measure of corner response based on the fact that the trace of a
                                         as “small” in relation to the other two plots). Figure 11.46(e) shows the derivatives of                                                                            square matrix is equal to the sum of its eigenvalues, and its determinant is equal to
                                         the patch containing the edge. Here, the spread is greater along the x-axis, and about                                           The advantage of this for-         the product of its eigenvalues. The measure is defined as
                                                                                                                                                                          mulation is that the trace
                                         nearly the same as Fig. 11.46 (a) in the y-axis. Thus, eigenvalue lx is “large” while ly is                                      is the sum of the main                                                      R = lx ly − k(lx + ly )2
                                        “small.” Consequently, the ellipse fitting the data is elongated in the x-direction. Final-                                       diagonal terms of M (just                                                                                                          (11-63)
                                         ly, Fig. 11.46(f) shows the derivatives of the patch containing the corner. Here, the
                                                                                                                                                                          two numbers). The deter-                                                      = det(M) − k trace 2 (M)
                                                                                                                                                                          minant of a 2 × 2 matrix
                                         data is spread along both directions, resulting in two large eigenvalues and a much                                              is the product of the main
                                                                                                                                                                                                             where k is a constant to be explained shortly. Measure R has large positive values
                                                                                                                                                                          diagonal elements minus
                                         larger and nearly circular fitting ellipse. From this we conclude that: (1) two small                                            the product of the cross           when both eigenvalues are large, indicating the presence of a corner; it has large
                                         eigenvalues indicate nearly constant intensity; (2) one small and one large eigenvalue                                           elements. These are trivial
                                                                                                                                                                                                             negative values when one eigenvalue is large and the other small, indicating an edge;
                                                                                                                                                                          computations.
www.EBooksWorld.ir www.EBooksWorld.ir
                                    and its absolute value is small when both eigenvalues are small, indicating that the                                a b
                                    image patch under consideration is flat.                                                                            c d
                                       Constant k is determined empirically, and its range of values depends on the imple-                             FIGURE 11.48
                                    mentation. For example, the MATLAB Image Processing Toolbox uses 0 < k < 0.25.                                     (a) Same as Fig.
                                    You can interpret k as a “sensitivity factor;” the smaller it is, the more likely the detec-                       11.47(a), but
                                                                                                                                                       corrupted with
                                    tor is to find corners. Typically, R is used with a threshold, T. We say that a corner at                          Gaussian noise of
                                    an image location has been detected only if R > T for a patch at that location.                                    mean 0 and
                                                                                                                                                       variance 0.01.
                                                                                                                                                       (b) Result of using
             EXAMPLE 11.18 : Applying the HS corner detector.
                                                                                                                                                       the HS detector
            Figure 11.47(a) shows a noisy image, and Fig. 11.47(b) is the result of using the HS corner detector                                       with k = 0.04 and
            with k = 0.04 and T = 0.01 (the default values in our implementation). All corners of the squares were                                     T = 0.01 [compare
                                                                                                                                                       with Fig. 11.47(b)].
            detected correctly, but the number of false detections is too high (note that all errors occurred on the                                   (c) Result with
            right side of the image, where the difference in intensity between squares is less). Figure 11.47(c) shows                                 k = 0.249, (near
                                                                                                                                                       the highest value
                                                                                                                                                       in our implementa-
                                                                                                                                                       tion), and T = 0.01.
                                                                                                                                                       (d) Result of using
                                                                                                                                                       k = 0.04 and
                                                                                                                                                       T = 0.15.
                                                                                                                                                       the result obtained by increasing k to 0.1 and leaving T at 0.01. This time, all corners were detected cor-
                                                                                                                                                       rectly. As Fig. 11.47(d) shows, increasing the threshold to T = 0.1 yielded the same result. In fact, using
                                                                                                                                                       the default value of k and leaving T at 0.1 also produced the same result, as Fig. 11.47(e) shows. The
                                                                                                                                                       point of all this is that there is considerable flexibility in the interplay between the values of k and T.
                                                                                                                                                       Figure 11.47(f) shows the result obtained using the default value for k and using T = 0.3. As expected,
                                                                                                                                                       increasing the value of the threshold eliminated some corners, yielding in this case only the corner of
                                                                                                                                                       the squares with larger intensity differences. Increasing the value of k to 0.1 and setting T to its default
                                                                                                                                                       value yielded the same result, as did using k = 0.1 and T = 0.3, demonstrating again the flexibility in the
                                                                                                                                                       values chosen for these two parameters. However, as the level of noise increases, the range of possible
                                                                                                                                                       values becomes narrower, as the results in the next paragraph illustrate.
                                                                                                                                                           Figure 11.48(a) shows the checkerboard corrupted by a much higher level of additive Gaussian noise
                                                                                                                                                       (see the figure caption). Although this image does not appear much different than Fig. 11.47(a), the
                                                                                                                                                       results using the default values of k and T are much worse than before. False corners were detected even
             a b c                                                                                                                                     on the left side of the image, where the intensity differences are much stronger. Figure 11.48(c) is the
             d e f                                                                                                                                     result of increasing k near the maximum value in our implementation (2.5) while keeping T at its default
            FIGURE 11.47 (a) A 600 × 600 image with values in the range [0, 1], corrupted by additive Gaussian noise with 0 mean                       value. This time, k alone could not overcome the higher noise level. On the other hand, decreasing k to
            and variance of 0.006. (b) Result of applying the HS corner detector with k = 0.04 and T = 0.01 (the defaults). Sev-                       its default value and increasing T to 0.15 produced a perfect result, as Fig. 11.48(d) shows.
            eral errors are visible. (c) Result using k = 0.1 and T = 0.01. (d) Result using k = 0.1 and T = 0.1. (e) Result using                         Figure 11.49(a) shows a more complex image with a significant number of corners embedded in
            k = 0.04 and T = 0.1. (f) Result using k = 0.04 and T = 0.3 (only the strongest corners on the left were detected).                        various ranges of intensities. Figure 11.49(b) is the result obtained using the default values for k and T.
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                        a b
                                                                                                                                                      FIGURE 11.50
                                                                                                                                                      (a) Image
                                                                                                                                                      rotated 5°.
                                                                                                                                                      (b) Corners
                                                                                                                                                      detected using the
                                                                                                                                                      parameters used
                                                                                                                                                      to obtain
                                                                                                                                                      Fig. 11.49(f).
www.EBooksWorld.ir www.EBooksWorld.ir
                                    where I is the image under consideration, and p and q are image points. This equa-
                                    tion indicates that an extremal region R is a region of I, with the property that the                                                                                            225   175       90     90
                                    intensity of any point in the region is higher than the intensity at any point in the
                                    boundary of the region. As usual, we assume that image intensities are integers,                                                                                                 125    5        90     225
                                    ordered from 0 (black) to the maximum intensity (e.g., 255 for 8-bit images), which
                                    are represented by white.                                                                                                                                                         5     5        5      225
                                       MSERs are found by analyzing the nodes of the component tree. For each con-
                                    nected region in the tree, we compute a stability measure, c, defined as                                                                                                         125    5       225     225
                                                                              RiT + ( n−1)!T − RkT + ( n+1)!T
                                                            c(RTj + n!T ) =                                                 (11-65)                                          T + !T = 60
                                                                                         RTj + n!T
                                                                                                                                                                                                                                Region R1
                                    where R is the size of the area (number of pixels) of connected region R, T is a                                                                                                            Area = 11
                                    threshold value in the range T ∈[min( I ), max( I )], and !T is a specified thresh-
                                    old increment. Regions RiT + ( n−1)!T , RTj + n!T , and RkT + ( n+1)!T are connected regions
                                    obtained at threshold levels T + (n − 1)!T , T + n!T , and T + (n + 1)!T , respectively.                                                 T + 2!T = 110
                                    In terms of the component tree, regions Ri and Rk are respectively the parent and
                                                                                                                                                                                                 Region R2                   Region R3                    Region R4
                                    child of region Rj . Because T + (n − 1)!T < T + (n + 1)!T , we are guaranteed that
                                                                                                                                                                                                 Area = 3                    Area = 1                     Area = 3
                                    | RiT + ( n−1)!T | ≥ | RkT + ( n+1)!T |. It then follows from Eq. (11-65) that c ≥ 0. MSREs                                                                  c=3                                                      c=83
                                    are the regions corresponding to the nodes in the tree that have a stability value
                                    that is a local minimum along the path of the tree containing that region. What this
                                    means in practice is that maximally stable regions are regions whose sizes do not                                                        T + 3!T = 160
                                    change appreciably across two, 2!T neighboring thresholded images.                                                                                           Region R5                                                Region R6
                                        Figure 11.51 illustrates the concepts just introduced. The grayscale image at the                                                                        Area = 2                                                 Area = 3
                                    top consists of some simple regions of constant intensity, with values in the range                                                                          c=1                                                      c=0
                                    [0, 255]. Based on the explanation of Eqs. (11-64) and (11-65), we used the threshold
                                    T = 10, which is in the range [ min( I ) = 5, max( I ) = 225 ]. Choosing !T = 50 segmen-
                                                                                                                                                                             T + 4!T = 210
                                    ted all the different regions of the image. The column of binary images on the left con-
                                    tains the results of thresholding the grayscale image with the threshold values shown.                                                                       Region R7                                                Region R8
                                    The resulting component tree is on the right. Note that the tree is shown “root up,”                                                                         Area = 1                                                 Area = 3
                                    which is the way you would normally program it.
                                       All the squares in the grayscale image are of the same size (area); therefore,
                                    regardless of the image size, we can normalize the size of each square to 1. For exam-
                                    ple, if the image is of size 400 × 400 pixels, the size of each square is 100 × 100 = 10 4                             FIGURE 11.51 Detecting MSERs. Top: Grayscale image. Left: Thresholded images using T = 10 and !T = 50. Right:
                                    pixels. Normalizing the size to 1 means that size 1 corresponds to 10 4 pixels (one                                    Component tree, showing the individual regions. Only one MSER was detected (see dashed tree node on the
                                    square), size 2 corresponds to 2 × 10 4 pixels (two squares), and so forth. You can                                    rightmost branch of the tree). Each level of the tree is formed from the thresholded image on the left, at that same
                                    arrive at the same conclusion by noticing that the ratio in Eq. (11-65) eliminates the                                 level. Each node of the tree contains one extremal region (connected component) shown in white, and denoted by
                                                                                                                                                           a subscripted R.
                                    common 10 4 factor.
                                        The component tree in Fig. 11.51 is a good summary of how the MSER algorithm
                                    works. The first level is the result of thresholding I with T + !T = 60. There is only
                                                                                                                                                                                        regions in the binary image obtained by thresholding I using T + 2!T = 110. As you
                                    one connected component (white pixels) in the thresholded image on the left. The
                                                                                                                                                                                        can see on the left, this image has three connected components, so we create three
                                    size of the connected component is 11 normalized units. As mentioned above, each
                                                                                                                                                                                        nodes in the component tree at the level of the thresholded image. Similarly, the
                                    node of a component tree, denoted by a subscripted R, contains one connected
                                                                                                                                                                                        binary image obtained by thresholding I with T + 3!T = 160 has two connected
                                    component consisting of white pixels. The next level in the tree is formed from the
www.EBooksWorld.ir www.EBooksWorld.ir
                                    components, so we create two nodes in the tree at this level. These two connected                              a b
                                    components are children of the connected components in the previous level, so we                               c d
                                    place the new nodes in the same path as their respective parents. The next level of                           FIGURE 11.52
                                    the tree is explained in the same manner. Note that the center node in the previous                           (a) 600 × 570 CT
                                    level had no children, so that path of the tree ends in the second level.                                     slice of a human
                                                                                                                                                  head. (b) Image
                                       Because we need to check size variations between parent and child regions to deter-                        smoothed with a
                                    mine stability, only the two middle regions (corresponding to threshold values of 110                         box kernel of size
                                    and 160) are relevant in this example. As you can see in our component tree, only R6                          15 × 15 elements. (c)
                                    has a parent and child of similar size (the sizes are identical in this case). Therefore,                     A extremal region
                                                                                                                                                  along the path of the
                                    region R6 is the only MSER detected in this case. Observe that if we had used a single                        tree containing one
                                    global threshold to detect the brightest regions, region R7 would have been detected                          MSER.
                                    also (an undesirable result in this context). Thus, we see that although MSERs are                            (d) The MSER.
                                    based on intensity, they also depend on the nature of the background surrounding a                            (All MSER regions
                                                                                                                                                  were limited to the
                                    region. In this case, R6 was surrounded by a darker background than R7 , and the darker                       range 10,260 – 34,200
                                    background was thresholded earlier in the tree, allowing the size of R6 to remain con-                        pixels, correspond-
                                    stant over the two, 2!T neighboring range required for detection as an MSER.                                  ing to a range
                                       In our example, it was easy to detect an MSER as the only region that did not                              between 3%
                                                                                                                                                  and 10% of image
                                    change size, which gave a stability factor 0. A value of zero automatically implies
                                                                                                                                                  size.)
                                    that an MSER has been found because the parent and child regions are of the                                   (Original image
                                    same size. When working with more complex images, the values of stability fac-                                courtesy of Dr.
                                    tors seldom are zero because of variations in intensity caused by variables such                              David R.
                                    as illumination, viewpoint, and noise. The concept of a local minimum mentioned                               Pickens, Vanderbilt
                                                                                                                                                  University.)
                                    earlier is simply a way of saying that MSERs are extremal regions that do change
                                    size significantly over a 2!T thresholding range. What is considered a “significant”
                                    change depends on the application.
                                       It is not unusual for numerous MSERs to be detected, many of which may not be
                                    meaningful because of their size. One way to control the number of regions detected                           preprocessing step when !T is relatively small. In this case, we used T = 0 and !T = 10. This increment
                                    is by the choice of !T. Another is to label as insignificant any region whose size is                         was small enough to require smoothing for proper MSER detection. In addition, we used a “size filter,”
                                    not in a specified size range. We illustrate this in Example 11.19.                                           in the sense that the size (area) of an MSER had to be between 10,262 and 34,200 pixels; these size limits
                                       Matas et al. [2002] indicate that MSERs are affine-covariant (see Section 11.1).                           are 3% and 10% of the size of the image, respectively.
                                    This follows directly from the fact that area ratios are preserved under affine trans-                           Figure 11.53 illustrates MSER detection on a more complex image. We used less blurring (a 5 × 5 box
                                    formations, which in turn implies that for an affine transformation the original and                          kernel) in this image because is has more fine detail. We used the same T and !T as in Fig. 11.52, and
                                    transformed regions are related by that transformation. We illustrate this property                           a valid MSER size in the range 10,000 to 30,000 pixels, corresponding approximately to 3% and 8% of
                                    in Figs. 11.54 and 11.55.                                                                                     image size, respectively. Two MSERs were detected using these parameters, as Figs. 11.53(c) and (d)
                                       Finally, keep in mind that the preceding MSER formulation is designed to detect                            show. The composite MSER, shown in Fig. 11.53(e), is a good representation of the front of the building.
                                    bright regions with darker surroundings. The same formulation applied to the nega-                               Figure 11.54 shows the behavior under rotation of the MSERs detected in Fig. 11.53. Figure 11.54(a)
                                    tive (in the sense defined in Section 3.2) of an image will detect dark regions with                          is the building image rotated 5° in the conterclockwise direction. The image was cropped after rota-
                                    lighter surroundings. If interest lies in detecting both types of regions simultaneously,                     tion to eliminate the resulting black areas (see Fig. 2.41), which would change the nature of the image
                                    we form the union of both sets of MSERs.                                                                      data and thus influence the results. Figure 11.54(b) is the result of performing the same smoothing as
                                                                                                                                                  in Fig. 11.53, and Fig. 11.54(c) is the composite MSER detected using the same parameters as in Fig.
                                                                                                                                                  11.53(e). As you can see, the composite MSER of the rotated image corresponds quite closely to the
             EXAMPLE 11.19 : Extracting MSERs from grayscale images.                                                                              MSER in Fig. 11.53(e).
            Figure 11.52(a) shows a slice image from a CT scan of a human head, and Fig. 11.52(b) shows the result                                   Finally, Fig. 11.55 shows the behavior of the MSER detector under scale changes. Figure 11.55(a) is the
            of smoothing Fig. 11.52(a) with a box kernel of size 15 × 15 elements. Smoothing is used routinely as a                               building image scale to 0.5 of its original dimensions, and Fig. 11.55(b) shows the image smoothed with
                                                                                                                                                  a correspondingly smaller box kernel of size 3 × 3. Because the image area is now one-fourth the size
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                   a b c
                                                                                                                                                  FIGURE 11.54 (a) Building image rotated 5° counterclockwise. (b) Smoothed image using the same kernel as in
                                                                                                                                                  Fig. 11.53(b). (c) Composite MSER detected using the same parameters we used to obtain Fig. 11.53(e). The MSERs
                                                                                                                                                  of the original and rotated images are almost identical.
                                    SIFT is an algorithm developed by Lowe [2004] for extracting invariant features from
                                    an image. It is called a transform because it transforms image data into scale-invariant
                                    coordinates relative to local image features. SIFT is by far the most complex feature
                                    detection and description approach we discuss in this chapter.
                                       As you progress though this section, you will notice the use of a significant num-                                                      a b c
                                    ber of experimentally determined parameters. Thus, unlike most of the formulations                                                         FIGURE 11.55 (a) Building image reduced to half-size. (b) Image smoothed with a 3 × 3 box
                                    of individual approaches we have discussed thus far, SIFT is strongly heuristic. This                                                      kernel. (c) Composite MSER obtained with the same parameters as Fig. 11.53(e), but using a
                                    is a consequence of the fact that our current knowledge is insufficient to tell us how                                                     valid MSER region size range of 2,500 -–7,500 pixels.
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                                                                                                                          6.
                                                                                                                                                                                                                                                   4
                                                                                                                                                                                                                                               k .s3
                                                                                                                                                                          smoothed                                                               ..
                                         tent manner. The idea is to have a formalism for handling the fact that objects in                                                                                                                   ks3
                                                                                                                                                                          images. A                                             Scale
                                         unconstrained scenes will appear in different ways, depending on the scale at which                                              Gaussian ker-                                               Octave 3 3 = 2s2
                                                                                                                                                                                                                                              s
                                         images are captured. Because these scales may not be known beforehand, a reason-                                                 nel was used for                                                      k 4.s2                  Standard deviations used
                                         able approach is to work with all relevant scales simultaneously. Scale space repre-
                                         sents an image as a one-parameter family of smoothed images, with the objective of
                                                                                                                                                                          smoothing, so the
                                                                                                                                                                          space parameter
                                                                                                                                                                          is s.
                                                                                                                                                                                                                                                ks2
                                                                                                                                                                                                                                                   ..
                                                                                                                                                                                                                                                s2 = 2s1
                                                                                                                                                                                                                                                           6.           in the Gaussian lowpass
                                                                                                                                                                                                                                                                        kernels of each octave (the
                                                                                                                                                                                                                                                                        same number of images
                                         simulating the loss of detail that would occur as the scale of an image decreases. The                                                                                         Scale
                                                                                                                                                                                                                                                                        with the same powers of k is
                                         parameter controlling the smoothing is referred to as the scale parameter.                                                                                                             Octave 2                                generated in each octave)
                                                                                                                                                                                                                                                                6
                                            In SIFT, Gaussian kernels are used to implement smoothing, so the scale param-                                                                                                                             k 4s1
                                         eter is the standard deviation. The reason for using Gaussian kernels in based on
                                         work performed by Lindberg [1994], who showed that the only smoothing kernel
                                                                                                                                                                                                                                                       k 3s1
                                                                                                                                                                                                                                                       k 2s1
                                                                                                                                                                                                                                                                    .
                                         that meets a set of important constraints, such as linearity and shift-invariance, is                                                                                                                         ks1
                                         the Gaussian lowpass kernel. Based on this, the scale space, L( x, y, s), of a grayscale                                                                                                                      s1
                                         image, f ( x, y),† is produced by convolving f with a variable-scale Gaussian kernel,
                                         G( x, y, s) :
                                                                                                                                                                                                                Scale                      Images smoothed using
            As in Chapter 3, “!”                                                                                                                                                                                          Octave 1         Gaussian lowpass kernels
            indicates spatial convolu-                                     L( x, y, s) = G( x, y, s) ! f ( x, y)                           (11-66)
            tion.
                                         where the scale is controlled by parameter s, and G is of the form                                                                                               The preceding discussion indicates that the number of smoothed images gener-
                                                                                                                                                                                                       ated in an octave is s + 1. However, as you will see in the next section, the smoothed
                                                                                              1         2   2
                                                                                                                 2s2                                                                                   images in scale space are used to compute differences of Gaussians [see Eq. (10-32)]
                                                                            G( x, y, s) =         e −( x + y )                             (11-67)
                                                                                            2ps 2                                                                                                      which, in order to cover a full octave, implies that an additional two images past the
                                                                                                                                                                                                       octave image are required, giving a total of s + 3 images. Because the octave image is
                                         The input image f ( x, y) is successively convolved with Gaussian kernels having                                                                              always the ( s + 1)th image in the stack (counting from the bottom), it follows that this
                                         standard deviations s, ks, k 2s, k 3s, . . . to generate a “stack” of Gaussian-filtered                                                                       image is the third image from the top in the expanded sequence of s + 3 images. Each
                                         (smoothed) images that are separated by a constant factor k, as shown in the lower                                                                            octave in Fig. 11.56 contains five images, indicating that s = 2 was used in this case.
                                         left of Fig. 11.56.                                                                                                                                              The first image in the second octave is formed by downsampling the original
                                            SIFT subdivides scale space into octaves, with each octave corresponding to a                                                                              image (by skipping every other row and column), and then smoothing it using a
                                         doubling of s, just as an octave in music theory corresponds to doubling the fre-                                                                             kernel with twice the standard deviation used in the first octave (i.e., s2 = 2s1 ).
                                         quency of a sound signal. SIFT further subdivides each octave into an integer num-                                                                            Subsequent images in that octave are smoothed using s 2 , with the same sequence
                                         ber, s, of intervals, so that an interval of 1 consists of two images, an interval of 2                                                                       of values of k as in the first octave (this is denoted by dots in Fig. 11.56). The same
                                         consists of three images, and so forth. It then follows that the value used in the Gauss-                                                                     basic procedure is then repeated for subsequent octaves. That is, the first image of
                                         ian kernel that generates the image corresponding to an octave is k ss = 2s which                                                                             the new octave is formed by: (1) downsampling the original image enough times
                                                                                                                                                                          Instead of repeatedly
                                         means that k = 21 s. For example, for s = 2, k = 2, and the input image is succes-                                               downsampling the             to achieve half the size of the image in the previous octave, and (2) smoothing the
                                         sively smoothed using standard deviations of s, ( 2 ) s, and ( 2 )2 s, so that the third                                         original image, we can
                                                                                                                                                                                                       downsampled image with a new standard deviation that is twice the standard devia-
                                                                                                                                                                          carry the previously
                                         image (i.e., the octave image for s = 2) in the sequence is filtered using a Gaussian                                            downsampled image,           tion of the previous octave. The rest of the images in the new octave are obtained by
                                         kernel with standard deviation ( 2 )2 s = 2s.                                                                                    and downsample it
                                                                                                                                                                                                       smoothing the downsampled image with the new standard deviation multiplied by
                                                                                                                                                                          by 2 to obtain the image
                                         †
                                                                                                                                                                          required for the next        the same sequence of values of k as before.
                                          Experimental results reported by Lowe [2004] suggest that smoothing the original image using a Gaussian                         octave.
                                         kernel with s = 0.5 and then doubling its size by linear (nearest-neighbor) interpolation improves the number
                                                                                                                                                                                                          When k = 2, we can obtain the first image of a new octave without having to
                                         of stable features detected by SIFT. This preprocessing step is an integral part of the algorithm. Images are                                                 smooth the downsampled image. This is because, for this value of k, the kernel used
                                         assumed to have values in the range [0, 1].                                                                                                                   to smooth the first image of every octave is the same as the kernel used to smooth
www.EBooksWorld.ir www.EBooksWorld.ir
                                    the third image from the top of the previous octave. Thus, the first image of a new                             FIGURE 11.57
                                    octave can be obtained directly by downsampling that third image of the previous                                Illustration using
                                                                                                                                                    images of the first                                                                                                                                  k 4s 3
                                    octave by 2. The result will be the same (see Problem 11.36). The third image from                              three octaves of
                                    the top of any octave is called the octave image because the standard deviation used                            scale space in
                                    to smooth it is twice (i.e., k 2 = 2) the value of the standard deviation used to smooth                        SIFT. The entries                                                                                                            k 4s 2
                                                                                                                                                                                                                                                                                                         k 3s3
                                    the first image in the octave.                                                                                  in the table are
                                       Figure 11.57 uses grayscale images to further illustrate how scale space is con-                             values of standard
                                                                                                                                                    deviation used                                                                                                                                       k 2 s3
                                    structed in SIFT. Because each octave is composed of five images, it follows that                               at each scale of
                                    we are again using s = 2. We chose s1 = 2 2 = 0.707 and k = 2 = 1.414 for this                                  each octave. For                                                                               4
                                                                                                                                                                                                                                                  k s1
                                    example so that the numbers would result in familiar multiples. As in Fig. 11.56, the                                                                                                                                                        k 3s2
                                                                                                                                                    example the                                                                                                                                          ks3
                                    images going up scale space are blurred by using Gaussian kernels with progressively                            standard
                                    larger standard deviations, and the first image of the second and subsequent octaves                            deviation used in
                                                                                                                                                                                                                                                                                          Scale
                                                                                                                                                    scale 2 of octave 1
                                    is obtained by downsampling the octave image from the previous octave by 2. As                                  is ks1 , which is
                                                                                                                                                                                                                                                                                                         s3 = 2s2 = 4s1
                                    you can see, the images become significantly more blurred (and consequently lose                                equal to 1.0.                                                                                                                 2
                                                                                                                                                                                                                                                                                              Octave 3
                                                                                                                                                                                                                                                                                 k s2
                                    more fine detail) as they go up both in scale as well as in octave. The images in the                           (The images
                                    SIFT initially finds the locations of keypoints using the Gaussian filtered images,
                                    then refines the locations and validity of those keypoints using two processing steps.
                                                                                                                                                                                                                                                  k 2s1
                                                                                                                                                                                                                                                          Scale
                                    Finding the Initial Keypoints                                                                                                                                                                                                                s2 = 2s1
                                    Keypoint locations in scale space are found initially by SIFT by detecting extrema                                                                                                                                                Octave 2
                                    in the difference of Gaussians of two adjacent scale-space images in an octave, con-
                                    volved with the input image that corresponds to that octave. For example, to find                                                                                                                                       s1 = 2 2 = 0.707       k=     2 = 1.414
                                    keypoint locations related to the first two levels of octave 1 in scale space, we look
                                    for extrema in the function
                                                                                                                                                                                                                                                  ks1
                                                                                                                                                                                                                                                                                            Scale
                                                                                                                                                                                                                                                           Octave
                                                       D( x, y, s) = [G( x, y, ks) − G( x, y, s)] ! f ( x, y)        (11-68)                                                                                                                                               1       2          3          4          5
                                                                                                                                                                                                                                                                  1      0.707   1.000      1.414     2.000        2.828
                                    It follows from Eq. (11-66) that
                                                                                                                                                                                                                                                                  2      1.414   2.000      2.828     4.000        5.657
                                                              D( x, y, s) = L( x, y, ks) − L( x, y, s)               (11-69)
                                                                                                                                                                                 Scale
                                                                                                                                                                                                                                                                  3      2.828   4.000      5.657     8.000       11.314
                                                                                                                                                                                                                                                  s1
                                    In other words, all we have to do to form function D( x, y, s) is subtract the first two                                                                 Octave 1
                                    images of octave 1. Recall from the discussion of the Marr-Hildreth edge detector
                                    (Section 10.2) that the difference of Gaussians is an approximation to the Laplacian
                                    of a Gaussian (LoG). Therefore, Eq. (11-69) is nothing more than an approximation
                                    to Eq. (10-30). The key difference is that SIFT looks for extrema in D( x, y, s), where-                                                                                                              G( x, y, ks) − G( x, y, s) ≈ (k − 1) s 2 ( 2G                           (11-70)
                                    as the Marr-Hildreth detector would look for the zero crossings of this function.
                                       Lindberg [1994] showed that true scale invariance in scale space requires that the                                                        Therefore, DoGs already have the necessary scaling “built in.” The factor (k − 1) is
                                    LoG be normalized by s 2 (i.e., that s 2 ( 2G be used). It can be shown (see Problem                                                         constant over all scales, so it does not influence the process of locating extrema in
                                    11.34) that                                                                                                                                  scale space. Although Eqs. (11-68) and (11-69) are applicable to the first two images
www.EBooksWorld.ir www.EBooksWorld.ir
                                       of octave 1, the same form of these equations is applicable to any two images from                                      FIGURE 11.59
                                       any octave, provided that the appropriate downsampled image is used, and the DoG                                        Extrema (maxima
                                                                                                                                                               or minima) of the
                                       is computed from two adjacent images in the octave.
                                                                                                                                                               D( x, y, s) images
                                           Figure 11.58 illustrates the concepts just discussed, using the building image from                                 in an octave are
                                       Fig. 11.57. A total of s + 2 difference functions, D( x, y, s), are formed in each octave                               detected by
                                       from all adjacent pairs of Gaussian-filtered images in that octave. These difference                                    comparing a pixel
                                                                                                                                                               (shown in black)
                                       functions can be viewed as images, and one sample of such an image is shown for each
                                                                                                                                                               to its 26 neighbors
                                       of the three octaves in Fig. 11.58. As you might expect from the results in Fig. 11.57,                                 (shown shaded) in
                                       the level of detail in these images decreases the further up we go in scale space.                                      3 × 3 regions at the
                                           Figure 11.59 shows the procedure used by SIFT to find extrema in a D( x, y, s)                                      current and                                           Scale
                                                                                                                                                               adjacent scale
                                       image. At each location (shown in black) in a D( x, y, s) image, the value of the pixel                                                                                               Corresponding sections of three
                                                                                                                                                               images.                                                       contiguous D( x, y, s) images
                                       at that location is compared to the values of its eight neighbors in the current image
                                       and its nine neighbors in the images above and below. The point is selected as an
                                       extremum (maximum or minimum) point if its value is larger than the values of all
                                       its neighbors, or smaller than all of them. No extrema can be detected in the first                                                                  extremum (to achieve subpixel accuracy) is to fit an interpolating function at each
                                       (last) scale of an octave because it has no lower (upper) scale image of the same size.                                                              extremum point found in the digital function, then look for an improved extremum
                                                                                                                                                                                            location in the interpolated function. SIFT uses the linear and quadratic terms of
                                       Improving the Accuracy of Keypoint Locations                                                                                                         a Taylor series expansion of D( x, y, s), shifted so that the origin is located at the
                                                                                                                                                                                            sample point being examined. In vector form, the expression is
                                       When a continuous function is sampled, its true maximum or minimum may actually
                                       be located between sample points. The usual approach used to get closer to the true
                                                                                                                                                                                                                                  ∂D T    1 T ∂ ∂D
                                                                                                                                                                                                                  D(x) = D + a        bx + x     a bx
                                                                                                                                                                                                                                   ∂x     2 ∂x ∂x                        (11-71)
                                                                                                                                                                                                                                          1
                                                                                                                                                                                                                          = D + ( (D) x + xT H x
                                                                                                                                                                                                                                      T
                                                         Octave 3                                                                                                                           where D and its derivatives are evaluated at the sample point, x = ( x, y, s)T is the
                                                                                                                                                                                            offset from that point, ( is the familiar gradient operator,
                                                                                                                                                                                                                                          ∂D ∂x 
                                                                                                                                                                                                                                    ∂D 
                                                                                                                                                                                                                             (D =      =  ∂D ∂y                       (11-72)
                                                                                                                                                                                                                                    ∂x
                                                        Octave 2                                                                                                                                                                          ∂D ∂s 
                                                                                                                                                                                                                      ∂ 2 D ∂x 2 ∂ 2 D ∂x∂y ∂ 2 D ∂x∂s 
                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                 H =  ∂ 2 D ∂y∂x ∂ 2 D ∂y 2 ∂ 2 D ∂y∂s                 (11-73)
                                                                                                                                                                                                                      2           2            2     2 
                                                                                  D( x, y, s)                                                                                                                         ∂ D ∂s∂x ∂ D ∂s∂y ∂ D ∂s 
                                                     Octave 1                                               Sample D( x, y, s)
                 Scale                                                                                                                                                                      The location of the extremum, xˆ , is found by taking the derivative of Eq. (11-71)
                         Gaussian-filtered images, L( x, y, s)                                                                                                                              with respect to x and setting it to zero, which gives us (see Problem 11.37):
                                                                                                                                                               Because D and its
            FIGURE 11.58 How Eq. (11-69) is implemented in scale space. There are s + 3 L( x, y, s) images and s + 2 corre-                                    derivatives are evalu-
            sponding D( x, y, s) images in each octave.
                                                                                                                                                               ated at the sample point,                                        x̂ = − H −1 ( (D)                        (11-74)
                                                                                                                                                               they are constants with
                                                                                                                                                               respect to x.
www.EBooksWorld.ir www.EBooksWorld.ir
                                          The Hessian and gradient of D are approximated using differences of neighbor-                                                              If the determinant is negative, the curvatures have different signs and the keypoint
                                       ing points, as we did in Section 10.2. The resulting 3 × 3 system of linear equations                                                         in question cannot be an extremum, so it is discarded.
                                       is easily solved computationally. If the offset x̂ is greater than 0.5 in any of its three                       As with the HS corner            Let r denote the ratio of the largest to the smallest eigenvalue. Then a = r b and
                                                                                                                                                        detector, the advantage
                                       dimensions, we conclude that the extremum lies closer to another sample point, in                                of this formulation is
                                       which case the sample point is changed and the interpolation is performed about                                  that the trace and deter-
                                                                                                                                                        minants of 2 × 2 matrix
                                                                                                                                                                                                        [ Tr(H)]2 = (a   + b)
                                                                                                                                                                                                                             2
                                                                                                                                                                                                                              =
                                                                                                                                                                                                                                ( r b + b ) = ( r + 1)
                                                                                                                                                                                                                                                     2   2
                                                                                                                                                                                                                                                                  (11-78)
                                       that point instead. The final offset x̂ is added to the location of its sample point to                          H are easy to compute.                           Det(H )         ab          rb2           r
                                       obtain the interpolated estimate of the location of the extremum.                                                See the margin note in
                                                                                     ⁄                                                                  Eq. (11-63).
                                          The function value at the extremum, D(x), is used by SIFT for rejecting unstable                                                           which depends on the ratio of the eigenvalues, rather than their individual values.
                                                                               ⁄
                                       extrema with low contrast, where D(x) is obtained by substituting Eq. (11-74) into                                                            The minimum of (r + 1)2 r occurs when the eigenvalues are equal, and it increases
                                       Eq. (11-71), giving (see Problem 11.37):                                                                                                      with r. Therefore, to check that the ratio of principal curvatures is below some
                                                                                                                                                                                     threshold, r, we only need to check
                                                                                      1
                                                                          ⁄
                                                                       D(x) = D +       ( (D)T x⁄                        (11-75)
                                                                                                                                                                                                                     [ Tr(H)]2       (r   + 1)
                                                                                                                                                                                                                                                 2
                                                                                      2                                                                                                                                          <                                (11-79)
                                                                                                                                 ⁄
                                                                                                                                                                                                                     Det(H )               r
                                       In the experimental results reported by Lowe [2004], any extrema for which D(x)
                                       was less than 0.03 was rejected, based on all image values being in the range [0, 1].                                                         which is a simple computation. In the experimental results reported by Lowe [2004],
                                       This eliminates keypoints that have low contrast and/or are poorly localized.                                                                 a value of r = 10 was used, meaning that keypoints with ratios of curvature greater
                                                                                                                                                                                     than 10 were eliminated.
                                                                                                                                                                                        Figure 11.60 shows the SIFT keypoints detected in the building image using the
                                       Eliminating Edge Responses                                                                                                                                                                               ⁄
                                                                                                                                                                                     approach discussed in this section. Keypoints for which D(x) in Eq. (11-75) was less
                                       Recall from Section 10.2 that using a difference of Gaussians yields edges in an                                                              than 0.03 were rejected, as were keypoints that failed to satisfy Eq. (11-79) with
                                       image. But keypoints of interest in SIFT are “corner-like” features, which are signifi-                                                       r = 10.
                                       cantly more localized. Thus, intensity transitions caused by edges are eliminated. To
            If you display an image    quantify the difference between edges and corners, we can look at local curvature.                                                            KEYPOINT ORIENTATION
            as a topographic map
                                       An edge is characterized by high curvature in one direction, and low curvature in the
            (see Fig. 2.18), edges                                                                                                                                                  At this point in the process, we have computed keypoints that SIFT considers stable.
            will appear as ridges      orthogonal direction. Curvature at a point in an image can be estimated from the
                                                                                                                                                                                    Because we know the location of each keypoint in scale space, we have achieved
            that have low curvature
                                       2 × 2 Hessian matrix evaluated at that point. Thus, to estimate local curvature of the
            along the ridge and high                                                                                                                                                scale independence. The next step is to assign a consistent orientation to each key-
            curvature perpendicular    DoG at any level in scalar space, we compute the Hessian matrix of D at that level:
            to it.                                                                                                                                                                  point based on local image properties. This allows us to represent a keypoint rela-
                                                                                                                                                                                    tive to its orientation and thus achieve invariance to image rotation. SIFT uses a
                                                                 ∂ 2 D ∂x 2 ∂ 2 D ∂x∂y   Dxx       Dxy 
                                                             H= 2                      =                              (11-76)
                                                                ∂ D ∂y∂x ∂ D ∂y   Dyx
                                                                                2     2               Dyy                                             FIGURE 11.60
                                                                                                                                                        SIFT keypoints
                                                                                                                                                        detected in the
                                       where the form on the right uses the same notation as the A term [Eq. (11-61)] of                                building image.
                                       the Harris matrix (but note that the main diagonals are different). The eigenvalues                              The points were
                                       of H are proportional to the curvatures of D. As we explained in connection with the                             enlarged slightly
                                       Harris-Stephens corner detector, we can avoid direct computation of the eigenvalues                              to make them
                                                                                                                                                        easier to see.
                                       by formulating tests based on the trace and determinant of H, which are equal to
                                       the sum and product of the eigenvalues, respectively. To use notation different from
                                       the HS discussion, let a and b be the eigenvalues of H with the largest and smallest
                                       magnitude, respectively. Using the relationship between the eigenvalues of H and
                                       its trace and determinant we have (remember, H is is symmetric and of size 2 × 2) :
www.EBooksWorld.ir www.EBooksWorld.ir
                                       straightforward approach for this. The scale of the keypoint is used to select the                                                                          of similar sets of keypoints in the image. For example, observe the keypoints on the
                                       Gaussian smoothed image, L, that is closest to that scale. In this way, all orienta-                                                                        right, vertical corner of the building. The lengths of the arrows vary, depending on
                                       tion computations are performed in a scale-invariant manner. Then, for each image                                                                           illumination and image content, but their direction is unmistakably consistent. Plots
                                       sample, L( x, y), at this scale, we compute the gradient magnitude, M( x, y), and ori-                                                                      of keypoint orientations generally are quite cluttered and are not intended for gen-
            See Section 10.2 regard-
            ing computation of the     entation angle, u( x, y), using pixel differences:                                                                                                          eral human interpretation. The value of keypoint orientation is in image matching,
            gradient magnitude and                                                                                                  1                                                              as we will illustrate later in our discussion.
            angle.
                                             M( x, y) = ( L( x + 1, y) − L( x − 1, y)) + ( L( x, y + 1) − L( x, y − 1))  2
                                                                                       2                                2
                                                                                                                                        (11-80)
                                                                                                                         
                                                                                                                                                                                                   KEYPOINT DESCRIPTORS
                                       and                                                                                                                                                         The procedures discussed up to this point are used for assigning an image location,
                                                                                                                                                                                                   scale, and orientation to each keypoint, thus providing invariance to these three
                                              u( x, y) = tan −1 ( L( x, y + 1) − L( x, y − 1)) ( L( x + 1, y) − L( x − 1, y))       (11-81)                                                    variables. The next step is to compute a descriptor for a local region around each
                                                                                                                                                                                                   keypoint that is highly distinctive, but is at the same time as invariant as possible to
                                       A histogram of orientations is formed from the gradient orientations of sample                                                                              changes in scale, orientation, illumination, and image viewpoint. The idea is to be
                                       points in a neighborhood of each keypoint. The histogram has 36 bins covering the                                                                           able to use these descriptors to identify matches (similarities) between local regions
                                       360° range of orientations on the image plane. Each sample added to the histogram                                                                           in two or more images.
                                       is weighed by its gradient magnitude, and by a circular Gaussian function with a stan-                                                                         The approach used by SIFT to compute descriptors is based on experimental
                                       dard deviation 1.5 times the scale of the keypoint.                                                                                                         results suggesting that local image gradients appear to perform a function similar
                                          Peaks in the histogram correspond to dominant local directions of local gradients.                                                                       to what human vision does for matching and recognizing 3-D objects from different
                                       The highest peak in the histogram is detected and any other local peak that is within                                                                       viewpoints (Lowe [2004]). Figure 11.62 summarizes the procedure used by SIFT
                                       80% of the highest peak is used also to create another keypoint with that orienta-                                                                          to generate the descriptors associated with each keypoint. A region of size 16 × 16
                                       tion. Thus, for the locations with multiple peaks of similar magnitude, there will be
                                       multiple keypoints created at the same location and scale, but with different orienta-
                                       tions. SIFT assigns multiple orientations to only about 15% of points with multiple
                                       orientations, but these contribute significant to image matching (to be discussed                                              FIGURE 11.62
                                                                                                                                                                                                                               }
                                                                                                                                                                                                                                           }
                                       later and in Chapter 12). Finally, a parabola is fit to the three histogram values clos-                                       Approach used to
                                                                                                                                                                      compute a
                                       est to each peak to interpolate the peak position for better accuracy.
                                                                                                                                                                      keypoint
                                          Figure 11.61 shows the same keypoints as Fig. 11.60 superimposed on the image                                               descriptor.
                                       and showing keypoint orientations as arrows. Note the consistency of orientation
            FIGURE 11.61
            The keypoints
            from Fig. 11.60                                                                                                                                                                                                                Gradients
            superimposed                                                                                                                                                                                                                   in 16*16
                                                                                                                                                                                                                                           region
            on the original
            image. The arrows                                                                                                                                                                                                                  = Keypoint
                                                                                                                                                                                                             Gaussian weighting function
            indicate keypoint
            orientations.
www.EBooksWorld.ir www.EBooksWorld.ir
                                    pixels is centered on a keypoint, and the gradient magnitude and direction are com-                                                         are less likely to affect gradient orientation. SIFT reduces the influence of large
                                    puted at each point in the region using pixel differences. These are shown as ran-                                                          gradient magnitudes by thresholding the values of the normalized feature vector
                                    domly oriented arrows in the upper-left of the figure. A Gaussian weighting function                                                        so that all components are below the experimentally determined value of 0.2. After
                                    with standard deviation equal to one-half the size of the region is then used to assign                                                     thresholding, the feature vector is renormalized to unit length.
                                    a weight that multiplies the magnitude of the gradient at each point. The Gaussian
                                    weighting function is shown as a circle in the figure, but it is understood that it is a                                                    SUMMARY OF THE SIFT ALGORITHM
                                    bell-shaped surface whose values (weights) decrease as a function of distance from
                                    the center. The purpose of this function is to reduce sudden changes in the descriptor                                                      As the material in the preceding sections shows, SIFT is a complex procedure con-
                                    with small changes in the position of the function.                                                                                         sisting of many parts and empirically determined constants. The following is a step-
                                       Because there is one gradient computation for each point in the region surround-                            As indicated at the
                                                                                                                                                                                by-step summary of the method.
                                    ing a keypoint, there are (16)2 gradient directions to process for each keypoint.                              beginning of this section,
                                                                                                                                                                                 1. Construct the scale space. This is done using the procedure outlined in Figs. 11.56
                                                                                                                                                   smoothing and doubling
                                    There are 16 directions in each 4 × 4 subregion. The top-rightmost subregion is                                the size of the input            and 11.57. The parameters that need to be specified are s, s, (k is computed
                                    shown zoomed in the figure to simplify the explanation of the next step, which                                 image is assumed. Input
                                    consists of quantizing all gradient orientations in the 4 × 4 subregion into eight pos-                        images are assumed to            from s), and the number of octaves. Suggested values are s = 1.6, s = 2, and
                                    sible directions differing by 45°. Rather than assigning a directional value as a full
                                                                                                                                                   have values in the range
                                                                                                                                                   [0, 1].
                                                                                                                                                                                    three octaves.
                                    count to the bin to which it is closest, SIFT performs interpolation that distributes a                                                      2. Obtain the initial keypoints. Compute the difference of Gaussians, D( x, y, s),
                                    histogram entry among all bins proportionally, depending on the distance from that                                                              from the smoothed images in scale space, as explained in Fig. 11.58 and Eq. (11-69).
                                    value to the center of each bin. This is done by multiplying each entry into a bin by                                                           Find the extrema in each D( x, y, s) image using the method explained in Fig.
                                    a weight of 1 − d, where d is the shortest distance from the value to the center of a                                                           11.59. These are the initial keypoints.
                                    bin, measured in the units of the histogram spacing, so that the maximum possible
                                    distance is 1. For example, the center of the first bin is at 45° 2 = 22.5°, the next cen-                                                   3. Improve the accuracy of the location of the keypoints. Interpolate the values
                                    ter is at 22.5° + 45° = 67.5°, and so on. Suppose that a particular directional value is                                                        of D( x, y, s) via a Taylor expansion. The improved key point locations are given
                                    22.5°. The distance from that value to the center of the first histogram bin is 0, so we                                                        by Eq. (11-74).
                                    would assign a full entry (i.e., a count of 1) to that bin in the histogram. The distance                                                    4. Delete unsuitable keypoints. Eliminate keypoints that have low contrast and/or
                                    to the next center would be greater than 0, so we would assign a fraction of a full                                                             are poorly localized. This is done by evaluating D from Step 3 at the improved
                                    entry, that is 1 * (1 − d), to that bin, and so forth for all bins. In this way, every bin                                                      locations, using Eq. (11-75). All keypoints whose values of D are lower than a
                                    gets a proportional fraction of a count, thus avoiding “boundary” effects in which a                                                            threshold are deleted. A suggested threshold value is 0.03. Keypoints associated
                                    descriptor changes abruptly as a small change in orientation causes it to be assigned                                                           with edges are deleted also, using Eq. (11-79). A value of 10 is suggested for r.
                                    from one bin to another.
                                       Figure 11.62 shows the eight directions of a histogram as a small cluster of vec-                                                         5. Compute keypoint orientations. Use Eqs. (11-80) and (11-81) to compute the
                                    tors, with the length of each vector being equal to the value of its corresponding bin.                                                         magnitude and orientation of each keypoint using the histogram-based proce-
                                    Sixteen histograms are computed, one for each 4 × 4 subregion of the 16 × 16 region                                                             dure discussed in connection with these equations.
                                    surrounding a keypoint. A descriptor, shown on the lower left of the figure, then con-                                                       6. Compute keypoint descriptors. Use the method summarized in Fig. 11.62 to
                                    sists of a 4 × 4 array, each containing eight directional values. In SIFT, this descriptor                                                      compute a feature (descriptor) vector for each keypoint. If a region of size
                                    data is organized as a 128-dimensional vector.                                                                                                  16 × 16 around each keypoint is used, the result will be a 128-dimensional feature
                                       In order to achieve orientation invariance, the coordinates of the descriptor and                                                            vector for each keypoint.
                                    the gradient orientations are rotated relative to the keypoint orientation. In order to
                                    reduce the effects of illumination, a feature vector is normalized in two stages. First,                                                    The following example illustrates the power of this algorithm.
                                    the vector is normalized to unit length by dividing each component by the vector
                                    norm. A change in image contrast resulting from each pixel value being multiplied                                EXAMPLE 11.20 : Using SIFT for image matching.
                                    by a constant will multiply the gradients by the same constant, so the change in                               We illustrate the performance of the SIFT algorithm by using it to find the number of matches between
                                    contrast will be cancelled by the first normalization. A brightness change caused                              an image of a building and a subimage formed by extracting part of the right corner edge of the building.
                                    by a constant being added to each pixel will not affect the gradient values because                            We also show results for rotated and scaled-down versions of the image and subimage. This type of pro-
                                    they are computed from pixel differences. Therefore, the descriptor is invariant to                            cess can be used in applications such as finding correspondences between two images for the purpose of
                                    affine changes in illumination. However, nonlinear illumination changes resulting,                             image registration, and for finding instances of an image in a database of images.
                                    for example, from camera saturation, can also occur. These types of changes can                                   Figure 11.63(a) shows the keypoints for the building image (this is the same as Fig. 11.61), and the
                                    cause large variations in the relative magnitudes of some of the gradients, but they                           keypoints for the subimage, which is a separate, much smaller image. The keypoints were computed
www.EBooksWorld.ir www.EBooksWorld.ir
             a b                                                                                                                                     a b
            FIGURE 11.63 (a) Keypoints and their directions (shown as gray arrows) for the building image and for a section of                      FIGURE 11.64 (a) Keypoints for the rotated (by 5°) building image and for a section of the right corner of the building.
            the right corner of the building. The subimage is a separate image and was processed as such. (b) Corresponding                         The subimage is a separate image and was processed as such. (b) Corresponding keypoints between the corner and
            key points between the building and the subimage (the straight lines shown connect pairs of matching points). Only                      the building. Of the 26 matches found, only two are in error.
            three of the 36 matches found are incorrect.
                                                                                                                                                    that we do not always know a priori when images have been acquired under different conditions and
                                                                                                                                                    geometrical arrangements. A more practical test is to compute features for a prototype image and test
            using SIFT independently for each image. The building shows 643 keypoints and the subimage 54 key-                                      them against unknown samples. Figure 11.66 shows the results of such tests. Figure 11.66(a) is the origi-
            points. Figure 11.63(b) shows the matches found by SIFT between the image and subimage; 36 keypoint                                     nal building image, for which SIFT features vectors were already computed (see Fig. 11.63). SIFT was
            matches were found and, as the figure shows, only three were incorrect. Considering the large number                                    used to compare the rotated subimage from Fig. 11.64(a) against the original, unrotated image. As Fig.
            of initial keypoints, you can see that keypoint descriptors offer a high degree of accuracy for establishing                            11.66(a) shows, 10 matches were found, of which two were incorrect. These are excellent results, con-
            correspondences between images.                                                                                                         sidering the relatively small size of the subimage, and the fact that it was rotated. Figure 11.66(b) shows
               Figure 11.64(a) shows keypoints for the building image after it was rotated by 5° counterclockwise,                                  the results of matching the half-sized subimage against the original image. Eleven matches were found,
            and for a subimage extracted from its right corner edge. The rotated image is smaller than the original
            because it was cropped to eliminate the constant areas created by rotation (see Fig. 2.41). Here, SIFT
            found 547 keypoints for the building and 49 for the subimage. A total of 26 matches were found and, as
            Fig. 11.64(b) shows, only two were incorrect.
               Figure 11.65 shows the results obtained using SIFT on an image of the building reduced to half the
            size in both spatial directions. When SIFT was applied to the downsampled image and a correspond-
            ing subimage, no matches were found. This was remedied by brightening the reduced image slightly
            by manipulating the intensity gamma. The subimage was extracted from this image. Despite the fact
            that SIFT has the capability to handle some degree of changes in intensity, this example indicates that
            performance can be improved by enhancing the contrast of an image prior to processing. When work-
            ing with a database of images, histogram specification (see Chapter 3) is an excellent tool for normal-
            izing the intensity of all images using the characteristics of the image being queried. SIFT found 195
            keypoints for the half-size image and 24 keypoints for the corresponding subimage. A total of seven
            matches were found between the two images, of which only one was incorrect.                                                              a b
               The preceding two figures illustrate the insensitivity of SIFT to rotation and scale changes, but they                               FIGURE 11.65 (a) Keypoints for the half-sized building and a section of the right corner. (b) Corresponding keypoints
            are not ideal tests because the reason for seeking insensitivity to these variables in the first place is                               between the corner and the building. Of the seven matches found, only one is in error.
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                        Our discussion of moment-invariants is based on Hu [1962]. For generating moments of arbitrary order, see Flusser
                                                                                                                                                        [2000].
                                                                                                                                                            Hotelling [1933] was the first to derive and publish the approach that transforms discrete variables into uncor-
                                                                                                                                                        related coefficients (Section 11.5). He referred to this technique as the method of principal components. His paper
                                                                                                                                                        gives considerable insight into the method and is worth reading. Principal components are still used widely in
                                                                                                                                                        numerous fields, including image processing, as evidenced by Xiang et al. [2016]. The corner detector in Section 11.6
                                                                                                                                                        is from Harris and Stephens [1988], and our discussion of MSERs is based on Matas et al. [2002]. The SIFT material
                                                                                                                                                        in Section 11.7 is from Lowe [2004]. For details on the software aspects of many of the examples in this chapter, see
                                                                                                                                                        Gonzalez, Woods, and Eddins [2009].
                                                                                                                                                        Problems
                                                                                                                                                        Solutions to the problems marked with an asterisk (*) are in the DIP4E Student Support Package (consult the book
                                                                                                                                                        website: www.ImageProcessingPlace.com).
www.EBooksWorld.ir www.EBooksWorld.ir
                    curve under consideration. Assume also that the                the following boundaries, and plot the signatures.                       11.19 * Give the smallest number of statistical moment          11.27 For a set of images of size 64 × 64, assume that
                    angle accuracy is high enough so that it may be                (a) * An equilateral triangle.                                                   descriptors needed to differentiate between the               the covariance matrix given in Eq. (11-52) is
                    considered infinite for your purposes, answer the                                                                                               signatures of the figures in Fig. 11.10.                      the identity matrix. What would be the mean
                    following:                                                     (b) A rectangle.                                                                                                                               squared error between the original images and
                                                                                                                                                            11.20 Give two boundary shapes that have the same
                    (a) * What is the tortuosity of a square boundary              (c) An ellipse                                                                 mean and third statistical moment descriptors,                  images reconstructed using Eq. (11-54) with only
                          of size d × d ?                                                                                                                         but different second moments.                                   half of the original eigenvectors?
                                                                            11.14 Do the following:
                    (b) * What is the tortuosity of a circle of radius r?                                                                                   11.21 * Propose a set of descriptors capable of differen-       11.28 Under what conditions would you expect the
                                                                                   (a) * With reference to Figs. 11.11(c) and (f), give
                                                                                                                                                                    tiating between the shapes of the characters 0, 1,            major axes of a boundary, defined in the discus-
                    (c) What is the tortuosity of a closed convex                        a word description of an algorithm for count-
                                                                                                                                                                    8, 9, and X. (Hint: Use topological descriptors in            sion of Eq. (11-4), to be equal to the eigen axes of
                        curve?                                                           ing the peaks in the two waveforms. Such an
                                                                                                                                                                    conjunction with the convex hull.)                            that boundary?
            11.8 * Advance an argument that explains why the                             algorithm would allow us to differentiate
                                                                                         between triangles and rectangles.                                  11.22 Consider a binary image of size 200 × 200 pix-            11.29 *You are contracted to design an image process-
                   uppermost-leftmost point of a digital closed
                                                                                                                                                                  els, with a vertical black band extending from                   ing system for detecting imperfections on the
                   curve has the property that a polygonal approxi-                (b) How can you make your solution indepen-
                                                                                                                                                                  columns 1 to 99 and a vertical white band extend-                inside of certain solid plastic wafers. The wafers
                   mation to the curve has a convex vertex at that                     dent of scale changes? You may assume that
                                                                                                                                                                  ing from columns 100 to 200.                                     are examined using an X-ray imaging system,
                   point.                                                              the scale changes are the same in both direc-
                                                                                                                                                                                                                                   which yields 8-bit images of size 512 × 512. In
            11.9    With reference to Example 11.2, start with vertex                  tions.                                                                       (a) Obtain the co-occurrence matrix of this                    the absence of imperfections, the images appear
                    V7 and apply the MPP algorithm through, and             11.15 Draw the medial axis of:                                                              image using the position operator “one pixel               uniform, having a mean intensity of 100 and vari-
                    including, V11 .                                                                                                                                    to the right.”                                             ance of 400. The imperfections appear as blob-
                                                                                   (a) * A circle.
            11.10 Do the following:                                                                                                                                 (b) * Normalize this matrix so that its elements               like regions in which about 70% of the pixels
                                                                                   (b) * A square.
                                                                                                                                                                          become probability estimates, as explained               have excursions in intensity of 50 intensity levels
                    (a) * Explain why the rubber-band polygonal                    (c) An equilateral triangle.                                                           in Section 11.4.                                         or less about a mean of 100. A wafer is consid-
                          approximation approach discussed in Sec-
                                                                            11.16 For the figure shown,                                                                                                                            ered defective if such a region occupies an area
                          tion 11.2 yields a polygon with minimum                                                                                                   (c) Use your matrix from (b) to compute the six
                                                                                                                                                                                                                                   exceeding 20 × 20 pixels in size. Propose a system
                          perimeter for a convex curve.                            (a) * What is the order of the shape number?                                         descriptors in Table 11.3.
                                                                                                                                                                                                                                   based on texture analysis for solving this prob-
                    (b) Show that if each cell corresponds to a pixel              (b) Obtain the shape number.                                             11.23 Consider a checkerboard image composed of                        lem.
                        on the boundary, the maximum possible                                                                                                     alternating black and white squares, each of size
                                                                                                                                                                                                                            11.30 With reference to Fig. 11.46, answer the following:
                        error in that cell is 2d, where d is the mini-                                                                                            m × m pixels. Give a position operator that will
                        mum possible horizontal or vertical distance                                                                                              yield a diagonal co-occurrence matrix.                          (a) * What is the cause of nearly identical clusters
                        between adjacent pixels (i.e., the distance                                                                                                                                                                     near the origin in Figs. 11.46(d)-(f).
                                                                                                                                                            11.24 Obtain the gray-level co-occurrence matrix of
                        between lines in the sampling grid used to                                                                                                an array pattern of alternating single 0’s and 1’s              (b) Look carefully, and you will see a single point
                        produce the digital image).                                                                                                               (starting with 0) if:                                               near coordinates (0.8, 0.8) in Fig. 11.46(f).
            11.11 Explain how the MPP algorithm in Section 11.2                                                                                                                                                                       What caused this point?
                                                                            11.17 * The procedure discussed in Section 11.3 for using                               (a) * The position operator Q is defined as “one
                  behaves under the following conditions:                                                                                                                                                                         (c) The results in Fig. 11.46(d)–(e) are for
                                                                                    Fourier descriptors consists of expressing the                                        pixel to the right.”
                    (a) * One-pixel wide, one-pixel deep indentations.                                                                                                                                                                the small image patches shown in Figs.
                                                                                    coordinates of a contour as complex numbers,                                    (b) The position operator Q is defined as “two                    11.46(a)–(b). What would the results look
                    (b) * One-pixel wide, two-or-more pixel deep                    taking the DFT of these numbers, and keeping                                        pixels to the right.”                                         like if we performed the computations over
                          indentations.                                             only a few components of the DFT as descriptors
                                                                                                                                                            11.25 Do the following.                                                   the entire image, instead of limiting the com-
                                                                                    of the boundary shape. The inverse DFT is then
                    (c) One-pixel wide, n-pixel long protrusions.                                                                                                                                                                     putation to the patches?
                                                                                    an approximation to the original contour. What                                  (a) * Prove the validity of Eqs. (11-50) and (11-51).
            11.12 Do the following.                                                 class of contour shapes would have a DFT con-                                                                                           11.31 When we discussed the Harris-Stephens corner
                                                                                    sisting of real numbers, and how would the axis                                 (b) Prove the validity of Eq. (11-52).                        detector, we mentioned that there is a closed-form
                    (a) * Plot the signature of a square boundary using
                          the tangent-angle method discussed in Sec-                system in Fig. 11.18 have to be set up to obtain                        11.26 * We mentioned in Example 11.16 that a credible                 formula for computing the eigenvalues of a 2 × 2
                          tion 11.2.                                                those real numbers?                                                             job could be done of reconstructing approxima-                matrix.
                                                                            11.18 Show that if you use only two Fourier descrip-                                    tions to the six original images by using only the            (a) * Given matrix M = [a b; c d], give the gen-
                    (b) Repeat (a) for the slope density function.
                                                                                  tors (u = 0 and u = 1) to reconstruct a bound-                                    two principal-component images associated with                      eral formula for finding its eigenvalues.
                        Assume that the square is aligned with the x-
                                                                                  ary with Eq. (11-10), the result will always be a                                 the largest eigenvalues. What would be the mean                     Express your formula in terms of the trace
                        and y-axes, and let the x-axis be the reference
                                                                                  circle. (Hint: Use the parametric representation                                  squared error incurred in doing so? Express your                    and determinant of M.
                        line. Start at the corner closest to the origin.
                                                                                  of a circle in the complex plane, and express the                                 answer as a percentage of the maximum possible
            11.13 Find an expression for the signature of each of                                                                                                   error.                                                        (b) Give the formula for symmetric matrices of
                                                                                  equation of a circle in polar coordinates.)
www.EBooksWorld.ir www.EBooksWorld.ir
                         size 2 × 2 in terms of its four elements, with-     11.38 A company that bottles a variety of industrial                                  image: (1) Determine the ratio of the area occu-         your report, state the physical dimensions of the
                         out using the trace nor the determinant.                  chemicals employs you to design an approach for                                 pied by bubbles to the total area of the image;          smallest bubble your solution can detect. State
                                                                                   detecting when bottles of their product are not                                 and (2) count the number of distinct bubbles.            clearly all assumptions that you make and that
            11.32 * With reference to the component tree in Fig.
                                                                                   full. As they move along a conveyor line past an                                Based on the material you have learned up to             are likely to impact the solution you propose.
                    11.51, assume that any pixels extending past the
                                                                                   automatic filling and capping station, the bottles                              this point, propose a solution to this problem. In
                    border of the small image are 0. Is region R1 an
                    extremal region? Explain.                                      appear as shown in the following image. A bottle
                                                                                   is considered imperfectly filled when the level
            11.33 With reference to the discussion of maximally                    of the liquid is below the midway point between
                  stable extremal regions in Section 11.6, can the                 the bottom of the neck and the shoulder of the
                  root of a component tree contain an MSER?                        bottle. The shoulder is defined as the intersection
                  Explain.                                                         of the sides and slanted portions of the bottle.
            11.34 * The well known heat-diffusion equation of a                    The bottles move at a high rate of speed, but the
                    temperature function g( x, y, z, t ) of three spatial          company has an imaging system equipped with
                    variables, ( x, y, z), is given by ∂g ∂t − a( 2 g = 0,         an illumination flash front end that effectively
                    where a is the thermal diffusivity and ( 2 is the              stops motion, so you will be given images that
                    Laplacian operator. In terms of our discussion of              look very close to the sample shown here. Based
                    SIFT, the form of this equation is used to estab-              on the material you have learned up to this point,
                    lish a relationship between the difference of                  propose a solution for detecting bottles that are
                    Gaussians and the scaled Laplacian, s 2 ( 2 . Show             not filled properly. State clearly all assumptions
                    how this can be done to derive Eq. (11-70).                    that you make and that are likely to impact the
                                                                                   solution you propose.
            11.35 With reference to the SIFT algorithm discussed
                  in Section 11.7, assume that the input image is
                  square, of size M × M (with M = 2 n ), and let the
                  number of intervals per octave be s = 2.
                    (a) How many smoothed images will there be in
                        each octave?
                    (b) * How many octaves could be generated before
                          it is no longer possible to down-sample the
                          image by 2?                                        11.39 Having heard about your success with the
                                                                                   bottle inspection problem, you are contacted by a
                    (c) If the standard deviation used to smooth                   fluids company that wishes to automate bubble-
                        the first image in the first octave is s, what             counting in certain processes for quality control.
                        are the values of standard deviation used to               The company has solved the imaging problem
                        smooth the first image in each of the remain-              and can obtain 8-bit images of size 700 × 700 pix-
                        ing octaves in (b)?                                        els, such as the one shown in the figure below.
            11.36 Advance an argument showing that smoothing
                  an image and then downsampling it by 2 gives
                  the same result as first downsampling the image
                  by 2 and then smoothing it with the same kernel.
                  By downsampling we mean skipping every other
                  row and column. (Hint: Consider the fact that
                  convolution is a linear process.)
            11.37 Do the following:
                    (a) * Show how to obtain Eq. (11-74) from Eq.
                          (11-71).
                    (b) Show how Eq. (11-75) follows from Eqs.                     Each image represents an area of 7 cm 2 . The
                        (11-74) and (11-71).                                       company wishes to do two things with each
www.EBooksWorld.ir www.EBooksWorld.ir
12.1 BACKGROUND
              12
                                                                                                                                                                                  12.1
                                                                                                                                                                                  Humans possess the most sophisticated pattern recognition capabilities in the known
                                                                                                                                                                                  biological world. By contrast, the capabilities of current recognition machines pale
                                                                                                                                                                                  in comparison with tasks humans perform routinely, from being able to interpret the
                                                                                                                                                                                  meaning of complex images, to our ability for generalizing knowledge stored in our
                                         Image Pattern Classification                                                                                                             brains. But recognition machines play an important, sometimes even crucial role in
                                                                                                                                                                                  everyday life. Imagine what modern life would be like without machines that read
                                                                                                                                                                                  barcodes, process bank checks, inspect the quality of manufactured products, read
                                                                                                                                                                                  fingerprints, sort mail, and recognize speech.
                                                  One of the most interesting aspects of the world is that it can be                                                                 In image pattern recognition, we think of a pattern as a spatial arrangement of
                                                  considered to be made up of patterns.                                                                                           features. A pattern class is a set of patterns that share some common properties. Pat-
                                                 A pattern is essentially an arrangement. It is characterized by                                                                  tern recognition by machine encompasses techniques for automatically assigning
                                                 the order of the elements of which it is made, rather than by the                                                                patterns to their respective classes. That is, given a pattern or sets of patterns whose
                                                 intrinsic nature of these elements.                                                                                              class is unknown, the job of a pattern recognition system is to assign a class label to
                                                                                                        Norbert Wiener                                                            each of its input patterns.
                                                                                                                                                                                     There are four main stages involved in recognition: (1) sensing, (2) preprocessing,
                                                                                                                                                                                  (3) feature extraction, and (4) classification. In terms of image processing, sensing is
                                                                                                                                                                                  concerned with generating signals in a spatial (2-D) or higher-dimensional format.
                                                                                                                                                                                  We covered numerous aspects of image sensing in Chapter 1. Preprocessing deals
                                                                                                                                                                                  with techniques for tasks such as noise reduction, enhancement, restoration, and
            Preview                                                                                                                                                               segmentation, as discussed in earlier chapters. You learned about feature extraction
            We conclude our coverage of digital image processing with an introduction to techniques for image                                                                     in Chapter 11. Classification, the focus of this chapter, deals with using a set of fea-
            pattern classification. The approaches developed in this chapter are divided into three principal catego-                                                             tures as the basis for assigning class labels to unknown input image patterns.
            ries: classification by prototype matching, classification based on an optimal statistical formulation, and                                                              In the following section, we will discuss three basic approaches used for image
                                                                                                                                                                                  pattern classification: (1) classification based on matching unknown patterns against
            classification based on neural networks. The first two approaches are used extensively in applications in
                                                                                                                                                                                  specified prototypes, (2) optimum statistical classifiers, and (3) neural networks.
            which the nature of the data is well understood, leading to an effective pairing of features and classifier
                                                                                                                                                                                  One way to characterize the differences between these approaches is in the level
            design. These approaches often rely on a great deal of engineering to define features and elements of a                                                               of “engineering” required to transform raw data into formats suitable for computer
            classifier. Approaches based on neural networks rely less on such knowledge, and lend themselves well                                                                 processing. Ultimately, recognition performance is determined by the discriminative
            to applications in which pattern class characteristics (e.g., features) are learned by the system, rather                                                             power of the features used.
            than being specified a priori by a human designer. The focus of the material in this chapter is on prin-                                                                 In classification based on prototypes, the objective is to make the features so
            ciples, and on how they apply specifically in image pattern classification.                                                                                           unique and easily detectable that classification itself becomes a simple task. A good
                                                                                                                                                                                  example of this are bank-check processors, which use stylized font styles to simplify
            Upon completion of this chapter, readers should:                                                                                                                      machine processing (we will discuss this application in Section 12.3).
                                                                                                                                                                                     In the second category, classification is cast in decision-theoretic, statistical terms,
                Understand the meaning of patterns and pat-              Understand perceptrons and their history.                                                                and the classification approach is based on selecting parameters that can be shown
                tern classes, and how they relate to digital                                                                                                                      to yield optimum classification performance in a statistical sense. Here, emphasis is
                                                                         Be familiar with the concept of learning from
                image processing.                                                                                                                                                 placed on both the features used, and the design of the classifier. We will illustrate
                                                                         training samples.
                Be familiar with the basics of minimum-dis-                                                                                                                       this approach in Section 12.4 by deriving the Bayes pattern classifier, starting from
                                                                         Understand neural network architectures.                                                                 basic principles.
                tance classification.
                                                                         Be familiar with the concept of deep learning                                                               In the third category, classification is performed using neural networks. As you
                Know how to apply image correlation tech-
                                                                         in fully connected and deep convolutional neu-                                                           will learn in Sections 12.5 and 12.6, neural networks can operate using engineered
                niques for template matching.
                                                                         ral networks. In particular, be familiar with the                                                        features too, but they have the unique ability of being able to generate, on their own,
                Understand the concept of string matching.               importance of the latter in digital image pro-                                                           representations (features) suitable for recognition. These systems can accomplish
                Be familiar with Bayes classifiers.                      cessing.                                                                                                 this using raw data, without the need for engineered features.
903
www.EBooksWorld.ir www.EBooksWorld.ir
                                            One characteristic shared by the preceding three approaches is that they are                                                               distributions. Starting with Section 12.5, we will spend the rest of the chapter discuss-
                                         based on parameters that must be either specified or learned from patterns that rep-                                                          ing neural networks. We will begin Section 12.5 with a brief introduction to percep-
                                         resent the recognition problem we want to solve. The patterns can be labeled, mean-                                                           trons and some historical facts about machine learning. Then, we will introduce the
                                         ing that we know the class of each pattern, or unlabeled, meaning that the data are                                                           concept of deep neural networks and derive the equations of backpropagation, the
                                         known to be patterns, but the class of each pattern is unknown. A classic example                                                             method of choice for training deep neural nets. These networks are well-suited for
                                         of labeled data is the character recognition problem, in which a set of character                                                             applications in which input patterns are vectors. In Section 12.6, we will introduce
                                         samples is collected and the identity of each character is recorded as a label from                                                           deep convolutional neural networks, which currently are the preferred approach
                                         the group 0 through 9 and a through z. An example of unlabeled data is when we are                                                            when the system inputs are digital images. After deriving the backpropagation equa-
                                         seeking clusters in a data set, with the aim of utilizing the resulting cluster centers as                                                    tions used for training convolutional nets, we will give several examples of appli-
                                         being prototypes of the pattern classes contained in the data.                                                                                cations involving classes of images of various complexities. In addition to working
                                            When working with a labeled data, a given data set generally is subdivided into                                                            directly with image inputs, deep convolutional nets are capable of learning, on their
                                         three subsets: a training set, a validation set, and a test set (a typical subdivision might                                                  own, image features suitable for classification. This is accomplished starting with raw
            Because the examples in
            this chapter are intended    be 50% training, and 25% each for the validation and test sets). The process by                                                               image data, as opposed to the other classification methods discussed in Sections 12.3
            to demonstrate basic
                                         which a training set is used to generate classifier parameters is called training. In                                                         and 12.4, which rely on “engineered” features whose form, as noted earlier, is speci-
            principles and are not
            large scale, we dispense     this mode, a classifier is given the class label of each pattern, the objective being to                                                      fied a priori by a human designer.
            with validation and
            subdivide the pattern
                                         make adjustments in the parameters if the classifier makes a mistake in identify-
            data into training and       ing the class of the given pattern. At this point, we might be working with several                                                        12.2 PATTERNS AND PATTERN CLASSES
            test sets.
                                         candidate designs. At the end of training, we use the validation set to compare the
                                                                                                                                                                                       12.2
                                                                                                                                                                                       In image pattern classification, the two principal pattern arrangements are quantita-
                                         various designs against a performance objective. Typically, several iterations of train-
                                                                                                                                                                                       tive and structural. Quantitative patterns are arranged in the form of pattern vectors.
                                         ing/validation are required to establish the design that comes closest to meeting the
                                                                                                                                                                                       Structural patterns typically are composed of symbols, arranged in the form of strings,
                                         desired objective. Once a design has been selected, the final step is to determine how
                                                                                                                                                                                       trees, or, less frequently, as graphs. Most of the work in this chapter is based on pat-
                                         it will perform “in the field.” For this, we use the test set, which consists of patterns
                                                                                                                                                                                       tern vectors, but we will discuss structural patterns briefly at the end of this section,
                                         that the system has never “seen” before. If the training and validation sets are truly
                                                                                                                                                                                       and give an example at the end of Section 12.3.
                                         representative of the data the system will encounter in practice, the results of train-
                                         ing/validation should be close to the performance using the test set. If training/vali-
                                         dation results are acceptable, but test results are not, we say that training/validation                                                      PATTERN VECTORS
                                        “over fit” the system parameters to the available data, in which case further work on                                                          Pattern vectors are represented by lowercase letters, such as x, y, and z, and have
                                         the system architecture is required. Of course all this assumes that the given data are                                                       the form
                                         truly representative of the problem we want to solve, and that the problem in fact
                                         can be solved by available technology.                                                                                                                                                  x1 
                                                                                                                                                                                                                                x 
                                            A system that is designed using training data is said to undergo supervised learn-
                                                                                                                                                                                                                              x= 
                                                                                                                                                                                                                                   2
                                                                                                                                                                                                                                                                        (12-1)
                                         ing. If we are working with unlabeled data, the system learns the pattern classes                                                                                                      # 
                                         themselves while in an unsupervised learning mode. In this chapter, we deal only                                                                                                        
                                                                                                                                                                                                                                 xn 
                                         with supervised learning. As you will see in this and the next chapter, supervised
                                         learning covers a broad range of approaches, from applications in which a system                                                              where each component, xi , represents the ith feature descriptor, and n is the total
                                         learns parameters of features whose form is fixed by a designer, to systems that uti-                                                         number of such descriptors. We can express a vector in the form of a column, as
                                                                                                                                                                                       in Eq. (12-1), or in the equivalent row form x = ( x1 , x2 , …, xn ) , where T indicates
                                                                                                                                                                                                                                                           T
            Generally, we associate
                                         lize deep learning and large sets of raw data sets to learn, on their own, the features
            the concept of deep          required for classification. These systems accomplish this task without a human                                                               transposition. A pattern vector may be “viewed” as a point in n-dimensional Euclid-
            learning with large sets
            of data. These ideas are
                                         designer having to specify the features, a priori.                                                                                            ean space, and a pattern class may be interpreted as a “hypercloud” of points in this
            discussed in more detail        After a brief discussion in the next section of how patterns are formed, and on                                                            pattern space. For the purpose of recognition, we like for our pattern classes to be
            later in this section and
            next.
                                         the nature of patterns classes, we will discuss in Section 12.3 various approaches for                                                        grouped tightly, and as far away from each other as possible.
                                         prototype-based classification. In Section 12.4, we will start from basic principles                                                             Pattern vectors can be formed directly from image pixel intensities by vector-
                                         and derive the equations of the Bayes classifier, an approach characterized by opti-                                                          izing the image using, for example, linear indexing, as in Fig. 12.1. A more common
                                                                                                                                                          We discussed linear
                                         mum classification performance on an average basis. We will also discuss supervised                              indexing in Section 2.4      approach is for pattern elements to be features. An early example is the work of
                                         training of a Bayes classifier based on the assumption of multivariate Gaussian                                  (see Fig. 2.22).             Fisher [1936] who, close to a century ago, reported the use of what then was a new
www.EBooksWorld.ir www.EBooksWorld.ir
             a b                                                                                                                                         a b
                                                                    225   175    90     90                                                                                                                         r(u)
            FIGURE 12.1                                                                            225                                                FIGURE 12.3
                                                                                                  125                                                 (a) A noisy object                                                                                              g ( r(u1 )) 
            Using linear                                                                                                                                                                              r                                                                              
            indexing to                                             125    5     90     225        5                                                  boundary, and (b)                                                                                                 g ( r ( u2 )) 
            vectorize a                                                                                                                               its corresponding
                                                                                                                                                                                                            u                                                        x=
                                                                                                x= #                                                                                                                                                                         #       
            grayscale image.                                                                                                                            signature.                                                                                                                     
                                                                     5     5        5   225
                                                                                        175
                                                                                                   225                                                                                                                                                                g ( r(un ))
                                                                                                       
                                                                                                  175 
                                                                                                                                                                                                                                                               u
                                                                                                   225                                                                                                                  p   p      3p           5p   3p   7p
                                                                    125    5    225     225                                                                                                                         0                        p                 2p
                                                                                                                                                                                                                          4   2       4            4    2    4
                                                                                                                                                                                        Vectors can be formed also from features of both boundary and regions. For
                                         technique called discriminant analysis to recognize three types of iris flowers (Iris
                                                                                                                                                                                     example, the objects in Fig. 12.4 can be represented by 3-D vectors whose compo-
                                         setosa, virginica, and versicolor). Fisher described each flower using four features:
                                                                                                                                                                                     nents capture shape information related to both boundary and region properties
            Sepals are the undergrowth   the length and width of the petals, and similarly for the sepals (see Fig. 12.2). This
            beneath the petals.                                                                                                                                                      of single binary objects. Pattern vectors can be used also to represent properties of
                                         leads to the 4-D vectors shown in the figure. A set of these vectors, obtained for fifty
                                                                                                                                                                                     image regions. For example, the elements of the 6-D vector in Fig. 12.5 are texture
                                         samples of each flower gender, constitutes the three famous Fisher iris pattern class-
                                                                                                                                                                                     measures based on the feature descriptors in Table 11.3. Figure 12.6 shows an exam-
                                         es. Had Fisher been working today, he probably would have added spectral colors
                                                                                                                                                                                     ple in which pattern vector elements are features that are invariant to transforma-
                                         and shape features to his measurements, yielding vectors of higher dimensionality.
                                                                                                                                                                                     tions, such as image rotation and scaling (see Section 11.4).
                                         We will be working with the original iris data set later in this chapter.
                                                                                                                                                                                        When working with sequences of registered images, we have the option of using
                                            A higher-level representation of patterns is based on feature descriptors of the
                                                                                                                                                                                     pattern vectors formed from corresponding pixels in those images (see Fig. 12.7).
                                         types you learned in Chapter 11. For instance, pattern vectors formed from descrip-
                                                                                                                                                                                     Forming pattern vectors in this way implies that recognition will be based on infor-
                                         tors of boundary shape are well-suited for applications in controlled environments,
                                                                                                                                                                                     mation extracted from the same spatial location across the images. Although this
                                         such as industrial inspection. Figure 12.3 illustrates the concept. Here, we are inter-
                                                                                                                                                                                     may seem like a very limiting approach, it is ideally suited for applications such as
                                         ested in classifying different types of noisy shapes, a sample of which is shown in
                                                                                                                                                                                     recognizing regions in multispectral images, as you will see in Section 12.4.
                                         the figure. If we represent an object by its signature, we would obtain 1-D signals
                                                                                                                                                                                        When working with entire images as units, we need the detail afforded by vectors
                                         of the form shown in Fig. 12.3(b). We can express a signature as a vector by sam-
                                                                                                                                                                                     of much-higher dimensionality, such as those we discussed in Section 11.7 in connec-
                                         pling its amplitude at increments of u, then formimg a vector by letting xi = r(ui ),
                                                                                                                                                                                     tion with the SIFT algorithm. However, a more powerful approach when working
                                         for i = 0, 1, 2, … , n. Instead of using “raw” sampled signatures, a more common
                                                                                                                                                                                     with entire images is to use deep convolutional neural networks. We will discuss
                                         approach is to compute some function, xi = g ( r(ui )) , of the signature samples and
                                                                                                                                                                                     neural nets in detail in Sections 12.5 and 12.6.
                                         use them to form vectors. You learned in Section 11.3 several approaches to do this,
                                         such as statistical moments.
                                                                                                                                                                                     STRUCTURAL PATTERNS
                                                                                                                                                                                     Pattern vectors are not suitable for applications in which objects are represented
            FIGURE 12.2                                                                                                                                                              by structural features, such as strings of symbols. Although they are used much less
            Petal and sepal                                 Sepal                                                                                                                    than vectors in image processing applications, patterns containing structural descrip-
            width and length
            measurements
                                                                                        Petal                                                                                        tions of objects are important in applications where shape is of interest. Figure 12.8
            (see arrows)                                                                                                                                                             shows an example. The boundaries of the bottles were approximated by a polygon
            performed on iris
                                                                                                       x1 
            flowers for the                                                                           x 
            purpose of data                                                                       x =  2                                               a b c d
            classification. The                                                                        x3 
                                                                                                       
            image shown is of                                                                          x4                                           FIGURE 12.4
            the Iris virginica                                                                                                                          Pattern vectors
            gender. (Image                                                                                                                              whose components
            courtesy of                                                                           x1 = Petal width                                      capture both bound-                                                        x1          x1 = compactness
                                                                                                  x2 = Petal length                                                                                                                
            USDA.)                                                                                                                                      ary and regional                                                      x =  x2          x2 = circularity
                                                                                                  x3 = Sepal width
                                                                                                  x4 = Sepal length
                                                                                                                                                        characteristics.                                                           x3        x3 = eccentricity
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                              Spectral band 2                                                                           Prototype matching involves comparing an unknown pattern against a set of pro-
                                                                                             x1
                                                                                                          Spectral band 1
                                                                                                                                                                                                        totypes, and assigning to the unknown pattern the class of the prototype that is the
                                                                                             x2                                                                                                         most “similar” to the unknown. Each prototype represents a unique pattern class,
                                                                                             x3                                                                                                         but there may be more than one prototype for each class. What distinguishes one
                                                                                     x )
                                                                                             x4                                                                                                         matching method from another is the measure used to determine similarity.
                                                                                             x5
                                                                                             x6                                                                                                         MINIMUM-DISTANCE CLASSIFIER
                                    Images in spectral bands 4 – 6                                                                                                        The minimum-distance          One of the simplest and most widely used prototype matching methods is the
                                                                                                                                                                          classifier is also referred
                                                                                                                                                                                                        minimum-distance classifier which, as its name implies, computes a distance-based
            FIGURE 12.7 Pattern (feature) vectors formed by concatenating corresponding pixels from a set of registered images.                                           to as the nearest-neighbor
            (Original images courtesy of NASA.)                                                                                                                           classifier.                   measure between an unknown pattern vector and each of the class prototypes. It
                                                                                                                                                                                                        then assigns the unknown pattern to the class of its closest prototype. The prototype
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                                                            and assigning an unknown pattern x to the class whose prototype yielded the largest
                                                                                                                                                                                            value of d. That is, x is assigned to class ci , if
                                                                                                                                                                                            When used for recognition, functions of this form are referred to as decision or dis-
                                                                                                                                                                                            criminant functions.
                                                                                                                                                                                               The decision boundary separating class ci from c j is given by the values of x for
                                                                                                                                                                                            which
                                                                                                                                                                                                                              di (x) = d j (x)                            (12-6)
                                                                                                                                                                 EXAMPLE 12.1 : Illustration of the minimum-distance classifier for two classes in 2-D.
                                    vectors of the minimum-distance classifier usually are the mean vectors of the vari-
                                                                                                                                                               Figure 12.10 shows scatter plots of petal width and length values for the classes Iris versicolor and Iris
                                    ous pattern classes:
                                                                                                                                                               setosa. As mentioned in the previous section, pattern vectors in the iris database consists of four mea-
                                                                  1                                                                                            surements for each flower. We show only two here so that you can visualize the pattern classes and the
                                                                  n j x∑
                                                            mj =          x   j = 1, 2, …, Nc                   (12-2)
                                                                       ∈c          j
                                                                                                                                                               decision boundary between them. We will work with the complete database later in this chapter.
                                                                                                                                                                  We denote the Iris versicolor and setosa data as classes c1 and c2 , respectively. The means of the two
                                                                                                                                                               classes are m1 = ( 4.3, 1.3) and m2 = (1.5, 0.3) . It then follows from Eq. (12-4) that
                                                                                                                                                                                           T                   T
                                    where n j is the number of pattern vectors used to compute the jth mean vector,
                                    c j is the jth pattern class, and Nc is the number of classes. If we use the Euclidean
                                                                                                                                                                                                                                1 T
                                    distance to determine similarity, the minimum-distance classifier computes the dis-                                                                                       d1 ( x ) = mT1 x −  m1 m1
                                    tances                                                                                                                                                                                      2
                                                                Dj ( x ) =! x − m j ! j = 1, 2,…, Nc                (12-3)                                                                                           = 4.3 x1 + 1.3 x2 − 10.1
                                                                                                                                                               and
                                    where ! a ! = (aT a)1 2 is the Euclidean norm. The classifier then assigns an unknown                                                                                                       1 T
                                                                                                                                                                                                              d2 ( x ) = mT2 x −  m2 m2
                                    pattern x to class ci if Di (x) < Dj (x) for j = 1, 2, …, Nc , j ≠ i. Ties [i.e., Di (x) = Dj (x)]                                                                                          2
                                    are resolved arbitrarily.                                                                                                                                                        = 1.5 x1 + 0.3 x2 − 1.17
                                       It is not difficult to show (see Problem 12.2) that selecting the smallest distance is
                                    equivalent to evaluating the functions                                                                                     From Eq. (12-8), the equation of the boundary is
www.EBooksWorld.ir www.EBooksWorld.ir
            unknown pattern x belonging to one of these two classes, the sign of d12 (x) would be sufficient to deter-
                                                                                                                                                                                                                                                    D
            mine the class to which that pattern belongs.                                                                                                                                                                                           a
                                                                                                                                                                                                                                                    s
                                                                                                                                                                                                                                                    h
                                        The minimum-distance classifier works well when the distance between means is
                                    large compared to the spread or randomness of each class with respect to its mean.
                                                                                                                                                                                                        In addition to a stylized font design, the operation of the reading system is further
                                    In Section 12.4 we will show that the minimum-distance classifier yields optimum
                                                                                                                                                                                                     enhanced by printing each character using an ink that contains finely ground mag-
                                    performance (in terms of minimizing the average loss of misclassification) when the
                                                                                                                                                                                                     netic material. To improve character detectability in a check being read, the ink is
                                    distribution of each class about its mean is in the form of a spherical “hypercloud” in                                             Appropriately, recogni-
                                                                                                                                                                        tion of magnetized char-     subjected to a magnetic field that accentuates each character against the background.
                                    n-dimensional pattern space.                                                                                                        acters is referred to as     The stylized design further enhances character detectability. The characters are
                                       As noted earlier, one of the keys to accurate recognition performance is to specify                                              Magnetic Ink Character
                                                                                                                                                                        Recognition (MICR).          scanned in a horizontal direction with a single-slit reading head that is narrower but
                                    features that are effective discriminators between classes. As a rule, the better the
                                                                                                                                                                                                     taller than the characters. As a check passes through the head, the sensor produces a
                                    features are at meeting this objective, the better the recognition performance will be.
                                                                                                                                                                                                     1-D electrical signal (a signature) that is conditioned to be proportional to the rate
                                    In the case of the minimum-distance classifier this implies wide separation between
                                                                                                                                                                                                     of increase or decrease of the character area under the head. For example, consider
                                    means and tight grouping of the classes.
                                                                                                                                                                                                     the waveform of the number 0 in Fig. 12.11. As a check moves to the right past the
                                        Systems based on the Banker’s Association E-13B font character are a classic
                                                                                                                                                                                                     head, the character area seen by the sensor begins to increase, producing a positive
                                    example of how highly engineered features can be used in conjunction with a simple
                                                                                                                                                                                                     derivative (a positive rate of change). As the right leg of the character begins to pass
                                    classifier to achieve superior results. In the mid-1940s, bank checks were processed
                                                                                                                                                                                                     under the head, the character area seen by the sensor begins to decrease, produc-
                                    manually, which was a laborious, costly process prone to mistakes. As the volume
                                                                                                                                                                                                     ing a negative derivative. When the head is in the middle zone of the character, the
                                    of check writing increased in the early 1950s, banks became keenly interested in
                                                                                                                                                                                                     area remains nearly constant, producing a zero derivative. This waveform repeats
                                    automating this task. In the middle 1950s, the E-13B font and the system that reads
                                                                                                                                                                                                     itself as the other leg of the character enters the head. The design of the font ensures
                                    it became the standard solution to the problem. As Fig. 12.11 shows, this font set con-
                                                                                                                                                                                                     that the waveform of each character is distinct from all others. It also ensures that
                                    sists of 14 characters laid out on a 9 × 7 grid. The characters are stylized to maximize
                                                                                                                                                                                                     the peaks and zeros of each waveform occur approximately on the vertical lines of
                                    the difference between them. The font was designed to be compact and readable by
                                                                                                                                                                                                     the background grid on which these waveforms are displayed, as the figure shows.
                                    humans, but the overriding purpose was that the characters should be readable by
                                                                                                                                                                                                     The E-13B font has the property that sampling the waveforms only at these (nine)
                                    machine, quickly, and with very high accuracy.
www.EBooksWorld.ir www.EBooksWorld.ir
                                        points yields enough information for their accurate classification. The effectiveness                                       FIGURE 12.12
                                        of these highly engineered features is further refined by the magnetized ink, which                                         The mechanics of                                                                          (m % 1)/ 2
                                                                                                                                                                    template
                                        results in clean waveforms with almost no scatter.                                                                                                                                     Origin
                                                                                                                                                                    matching.
                                            Designing a minimum-distance classifier for this application is straightforward.                                                                                                                                 (n % 1)/ 2
                                        We simply store the sample values of each waveform at the vertical lines of the grid,                                                                                                                         n
                                        and let each set of the resulting samples be represented as a 9-D prototype vector,                                                                                                                     m
                                        m j , j = 1, 2,…, 14. When an unknown character is to be classified, the approach is                                                                                                                        (x, y)
                                        to scan it in the manner just described, express the grid samples of the waveform as                                                                                                            Template w
                                        a 9-D vector, x, and identify its class by selecting the class of the prototype vector                                                                                                          centered at an arbitrary
                                        that yields the highest value in Eq. (12-4). We do not even need a computer to do                                                                                                               location (x, y)
                                        this. Very high classification speeds can be achieved with analog circuits composed
                                        of resistor banks (see Problem 12.4).                                                                                                                                                                                   Image, f
                                           The most important lesson in this example is that a recognition problem often can
                                                                                                                                                                                                                                             Padding
                                        be made trivial if we can control the environment in which the patterns are gener-
                                        ated. The development and implementation of the E13-B font reading system is a
                                        striking example of this fact. On the other hand, this system would be inadequate if                                                                     It can be shown (see Problem 12.5) that g( x, y) has values in the range [ −1, 1] and is thus
                                        we added the requirement that it has to recognize the textual content and signature                                                                      normalized to changes in the amplitudes of w and f . The maximum value of g occurs
                                        written on each check. For this, we need systems that are significantly more complex,                                                                    when the normalized w and the corresponding normalized region in f are identical.
                                        such as the convolutional neural networks we will discuss in Section 12.6.                                                                               This indicates maximum correlation (the best possible match). The minimum occurs
                                                                                                                                                                                                 when the two normalized functions exhibit the least similarity in the sense of Eq. (12-10).
                                        USING CORRELATION FOR 2-D PROTOTYPE MATCHING                                                                                                                 Figure 12.12 illustrates the mechanics of the procedure just described. The border
                                        We introduced the basic idea of spatial correlation and convolution in Section 3.4,                                                                      around image f is padding, as explained in Section 3.4. In template matching, values
                                        and used these concepts extensively in Chapter 3 for spatial filtering. From Eq. (3-34),                                                                 of correlation when the center of the template is past the border of the image gener-
                                        we know that correlation of a kernel w with an image f ( x, y) is given by                                                                               ally are of no interest, so the padding is limited to half the kernel width.
                                                                                                                                                                                                     The template in Fig. 12.12 is of size m × n, and it is shown with its center at an
                                                                 (w " f )( x, y) = ∑ ∑ w(s, t )f ( x + s, y + t )                      (12-9)                                                    arbitrary location ( x, y). The value of the correlation coefficient at that point is com-
                                                                                     s   t
                                                                                                                                                                                                 puted using Eq. (12-10). Then, the center of the template is incremented to an adja-
                                        where the limits of summation are taken over the region shared by w and f . This                                                                         cent location and the procedure is repeated. Values of the correlation coefficient
                                        equation is evaluated for all values of the displacement variables x and y so all ele-                                                                   g( x, y) are obtained by moving the center of the template (i.e., by incrementing x
                                        ments of w visit every pixel of f . As you know, correlation has its highest value(s)                                                                    and y) so the center of w visits every pixel in f . At the end of the procedure, we
                                        in the region(s) where f and w are equal or nearly equal. In other words, Eq. (12-9)                                                                     look for the maximum in g( x, y) to find where the best match occurred. It is possible
                                        finds locations where w matches a region of f . But this equation has the drawback                                                                       to have multiple locations in g( x, y) with the same maximum value, indicating sev-
                                        that the result is sensitive to changes in the amplitude of either function. In order                                                                    eral matches between w and f .
            To be formal, we should     to normalize correlation to amplitude changes in one or both functions, we perform
            refer to correlation (and   matching using the correlation coefficient instead:
            the correlation
                                                                                                                                                                      EXAMPLE 12.2 : Matching by correlation.
            coefficient) as cross-
                                                                        ∑ ∑ [ w(s, t ) - w ]  f ( x + s, y + t ) − f                                              Figure 12.13(a) shows a 913 × 913 satellite image of 1992 Hurricane Andrew, in which the eye of the
            correlation when the
                                                                                                                              
            functions are different,                                                                                     xy   
            and as autocorrelation
                                                    g( x, y) =           s   t
                                                                                                                                      (12-10)                       storm is clearly visible. We want to use correlation to find the location of the best match in Fig. 12.13(a)
            when they are the same.                                                                                               1
                                                                                                                          2 2                                     of the template in Fig. 12.13(b), which is a 31 × 31 subimage of the eye of the storm. Figure 12.13(c)
                                                                  ∑ ∑ [ w( s, t ) − w ] ∑ ∑  f ( x + s, y + t ) − fxy  
            However, it is customary                                                    2
            to use the generic term                                                                                                                                 shows the result of computing the correlation coefficient in Eq. (12-10) for all values of x and y in
            correlation and                                       s t                    s t                                
            correlation coefficient,                                                                                                                                the original image. The size of this image was 943 × 943 pixels due to padding (see Fig. 12.12), but we
            except when the distinc-
                                        where the limits of summation are taken over the region shared by w and f , w is the                                        cropped it to the size of the original image for display. The intensity in this image is proportional to the
            tion is important (as in
            deriving equations, in      average value of the kernel (computed only once), and fxy is the average value of f in                                      correlation values, and all negative correlations were clipped at 0 (black) to simplify the visual analysis
            which it makes a dif-
            ference which is being
                                        the region coincident with w. In image correlation work, w is often referred to as a                                        of the image. The area of highest correlation values appears as a small white region in this image. The
            applied).                   template (i.e., a prototype subimage) and correlation is referred to as template matching.                                  brightest point in this region matches with the center of the eye of the storm. Figure 12.13(d) shows as a
www.EBooksWorld.ir www.EBooksWorld.ir
             a b                                                                                                                                   FIGURE 12.14
             c d                                                                                                                                   Circuit board
                                                                                                                                                   image of size
            FIGURE 12.13                                                                                                                           948 × 915 pixels,
            (a) 913 × 913                                                                                                                          and a subimage
            satellite image                                                                                                                        of one of the
            of Hurricane                                                                                                                           connectors. The
            Andrew.                                                                                                                                subimage is of size
            (b) 31 × 31                                                                                                                            212 × 128 pixels,
            template of the                                                                                                                        shown zoomed
            eye of the storm.                                                                                                                      on the right for
            (c) Correlation                                                                                                                        clarity. (Original
            coefficient shown                                                                                                                      image courtesy of
            as an image (note                                                                                                                      Mr. Joseph E.
            the brightest                                                                                                                          Pascente, Lixi,
            point, indicated                                                                                                                       Inc.)
            by an arrow).
            (d) Location of
            the best match                                                                                                                                                      know from the discussion in Section 10.2 that the Hough transform simplifies looking
            (identified by the                                                                                                                                                  for data patterns by utilizing bins that reduce the level of detail with which we look at
            arrow). This point
            is a single pixel,                                                                                                                                                  a data set. We already discussed the SIFT algorithm in Section 11.7. The focus in this
            but its size was                                                                                                                                                    section is to further illustrate the capabilities of SIFT for prototype matching.
            enlarged to make                                                                                                                                                       Figure 12.14 shows the circuit board image we have used several times before.
            it easier to see.                                                                                                                                                   The small rectangle enclosing the rightmost connector on the top of the large image
            (Original image                                                                                                                                                     identifies an area from which an image of the connector was extracted. The small
            courtesy of
            NOAA.)                                                                                                                                                              image is shown zoomed for clarity. The sizes of the large and small images are shown
                                                                                                                                                                                in the figure caption. Figure 12.15 shows the keypoints found by SIFT, as explained
                                                                                                                                                                                in Section 11.7. They are visible as faint lines on both images. The zoomed view of
                                                                                                                                                                                the subimage shows them a little clearer. It is important to note that the keypoints
                                                                                                                                                                                for the image and subimage were found independently by SIFT. The large image
                                                                                                                                                                                had 2714 keypoints, and the small image had 35.
                                                                                                                                                                                   Figure 12.16 shows the matches between keypoints found by SIFT. A total of 41
                                                                                                                                                                                matches were found between the two images. Because there are only 35 keypoints
            white dot the location of this maximum correlation value (in this case there was a unique match whose
            maximum value was 1), which we see corresponds closely with the location of the eye in Fig. 12.13(a).
                                                                                                                                                   FIGURE 12.15
                                                                                                                                                   Keypoints found
                                    MATCHING SIFT FEATURES                                                                                         by SIFT. The
                                    We discussed the scale-invariant feature transform (SIFT) in Section 11.7. SIFT                                large image has
                                    computes a set of invariant features that can be used for matching between known                               2714 keypoints
                                                                                                                                                   (visible as faint
                                    (prototype) and unknown images. The SIFT implementation in Section 11.7 yields                                 gray lines). The
                                    128-dimensional feature vectors for each local region in an image. SIFT performs                               subimage has 35
                                    matching by looking for correspondences between sets of stored feature vector pro-                             keypoints. This is
                                    totypes and feature vectors computed for an unknown image. Because of the large                                a separate image,
                                    number of features involved, searching for exact matches is computationally inten-                             and SIFT found
                                                                                                                                                   its keypoints inde-
                                    sive. Instead, the approach is to use a best-bin-first method that can identify the near-                      pendently of the
                                    est neighbors with high probability using only a limited amount of computation (see                            large image. The
                                    Lowe [1999], [2004]). The search is further simplified by looking for clusters of poten-                       zoomed section is
                                    tial solutions using the generalized Hough transform proposed by Ballard [1981]. We                            shown for clarity.
www.EBooksWorld.ir www.EBooksWorld.ir
            FIGURE 12.16                                                                                                                                                       described by shape numbers. With reference to the discussion in Section 11.3, the
            Matches found by                                                                                                                                                   degree of similarity, k, between two region boundaries, is defined as the largest order
            SIFT between the
            large and small
                                                                                                                                                                               for which their shape numbers still coincide. For example, let a and b denote shape
            images. A total of                                                                                                                                                 numbers of closed boundaries represented by 4-directional chain codes. These two
            41 matching pairs                                                                                                                                                  shapes have a degree of similarity k if
            were found. They                                                                                                                      Parameter j starts at
                                                                                    Errors
            are shown                                                                                                                             4 and is always even                             s j ( a ) = sj ( b)            for j = 4, 6, 8, …, k; and
            connected by                                                                                                                          because we are working                                                                                       (12-11)
            straight lines.                                                                                                                       with 4-connectivity, and
                                                                                                                                                  we require that
                                                                                                                                                                                                   s j ( a ) ≠ s j ( b)           for j = k + 2, k + 4, …
            Only three of the                                                                                                                     boundaries be closed.
            matches were
            “real” errors                                                                                                                                                      where s indicates shape number, and the subscript indicates shape order. The dis-
            (labeled “Errors”                                                                                                                                                  tance between two shapes a and b is defined as the inverse of their degree of simi-
            in the figure).                                                                                                                                                    larity:
                                                                                                                                                                                                                             1
                                                                                                                                                                                                                 D ( a, b) =                             (12-12)
                                                                                                                                                                                                                             k
                                                                                                                                                                               This expression satisfies the following properties:
                                                                                                                                                                                                            D ( a, b) ≥ 0
                                                                                                                                                                                                            D ( a, b) = 0      if and only if a = b            (12-13)
                                                                                                                                                                                                            D ( a, c ) ≤ max  D ( a, b) , D ( b, c )
                                    in the small image, obviously at least six matches are either incorrect, or there are                                                      Either k or D may be used to compare two shapes. If the degree of similarity is used,
                                    multiple matches. Three of the errors are clearly visible as matches with connectors                                                       the larger k is, the more similar the shapes are (note that k is infinite for identical
                                    in the middle of the large image. However, if you compare the shape of the connec-                                                         shapes). The reverse is true when Eq. (12-12) is used.
                                    tors in the middle of the large image, you can see that they are virtually identical to
                                    parts of the connectors on the right. Therefore, these errors can be explained on that
                                    basis. The other three extra matches are easier to explain. All connectors on the top                           EXAMPLE 12.3 : Matching shape numbers.
                                    right of the circuit board are identical, and we are comparing one of them against                            Suppose we have a shape, f , and want to find its closest match in a set of five shape prototypes, denoted
                                    the rest. There is no way for a system to tell the difference between them. In fact, by                       by a, b, c, d, and e, as shown in Fig. 12.17(a). The search may be visualized with the aid of the similarity
                                    looking at the connecting lines, we can see that the matches are between the subim-                           tree in Fig. 12.17(b). The root of the tree corresponds to the lowest possible degree of similarity, which
                                    age and all five connectors. These in fact are correct matches between the subimage                           is 4. Suppose shapes are identical up to degree 8, with the exception of shape a, whose degree of simi-
                                    and other connectors that are identical to it.                                                                larity with respect to all other shapes is 6. Proceeding down the tree, we find that shape d has degree of
                                                                                                                                                  similarity 8 with respect to all others, and so on. Shapes f and c match uniquely, having a higher degree
                                    MATCHING STRUCTURAL PROTOTYPES                                                                                of similarity than any other two shapes. Conversely, if a had been an unknown shape, all we could have
                                    The techniques discussed up to this point deal with patterns quantitatively, and                              said using this method is that a was similar to the other five shapes with degree of similarity 6. The same
                                    largely ignore any structural relationships inherent in pattern shapes. The methods                           information can be summarized in the form of the similarity matrix in Fig. 12.17(c).
                                    discussed in this section seek to achieve pattern recognition by capitalizing precisely
                                    on these types of relationships. In this section, we introduce two basic approaches                                                        String Matching
                                    for the recognition of boundary shapes based on string representations, which are                                                          Suppose two region boundaries, a and b, are coded into strings of symbols, denot-
                                    the most practical approach in structural pattern recognition.                                                                             ed as a1a2 … an and b1b2 … bm , respectively. Let a represent the number of matches
                                                                                                                                                                               between the two strings, where a match occurs in the kth position if ak = bk . The
                                    Matching Shape Numbers                                                                                                                     number of symbols that do not match is
                                    A procedure similar in concept to the minimum-distance classifier introduced ear-
                                                                                                                                                                                                                      b = max ( a , b ) − a                    (12-14)
                                    lier for pattern vectors can be formulated for comparing region boundaries that are
www.EBooksWorld.ir www.EBooksWorld.ir
              a                                                                                                                                                              a b
             b c                                                                                                                                                             c d
                                                                                                    b                c                                                       e f
            FIGURE 12.17                                                                                                                                                      g
                                                                           a
            (a) Shapes.
                                                                                                                                                                            FIGURE 12.18
            (b) Similarity
                                                                                                                                                                            (a) and (b) sample
            tree. (c) Similarity
                                                                                                                                                                            boundaries of two
            matrix.
                                                                                                                                                                            different object
            (Bribiesca and                                             d                            e                    f
                                                                                                                                                                            classes; (c) and (d)
            Guzman.)
                                                                                                                                                                            their corresponding
                                                                                                                                                                            polygonal
                                                                                                                                                                            approximations;
                                           Degree
                                                                                                                                                                            (e)–(g) tabulations
                                            4                          abcdef                                                                                               of R.                        R     1.a    1.b    1.c      1.d         1.e       1.f          R          2.a          2.b         2.c    2.d    2.e    2.f
                                                                                                                         a   b   c   d   e     f
                                                                                                                                                                            (Sze and Yang.)
                                                                                                                                                                                                         1.a                                                            2.a
                                            6                          abcdef                                    a           6   6   6   6     6
                                                                                                                                                                                                         1.b   16.0                                                     2.b      33.5
                                                                                                                 b               8   8   10    8
                                                                                    bcdef                                                                                                                1.c    9.6   26.3                                              2.c          4.8         5.8
                                            8              a
                                                                                                                 c                   8   8     12                                                        1.d    5.1    8.1   10.3                                       2.d          3.6         4.2      19.3
                                           10              a                        cf                  be                                                                                               1.e    4.7    7.2   10.3    14.2                               2.e          2.8         3.3          9.2   18.3
                                                                   d                                             d                       8     8
                                                                                                                                                                                                         1.f    4.7    7.2   10.3         8.4     23.7                  2.f          2.6         3.0          7.7   13.5   27.0
                                           12              a       d                cf          b            e   e                             8
                                                                                                                                                                                                                                    R           1.a      1.b      1.c         1.d          1.e         1.f
                                                                                                                 f
                                           14                                                                                                                                                                                       2.a         1.24     1.50     1.32        1.47        1.55         1.48
                                                       a       d               c     f      b            e                                                                                                                          2.b         1.18     1.43     1.32        1.47        1.55         1.48
                                                                                                                                                                                                                                    2.c         1.02     1.18     1.19        1.32        1.39         1.48
                                        where arg is the length (number of symbols) of string in the argument. It can be                                                                                                            2.d         1.02     1.18     1.19        1.32        1.29         1.40
                                        shown that b = 0 if and only if a and b are identical (see Problem 12.7).                                                                                                                   2.e         0.93     1.07     1.08        1.19        1.24         1.25
                                          An effective measure of similarity is the ratio                                                                                                                                           2.f         0.89     1.02     1.02        1.24        1.22         1.18
                                                                                         a        a
                                                                                   R=      =                                                  (12-15)                       corresponding to the boundaries in Figs. 12.18(a) and (b), respectively. Strings were formed from the
                                                                                         b max ( a , b ) − a
                                                                                                                                                                            polygons by computing the interior angle, u, between segments as each polygon was traversed clock-
                                        We see that R is infinite for a perfect match and 0 when none of the corresponding                                                  wise. Angles were coded into one of eight possible symbols, corresponding to multiples of 45°; that is,
                                        symbols in a and b match (a = 0 in this case). Because matching is done symbol by                                                   a1 : 0° < u ≤ 45°; a2 : 45° < u ≤ 90°; … ; a8 : 315° < u ≤ 360°.
                                        symbol, the starting point on each boundary is important in terms of reducing the                                                      Figure 12.18(e) shows the results of computing the measure R for six samples of object 1 against
                                        amount of computation required to perform a match. Any method that normalizes                                                       themselves. The entries are values of R and, for example, the notation 1.c refers to the third string from
            Refer to Section 11.2
                                        to, or near, the same starting point is helpful if it provides a computational advan-                                               object class 1. Figure 12.18(f) shows the results of comparing the strings of the second object class
            for examples of how the
            starting point of a curve   tage over brute-force matching, which consists of starting at arbitrary points on each                                              against themselves. Finally, Fig. 12.18(g) shows the R values obtained by comparing strings of one class
            can be normalized.
                                        string, then shifting one of the strings (with wraparound) and computing Eq. (12-15)                                                against the other. These values of R are significantly smaller than any entry in the two preceding tabu-
                                        for each shift. The largest value of R gives the best match.                                                                        lations. This indicates that the R measure achieved a high degree of discrimination between the two
                                                                                                                                                                            classes of objects. For example, if the class of string 1.a had been unknown, the smallest value of R result-
                                                                                                                                                                            ing from comparing this string against sample (prototype) strings of class 1 would have been 4.7 [see
             EXAMPLE 12.4 : String matching.                                                                                                                                Fig. 12.18(e)]. By contrast, the largest value in comparing it against strings of class 2 would have been
            Figures 12.18(a) and (b) show sample boundaries from each of two object classes, which were approxi-                                                            1.24 [see Fig. 12.18(g)]. This result would have led to the conclusion that string 1.a is a member of object
            mated by a polygonal fit (see Section 11.2). Figures 12.18(c) and (d) show the polygonal approximations                                                         class 1. This approach to classification is analogous to the minimum-distance classifier introduced earlier.
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                 1 Nc
                                                                                                                                                                                                                                   (        ) ( )
                                                                                                                                                                                                                 d j ( x ) = p x c j P c j j = 1, 2, …, Nc            (12-24)
                                                                  rj ( x ) =          ∑ Lkj p ( x ck ) P (ck )
                                                                               p (x ) k =1
                                                                                                                             (12-17)
                                                                                                                                                                                         and assigns a pattern to class ci if di ( x) > d j ( x) for all j ≠ i. This is exactly the same
                                    where p ( x ck ) is the probability density function (PDF) of the patterns from class                                                                process described in Eq. (12-5), but we are now dealing with decision functions that
                                    ck , and P(ck ) is the probability of occurrence of class ck (sometimes P(ck ) is referred                                                           have been shown to be optimal in the sense that they minimize the average loss in
                                    to as the a priori, or simply the prior, probability). Because 1 p( x ) is positive and                                                              misclassification.
                                    common to all the rj ( x ) , j = 1, 2, …, Nc , it can be dropped from Eq. (12-17) without                                                               For the optimality of Bayes decision functions to hold, the probability density
                                    affecting the relative order of these functions from the smallest to the largest value.                                                              functions of the patterns in each class, as well as the probability of occurrence of
                                    The expression for the average loss then reduces to                                                                                                  each class, must be known. The latter requirement usually is not a problem. For
                                                                                                                                                                                         instance, if all classes are equally likely to occur, then P(c j ) = 1 Nc . Even if this con-
                                                                                  Nc
                                                                                                                                                                                         dition is not true, these probabilities generally can be inferred from knowledge of
                                                                     rj ( x ) =   ∑ Lkj p ( x ck ) P (ck )
                                                                                  k =1
                                                                                                                             (12-18)
                                                                                                                                                                                         the problem. Estimating the probability density functions p(x c j ) is more difficult. If
                                                                                                                                                                                         the pattern vectors are n-dimensional, then p(x c j ) is a function of n variables. If the
                                    Given an unknown pattern, the classifier has Nc possible classes from which to                                                                       form of p(x c j ) is not known, estimating it requires using multivariate estimation
                                    choose. If the classifier computes r1 (x), r2 (x), …, rNc (x) for each pattern x and                                                                 methods. These methods are difficult to apply in practice, especially if the number
                                    assigns the pattern to the class with the smallest loss, the total average loss with                                                                 of representative patterns from each class is not large, or if the probability density
                                    respect to all decisions will be minimum. The classifier that minimizes the total                                                                    functions are not well behaved. For these reasons, uses of the Bayes classifier often
                                    average loss is called the Bayes classifier. This classifier assigns an unknown pat-                                                                 are based on assuming an analytic expression for the density functions. This in turn
                                    tern x to class ci if ri (x) < rj (x) for j = 1, 2, …, Nc ; j ≠ i. In other words, x is assigned                                                     reduces the problem to one of estimating the necessary parameters from sample
                                    to class ci if                                                                                                                                       patterns from each class using training patterns. By far, the most prevalent form
                                                           Nc                                   Nc                                                                                       assumed for p(x c j ) is the Gaussian probability density function. The closer this
                                                           ∑ Lki p ( x ck ) P (ck )         <   ∑ Lqj p ( x cq ) P (cq )     (12-19)                                                     assumption is to reality, the closer the Bayes classifier approaches the minimum
                                                           k =1                                 q=1                                                                                      average loss in classification.
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                   (         ) ( )
                                                                        dj ( x ) = p x c j P c j                                                                                                                       where E j { ⋅ } is the expected value of the argument over the patterns of class c j . In
                                                                                                                                                                                                                       Eq. (12-26), n is the dimensionality of the pattern vectors, and C j is the determinant
                                                                                                ( x − mj )
                                                                                                             2
                                                                                              −                                                             (12-25)                                                    of matrix C j . Approximating the expected value E j by the sample average yields an
                                                                               =
                                                                                     1
                                                                                    2ps j
                                                                                          e
                                                                                                   2 s 2j
                                                                                                                 P cj ( )             j = 1, 2                                                                         estimate of the mean vector and covariance matrix:
                                                                                                                                                                                                                                                                              1
                                        where the patterns are now scalars, denoted by x. Figure 12.19 shows a plot of the                                                                                                                                         mj =
                                                                                                                                                                                                                                                                              nj
                                                                                                                                                                                                                                                                                       ∑x                                              (12-29)
                                        probability density functions for the two classes. The boundary between the two                                                                                                                                                            x∈ c j
                                                                                                                                                                                                                       and
                                        classes is a single point, x0 , such that d1 ( x0 ) = d2 ( x0 ). If the two classes are equally
                                                                                                                                                                                                                                                                   1
                                        likely to occur, then P(c1 ) = P(c2 ) = 1 2 , and the decision boundary is the value of                                                                                                                             Cj =
                                                                                                                                                                                                                                                                   nj
                                                                                                                                                                                                                                                                        ∑ xxT − m j mTj                                                (12-30)
                                        x0 for which p( x0 c1 ) = p( x0 c2 ). This point is the intersection of the two probabil-                                                                                                                                       x∈ c j
                                        ity density functions, as shown in Fig. 12.19. Any pattern (point) to the right of x0 is                                                                                       where n j is the number of sample pattern vectors from class c j and the summation
                                        classified as belonging to class c1 . Similarly, any pattern to the left of x0 is classified                                                                                   is taken over these vectors. We will give an example later in this section of how to
                                        as belonging to class c2 . When the classes are not equally likely to occur, x0 moves to                                                                                       use these two expressions.
                                        the left if class c1 is more likely to occur or, conversely, it moves to the right if class                                                                                        The covariance matrix is symmetric and positive semidefinite. Its kth diagonal ele-
                                        c2 is more likely to occur. This result is to be expected, because the classifier is trying                                                                                    ment is the variance of the kth element of the pattern vectors. The kjth off-diagonal
                                        to minimize the loss of misclassification. For instance, in the extreme case, if class c2                                                                                      matrix element is the covariance of elements xk and x j in these vectors. The multi-
                                        never occurs, the classifier would never make a mistake by always assigning all pat-                                                                                           variate Gaussian density function reduces to the product of the univariate Gauss-
                                        terns to class c1 (that is, x0 would move to negative infinity).                                                                                                               ian density of each element of x when the off-diagonal elements of the covariance
                                           In the n-dimensional case, the Gaussian density of the vectors in the jth pattern                                                                                           matrix are zero, which happens when the vector elements xk and x j are uncorrelated.
                                        class has the form                                                                                                                                                                 From Eq. (12-24), the Bayes decision function for class c j is d j (x) = p(x c j )P(c j ).
                                                                                                                                                                                                                       However, the exponential form of the Gaussian density allows us to work with the
                                                                                                                     1
                                                                                                                      (         )       (           )
                                                                                                                                T
                                                                         (     )              1                  −     x − mj       C−j 1 x − m j
                                                                        p x cj =                            e        2                                      (12-26)                                                    natural logarithm of this decision function, which is more convenient. In other words,
                                                                                                       12
                                                                                    ( 2p )   n2
                                                                                                  Cj                                                                                                                   we can use the form
                                        where each density is specified completely by its mean vector m j and covariance                                                                                                                                                      (
                                                                                                                                                                                                                                                         d j ( x ) = ln  p x c j P c j ) ( )                                       (12-31)
                                        matrix C j , which are defined as
                                                                                                                                                                                                                                                                          (
                                                                                                                                                                                                                                                                = ln p x c j + ln P c j)            ( )
                                                                                                                                                                                                                       This expression is equivalent to Eq. (12-24) in terms of classification performance
            FIGURE 12.19                                                                                                                                                                                               because the logarithm is a monotonically increasing function. That is, the numerical
            Probability
            density functions                                                                 p( x c2 )                                                                                                                order of the decision functions in Eqs. (12-24) and (12-31) is the same. Substituting
            for two 1-D                                                                                                                                                                                                Eq. (12-26) into Eq. (12-31) yields
                                                  Probability density
            pattern classes.
            Point x0 (at the
                                                                                                                                                                                                                                              ( )         n        1        1
                                                                                                                                                                                                                                                            ln 2p − ln C j −  x − m j             (              )        (       )
                                                                                                                                                                                                                                                                                                                       C−j 1 x − m j  (12-32)
                                                                                                                                                                                                                                                                                                                   T
            intersection of the                                                                                             p( x c1 )                                                                                           d j ( x ) = ln P c j −
            two curves) is the
                                                                                                                                                                                          As noted in Section 6.7                                         2        2        2                                                      
                                                                                                                                                                                          [see Eq. (6-49)], the
            Bayes decision                                                                                                                                                                square root of the
            boundary if the                                                                                                                                                               rightmost term in this       The term ( n 2 ) ln 2p is the same for all classes, so it can be eliminated from Eq.
                                                                                                                                                                                          equation is called the
            two classes are                                                                                                                                                               Mahalanobis distance.        (12-32), which then becomes
            equally likely to
                                                                                                                                                                                                                                                     ( )         1         1
                                                                                                                                                                                                                                                                                            (          )           (           )
            occur.
                                                                                                                                                                                                                                                                   ln C j −  x − m j                          C−j 1 x − m j 
                                                                                                                                                                                                                                                                                                           T
                                                                                                                                                        x                                                                             d j ( x ) = ln P c j −                                                                           (12-33)
                                                                                   m2    x0            m1                                                                                                                                                        2         2                                               
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                         FIGURE 12.20                                                        x3
                                    for j = 1, 2, …, Nc . This equation gives the Bayes decision functions for Gaussian
                                                                                                                                                         Two simple
                                    pattern classes under the condition of a 0-1 loss function.
                                                                                                                                                         pattern classes                                                      (0, 0, 1)
                                       The decision functions in Eq. (12-33) are hyperquadrics (quadratic functions in                                   and the portion                                                                              (0, 1, 1)
                                    n-dimensional space), because no terms higher than the second degree in the com-                                     of their Bayes
                                    ponents of x appear in the equation. Clearly, then, the best that a Bayes classifier                                 decision bound-
                                    for Gaussian patterns can do is to place a second-order decision boundary between                                    ary (shaded) that
                                                                                                                                                         intersects the
                                    each pair of pattern classes. If the pattern populations are truly Gaussian, no other                                cube.                                      (1, 0, 1)                             (1, 1, 1)
                                    boundary would yield a lesser average loss in classification.
                                       If all covariance matrices are equal, then C j = C for j = 1, 2, …, Nc . By expanding                                                                                                                                      x2
                                    Eq. (12-33), and dropping all terms that do not depend on j, we obtain                                                                                                                 (0, 0, 0)                  (0, 1, 0)
                                                                          ( )
                                                           d j ( x ) = ln P c j + xT C−1m j −
                                                                                                1 T −1
                                                                                                2
                                                                                                  mj C mj                 (12-34)                                                                   (1, 0, 0)
                                                                                                                                                                                                                                          (1, 1, 0)
                                                                                                                                                                                                                                                        ∈ c1
                                                                                                                                                                                                                                                        ∈ c2
                                                                                                                                                                                                      x1
                                    which are linear decision functions (hyperplanes) for j = 1, 2, …, Nc .
                                       If, in addition, C = I, where I is the identity matrix, and also if the classes are                                                                                              3 1 1 
                                    equally likely (i.e., P(c j ) = 1 Nc for all j), then we can drop the term ln P(c j ) because                                                                                    1 
                                                                                                                                                                                                      C1 = C2 =         1 3 −1
                                    it would be the same for all values of j. Equation (12-34) then becomes                                                                                                         16
                                                                                                                                                                                                                       1 −1 3 
                                                                                1 T                                                                      The inverse of this matrix is
                                                          d j ( x ) = mTj x −     mj mj       j = 1, 2, …, Nc             (12-35)
                                                                                2                                                                                                                                      8 − 4 −4 
                                                                                                                                                                                                      C1−1 = C−2 1 =  −4 8  4 
                                    which we recognize as the decision functions for a minimum-distance classifier [see
                                    Eq. (12-4)]. Thus, as mentioned earlier, the minimum-distance classifier is optimum                                                                                                −4 4 8 
                                    in the Bayes sense if (1) the pattern classes follow a Gaussian distribution, (2) all
                                    covariance matrices are equal to the identity matrix, and (3) all classes are equally                                Next, we obtain the decision functions. Equation (12-34) applies because the covariance matrices are
                                    likely to occur. Gaussian pattern classes satisfying these conditions are spherical                                  equal, and we are assuming that the classes are equally likely:
                                    clouds of identical shape in n dimensions (called hyperspheres). The minimum-                                                                                                             1 T −1
                                    distance classifier establishes a hyperplane between every pair of classes, with the                                                                           d j ( x ) = xT C−1m j −      mj C mj
                                                                                                                                                                                                                              2
                                    property that the hyperplane is the perpendicular bisector of the line segment join-
                                    ing the center of the pair of hyperspheres. In 2-D, the patterns are distributed in cir-                             Carrying out the vector-matrix expansion, we obtain the two decision functions:
                                    cular regions, and the boundaries become lines that bisect the line segment joining                                                               d1 ( x ) = 4 x1 − 1.5 and     d2 ( x ) = −4 x1 + 8 x2 + 8 x3 − 5.5
                                    the center of every pair of such circles.
                                                                                                                                                         The decision boundary separating the two classes is then
             EXAMPLE 12.5 : A Bayes classifier for 3-D patterns.
                                                                                                                                                                                               d1 ( x ) − d2 ( x ) = 8 x1 − 8 x2 − 8 x3 + 4 = 0
            We illustrate the mechanics of the preceding development using the simple patterns in Fig. 12.20. We
            assume that the patterns are samples from two Gaussian populations, and that the classes are equally                                         Figure 12.20 shows a section of this planar surface. Note that the classes were separated effectively.
            likely to occur. Applying Eq. (12-29) to the patterns in the figure results in
www.EBooksWorld.ir www.EBooksWorld.ir
            is a shade of blue, x2 a shade of green, and so on. If the images are of size 512 × 512 pixels, each stack
            of four multispectral images can be represented by 266,144 four-dimensional pattern vectors. As noted
            previously, the Bayes classifier for Gaussian patterns requires estimates of the mean vector and covari-
            ance matrix for each class. In remote sensing applications, these estimates are obtained using training
            multispectral data whose classes are known from each region of interest (this knowledge sometimes is
            referred to as ground truth). The resulting vectors are then used to estimate the required mean vectors
            and covariance matrices, as in Example 12.5.
               Figures 12.21(a) through (d) show four 512 × 512 multispectral images of the Washington, D.C. area,
            taken in the bands mentioned in the previous paragraph. We are interested in classifying the pixels in
            these images into one of three pattern classes: water, urban development, or vegetation. The masks in
            Fig. 12.21(e) were superimposed on the images to extract samples representative of these three classes.
            Half of the samples were used for training (i.e., for estimating the mean vectors and covariance matri-
            ces), and the other half were used for independent testing to assess classifier performance. We assume
            that the a priori probabilities are equal, P(c j ) = 1 3 ; j = 1, 2, 3.
               Table 12.1 summarizes the classification results we obtained with the training and test data sets. The
            percentage of training and test pattern vectors recognized correctly was about the same with both data
            sets, indicating that the learned parameters did not over-fit the parameters to the training data. The larg-
            est error in both cases was with patterns from the urban area. This is not unexpected, as vegetation is
            present there also (note that no patterns in the vegetation or urban areas were misclassified as water).
            Figure 12.21(f) shows as black dots the training and test patterns that were misclassified, and as white
            dots the patterns that were classified correctly. No black dots are visible in region 1, because the seven
            misclassified points are very close to the boundary of the white region. You can compute from the num-
            bers in the table that the correct recognition rate was 96.4% for the training patterns, and 96.1% for the
            test patterns.
               Figures 12.21(g) through (i) are more interesting. Here, we let the system classify all image pixels into
            one of the three categories. Figure 12.21(g) shows in white all pixels that were classified as water. Pixels
            not classified as water are shown in black. We see that the Bayes classifier did an excellent job of deter-
            mining which parts of the image were water. Figure 12.21(h) shows in white all pixels classified as urban
            development; observe how well the system performed in recognizing urban features, such as the bridges
            and highways. Figure 12.21(i) shows the pixels classified as vegetation. The center area in Fig. 12.21(h)
            shows a high concentration of white pixels in the downtown area, with the density decreasing as a func-
            tion of distance from the center of the image. Figure 12.21(i) shows the opposite effect, indicating the
            least vegetation toward the center of the image, where urban development is the densest.                                               a b c
                                                                                                                                                   d e f
                                                                                                                                                   g h i
                                       We mentioned in Section 10.3 when discussing Otsu’s method that thresholding
                                    may be viewed as a Bayes classification problem, which optimally assigns patterns                             FIGURE 12.21 Bayes classification of multispectral data. (a)–(d) Images in the visible blue, visible green, visible red,
                                    to two or more classes. In fact, as the previous example shows, pixel-by-pixel classi-                        and near infrared wavelength bands. (e) Masks for regions of water (labeled 1), urban development (labeled 2),
                                    fication may be viewed as a segmentation that partitions an image into two or more                            and vegetation (labeled 3). (f) Results of classification; the black dots denote points classified incorrectly. The other
                                    possible types of regions. If only one single variable (e.g., intensity) is used, then                        (white) points were classified correctly. (g) All image pixels classified as water (in white). (h) All image pixels clas-
                                                                                                                                                  sified as urban development (in white). (i) All image pixels classified as vegetation (in white).
                                    Eq. (12-24) becomes an optimum function that similarly partitions an image based
                                    on the intensity of its pixels, as we did in Section 10.3. Keep in mind that optimal-
                                    ity requires that the PDF and a priori probability of each class be known. As we
www.EBooksWorld.ir www.EBooksWorld.ir
            TABLE 12.1                                                                                                                                                                 binary thresholding devices, and stochastic algorithms involving sudden 0–1 and 1–0
            Bayes classification of multispectral image data. Classes 1, 2, and 3 are water, urban, and vegetation, respectively.                                                      changes of states, as the basis for modeling neural systems. Subsequent work by
                                    Training Patterns                                               Test Patterns                                                                      Hebb [1949] was based on mathematical models that attempted to capture the con-
                                                                                                                                                                                       cept of learning by reinforcement or association.
                          No. of            Classified into Class        %                 No. of     Classified into Class      %                                                        During the mid-1950s and early 1960s, a class of so-called learning machines origi-
              Class      Samples            1         2         3      Correct   Class    Samples     1         2         3    Correct                                                 nated by Rosenblatt [1959, 1962] caused a great deal of excitement among research-
                1         484          482            2         0       99.6       1        483      478        3         2     98.9                                                   ers and practitioners of pattern recognition. The reason for the interest in these
                                                                                                                                                                                       machines, called perceptrons, was the development of mathematical proofs showing
                2         933               0        885        48      94.9       2        932       0        880       52     94.4
                                                                                                                                                                                       that perceptrons, when trained with linearly separable training sets (i.e., training sets
                3         483               0        19        464      96.1       3        482       0        16        466    96.7                                                   separable by a hyperplane), would converge to a solution in a finite number of itera-
                                                                                                                                                                                       tive steps. The solution took the form of parameters (coefficients) of hyperplanes
                                                                                                                                                                                       that were capable of correctly separating the classes represented by patterns of the
                                     have mentioned previously, estimating these densities is not a trivial task. If assump-                                                           training set.
                                     tions have to be made (e.g., as in assuming Gaussian densities), then the degree of                                                                  Unfortunately, the expectations following discovery of what appeared to be a
                                     optimality achieved in classification depends on how close the assumptions are to                                                                 well-founded theoretical model of learning soon met with disappointment. The
                                     reality.                                                                                                                                          basic perceptron, and some of its generalizations, were inadequate for most pattern
                                                                                                                                                                                       recognition tasks of practical significance. Subsequent attempts to extend the power
                                                                                                                                                                                       of perceptron-like machines by considering multiple layers of these devices lacked
                                     12.5 NEURAL NETWORKS AND DEEP LEARNING
                                     12.5
                                                                                                                                                                                       effective training algorithms, such as those that had created interest in the percep-
                                     The principal objectives of the material in this section and in Section 12.6 are to                                                               tron itself. The state of the field of learning machines in the mid-1960s was sum-
                                     present an introduction to deep neural networks, and to derive the equations that                                                                 marized by Nilsson [1965]. A few years later, Minsky and Papert [1969] presented
                                     are the foundation of deep learning. We will discuss two types of networks. In this                                                               a discouraging analysis of the limitation of perceptron-like machines. This view was
                                     section, we focus attention on multilayer, fully connected neural networks, whose                                                                 held as late as the mid-1980s, as evidenced by comments made by Simon [1986]. In
                                     inputs are pattern vectors of the form introduced in Section 12.2. In Section 12.6, we                                                            this work, originally published in French in 1984, Simon dismisses the perceptron
                                     will discuss convolutional neural networks, which are capable of accepting images                                                                 under the heading “Birth and Death of a Myth.”
                                     as inputs. We follow the same basic approach in presenting the material in these two                                                                 More recent results by Rumelhart, Hinton, and Williams [1986] dealing with the
                                     sections. That is, we begin by developing the equations that describe how an input is                                                             development of new training algorithms for multilayers of perceptron-like units
                                     mapped through the networks to generate the outputs that are used to classify that                                                                have changed matters considerably. Their basic method, called backpropagation
                                     input. Then, we derive the equations of backpropagation, which are the tools used                                                                 (backprop for short), provides an effective training method for multilayer networks.
                                     to train both types of networks. We give examples in both sections that illustrate the                                                            Although this training algorithm cannot be shown to converge to a solution in the
                                     power of deep neural networks and deep learning for solving complex pattern clas-                                                                 sense of the proof for the single-layer perceptron, backpropagation is capable of
                                     sification problems.                                                                                                                              generating results that have revolutionized the field of pattern recognition.
                                                                                                                                                                                          The approaches to pattern recognition we have studied up to this point rely on
                                     BACKGROUND                                                                                                                                        human-engineered techniques to transform raw data into formats suitable for com-
                                     The essence of the material that follows is the use of a multitude of elemental non-                                                              puter processing. The methods of feature extraction we studied in Chapter 11 are
                                     linear computing elements (called artificial neurons), organized as networks whose                                                                examples of this. Unlike these approaches, neural networks can use backpropaga-
                                     interconnections are similar in some respects to the way in which neurons are inter-                                                              tion to automatically learn representations suitable for recognition, starting with
                                     connected in the visual cortex of mammals. The resulting models are referred to                                                                   raw data. Each layer in the network “refines” the representation into more abstract
                                     by various names, including neural networks, neurocomputers, parallel distributed                                                                 levels. This type of multilayered learning is commonly referred to as deep learning,
                                     processing models, neuromorphic systems, layered self-adaptive networks, and con-                                                                 and this capability is one of the underlying reasons why applications of neural net-
                                     nectionist models. Here, we use the name neural networks, or neural nets for short.                                                               works have been so successful. As we noted at the beginning of this section, practical
                                     We use these networks as vehicles for adaptively learning the parameters of decision                                                              implementations of deep learning generally are associated with large data sets.
                                     functions via successive presentations of training patterns.                                                                                         Of course, these are not “magical” systems that assemble themselves. Human
                                        Interest in neural networks dates back to the early 1940s, as exemplified by the                                                               intervention is still required for specifying parameters such as the number of layers,
                                     work of McCulloch and Pitts [1943], who proposed neuron models in the form of                                                                     the number of artificial neurons per layer, and various coefficients that are problem
www.EBooksWorld.ir www.EBooksWorld.ir
                                    dependent. Teaching proper recognition to a complex multilayer neural network is                                                           reduces the training and operation of neural nets to a simple, straightforward cas-
                                    not a science; rather, it is an art that requires considerable knowledge and experi-                                                       cade of matrix multiplications.
                                    mentation on the part of the designer. Countless applications of pattern recogni-                                                             After studying several examples of fully connected neural nets, we will follow a
                                    tion, especially in constrained environments, are best handled by more “traditional”                                                       similar approach in developing the foundation of CNNs, including how they differ
                                    methods. A good example of this is stylized font recognition. It would be senseless                                                        from fully connected neural nets, and how their training is different. This is followed
                                    to develop a neural network to recognize the E-13B font we studied in Fig. 12.11. A                                                        by several examples of how CNNs are used for image pattern classification.
                                    minimum-distance classifier implemented on a hard-wired architecture is the ideal
                                    solution to this problem, provided that interest is limited to reading only the E-13B                                                      THE PERCEPTRON
                                    font printed on bank checks. On the other hand, neural networks have proved to be                                                      A single perceptron unit learns a linear boundary between two linearly separable
                                    the ideal solution if the scope of application is expanded to require that all relevant                                                pattern classes. Figure 12.22(a) shows the simplest possible example in two dimen-
                                    text written on checks, including cursive script, be read with high accuracy.                                                          sions: two pattern classes, consisting of a single pattern each. A linear boundary in
                                       Deep learning has shined in applications that defy other methods of solution. In                                                    2-D is a straight line with equation y = ax + b, where coefficient a is the slope and b
                                    the two decades following the introduction of backpropagation, neural networks                                                         is the y-intercept. Note that if b = 0, the line goes through the origin. Therefore, the
                                    have been used successfully in a broad range of applications. Some of them, such as                                                    function of parameter b is to displace the line from the origin without affecting its
                                    speech recognition, have become an integral part of everyday life. When you speak                                                      slope. For this reason, this “floating” coefficient that is not multiplied by a coordi-
                                    into a smart phone, the nearly flawless recognition is performed by a neural network.                                                  nate is often referred to as the bias, the bias coefficient, or the bias weight.
                                    This type of performance was unachievable just a few years ago. Other applications                                                        We are interested in a line that separates the two classes in Fig. 12.22. This is a line
                                    from which you benefit, perhaps without realizing it, are smart filters that learn user                                                positioned in such a way that pattern ( x1 , y1 ) from class c1 lies on one side of the line,
                                    preferences for rerouting spam and other junk mail from email accounts, and the                                                        and pattern ( x2 , y2 ) from class c2 lies on the other. The locus of points ( x, y) that are
                                    systems that read zip codes on postal mail. Often, you see television clips of vehicles                                                on the line, satisfy the equation y − ax − b = 0. It then follows that any point on one
                                    navigating autonomously, and robots that are capable of interacting with their envi-                                                   side of the line would yield a positive value when its coordinates are plugged into
                                    ronment. Most are solutions based on neural networks. Less familiar applications                                                       this equation, and conversely for a point on the other side.
                                    include the automated discovery of new medicines, the prediction of gene mutations                                                         Generally, we work with patterns in much higher dimensions than two, so we need
                                    in DNA research, and advances in natural language understanding.                                                                       more general notation. Points in n dimensions are vectors. The components of a vec-
                                       Although the list of practical uses of neural nets is long, applications of this tech-                                              tor, x1 , x2 , …, xn , are the coordinates of the point. For the coefficients of the boundary
                                    nology in image pattern classification has been slower in gaining popularity. As                                                       separating the two classes, we use the notation w1 , w2 , …, wn , wn+1 , where wn+1 is the
                                    you will learn shortly, using neural nets in image processing is based principally on                                                  bias. The general equation of our line using this notation is w1 x1 + w2 x2 + w3 = 0 (we
                                    neural network architectures called convolutional neural nets (denoted by CNNs                                                         can express this equation in slope-intercept form as x2 + (w1 w2 )x1 + w3 w2 = 0).
                                    or ConvNets). One of the earliest well-known applications of CNNs is the work of                                                       Figure 12.22(b) is the same as (a), but using this notation. Comparing the two fig-
                                    LeCun et al. [1989] for reading handwritten U.S. postal zip codes. A number of other                                                   ures, we see that y = x2 , x = x1 , a = w1 w 2 , and b = w3 w 2 . Equipped with our more
                                    applications followed shortly thereafter, but it was not until the results of the 2012
                                    ImageNet Challenge were published (e.g., see Krizhevsky, Sutskever, and Hinton                                FIGURE 12.22                    y                                           x2
                                    [2012]) that CNNs became widely used in image pattern recognition. Today, this is                             (a) The simplest
                                    the approach of choice for addressing complex image recognition tasks.                                        two-class example
                                                                                                                                                  in 2-D, showing one                     +                                            +
                                       The neural network literature is vast and rapidly evolving, so as usual, our
                                                                                                                                                  possible decision                   −                                            −
                                    approach is to focus on fundamentals. In this and the following sections, we will                             boundary out of an                                         ∈ c1                                       ∈ c1
                                    establish the foundation of how neural nets are trained, and how they operate after                           infinite number of
                                    training. We will begin by briefly discussing perceptrons. Although these computing                           such boundaries.
                                                                                                                                                  (b) Same as (a), but
                                    elements are not used per se in current neural network architectures, the opera-
                                                                                                                                                  with the                                                                                        w1 x1 + w2 x2 + w3 = 0
                                    tions they perform are almost identical to artificial neurons, which are the basic                            decision boundary                                      y = ax + b, or
                                    computing units of neural nets. In fact, an introduction to neural networks would                             expressed using                                        y − ax − b = 0
                                                                                                                                                                                              ∈ c2                                         ∈ c2
                                    be incomplete without a discussion of perceptrons. We will follow this discussion by                          more general
                                    developing in detail the theoretical foundation of backpropagation. After develop-                            notation.
                                    ing the basic backpropagation equations, we will recast them in matrix form, which                                                                                                    x                                         x1
www.EBooksWorld.ir www.EBooksWorld.ir
                                         general notation, we say that an arbitrary point ( x1 , x2 ) is on the positive side of a                                                       2) If x(k ) ∈ c2 and w T (k )x(k ) + wn+1 (k ) ≥ 0, let
                                         line if w1 x1 + w2 x2 + w3 > 0, and conversely for any point on the negative side. For
                                         points in 3-D, we work with the equation of a plane, w1 x1 + w2 x2 + w3 x3 + w4 = 0,                                                                                           w (k + 1) = w (k ) − a x(k )
                                                                                                                                                                                                                                                                       (12-41)
                                         but would perform exactly the same test to see if a point lies on the positive or                                                                                            vn+1 (k + 1) = vn+1 (k ) − a
                                         negative side of the plane. For a point in n dimensions, the test would be against a
                                         hyperplane, whose equation is                                                                                                                   3) Otherwise, let
                                                                                                                                                                                                                            w (k + 1) = w (k )
                                                                    w1 x1 + w2 x2 + " + wn xn + wn+1 = 0                  (12-36)                                                                                                                                      (12-42)
                                                                                                                                                                                                                          vn+1 (k + 1) = vn+1 (k )
                                         This equation is expressed in summation form as
                                                                                                                                                                                      The correction in Eq. (12-40) is applied when the pattern is from class c1 and
                                                                              n
                                                                                                                                                                                      Eq. (12-39) does not give a positive response. Similarly, the correction in Eq. (12-41)
                                                                             ∑ wi xi + wn+1 = 0
                                                                             i =1
                                                                                                                          (12-37)
                                                                                                                                                                                      is applied when the pattern is from class c2 and Eq. (12-39) does not give a negative
                                         or in vector form as                                                                                                                         response. As Eq. (12-42) shows, no change is made when Eq. (12-39) gives the cor-
                                                                                                                                                                                      rect response.
                                                                                  wT x + wn+1 = 0                         (12-38)                                                        The notation in Eqs. (12-40) through (12-42) can be simplified if we add a 1 at
                                                                                                                                                                                      the end of every pattern vector and include the bias in the weight vector. That is,
                                         where w and x are n-dimensional column vectors and wT x is the dot (inner) prod-                                                             we definex % [ x1 , x2 ,… , xn , 1]T and w % [w1 , w2 ,… , wn , wn+1 ]T . Then, Eq. (12-39)
                                         uct of the two vectors. Because the inner product is commutative, we can express                                                             becomes
                                         Eq. (12-38) in the equivalent form xT w + wn+1 = 0. We refer to w as a weight vector
                                         and, as above, to wn+1 as a bias. Because the bias is a weight that is always multiplied                                                                                             > 0         if x ∈ c1
                                                                                                                                                                                                                       wT x =                                         (12-43)
                                         by 1, sometimes we avoid repetition by using the term weights, coefficients, or param-                                                                                               < 0         if x ∈ c2
                                         eters when referring to the bias and the elements of a weight vector collectively.
                                            Stating the class separation problem in general form we say that, given any pat-                                                          where both vectors are now (n + 1)-dimensional. In this formulation, x and w are
            It is customary to           tern vector x from a vector population, we want to find a set of weights with the                                                            referred to as augmented pattern and weight vectors, respectively. The algorithm in
            associate > with class c1
            and < with class c2, but     property                                                                                                                                     Eqs. (12-40) through (12-42) then becomes: For any pattern vector, x(k ), at step k
            the sense of the
                                                                                  > 0        if x ∈ c1
            inequality is arbitrary,
                                                                    wT x + wn+1 =                                       (12-39)                                                        1$) If x(k ) ∈ c1 and w T (k ) x(k ) ≤ 0, let
                                                                                  < 0        if x ∈ c2
            provided that you are
            consistent. Note that this
            equation implements a
            linear decision function.    Finding a line that separates two linearly separable pattern classes in 2-D can be                                                                                            w (k + 1) = w (k ) + ax(k )                     (12-44)
                                         done by inspection. Finding a separating plane by visual inspection of 3-D data is
                                                                                                                                                                                         2$) If x(k ) ∈ c2 and w T (k )x(k ) ≥ 0, let
            Linearly separable class-    more difficult, but it is doable. For n > 3, finding a separating hyperplane by inspec-
            es satisfy Eq. (12-39).      tion becomes impossible in general. We have to resort instead to an algorithm to find
            That is, they are                                                                                                                                                                                          w (k + 1) = w (k ) − ax(k )                     (12-45)
            separable by single          a solution. The perceptron is an implementation of such an algorithm. It attempts
            hyperplanes.                 to find a solution by iteratively stepping through the patterns of each of two classes.                                                         3$) Otherwise, let
                                         It starts with an arbitrary weight vector and bias, and is guaranteed to converge in a
                                         finite number of iterations if the classes are linearly separable.                                                                                                                 w (k + 1) = w (k )                         (12-46)
                                             The perceptron algorithm is simple. Let a > 0 denote a correction increment (also
                                         called the learning increment or the learning rate), let w(1) be a vector with arbi-                                                         where the starting weight vector, w(1), is arbitrary and, as above, a is a positive
                                         trary values, and let wn+1 (1) be an arbitrary constant. Then, do the following for                                                          constant. The procedure implemented by Eqs. (12-40)–(12-42) or (12-44)–(12-46) is
                                         k = 2, 3, … : For a pattern vector, x(k ), at step k,                                                                                        called the perceptron training algorithm. The perceptron convergence theorem states
                                                                                                                                                                                      that the algorithm is guaranteed to converge to a solution (i.e., a separating hyper-
                                           1) If x(k ) ∈ c1 and w T (k )x(k ) + wn+1 (k ) ≤ 0, let                                                                                    plane) in a finite number of steps if the two pattern classes are linearly separable
                                                                                                                                                                                      (see Problem 12.15). Normally, Eqs. (12-44)–(12-46) are the basis for implementing
                                                                          w (k + 1) = w (k ) + a x(k )                                                                                the perceptron training algorithm, and we will use it in the following paragraphs
                                                                                                                          (12-40)
                                                                        vn+1 (k + 1) = vn+1 (k ) + a                                                                                  of this section. However, the notation in Eqs. (12-40)–(12-42), in which the bias is
www.EBooksWorld.ir www.EBooksWorld.ir
1 1  1  0 
                                                                                                                                                           We have gone through a complete training epoch with at least one correction, so we cycle through the
                                                                                                                                                           training set again.
                                       shown separately, is more prevalent in neural networks, so you need to be familiar
                                                                                                                                                              For k = 3, x(3) = [3 3 1]T ∈ c1 , and w(3) = [ 2 2 0]T . Their inner product is positive (i.e., 6) as it should
                                       with it as well.
                                                                                                                                                           be because x(3) ∈c1 . Therefore, Step 3$ applies and the weight vector is not changed:
                                          Figure 12.23 shows a schematic diagram of the perceptron. As you can see, all
                                       this simple “machine” does is form a sum of products of an input pattern using the                                                                                                  2
                                       weights and bias found during training. The output of this operation is a scalar value
            Note that the perceptron                                                                                                                                                                      w (4) = w (3) =  2 
            model implements Eq.       that is then passed through an activation function to produce the unit’s output. For
            (12-39), which is in       the perceptron, the activation function is a thresholding function (we will consider                                                                                                 0 
            the form of a decision
            function.                  other forms of activation when we discuss neural networks). If the thresholded out-
                                       put is a +1, we say that the pattern belongs to class c1 . Otherwise, a −1 indicates that                           For k = 4, x(4) = [1 1 1]T ∈ c2 , and w(4) = [ 2 2 0]T . Their inner product is positive (i.e., 4) and it should
                                       the pattern belongs to class c2 . Values 1 and 0 sometimes are used to denote the two                               have been negative, so Step 2$ applies:
                                       possible states of the output.                                                                                                                                                2          1   1 
                                                                                                                                                                                           w (5) = w (4) − ax(4) =  2  − (1) 1 =  1 
             EXAMPLE 12.7 : Using the perceptron algorithm to learn a decision boundary.                                                                                                                             0        1  −1
            We illustrate the steps taken by a perceptron in learning the coefficients of a linear boundary by solving
                                                                                                                                                           At least one correction was made, so we cycle through the training patterns again. For k = 5, we have
            the mini problem in Fig. 12.22. To simplify manual computations, let the pattern vector furthest from the
                                                                                                                                                           x(5) = [3 3 1]T ∈ c1 , and, using w(5), we compute their inner product to be 5. This is positive as it should
            origin be x = [3 3 1]T , and the other be x = [1 1 1]T , where we augmented the vectors by appending a
                                                                                                                                                           be, so Step 3$ applies and we let w (6) = w (5) = [1 1 − 1]T . Following this procedure just discussed, you
            1 at the end, as discussed earlier. To match the figure, let these two patterns belong to classes c1 and c2 ,
                                                                                                                                                           can show (see Problem 12.13) that the algorithm converges to the solution weight vector
            respectively. Also, assume the patterns are “cycled” through the perceptron in that order during training
            (one complete iteration through all patterns of the training is called an epoch). To start, we let a = 1 and
                                                                                                                                                                                                                         1
            w(1) = 0 = [0 0 0]T ; then,
               For k = 1, x(1) = [3 3 1]T ∈ c1 , and w(1) = [0 0 0]T . Their inner product is zero,                                                                                                        w = w (12) =  1 
                                                                                                                                                                                                                          −3
                                                                                   3                                                                     which gives the decision boundary
                                                          wT (1)x(1) = [ 0 0 0 ]  3 = 0
                                                                                                                                                                                                             x1 + x2 − 3 = 0
                                                                                  1 
            so Step 1$ of the second version of the training algorithm applies:                                                                               Figure 12.24(a) shows the boundary defined by this equation. As you can see, it clearly separates the
                                                                                                                                                           patterns of the two classes. In terms of the terminology we used in the previous section, the decision
                                                                          0             3  3                                                         surface learned by the perceptron is d(x) = d( x1 , x2 ) = x1 + x2 − 3, which is a plane. As before, the
                                                  w (2) = w (1) + ax(1) =  0  + (1)  3 =  3                                                    decision boundary is the locus of points such that d(x) = d( x1 , x2 ) = 0, which is a line. Another way to
                                                                                                                                                           visualize this boundary is that it is the intersection of the decision surface (a plane) with the x1 x2 -plane,
                                                                            0        1  1 
                                                                                                                                                           as Fig. 12.24(b) shows. All points ( x1 , x2 ) such that d( x1 , x2 ) > 0 are on the positive side of the boundary,
            For k = 2, x(2) = [1 1 1]T ∈ c2 and w(2) = [3 3 1]T . Their inner product is                                                                   and vice versa for d( x1 , x2 ) < 0.
www.EBooksWorld.ir www.EBooksWorld.ir
             a b                                                                                       d(x) = d( x1 , x2 ) = x1 + x2 − 3
                                              x2
            FIGURE 12.24                                                                                                                                                                E                                       E
                                                                                                                                     x1 + x2 − 3                                                                                                          1
            (a) Segment                                                                                                     +
                                          3                                                                                                                                      0.50                                    0.50
            of the decision                                                                                                                x1 + x2 − 3 = 0
            boundary learned
            by the perceptron                                     x1 + x2 − 3 = 0                                                                                                                                                                        0.5
            algorithm.                    2                                                                                                                                      0.25                                    0.25
            (b) Section of the
                                                                                x2
            decision surface.                                                                                                                             x1                                                                                              0
            The decision                                                                 3
                                                                                                                                                     3                                                                                                    2
            boundary is the               1                                                        2                                                                                                                                                                                               2
                                                                                                                                           2                                        0                               wx     0                        wx
                                                                                                                                                                                                                                                                 1
            intersection of the                                                                           1                  1                                                          0             1         2               0     1         2
                                                                                                                                                                                                                                                                                        1
            decision surface                                                                                          0
            with the x1 x2 -                                                                                                                                                                                                                                                 0
                                                                                    x1                                                                                           a b c
            plane.                            0     1         2            3
                                                                                                                                                                                FIGURE 12.25 Plots of E as a function of wx for r = 1. (a) A value of a that is too small can slow down convergence.
                                                                                                                                                                                (b) If a is too large, large oscillations or divergence may occur. (c) Shape of the error function in 2-D.
             EXAMPLE 12.8 : Using the perceptron to classify two sets of iris data measurements.                                                                                                                We find the minimum of E(w) using an iterative gradient descent algorithm, whose
                                                                                                                                                                                                             form is
            In Fig. 12.10 we showed a reduced set of the iris database in two dimensions, and mentioned that the
            only class that was separable from the others is the class of Iris setosa. As another illustration of the                                                           Note that the right side                                                     ∂E ( w ) 
            perceptron, we now find the full decision boundary between the Iris setosa and the Iris versicolor classes.                                                         of this equation is the                              w ( k + 1) = w(k ) − a                                (12-48)
            As we mentioned when discussing Fig. 12.10, these are 4-D data sets. Letting a = 0.5, and starting with                                                             gradient of E(w).                                                            ∂ w  w = w( k )
            all parameters equal to zero, the perceptron converged in only four epochs to the solution weight vector
                                                                                                                                                                                                             where the starting weight vector is arbitrary, and a > 0.
            w = [0.65, 2.05, − 2.60, − 1.10, 0.50]T , where the last element is wn+1 .
                                                                                                                                                                                                                 Figure 12.25(a) shows a plot of E for scalar values, w and x, of w and x. We want
                                                                                                                                                                                                             to move w incrementally so E(w) approaches a minimum, which implies that E
                                             In practice, linearly separable pattern classes are rare, and a significant amount                                                                              should stop changing or, equivalently, that ∂E(w) ∂ w = 0. Equation (12-48) does
                                          of research effort during the 1960s and 1970s went into developing techniques for                                                                                  precisely this. If ∂E(w) ∂ w > 0, a portion of this quantity (determined by the value
                                          dealing with nonseparable pattern classes. With recent advances in neural networks,                                                                                of the learning increment a) is subtracted from w(k ) to create a new, updated value
                                          many of those methods have become items of mere historical interest, and we will                                                                                   w(k + 1), of the weight. The opposite happens if ∂E(w) ∂ w < 0. If ∂E(w) ∂ w = 0,
                                          not dwell on them here. However, we mention briefly one approach because it is rel-                                                                                the weight is unchanged, meaning that we have arrived at a minimum, which is the
                                          evant to the discussion of neural networks in the next section. The method is based                                                                                solution we are seeking. The value of a determines the relative magnitude of the
                                          on minimizing the error between the actual and desired response at any training step.                                                                              correction in weight value. If a is too small, the step changes will be correspond-
                                             Let r denote the response we want the perceptron to have for any pattern during                                                                                 ingly small and the weight would move slowly toward convergence, as Fig. 12.25(a)
                                          training. The output of our perceptron is either +1 or −1, so these are the two pos-                                                                               illustrates. On the other hand, choosing a too large could cause large oscillations
                                          sible values that r can have. We want to find the augmented weight vector, w, that                                                                                 on either side of the minimum, or even become unstable, as Fig. 12.25(b) illustrates.
                                          minimizes the mean squared error (MSE) between the desired and actual responses                                                                                    There is no general rule for choosing a. However, a logical approach is to start small
                                          of the perceptron. The function should be differentiable and have a unique mini-                                                                                   and experiment by increasing a to determine its influence on a particular set of
                                          mum. The function of choice for this purpose is a quadratic of the form                                                                                            training patterns. Figure 12.25(c) shows the shape of the error function for two vari-
            The 1 ⁄ 2 is used to cancel                                                                                                                                                                      ables.
            out the 2 that will result
                                                                                              1
                                                                                               (              )                                                                                                  Because the error function is given analytically and it is differentiable, we can
                                                                                                                  2
            from taking the deriva-                                            E( w) =          r − wT x                                         (12-47)
            tive of this expression.                                                          2                                                                                                              express Eq. (12-48) in a form that does not require computing the gradient explicitly
            Also, remember that wTx
            is a scalar.
                                                                                                                                                                                                             at every step. The partial of E(w) with respect to w is
                                          where E is our error measure, w is the weight vector we are seeking, x is any pattern
                                                                                                                                                                                                                                            ∂E ( w )
                                          from the training set, and r is the response we desire for that pattern. Both w and x
                                          are augmented vectors.                                                                                                                                                                             ∂w
                                                                                                                                                                                                                                                           (
                                                                                                                                                                                                                                                     = − r − wT x x  )                       (12-49)
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                                                                                                                             x2                                  x2
                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                         A      B   A XOR B
                                                                                                                                                                                                                                                                  –                                    +
             a b                                               0.3                                                                                                                                                                                                                                    –
                                                                                                                                                                                                                                                         1                                   1
                                                                                                                                                                                                                         0      0       0
            FIGURE 12.26                                                                                                                                                                                                                             –
                                    Mean squared error (MSE)
            MSE as a function                                                                                                                                                                                                                       +
                                                                                                                                                                                                                         0      1       1
            of epoch for:
                                                               0.2
            (a) the linearly                                                                                                                                                                                             1      0       1
            separable Iris                                                                                                                                                                                                                                                    x1                                   x1
            classes (setosa                                                                                                                                                                                              1      1       0                0              1                    0               1
                                                                                                                                                                                                                                                                                    ∈ c1
            and versicolor);
                                                               0.1                                                                                                                                                                                                                  ∈ c2
            and (b) the
            linearly nonsepa-                                                                                                                                                                                           a b c
            rable Iris classes                                                                                                                                                                                         FIGURE 12.27 The XOR classification problem in 2-D. (a) Truth table definition of the XOR
            (versicolor and                                                                                                                                                                                            operator. (b) 2-D pattern classes formed by assigning the XOR truth values (1) to one pattern
            virginica).                                         0
                                                                     1   10     20       30         40        50           180       360      540     720        900                                                   class, and false values (0) to another. The simplest decision boundary between the two classes
                                                                                                                                                                                                                       consists of two straight lines. (c) Nonlinear (quadratic) boundary separating the two classes.
                                                                              Training epochs                                       Training epochs
www.EBooksWorld.ir www.EBooksWorld.ir
www.EBooksWorld.ir www.EBooksWorld.ir
                                    of wn+1 , as we do the perceptron. It is customary to use different notation, typically                        FIGURE 12.31                                a1(& − 1)
                                                                                                                                                   General model
                                    b, in neural networks to denote the bias term, so we are following convention. The                             of a feedforward,
                                                                                                                                                                                               a 2(& − 1)
                                    more complicated notation used in Fig. 12.29, which we will explain shortly, is need-                          fully connected                                           ..
                                    ed because we will be dealing with multilayer arrangements with several neurons                                neural net. The                                            ..                                            h
                                    per layer. We use the symbol “&” to denote layers.                                                             neuron is the                                               .                     n& −1
                                       As you can see by comparing Figs. 12.29 and 12.23, we use variable z to denote                              same as in                                  a j (& − 1)                           ∑ wij (&) aj (& − 1)             ai (&) = h ( zi (&))
                                                                                                                                                   Fig. 12.29. Note                                            ..         zi (&) =
                                                                                                                                                                                                                                     j =1
                                    the sum-of-products computed by the neuron. The output of the unit, denoted by a,                              how the output of                                            ..                           + bi (&)
                                    is obtained by passing z through h. We call h the activation function, and refer to its                        each neuron goes                                              .
                                    output, a = h(z), as the activation value of the unit. Note in Fig. 12.29 that the inputs                      to the input of all                       an&−1 (& − 1)
                                    to a neuron are activation values from neurons in the previous layer. Figure 12.30(a)                          neurons in the
                                                                                                                                                   following layer,                                     1
                                    shows a plot of h(z) from Eq. (12-51). Because this function has the shape of a sig-
                                                                                                                                                   hence the name
                                    moid function, the unit in Fig. 12.29 is sometimes called an artificial sigmoid neuron,                        fully connected
                                    or simply a sigmoid neuron. Its derivative has a very nice form, expressible in terms                          for this type of                          Neuron i in hidden layer &                      Output ai(&) goes to all neurons in layer & + 1
                                    of h(z) [see Problem 12.16(a)]:                                                                                architecture.                                                                             a i ( &)
                                                                                                                                                                                     x1
                                                                          ∂h(z)
                                                                h$(z) =         = h(z)[1 − h(z)]                    (12-52)
                                                                           ∂z
                                    Figures 12.30(b) and (c) show two other forms of h(z) used frequently. The hyper-                                                                x2
                                    bolic tangent also has the shape of a sigmoid function, but it is symmetric about both
                                    axes. This property can help improve the convergence of the backpropagation algo-
                                    rithm to be discussed later. The function in Fig. 12.30(c) is called the rectifier func-
                                    tion, and a unit using it is referred to a rectifier linear unit (ReLU). Often, you see
                                    the function itself referred to as the ReLU activation function. Experimental results                                                            x3
                                    suggest that this function tends to outperform the other two in deep neural networks.
                                    every node is connected to the input of all nodes in the next layer, to form a fully
                                    connected network. We also require that there be no loops in the network. Such
                                    networks are called feedforward networks. Fully connected, feedforward neural nets                                                          sometimes you will see the words “shallow” and “deep” used subjectively to denote
                                    are the only types of networks considered in this section.                                                                                  networks with a “few” and with “many” layers, respectively.
                                       We obviously know the values of the nodes in the first layer, and we can observe                                                            We used the notation in Eq. (12-37) to label all the inputs and weights of a per-
                                    the values of the output neurons. All others are hidden neurons, and the layers that                                                        ceptron. In a neural network, the notation is more complicated because we have to
                                    contain them are called hidden layers. Generally, we call a neural net with a single                                                        account for neuron weights, inputs, and outputs within a layer, and also from layer
                                    hidden layer a shallow neural network, and refer to network with two or more hid-                                                           to layer. Ignoring layer notation for a moment, we denote by wij the weight that
                                    den layers as a deep neural network. However, this terminology is not universal, and                                                        associates the link connecting the output of neuron j to the input of neuron i. That is,
www.EBooksWorld.ir www.EBooksWorld.ir
                                    the first subscript denotes the neuron that receives the signal, and the second refers                                                                               FORWARD PASS THROUGH A FEEDFORWARD NEURAL NETWORK
                                    to the neuron that sends the signal. Because i precedes j alphabetically, it would                                                                               A forward pass through a neural network maps the input layer (i.e., values of x) to
                                    seem to make more sense for i to send and for j to receive. The reason we use the                                                                                the output layer. The values in the output layer are used for determining the class of
                                    notation as stated is to avoid a matrix transposition in the equation that describes                                                                             an input vector. The equations developed in this section explain how a feedforward
                                    propagation of signals through the network. This notation is convention, but there is                                                                            neural network carries out the computations that result in its output. Implicit in the
                                    no doubt that it is confusing, so special care is necessary to keep the notation straight.                                                                       discussion in this section is that the network parameters (weights and biases) are
            Remember, a bias is a
                                        Because the biases depend only on the neuron containing it, a single subscript                                                                               known. The important results in this section will be summarized in Table 12.2 at the
            weight that is always   that associates a bias with a neuron is sufficient. For example, we use bi to denote the                                                                         end of our discussion, but understanding the material that gets us there is important
            multiplied by 1.        bias value associated with the ith neuron in a given layer of the network. Our use of                                                                            when we discuss training of neural nets in the next section.
                                    b instead of wn+1 (as we did for perceptrons) follows notational convention used in
                                    neural networks. The weights, biases, and activation function(s) completely define a                                                                                 The Equations of a Forward Pass
                                    neural network. Although the activation function of any neuron in a neural network
                                    could be different from the others, there is no convincing evidence to suggest that                                                                                  The outputs of the layer 1 are the components of input vector x:
                                    there is anything to be gained by doing so. We assume in all subsequent discussions                                                                                                                a j (1) = x j         j = 1, 2, … , n1              (12-53)
                                    that the same form of activation function is used in all neurons.
                                        Let & denote a layer in the network, for & = 1, 2, … , L. With reference to Fig. 12.31,                                                                          where n1 = n is the dimensionality of x. As illustrated in Figs. 12.29 and 12.31, the
                                    & = 1 denotes the input layer, & = L is the output layer, and all other values of &                                                                                  computation performed by neuron i in layer & is given by
                                    denote hidden layers. The number of neurons in layer & is denoted n& . We have two
                                                                                                                                                                                                                                                n& − 1
                                    options to include layer indexing in the parameters of a neural network. We can do
                                    it as a superscript, for example, wij& and bi& ; or we can use the notation wij (&) and
                                                                                                                                                                                                                                     zi (&) =   ∑
                                                                                                                                                                                                                                                j =1
                                                                                                                                                                                                                                                     wij (&) a j (& − 1) +   bi (&)        (12-54)
                                    bi (&). The first option is more prevalent in the literature on neural network. We use
                                    the second option because it is more consistent with the way we describe iterative                                                                                   for i = 1, 2, … , n& and & = 2, … , L. Quantity zi (&) is called the net (or total) input to
                                    expressions in the book, and also because you may find it easier to follow. Using this                                                                               neuron i in layer &, and is sometimes denoted by neti . The reason for this terminol-
                                    notation, the output (activation value) of neuron k in layer & is denoted ak (&).                                                                                    ogy is that zi (&) is formed using all outputs from layer & − 1. The output (activation
                                        Keep in mind that our objective in using neural networks is the same as for per-                                                                                 value) of neuron i in layer & is given by
                                    ceptrons: to determine the class membership of unknown input patterns. The most                                                                                                                 ai (&) = h ( zi (&))        i = 1, 2, … , n&           (12-55)
                                    common way to perform pattern classification using a neural network is to assign a
                                    class label to each output neuron. Thus, a neural network with nL outputs can clas-                                                                                  where h is an activation function. The value of network output node i is
                                    sify an unknown pattern into one of nL classes. The network assigns an unknown
                                    pattern vector x to class ck if output neuron k has the largest activation value; that is,                                                                                                     ai (L) = h ( zi (L))          i = 1, 2, … , nL          (12-56)
                                    if ak (L) > a j (L), j = 1, 2, … , nL ; j ≠ k. †
                                        In this and the following section, the number of outputs of our neural networks                                                                                  Equations (12-53) through (12-56) describe all the operations required to map the
                                    will always equal the number of classes. But this is not a requirement. For instance, a                                                                              input of a fully connected feedforward network to its output.
                                    network for classifying two pattern classes could be structured with a single output
                                    (Problem 12.17 illustrates such a case) because all we need for this task is two states,
                                                                                                                                                                              EXAMPLE 12.10 : Illustration of a forward pass through a fully connected neural network.
                                    and a single neuron is capable of that. For three and four classes, we need three and
                                    four states, respectively, which can be achieved with two output neurons. Of course,                                                    It will be helpful to consider a simple numerical example. Figure 12.32 shows a three-layer neural network
                                    the problem with this approach is that we would need additional logic to decipher                                                       consisting of the input layer, one hidden layer, and the output layer. The network accepts three inputs, and
                                    the output combinations. It is simply more practical to have one neuron per output,                                                     has two outputs. Thus, this network is capable of classifying 3-D patterns into one of two classes.
                                    and let the neuron with the highest output value determine the class of the input.                                                         The numbers shown above the arrow heads on each input to a node are the weights of that node
                                                                                                                                                                            associated with the outputs from the nodes in the preceding layer. Similarly, the number shown in the
                                                                                                                                                                            output of each node is the activation value, a, of that node. As noted earlier, there is only one output
                                    †
                                                                                                                                                                            value for each node, but it is routed to the input of every node in the next layer. The inputs associated
                                     Instead of a sigmoid or similar function in the final output layer, you will sometimes see a softmax function used
                                                                                                                                                                            with the 1’s are bias values.
                                    instead. The concept is the same as we explained earlier, but the activation values in a softmax implementation
                                    are given by ai (L) = exp[zi (L)] ∑ k exp[zi (L)], where the summation is over all outputs. In this formulation, the                       Let us look at the computations performed at each node, starting with the first (top) node in layer 2.
                                    sum of all activations is 1, thus giving the outputs a probabilistic interpretation.                                                    We use Eq. (12-54) to compute the net input, z1(2), for that node:
www.EBooksWorld.ir www.EBooksWorld.ir
a(1) = x (12-57)
                                                 3                                                                                                                                                                     Next, we look at Eq. (12-54). We know that the summation term is just the inner
                                    z1 (2) = ∑ w1j (2) a j (1) + b1 (2) = (0.1)(3) + (0.2)(0) + (0.6)(1) + 0.4 = 1.3                                                                                                   product of two vectors [see Eqs. (12-37) and (12-38)]. However, this equation has
                                               j =1                                                                                                                                                                    to be evaluated for all nodes in every layer past the first. This implies that a loop is
            We obtain the output of this node using Eqs. (12-51) and (12-55):                                                                                                                                          required if we do the computations node by node. The solution is to form a matrix,
                                                                                                                                                                                                                       W(&), that contains all the weights in layer &. The structure of this matrix is simple—
                                                                                                1
                                                              a1 (2) = h ( z1 (2)) =                   = 0.7858                                                                                                        each of its rows contains the weights for one of the nodes in layer & :
                                                                                            1 + e −1.3
            A similar computation gives the value for the output of the second node in the second layer,                                                                                  With reference to our                                       w11 (&) w12 (&)           " w1n&−1 (&) 
                                                                                                                                                                                          earlier discussion on the
                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                      w21 (&) w22 (&)           " w2 n&−1 (&) 
                                                                                                                                                                                          order of the subscripts
                                                 3
                                    z2 (2) = ∑ w2j (2) a j (1) + b2 (2) = (0.4)(3) + (0.3)(0) + (0.1)(1) + 0.2 = 1.5
                                                                                                                                                                                          i and j, if we had let i                            W(&) =                                                   (12-58)
                                                                                                                                                                                          be the sending node
                                                                                                                                                                                                                                                          #         #            "
                                               j =1                                                                                                                                       and j the receiver, this                                                                             
            and                                                                                                                                                                           matrix would have to be                                    wn& 1 (&) wn& 2 (&)       " wn& n&−1 (&)
                                                                                                                                                                                          transposed.
                                                                                                1
                                                              a2 (2) = h ( z2 (2)) =                   = 0.8176
                                                                                            1 + e −1.5                                                                                                                 Then, we can obtain all the sum-of-products computations, zi (&), for layer & simulta-
            We use the outputs of the nodes in layer 2 to obtain the net values of the neurons in layer 3:                                                                                                             neously:
                                                                                                                                                                                                                                          z(&) = W(&) a(& − 1) + b(&)  & = 2, 3, … , L               (12-59)
                                           2
                               z1 (3) = ∑ w1j (3) a j (2) + b1 (3) = (0.2)(0.7858) + (0.1)(0.8176) + 0.6 = 0.8389                                                                                                      where a(& − 1) is a column vector of dimension n&−1 × 1 containing the outputs of
                                          j =1
                                                                                                                                                                                                                       layer & − 1, b(&) is a column vector of dimension n& × 1 containing the bias values
            The output of this neuron is                                                                                                                                                                               of all the neurons in layer &, and z(&) is an n& × 1 column vector containing the net
                                                                                                1                                                                                                                      input values, zi (&), i = 1, 2, … , n& , to all the nodes in layer &. You can easily verify
                                                            a1 (3) = h ( z1 (3)) =                       = 0.6982                                                                                                      that Eq. (12-59) is dimensionally correct.
                                                                                           1 + e −0.8389
                                                                                                                                                                                                                          Because the activation function is applied to each net input independently of the
            Similarly,                                                                                                                                                                                                 others, the outputs of the network at any layer can be expressed in vector form as:
                                           2
                              z2 (3) = ∑ w2j (3) a j (2) + b2 (3) = (0.1)(0.7858) + (0.4)(0.8176) + 0.3 = 0.7056                                                                                                                                                         h ( z1 (&)) 
                                          j =1                                                                                                                                                                                                                                       
            and                                                                                                                                                                                                                                                          h ( z2 (&)) 
                                                                                                                                                                                                                                                     a(&) = h [ z(&)] =                                (12-60)
                                                                                                1                                                                                                                                                                              #
                                                            a2 (3) = h ( z2 (2)) =                       = 0.6694                                                                                                                                                                    
                                                                                           1 + e −0.7056                                                                                                                                                                     (
                                                                                                                                                                                                                                                                         h zn& (&) 
                                                                                                                                                                                                                                                                                     )
            If we were using this network to classify the input, we would say that pattern x belongs to class c1
            because a1(L) > a2 (L), where L = 3 and nL = 2 in this case.                                                                                                                                               Implementing Eqs. (12-57) through (12-60) requires just a series of matrix opera-
                                                                                                                                                                                                                       tions, with no loops.
www.EBooksWorld.ir www.EBooksWorld.ir
            FIGURE 12.33                                        3                                                                                                                               capable of processing all patterns in a single forward pass. Extending Eqs. (12-57)
                                                                                 0.1 0.2 0.6                 0.2 0.1
            Same as Fig. 12.32,                    a(1) = x =  0     W(2) =                      W(3) =                                                                                   through (12-60) to this more general formulation is straightforward. We begin by
            but using matrix                                                     0.4 0.3 0.1                 0.1 0.4 
            labeling.                                           1                                                                                                                             arranging all our input pattern vectors as columns of a single matrix, X, of dimension
                                                                                      0.4                       0.6 
                                                         x1                  b(2) =                    b(3) =                                                                                 n × n p where, as before, n is the dimensionality of the vectors and n p is the number
                                                                                      0.2                       0.3 
                                                                                                                                                                                                  of pattern vectors. It follows from Eq. (12-57) that
                                                                                                                                                                                                                                          A(1) = X                                   (12-61)
                                                                                                                           0.6982 
                                                         x2                                                        a(3) =         
                                                                                                                           0.6694                                                               where each column of matrix A(1) contains the initial activation values (i.e., the vec-
                                                                                                                                                                                                  tor values) for one pattern. This is a straightforward extension of Eq. (12-57), except
                                                                                                                                                                                                  that we are now dealing with an n × n p matrix instead of an n × 1 vector.
                                                                                                   0.7858 
                                                                                           a(2) =                                                                                                   The parameters of a network do not change because we are processing more
                                                         x3                                        0.8176                                                                                       pattern vectors, so the weight matrix is as given in Eq. (12-58). This matrix is of size
                                                                                                                                                                                                  n& × n& −1 . When & = 2, we have that W(2) is of size n2 × n, because n1 is always equal
             EXAMPLE 12.11 : Redoing Example 12.10 using matrix operations.                                                                                                                       to n. Then, extending the product term of Eq. (12-59) to use A(2) instead of a(2),
                                                                                                                                                                                                  results in the matrix product W(2)A(2), which is of size (n2 × n)(n × n p ) = n2 × n p .
            Figure 12.33 shows the same neural network as in Fig. 12.32, but with all its parameters shown in matrix
            form. As you can see, the representation in Fig. 12.33 is more compact. Starting with                                                                                                 To this, we have to add the bias vector for the second layer, which is of size n2 × 1.
                                                                                                                                                                                                  Obviously, we cannot add a matrix of size n2 × n p and a vector of size n2 × 1. How-
                                                                                  3                                                                                                             ever, as is true of the weight matrices, the bias vectors do not change because we
                                                                         a(1) =  0                                                                                                            are processing more pattern vectors. We just have to account for one identical bias
                                                                                  1                                                                                                           vector, b(2), per input vector. We do this by creating a matrix B(2) of size n2 × n p ,
                                                                                                                                                                                                  formed by concatenating column vector b(2) n p times, horizontally. Then, Eq. (12-59)
            it follows that
                                                                                3                                                                                                               written in matrix becomes Z(2) = W(2)A(1) + B(2). Matrix Z(2) is of size n2 × n p ; it
                                                                0.1 0.2 0.6         0.4  1.3                                                                                              contains the computation performed by Eq. (12-59), but for all input patterns. That
                                     z(2) = W(2) a(1) + b(2) =                0  +  0.2  = 1.5 
                                                                0.4 0.3 0.1                                                                                                               is, each column of Z(2) is exactly the computation performed by Eq. (12-59) for one
                                                                               1                                                                                                                input pattern.
            Then,
                                                                                                                                                                                                      The concept just discussed applies to the transition from any layer to the next
                                                                h ( z1 (2))   h(1.3)  0.7858                                                                                                in the neural network, provided that we use the weights and bias appropriate for a
                                            a(2) = h [ z(2)] =              =       =        
                                                                h ( z2 (2))  h(1.5)  0.8176                                                                                                 particular location in the network. Therefore, the full matrix version of Eq. (12-59) is
            With a(2) as input to the next layer, we obtain
                                                                                                                                                                                                                                Z(&) = W(&)A(& − 1) + B(&)                           (12-62)
                                                               0.2 0.1  0.7858      0.6   0.8389 
                                    z(3) = W(3) a(2) + b(3) =                    +  0.3  =  0.7056 
                                                               0.1 0.4   0.8176                                                                                                           where W(&) is given by Eq. (12-58) and B(&) is an n& × n p matrix whose columns are
                                                                                                                                                                                                  duplicates of b(&), the bias vector containing the biases of the neurons in layer &.
            and, as before,
                                                                                                                                                                                                    All that remains is the matrix formulation of the output of layer &. As Eq. (12-60)
                                                              h ( z1 (3))   h(0.8389)  0.6982                                                                                               shows, the activation function is applied independently to each element of the vec-
                                          a(3) = h [ z(3)] =              =          =        
                                                              h ( z2 (3))  h(0.7056)  0.6694 
                                                                                                                                                                                                  tor z(&). Because each column of Z(&) is simply the application of Eq. (12-60) cor-
                                                                                                                                                                                                  responding to a particular input vector, it follows that
            The clarity of the matrix formulation over the indexed notation used in Example 12.10 is evident.
                                                                                                                                                                                                                                       A(&) = h [ Z(&)]                              (12-63)
                                       Equations (12-57) through (12-60) are a significant improvement over node-by-
                                    node computations, but they apply only to one pattern. To classify multiple pat-                                                                              where activation function h is applied to each element of matrix Z(&).
                                    tern vectors, we would have to loop through each pattern using the same set of                                                                                   Summarizing the dimensions in our matrix formulation, we have: X and A(1)
                                    matrix equations per loop iteration. What we are after is one set of matrix equations                                                                         are of size n × n p , Z(&) is of size n& × n p , W(&) is of size n& × n& −1 , A(& − 1) is of
www.EBooksWorld.ir www.EBooksWorld.ir
            TABLE 12.2                                                                                                                                                                that minimize an error (also called cost or objective) function. Our interest is in
            Steps in the matrix computation of a forward pass through a fully connected, feedforward multilayer neural net.                                                           classification performance, so we define the error function for a neural network as
                Step                Description                                       Equations                                                                                       the average of the differences between desired and actual responses. Let r denote
               Step 1      Input patterns            A(1) = X
                                                                                                                                                                                      the desired response for a given pattern vector, x, and let a(L) denote the actu-
                                                                                                                                                                                      al response of the network to that input. For example, in a ten-class recognition
               Step 2      Feedforward               For & = 2, … , L, compute Z(&) = W(&)A(& − 1) + B(&) and A(&) = h ( Z(&))                                                        application, r and a(L) would be 10-D column vectors. The ten components of a(L)
                                                                                                                                                                                      would be the ten outputs of the neural network, and the components of r would be
               Step 3      Output                     A(L) = h ( Z(L))
                                                                                                                                                                                      zero, except for the element corresponding to the class of x, which would be 1. For
                                                                                                                                                                                      example, if the input training pattern belongs to class 6, the 6th element of r would
                                                                                                                                                                                      be 1 and the rest would be 0’s.
                                       size n&−1 × n p , B(&) is of size n& × n p , and A(&) is of size n& × n p . Table 12.2 summa-                                                     The activation values of neuron j in the output layer is a j (L). We define the error
                                       rizes the matrix formulation for the forward pass through a fully connected, feed-                                                             of that neuron as
                                       forward neural network for all pattern vectors. Implementing these operations in a
                                       matrix-oriented language like MATLAB is a trivial undertaking. Performance can                                                                                                              1
                                                                                                                                                                                                                                     (                )
                                                                                                                                                                                                                                                          2
                                                                                                                                                                                                                         Ej =        rj − a j (L)                      (12-64)
                                       be improved significantly by using dedicated hardware, such as one or more graphics                                                                                                         2
                                       processing units (GPUs).
                                          The equations in Table 12.2 are used to classify each of a set of patterns into one                                                         for j = 1, 2, … , nL , where rj is the desired response of output neuron a j (L) for a
                                       of nL pattern classes. Each column of output matrix A(L) contains the activation                                                               given pattern x. The output error with respect to a single x is the sum of the errors of
                                       values of the nL output neurons for a specific pattern vector. The class membership                                                            all output neurons with respect to that vector:
                                       of that pattern is given by the location of the output neuron with the highest activa-
                                                                                                                                                                                                                         nL
                                                                                                                                                                                                                                     1 nL
                                       tion value. Of course, this assumes we know the weights and biases of the network.
                                                                                                                                                                                                                                             (                )
                                                                                                                                                                                                                                                                  2
                                                                                                                                                                                                                  E = ∑ Ej =           ∑ rj − aj (L)
                                       These are obtained during training using backpropagation, as we explain next.                                     See Eqs. (2-50) and
                                                                                                                                                                                                                         j =1        2 j =1
                                                                                                                                                         (2-51) regarding the                                                                                          (12-65)
                                                                                                                                                         Euclidean vector norm.                                          1
                                       USING BACKPROPAGATION TO TRAIN DEEP NEURAL NETWORKS                                                                                                                           =     ! r − a(L) !2
                                                                                                                                                                                                                         2
                                       A neural network is defined completely by its weights, biases, and activation func-
                                       tion. Training a neural network refers to using one or more sets of training patterns                                                          where the second line follows from the definition of the Euclidean vector norm. The
                                       to estimate these parameters. During training, we know the desired response of                                                                 total network output error over all training patterns is defined as the sum of the errors
                                       every output neuron of a multilayer neural net. However, we have no way of know-                                  When the meaning is
                                                                                                                                                                                      of the individual patterns. We want to find the weights that minimize this total error.
                                       ing what the values of the outputs of hidden neurons should be. In this section, we                               clear, we sometimes          As we did for the LMSE perceptron, we find the solution using gradient descent.
                                       develop the equations of backpropagation, the tool of choice for finding the value                                include the bias term in
                                                                                                                                                         the word “weights.”
                                                                                                                                                                                      However, unlike the perceptron, we have no way for computing the gradients of the
                                       of the weights and biases in a multilayer network. This training by backpropaga-                                                               weights in the hidden nodes. The beauty of backpropagation is that we can achieve an
                                       tion involves four basic steps: (1) inputting the pattern vectors; (2) a forward pass                                                          equivalent result by propagating the output error back into the network.
                                       through the network to classify all the patterns of the training set and determine the                                                            The key objective is to find a scheme to adjust all weights in a network using train-
                                       classification error; (3) a backward (backpropagation) pass that feeds the output                                                              ing patterns. In order to do this, we need to know how E changes with respect to the
                                       error back through the network to compute the changes required to update the                                                                   weights in the network. The weights are contained in the expression for the net input
                                       parameters; and (4) updating the weights and biases in the network. These steps are                                                            to each node [see Eq. (12-54)], so the quantity we are after is ∂E ∂zj (&) where, as
                                       repeated until the error reaches an acceptable level. We will provide a summary of                                                             defined in Eq. (12-54), zj (&) is the net input to node j in layer &. In order to simplify
                                       all principal results derived in this section at the end of the discussion (see Table                                                          the notation later, we use the symbol d j (&) to denote ∂E ∂zj (&). Because backpropa-
                                       12.3). As you will see shortly, the principal mathematical tool needed to derive the                                                           gation starts with the output and works backward from there, we look first at
                                                                                                                                                         We use “j” generically
                                       equations of backpropagation is the chain rule from basic calculus.                                               to mean any node in the
                                                                                                                                                                                                                                             ∂E
                                                                                                                                                         network. We are not                                                    d j (L) =                              (12-66)
                                       The Equations of Backpropagation
                                                                                                                                                         concerned at the moment                                                            ∂zj (L)
                                                                                                                                                         with inputs to, or outputs
                                                                                                                                                         from, a node.
                                       Given a set of training patterns and a multilayer feedforward neural network archi-                                                            We can express this equation in terms of the output a j (L) using the chain rule:
                                       tecture, the approach in the following discussion is to find the network parameters
www.EBooksWorld.ir www.EBooksWorld.ir
                                                     d j (L) =
                                                                  ∂E
                                                                        =
                                                                           ∂E ∂a j (L)
                                                                                          =
                                                                                             ∂E ∂h zj (L)              (   )                                                                                               ∂E
                                                                                                                                                                                                                                  =
                                                                                                                                                                                                                                    ∂E ∂zi (&)
                                                                                                                                                                                                                          ∂wij (&) ∂zi (&) ∂wij (&)
                                                                 ∂zj (L) ∂ a j (L) ∂zj (L) ∂ a j (L) ∂zj (L)
                                                                                                                               (12-67)                                                                                                         ∂zi (&)
                                                                   ∂E
                                                            =
                                                                 ∂ a j (L)
                                                                               (
                                                                           h$ zj (L)        )                                                                                                                                      = di (&)
                                                                                                                                                                                                                                              ∂wij (&)
                                                                                                                                                                                                                                                                         (12-71)
                                                                                                                                                                                                                                   = a j (& − 1) di (&)
                                    where we used Eq. (12-56) to obtain the last expression in the first line. This equa-
                                    tion gives us the value of d j (L) in terms of quantities that can be observed or com-
                                    puted. For example, if we use Eq. (12-64) as our error measure, and Eq. (12-52) for                                                                    where we used Eq. (12-54), Eq. (12-69), and interchanged the order of the results
                                      (     )
                                    h$ zj ( x) , then                                                                                                                                      to clarify matrix formulations later in our discussion. Similarly (see Problem 12.26),
                                                                           (       )              (         )
                                                         d j (L) = h zj (L) 1 − h zj (L)   a j (L) − rj                (12-68)                                                                                          ∂E
                                                                                                                                                                                                                                       = di (&)                          (12-72)
                                                                                                                                                                                                                               ∂bi (&)
                                                                                                                (
                                    where we interchanged the order of the terms. The h zj (L) are computed in the     )                                                                   Now we have the rate of change of E with respect to the network weights and biases
                                    forward pass, a j (L) can be observed in the output of the network, and rj is given
                                    along with x during training. Therefore, we can compute d j (L).                                                                                       in terms of quantities we can compute. The last step is to use these results to update
                                        Because the relationship between the net input and the output of any neuron in                                                                     the network parameters using gradient descent:
                                    any layer (except the first) is the same, the form of Eq. (12-66) is valid for any node
                                    j in any hidden layer:                                                                                                                                                                                     ∂E( & )
                                                                                                                                                                                                                     wij (&) = wij (&) − a
                                                                                      ∂E                                                                                                                                                      ∂wij (&)                   (12-73)
                                                                          d j ( &) =                                (12-69)
                                                                                     ∂z j ( &)                                                                                                                              = wij (&) − a di (&) a j (& − 1)
                                    This equation tells us how E changes with respect to a change in the net input to any                                                                  and
                                    neuron in the network. What we want to do next is express d j (&) in terms of d j (& + 1).
                                    Because we will be proceeding backward in the network, this means that if we have                                                                                                                            ∂E
                                                                                                                                                                                                                          bi (&) = bi (&) − a
                                    this relationship, then we can start with d j (L) and find d j (L − 1). We then use this                                                                                                                    ∂bi (&)                  (12-74)
                                    result to find d j (L − 2), and so on until we arrive at layer 2. We obtain the desired                                                                                                     = bi (&) − a di (&)
                                    expression using the chain rule (see Problem 12.25):
                                                                                                                                                                                           for & = L − 1, L − 2,… 2, where the a’s are computed in the forward pass, and the d’s
                                                                       ∂E               ∂E     ∂zi (& + 1) ∂a j (&)
                                                         d j ( &) =              =∑                                                                                                        are computed during backpropagation. As with the perceptron, a is the learning
                                                                      ∂z j ( & )  i ∂zi (& + 1) ∂a j (&) ∂zj (&)                                                                           rate constant used in gradient descent. There are numerous approaches that attempt
                                                                                           ∂zi (& + 1)                                                                                     to find optimal learning rates, but ultimately this is a problem-dependent parameter
                                                                 = ∑ di (& + 1)
                                                                       i                    ∂a j ( &)
                                                                                                            (
                                                                                                       h$ zj (&)   )           (12-70)                                                     that involves experimenting. A reasonable approach is to start with a small value of
                                                                                                                                                                                           a (e.g., 0.01), then experiment with vectors from the training set to determine a suit-
                                                                           (
                                                                 = h$ zj (&)       ) ∑ w (& + 1) d (& + 1)
                                                                                             ij         i
                                                                                                                                                                                           able value in a given application. Remember, a is used only during training, so it has
                                                                                       i                                                                                                   no effect on post-training operating performance.
                                    for & = L − 1, L − 2, … 2, where we used Eqs. (12-55) and (12-69) to obtain the mid-
                                    dle line, and Eq. (12-54), plus some rearranging to obtain the last line.                                                                              Matrix Formulation
                                         The preceding development tells us how we can start with the error in the output                                                              As with the equations that describe the forward pass through a neural network, the
                                    (which we can compute) and obtain how that error changes as function of the net                                                                    equations of backpropagation developed in the previous discussion are excellent for
                                    inputs to every node in the network. This is an intermediate step toward our final                                                                 describing how the method works at a fundamental level, but they are clumsy when
                                    objective, which is to obtain expressions for ∂E ∂wij (&) and ∂E ∂bi (&) in terms of                                                               it comes to implementation. In this section, we follow a procedure similar to the one
                                    d j (&) = ∂E zj (&). For this, we use the chain rule again:                                                                                        we used for the forward pass to develop the matrix equations for backpropagation.
www.EBooksWorld.ir www.EBooksWorld.ir
                                      As before, we arrange all the pattern vectors as columns of matrix X, and package                                                                  each column corresponding to one pattern vector. All matrices in Eq. (12-78) are of
                                    the weights of layer & as matrix W(&). We use D(&) to denote the matrix equiva-                                                                      size nL × n p .
                                    lent of Î(&), the vector containing the errors in layer &. Our first step is to find an                                                                 Following a similar line of reasoning, we can express Eq. (12-70) in matrix form as
                                    expression for D(L). We begin at the output and proceed backward, as before. From
                                    Eq. (12-67),                                                                                                                                                                           (                       )
                                                                                                                                                                                                                   D(&) = WT (& + 1)D(& + 1) } h' ( Z(&))                     (12-79)
                                                              ∂E                           ∂E   h$ z (L)                                                                           It is easily confirmed by dimensional analysis that the matrix D(&) is of size n& × n p
                                                              ∂ a (L) h$ ( z1 (L))         ∂ a (L)   ( 1          )
                                                                                                                                                                                        (see Problem 12.27). Note that Eq. (12-79) uses the weight matrix transposed. This
                                                d1 (L)                                   1         
                                                                   1
                                                                                                                                                                                        reflects the fact that the inputs to layer & are coming from layer & + 1, because in
                                                d (L)   ∂E h$ ( z (L))                   ∂E   $
                                                                                                           h ( z   ( L) ) 
                                                          =  ∂ a2 (L)                     ∂ a (L)                                                                                  backpropagation we move in the direction opposite of a forward pass.
                                        D(L) = 
                                                   2                         2                                   2
                                                                                                                          
                                                #                                      = 2        }                
                                                                                                                              (12-75)                                                        We complete the matrix formulation by expressing the weight and bias update
                                                                     #                       #            #                                                                      equations in matrix form. Considering the weight matrix first, we can tell from Eqs.
                                               dnL (L)  ∂E                                                       
                                                                         (
                                                                        h$ znL (L)    )     ∂E   h$ z (L)  (       )                                                                (12-70) and (12-73) that we are going to need matrices W(&), D(&), and A(& − 1).
                                                              ∂ an (L)                     ∂ an (L)        nL
                                                                                                                                                                                        We already know that W(&) is of size n& × n& −1 and that D(&) is of size n& × n p . Each
                                                              L                            L        
                                                                                                                                                                                         column of matrix A(& − 1) is the set of outputs of the neurons in layer & − 1 for one
                                                                                                                                                                                         pattern vector. There are n p patterns, so A(& − 1) is of size n&−1 × n p . From Eq. (12-
                                    where, as defined in Section 2.6, “}” denotes elementwise multiplication (of two                                                                     73) we infer that A post-multiplies D, so we are also going to need AT (& − 1), which
                                    vectors in this case). We can write the vector on the left of this symbol as ∂E ∂ a(L),                                                              is of size n p × n&−1 . Finally, recall that in a matrix formulation, we construct a matrix
                                    and the vector on the right as h$ ( z(L)) . Then, we can write Eq. (12-75) as                                                                        B(&) of size n& × n p whose columns are copies of vector b(&), which contains all the
                                                                                ∂E                                                                                                       biases in layer &.
                                                                     Î(L) =          } h$ ( z(L))                             (12-76)                                                        Next, we look at updating the biases. We know from Eq. (12-74) that each ele-
                                                                              ∂ a(L)
                                                                                                                                                                                         ment bi (&) of b(&) is updated as bi (&) = bi (&) − a di (&), for i = 1, 2, … , n& . Therefore,
                                    This nL × 1 column vector contains the activation values of all the output neurons                                                                   b(&) = b(&) − aÎ (&). But this is for one pattern, and the columns of D(&) are the
                                    for one pattern vector. The only error function we use in this chapter is a quadratic                                                                Î(&)’ s for all patterns in the training set. This is handled in a matrix formulation by
                                    function, which is given in vector form in Eq. (12-65). The partial of that quadratic                                                                using the average of the columns of D(&) (this is the average error over all patterns)
                                    function with respect to a(L) is ( a(L) − r ) which, when substituted into Eq. (12-76),                                                              to update b(&).
                                    gives us                                                                                                                                                 Putting it all together results in the following two equations for updating the
                                                                                                                                                                                         network parameters:
                                                                  Î(L) = ( a(L) − r ) } h$ ( z(L))                            (12-77)
                                                                                                                                                                                                                        W(&) = W(&) − a D(&)AT (& − 1)                        (12-80)
                                    Column vector Î(L) accounts for one pattern vector. To account for all n p patterns                                                                  and
                                    simultaneously we form a matrix D(&), whose columns are the Î(L) from Eq. (12-77),                                                                                                                       np
                                    evaluated for a specific pattern vector. This is equivalent to writing Eq. (12-77)                                                                                                    b(&) = b(&) − a ∑ Îk (&)                            (12-81)
                                    directly in matrix form as                                                                                                                                                                              k= 1
                                                                                                                                                                                         where Îk (&) is the kth column of matrix D(&). As before, we form matrix B(&) of size
                                                                 D(L) = ( A(L) − R ) } h$ ( Z(L))                             (12-78)                                                    n& × n p by concatenating b(&) n p times in the horizontal direction:
                                    Each column of A(L) is the network output for one pattern. Similarly, each col-                                                                                                        B(&) = concatenate {b(&)}                          (12-82)
                                    umn of R is a binary vector with a 1 in the location corresponding to the class of a                                                                                                             n times
                                                                                                                                                                                                                                      p
                                    particular pattern vector, and 0’s elsewhere, as explained earlier. Each column of
                                    the difference ( A(L) − R ) contains the components of ! a − r ! . Therefore, squaring                                                                 As we mentioned earlier, backpropagation consists of four principal steps: (1)
                                    the elements of a column, adding them, and dividing by 2 is the same as computing                                                                    inputting the patterns, (2) a forward pass, (3) a backpropagation pass, and (4) a
                                    the error measure defined in Eq. (12-65), for one pattern. Adding all the column                                                                     parameter update step. The process begins by specifying the initial weights and bias-
                                    computations gives an average measure of error for all the patterns. Similarly, the                                                                  es as (small) random numbers. Table 12.3 summarizes the matrix formulations of
                                    columns of matrix h$ ( Z(L)) are values of the net inputs to all output neurons, with                                                                these four steps. During training, these steps are repeated for a number of specified
                                                                                                                                                                                         epochs, or until a predefined measure of error is deemed to be small enough.
www.EBooksWorld.ir www.EBooksWorld.ir
            TABLE 12.3                                                                                                                                                            x2
            Matrix formulation for training a feedforward, fully connected multilayer neural network using backpropagation.
            Steps 1–4 are for one epoch of training. X, R, and the learning rate parameter a, are provided to the network for train-                                          1                                                             1
            ing. The network is initialized by specifying weights, W(1), and biases, B(1), as small random numbers.
                                                                                                                                                                                                                                           0.8
∈ c1 0.5 – 0.5
                                                                                           (                   )
                                                                                                                                                                                                                                                                        1.0               – 1.0
              Step 3     Backpropagation       For & = L − 1, L − 2, … , 2, compute D(&) = WT (& + 1)D(& + 1) } h' ( Z(&))                                                                                         ∈ c2                                                       1.5 – 1.5
                                                                                                                                                            a b c
              Step 4     Update weights and For & = 2, … , L, let W(&) = W(&) − a D(&)AT (& − 1), b(&) = b(&) − a∑ np Î k (&),                             FIGURE 12.34 Neural net solution to the XOR problem. (a) Four patterns in an XOR arrangement. (b) Results of
                                                                                                                     k =1
                         biases                                                                                                                            classifying additional points in the range −1.5 to 1.5 in increments of 0.1. All solid points were classified as belong-
                                            and B(&) = concatenate   {b(&)} , where the Î k (&) are the columns of D(&)
                                                            n times
                                                               p                                                                                           ing to class c1 and all open circles were classified as belonging to class c2 . Together, the two lines separating the
                                                                                                                                                           regions constitute the decision boundary [compare with Fig. 12.27(b)]. (c) Decision surface, shown as a mesh. The
                                                                                                                                                           decision boundary is the pair of dashed, white lines in the intersection of the surface and a plane perpendicular to
                                                                                                                                                           the vertical axis, intersecting that axis at 0.5. (Figure (c) is shown in a different perspective than (b) in order to make
                                        There are two major types of errors in which we are interested. One is the clas-                                   all four patterns visible.)
                                    sification error, which we compute by counting the number of patterns that were
                                    misclassified and dividing by the total number of patterns in the training set. Mul-
                                    tiplying the result by 100 gives the percentage of patterns misclassified. Subtracting                                                      4.792 4.792            4.590              −9.180 9.429            4.420 
                                                                                                                                                                        W(2) =               ; b(2) =  − 4.486  ; W(3) =  9.178 −9.427  ; b(3) =  − 4.419 
                                    the result from 1 and multiplying by 100 gives the percent correct recognition. The                                                         4.486 4.486                                                               
                                    other is the mean squared error (MSE), which is based on actual values of E. For
                                    the error defined in Eq. (12-65), this value is obtained (for one pattern) by squaring                                 Figure 12.35 shows the neural net based on these values.
                                    the elements of a column of the matrix ( A(L) − R ) , adding them, and dividing by                                        When presented with the four training patterns after training was completed, the results at the two
                                    the result by 2 (see Problem 12.28). Repeating this operation for all columns and                                      outputs should have been equal to the values in R. Instead, the values were close:
                                    dividing the result by the number of patterns in X gives the MSE over the entire
                                                                                                                                                                                                                       0.987 0.990 0.010 0.010 
                                    training set.                                                                                                                                                              A(3) =                          
                                                                                                                                                                                                                       0.013 0.010 0.990 0.990 
                                                                                                                                                           These weights and biases, along with the sigmoid activation function, completely specify our trained
             EXAMPLE 12.12 : Using a fully connected neural net to solve the XOR problem.
                                                                                                                                                           neural network. To test its performance with values other than the training patterns, which we know it
            Figure 12.34(a) shows the XOR classification problem discussed previously (the coordinates were cho-                                           classifies correctly, we created a set of 2-D test patterns by subdividing the pattern space into increments
            sen to center the patterns for convenience in indexing, but the spatial relationships are as before). Pat-                                     of 0.1, from −1.5 to 1.5 in both directions, and classified the resulting points using a forward pass through
            tern matrix X and class membership matrix R are:
www.EBooksWorld.ir www.EBooksWorld.ir
www.EBooksWorld.ir www.EBooksWorld.ir
            FIGURE 12.38                                                                                                                                                                   this parameter to a = 0.1 resulted in a drop of the best correct recognition rate to 49.1%. Based on the
            Neural net                                                                      x1
                                                                                                                                                                                           preceding results, we used a = 0.001 and 50,000 epochs to train the network.
            architecture used to
            classify the                                                                                                                                                                      The parameters in Fig. 12.38 were the result of training. The recognition rate for the training data
            multispectral image                                                                                                                                                            using these parameters was 97%. We achieved a recognition rate of 95.6% on the test set using the same
                                                                                            x2
            data in Fig. 12.37                                                                                                                                                             parameters. The difference between these two figures, and the 96.4% and 96.2%, respectively, obtained
            into three classes:                                                                                                                                                            for the same data with the Bayes classifier (see Example 12.6), are statistically insignificant.
            water, urban, and                                                                                                                                                                 The fact that our neural networks achieved results comparable to those obtained with the Bayes
            vegetation. The                                                                 x3
            parameters shown                                                                                                                                                               classifier is not surprising. It can be shown (Duda, Hart, and Stork [2001]) that a three-layer neural net,
            were obtained                                                                                                                                                                  trained by backpropagation using a sum of errors squared criterion, approximates the Bayes decision
            in 50,000 epochs                                                                                                                                                               functions in the limit, as the number of training samples approaches infinity. Although our training sets
            of training using                                                               x4                                                                                             were small, the data were well behaved enough to yield results that are close to what theory predicts.
            a = 0.001.
                                                              2.393  1.020   1.249 −15.965                                    4.093 −10.563 −3.245 
                                                     W(2) =  6.599 −2.705 − 0.912  14.928                          W(3) =  7.045   9.662   6.436                                                                12.6 DEEP CONVOLUTIONAL NEURAL NETWORKS
                                                                                                                                                                                                                        12.6
                                                                                                                                                                                                                        human designer) and extracted from images prior to being input to a neural network
                                                                                                                                                                                                                        (Example 12.13 is an illustration of this approach). But one of the strengths of neural
                                                                                                                                                                                                                        networks is that they are capable of learning pattern features directly from training
            TABLE 12.5                                                                                                                                                                                                  data. What we would like to do is input a set of training images directly into a neural
            Recognition performance on the training set as a function of training epochs. The learning rate constant was a = 0.001                                                                                      network, and have the network learn the necessary features on its own. One way to
            in all cases.
                                                                                                                                                                                                                        do this would be to convert images to vectors directly by organizing the pixels based
                 Training                                                                                                                                                                                               on a linear index (see Fig. 12.1), and then letting each element (pixel) of the linear
                                    1,000   10,000   20,000                           30,000     40,000           50,000          60,000         70,000     80,000
                 Epochs                                                                                                                                                                                                 index be an element of the vector. However, this approach does not utilize any spa-
               Recognition          95.3%   96.6%    96.7%                            96.8%      96.9%            97.0%           97.0%          97.0%      97.0%                                                       tial relationships that may exist between pixels in an image, such as pixel arrange-
                  Rate                                                                                                                                                                                                  ments into corners, the presence of edge segments, and other features that may help
                                                                                                                                                                                                                        to differentiate one image from another. In this section, we present a class of neural
                                                                                                                                                                                                                        networks called deep convolutional neural networks (CNNs or ConvNets for short)
                                                                                                                                                                                                                        that accept images as inputs and are ideally suited for automatic learning and image
                                                                                                                                                                                                                        classification. In order to differentiate between CNNs and the neural nets we stud-
                                                                                                                                                                                                                        ied in Section 12.5, we will refer to the latter as “fully connected” neural networks.
            FIGURE 12.39                                                              1.4
            MSE for the                                                                                                                                                                                                 A BASIC CNN ARCHITECTURE
            network                                                                   1.2
                                                        Mean squared error (×10 3 )
            architecture in                                                                                                                                                                                             In the following discussion, we use a LeNet architecture (see references at the end of
            Fig. 12.38 as a                                                           1.0                                                                                                                               this chapter) to introduce convolutional nets. We do this for two main reasons: First,
            function of the                                                                                                                                                                                             the LeNet architecture is reasonably simple to understand. This makes it ideal for
            number of                                                                 0.8
            training epochs.
                                                                                                                                                                                                                        introducing basic CNN concepts. Second, our real interest is in deriving the equa-
            The learning rate                                                         0.6
                                                                                                                                                                                                                        tions of backpropagation for convolutional networks, a task that is simplified by the
            parameter was                                                                                                                                                                                               intuitiveness of LeNets.
            a = 0.001 in all                                                          0.4                                                                                                  To simplify the explana-        The CNN in Fig. 12.40 contains all the basic elements of a LeNet architecture,
            cases.                                                                                                                                                                         tion of the CNN in
                                                                                                                                                                                           Fig. 12.40, we focus         and we use it without loss of generality. A key difference between this architecture
                                                                                      0.2                                                                                                  attention initially on       and the neural net architectures we studied in the previous section is that inputs to
                                                                                                                                                                                           a single image input.
                                                                                                                                                                                           Multiple input images        CNNs are 2-D arrays (images), while inputs to our fully connected neural networks
                                                                                       0
                                                                                            0     1           2              3          4           5                                      are a trivial extension we   are vectors. However, as you will see shortly, the computations performed by both
                                                                                                                                                                                           will consider later in our
                                                                                                          Training epochs (×10 4 )                                                         discussion.                  networks are very similar: (1) a sum of products is formed, (2) a bias value is added,
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                      Vectorizing
                                                                                           +
                                                                                       Activation                                                                                                           These remarks are summarized in Fig. 12.40, the leftmost part of which shows a
                                                                                                                                                                                                        neighborhood at one location in the input image. In CNN terminology, these neigh-
                                                              A
              Receptive field                                                                           Subsampling                                                                                     borhoods are called receptive fields. All a receptive field does is select a region of
                                                                                                                                                                                                        pixels in the input image. As the figure shows, the first operation performed by a
                                                                                                        B                                                                                               CNN is convolution, whose values are generated by moving the receptive field over
                                                                                                                                                                                                        the image and, at each location, forming a sum of products of a set of weights and
                  Input image                        Feature maps                 Pooled            Feature       Pooled            Fully connected
                                                                                                                                                                                                        the pixels contained in the receptive field. The set of weights, arranged in the shape
                                                                                  feature            maps         feature             neural net                                                        of the receptive field, is a kernel, as in Chapter 3. The number of spatial increments
                                                                                   maps                            maps
                                                                                                                                                                                                        by which a receptive field is moved is called the stride. Our spatial convolutions in
            FIGURE 12.40 A CNN containing all the basic elements of a LeNet architecture. Points A and B are specific values                                                                            previous chapters had a stride of one, but that is not a requirement of the equations
            to be addressed later in this section. The last pooled feature maps are vectorized and serve as the input to a fully                                                                        themselves. In CNNs, an important motivations for using strides greater than one is
            connected neural network. The class to which the input image belongs is determined by the output neuron with the
            highest value.
                                                                                                                                                                                                        data reduction. For example, changing the stride from one to two reduces the image
                                                                                                                                                                                                        resolution by one-half in each spatial dimension, resulting in a three-fourths reduc-
                                                                                                                                                                                                        tion in the amount of data per image. Another important motivation is as a substi-
                                                                                                                                                                                                        tute for subsampling which, as we discuss below, is used to reduce system sensitivity
                                        (3) the result is passed through an activation function, and (4) the activation value                                                                           to spatial translation.
                                        becomes a single input to a following layer.                                                                                                                        To each convolution value (sum of products) we add a bias, then pass the result
                                           Despite the fact that the computations performed by CNNs and fully connected                                                                                 through an activation function to generate a single value. Then, this value is fed to
                                        neural nets are similar, there are some basic differences between the two, beyond                                                                               the corresponding ( x, y) location in the input of the next layer. When repeated for all
                                        their input formats being 2-D versus vectors. An important difference is that CNNs                                                                              locations in the input image, the process just explained results in a 2-D set of values
                                        are capable of learning 2-D features directly from raw image data, as mentioned ear-                                               In the terminology of        that we store in next layer as a 2-D array, called a feature map. This terminology is
                                                                                                                                                                           Chapter 3, a feature map
                                        lier. Because the tools for systematically engineering comprehensive feature sets for                                              is a spatially filtered
                                                                                                                                                                                                        motivated by the fact that the role performed by convolution is to extract features
                                        complex image recognition tasks do not exist, having a system that can learn its own                                               image.                       such as edges, points, and blobs from the input (remember, convolution is the basis
                                        image features from raw image data is a crucial advantage of CNNs. Another major                                                                                of spatial filtering, which we used in Chapter 3 for tasks such as smoothing, sharpen-
                                        difference is in the way in which layers are connected. In a fully connected neural net,                                                                        ing, and computing edges in an image). The same weights and a single bias are used
                                        we feed the output of every neuron in a layer directly into the input of every neuron in                                                                        to generate the convolution (feature map) values corresponding to all locations of
                                        the next layer. By contrast, in a CNN we feed into every input of a layer, a single value,                                                                      the receptive field in the input image. This is done to cause the same feature to be
                                        determined by the convolution (hence the name convolutional neural net) over a                                                                                  detected at all points in the image. Using the same weights and bias for this purpose
                                        spatial neighborhood in the output of the previous layer. Therefore, CNNs are not                                                                               is called weight (or parameter) sharing.
                                        fully connected in the sense defined in the last section. Another difference is that the                                                                            Figure 12.40 shows three feature maps in the first layer of the network. The other
                                        2-D arrays from one layer to the next are subsampled to reduce sensitivity to transla-                                                                          two feature maps are generated in the manner just explained, but using a different
                                        tional variations in the input. These differences and their meaning will become clear                                                                           set of weights and bias for each feature map. Because each set of weights and bias
                                        as we look at various CNN configurations in the following discussion.                                                                                           is different, each feature map generally will contain a different set of features, all
                                                                                                                                                                                                        extracted from the same input image. The feature maps are referred to collectively
                                        Basics of How a CNN Operates                                                                                                                                    as a convolutional layer. Thus, the CNN in Fig. 12.40 has two convolutional layers.
                                                                                                                                                                                                            The process after convolution and activation is subsampling (also called pooling),
                                        As noted above, the type of neighborhood processing in CNNs is spatial convolu-
                                                                                                                                                                                                        which is motivated by a model of the mammal visual cortex proposed by Hubel
                                        tion. We explained the mechanics of spatial convolution in Fig. 3.29, and expressed
            We will discuss in the                                                                                                                                                                      and Wiesel [1959]. Their findings suggest that parts of the visual cortex consist of
            next subsection the exact   it mathematically in Eq. (3-35). As that equation shows, convolution computes a
            form of neural computa-                                                                                                                                                                     simple and complex cells. The simple cells perform feature extraction, while the
                                        sum of products between pixels and a set of kernel weights. This operation is car-
            tions in a CNN, and show                                                                                                                                                                    complex cells combine (aggregate) those features into a more meaningful whole. In
            they are equivalent in      ried out at every spatial location in the input image. The result at each location
            form to the computations                                                                                                                                                                    this model, a reduction in spatial resolution appears to be responsible for achieving
            performed by neurons in
                                        ( x, y) in the input is a scalar value. Think of this value as the output of a neuron in
                                                                                                                                                                                                        translational invariance. Pooling is a way of modeling this reduction in dimension-
            a fully connected neural    a layer of a fully connected neural net. If we add a bias and pass the result through
            net.                                                                                                                                                                                        ality. When training a CNN with large image databases, pooling has the additional
                                        an activation function (see Fig. 12.29), we have a complete analogy between the
www.EBooksWorld.ir www.EBooksWorld.ir
                                        advantage of reducing the volume of data being processed. You can think of the                                                           value of the last pooled layer into a fully connected neural net, the details of which
                                        results of subsampling as producing pooled feature maps. In other words, a pooled                                                        you learned in Section 12.5. But the outputs of a CNN are 2-D arrays (i.e., filtered
                                        feature map is a feature map of reduced spatial resolution. Pooling is done by subdi-                                                    images of reduced resolution), whereas the inputs to a fully connected net are vec-
                                        viding a feature map into a set of small (typically 2 × 2) regions, called pooling neigh-                                                tors. Therefore, we have to vectorize the 2-D pooled feature maps in the last layer.
                                        borhoods, and replacing all elements in such a neighborhood by a single value. We                                                        We do this using linear indexing (see Fig. 12.1). Each 2-D array in the last layer of
                                        assume that pooling neighborhoods are adjacent (i.e., they do not overlap). There                                                        the CNN is converted into a vector, then all resulting vectors are concatenated (verti-
            Adjacency is not a
            requirement of pooling      are several ways to compute the pooled values; collectively, the different approaches                          The parameters of the
                                                                                                                                                                                 cally for a column) to form a single vector. This vector propagates through the neu-
            per se. We assume it
                                        are called pooling methods. Three common pooling methods are: (1) average pool-                                fully connected neural    ral net, as explained in Section 12.5. In any given application, the number of outputs
            here for simplicity
            and because this is an      ing, in which the values in each neighborhood are replaced by the average of the
                                                                                                                                                       net are learned during
                                                                                                                                                       training of the CNN, to
                                                                                                                                                                                 in the fully connected net is equal to the number of pattern classes being classified.
            approach that is used
                                        values in the neighborhood; (2) max-pooling, which replaces the values in a neigh-                             be discussed shortly.     As before, the output with the highest value determines the class of the input.
            frequently.
                                        borhood by the maximum value of its elements; and (3) L2 pooling, in which the
                                        resulting pooled value is the square root of the sum of the neighborhood values                                  EXAMPLE 12.14 : Receptive fields, pooling neighborhoods, and their corresponding feature maps.
                                        squared. There is one pooled feature map for each feature map. The pooled feature
                                                                                                                                                       The top row of Fig. 12.41 shows a numerical example of the relative sizes of feature maps and pooled
                                        maps are referred to collectively as a pooling layer. In Fig. 12.40 we used 2 × 2 pool-
                                                                                                                                                       feature maps as a function of the sizes of receptive fields and pooling neighborhoods. The input image
                                        ing so each resulting pooled map is one-fourth the size of the preceding feature map.                          is of size 28 × 28 pixels, and the receptive field is of size 5 × 5. If we require that the receptive field be
                                        The use of receptive fields, convolution, parameter sharing, and pooling are charac-                           contained in the image during convolution, you know from Section 3.4 that the resulting convolution
                                        teristics unique to CNNs.                                                                                      array (feature map) will be of size 24 × 24. If we use a pooling neighborhood of size 2 × 2, the resulting
                                           Because feature maps are the result of spatial convolution, we know from Chapter 3                          pooled feature maps will be of size 12 × 12, as the figure shows. As noted earlier, we assume that pooling
                                        that they are simply filtered images. It then follows that pooled feature maps are fil-                        neighborhoods do not overlap.
                                        tered images of lower resolution. As Fig. 12.40 illustrates, the pooled feature maps                              As an analogy with fully connected neural nets, think of each element of a 2-D array in the top row
                                        in the first layer become the inputs to the next layer in the network. But, whereas                            of Fig. 12.41 as a neuron. The outputs of the neurons in the input are pixel values. The neurons in the
                                        we showed a single image as an input to the first layer, we now have multiple pooled                           feature map of the first layer have output values generated by convolving with the input image a kernel
                                        feature maps (filtered images) that are inputs into the second layer.                                          whose size and shape are the same as the receptive field, and whose coefficients are learned during train-
                                           To see how these multiple inputs to the second layer are handled, focus for a                               ing. To each convolution value we add a bias and pass the result through an activation function to gener-
                                        moment on one pooled feature map. To generate the values for the first feature map                             ate the output value of the corresponding neuron in the feature map. The output values of the neurons in
                                        in the second convolutional layer, we perform convolution, add a bias, and use acti-                           the pooled feature maps are generated by pooling the output values of the neurons in the feature maps.
                                        vation, as before. Then, we change the kernel and bias, and repeat the procedure for                              The second row in Fig. 12.41 illustrates visually how feature maps and pooled feature maps look
                                        the second feature map, still using the same input. We do this for every remaining                             based on the input image shown in the figure. The kernel shown is as described in the previous para-
                                        feature map, changing the kernel weights and bias for each. Then, we consider the                              graph, and its weights (shown as intensity values) were learned from sample images using the training
                                        next pooled feature map input and perform the same procedure (convolution, plus                                of the CNN described later in Example 12.17. Therefore, the nature of the learned features is deter-
                                        bias, plus activation) for every feature map in the second layer, using yet another set                        mined by the learned kernel coefficients. Note that the contents of the feature maps are specific features
                                        of different kernels and biases. When we are finished, we will have generated three                            detected by convolution. For example, some of the features emphasize edges in the the character. As
                                        values for the same location in every feature map, with one value coming from the                              mentioned earlier, the pooled features are lower-resolution versions of this effect.
            You could interpret the
                                        corresponding location in each of the three inputs. The question now is: How do
            convolution with several    we combine these three individual values into one? The answer lies in the fact that
            input images as 3-D con-
            volution, but with move-
                                        convolution is a linear process, from which it follows that the three individual values                          EXAMPLE 12.15 : Graphical illustration of the functions performed by the components of a CNN.
            ment only in the spatial    are combined into one by superposition (that is, by adding them).
                                                                                                                                                       Figure 12.42 shows the 28 × 28 image from Fig. 12.41, input into an expanded version of the CNN archi-
            (x and y) directions. The      In the first layer, we had one input image and three feature maps, so we needed
            result would be identical                                                                                                                  tecture from Fig. 12.40. The expanded CNN, which we will discuss in more detail in Example 12.17, has
            to summing individual       three kernels to complete all required convolutions. In the second layer, we have
                                                                                                                                                       six feature maps in the first layer, and twelve in the second. It uses receptive fields of size 5 × 5, and
            convolutions with each      three inputs and seven feature maps, so the total number of kernels (and biases)
            image separately, as we                                                                                                                    pooling neighborhoods of size 2 × 2. Because the receptive fields are of size 5 × 5, the feature maps in
            do here.                    needed is 3 × 7 = 21. Each feature map is pooled to generate a corresponding
                                                                                                                                                       the first layer are of size 24 × 24, as we explained in Example 12.14. Each feature map has its own set of
                                        pooled feature map, resulting in seven pooled feature maps. In Fig. 12.40, there are
                                                                                                                                                       weights and bias, so we will need a total of (5 × 5) × 6 + 6 = 156 parameters (six kernels with twenty-five
                                        only two layers, so these seven pooled feature maps are the outputs of the last layer.
                                                                                                                                                       weights each, and six biases) to generate the feature maps in the first layer. The top row of Fig. 12.43(a)
                                           As usual, the ultimate objective is to use features for classification, so we need
                                                                                                                                                       shows the kernels with the weights learned during training of the CNN displayed as images, with intensity
                                        a classifier. As Fig. 12.40 shows, in a CNN we perform classification by feeding the
                                                                                                                                                       being proportional to kernel values.
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                                                     FIGURE 12.43 Top: The weights (shown as images of size 5 × 5) corresponding to the six feature maps in the first layer
                                                                                                                                                                                     of the CNN in Fig. 12.42. Bottom: The weights corresponding to the twelve feature maps in the second layer.
                                                                                            Kernel
                                                                                                                                                                                     parameters to generate the feature maps in the second layer (i.e., twelve sets of six kernels with twenty-
                                                                                                                                                                                     five weights each, plus twelve biases). The bottom part of Fig. 12.43 shows the kernels as images. Because
               Because we used pooling neighborhoods of size 2 × 2, the pooled feature maps in the first layer of                                                                    we are using receptive fields of size 5 × 5, the feature maps in the second layer are of size 8 × 8. Using
            Fig. 12.42 are of size 12 × 12. As we discussed earlier, the number of feature maps and pooled feature                                                                   2 × 2 pooling neighborhoods resulted in pooled feature maps of size 4 × 4 in the second layer.
            maps is the same, so we will have six arrays of size 12 × 12 acting as inputs to the twelve feature maps                                                                    As we discussed earlier, the pooled feature maps in the last layer have to be vectorized to be able to
            in the second layer (the number of feature maps generally is different from layer to layer). Each fea-                                                                   input them into the fully connected neural net. Each pooled feature map resulted in a column vector of
            ture map will have its own set of weights and bias, so will need a total of 6 × (5 × 5) × 12 + 12 = 1812                                                                 size 16 × 1. There are 12 of these vectors which, when concatenated vertically, resulted in a single vector
                                                                                                                                                                                     of size 192 × 1. Therefore, our fully connected neural net has 192 input neurons. There are ten numeral
                       Convolution + Bias + Activation                             Convolution
                                                                                                                                                                                     classes, so there are 10 output neurons. As you will see later, we obtained excellent performance by using
                                                                                        +                                                                                            a neural net with no hidden layers, so our complete neural net had a total of 192 input neurons and 10
                                                                                                                       Vectorization
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                          Vector
            both layers of the                                                                                                                                                                                                            9                                                                      (12-85)
            network. (Example                                                                                               5                                                                                                         = ∑ wi ai
            12.17 contains more                                                                                                                                                                                                          i =1
            details about this                                                                                              6
            figure.)                                                                                                                                                                       The results of Eqs. (12-84) and (12-85) are identical. If we add a bias to the latter
                                                                                                                            7                                                              equation and call the result z we have
                                                                                                                                                                                                                                                     9
                                                                                                                            8
                                                                                                                                                                                                                                         z = ∑ wj aj + b
                                                                                                                                                                                                                                                    j =1                                                         (12-86)
                                                                                                                            9
                                                                                                                                                                                                                                              = w ! ax,y + b
                                                                                                                                                                                           The form of the first line of this equation is identical to Eq. (12-54). Therefore, we
            level abstractions of the top part of the character, in the sense that they show an area flanked on both
                                                                                                                                                                                           conclude that if we add a bias to the spatial convolution computation performed by
            sides by areas of opposite intensity. These abstractions are not always easy to analyze visually, but as you
                                                                                                                                                                                           a CNN at any fixed position ( x, y) in the input, the result can be expressed in a form
            will see in later examples, they can be very effective. The vectorized version of the last pooled layer is
                                                                                                                                                                                           identical to the computation performed by an artificial neuron in a fully connected
            self-explanatory. The output of the fully connected neural net shows dark for low values and white for
                                                                                                                                                                                           neural net. We need the x, y only to account for the fact that we are working in 2-D.
            the highest value, indicating that the input was properly recognized as a number 6. Later in this section,
                                                                                                                                                                                           If we think of z as the net input to a neuron, the analogy with the neurons discussed
            we will show that the simple CNN architecture in Fig. 12.42 is capable of recognizing the correct class of                                                                     in Section 12.5 is completed by passing z through an activation function, h, to get
            over 70,000 numerical samples with nearly perfect accuracy.                                                                                                                    the output of the neuron:
                                                                                                                                                                                                                                   a = h(z)                                (12-87)
                                    Neural Computations in a CNN
                                                                                                                                                                                       This is exactly how the value of any point in a feature map (such as the point labeled
                                    Recall from Fig. 12.29 that the basic computation performed by an artificial neuron                                                                A in Fig. 12.40) is computed.
                                    is a sum of products between weights and values from a previous layer. To this we                                                                     Now consider point B in that figure. As mentioned earlier, its value is given by
                                    add a bias and call the result the net (total) input to the neuron, which we denoted                                                               adding three convolution equations:
                                    by zi . As we showed in Eq. (12-54), the sum involved in generating zi is a single sum.
                                    The computations performed in a CNN to generate a single value in a feature map
                                                                                                                                                                                               , k ! ax, y + wl , k ! ax, y + wl , k ! ax, y = ∑ ∑ wl , k ax −ll , y − k +
                                                                                                                                                                                                      ( 1)             (2)              ( 3)        ( 1) ( 1)
                                                                                                                                                                                            wl(1)             (2)              (3)
                                    is 2-D convolution. As you learned in Chapter 3, this is a double sum of products                                                                                                                                    l   k
                                                                                                                                                                                                                                                                                                                 (12-88)
                                    between the coefficients of a kernel and the corresponding elements of the image
                                    array overlapped by the kernel. With reference to Fig. 12.40, let w denote a kernel
                                                                                                                                                                                                                                                         ∑l ∑k wl(,2k)a(x2−)l ,y−k + ∑l ∑k wl(,2k)a(x2−)l ,y−k
                                    formed by arranging the weights in the shape of the receptive field we discussed
                                    in connection with that figure. For notational consistency with Section 12.5, let ax, y                                                                where the superscripts refer to the three pooled feature maps in Fig. 12.40. The val-
                                    denote image or pooled feature values, depending on the layer. The convolution                                                                         ues of l , k, x, and y are the same in all three equations because all three kernels are
                                    value at any point ( x, y) in the input is given by                                                                                                    of the same size and they move in unison. We could expand this equation and obtain
                                                                                                                                                                                           a sum of products that is lengthier than for point A in Fig. 12.40, but we could still
                                                                   w ! ax,y = ∑ ∑ wl , k ax − l , y − k                         (12-83)                                                    relabel all terms and obtain a sum of products that involves only one summation,
                                                                                  l   k                                                                                                    exactly as before.
www.EBooksWorld.ir www.EBooksWorld.ir
                                             The preceding result tells us that the equations used to obtain the value of an                                                             and
                                          element of any feature map in a CNN can be expressed in the form of the computa-
                                          tion performed by an artificial neuron. This holds for any feature map, regardless
                                          of how many convolutions are involved in the computation of the elements of that
                                                                                                                                                                                                                                                     (
                                                                                                                                                                                                                                   ax, y (&) = h zx, y (&)        )        (12-92)
                                          feature map, in which case we would simply be dealing with the sum of more con-
                                                                                                                                                                                         for & = 1, 2, … , Lc , where Lc is the number of convolutional layers, and ax, y (&)
                                          volution equations. The implication is that we can use the basic form of Eqs. (12-86)
                                                                                                                                                                                         denotes the values of pooled features in convolutional layer &. When & = 1,
                                          and (12-87) to describe how the value of an element in any feature map of a CNN
                                          is obtained. This means we do not have to account explicitly for the number of dif-
                                          ferent pooled feature maps (and hence the number of different kernels) used in a                                                                                  ax, y (0) = {values of pixels in the input image(s)}           (12-93)
                                          pooling layer. The result is a significant simplification of the equations that describe
                                          forward and backpropagation in a CNN.                                                                                                          When & = Lc ,
                                          Multiple Input Images                                                                                                                                     ax, y (Lc ) = {values of pooled features in last layer of the CN
                                                                                                                                                                                                                                                                   NN}     (12-94)
                                          The values of ax, y just discussed are pixel values in the first layer but, in layers past
                                          the first, ax, y denotes values of pooled features. However, our equations do not dif-                                                         Note that & starts at 1 instead of 2, as we did in Section 12.5. The reason is that we are
                                          ferentiate based on what these variables actually represent. For example, suppose                                                              naming layers, as in “convolutional layer &.” It would be confusing to start at convo-
                                          we replace the input to Fig. 12.40 with three images, such as the three components                                                             lutional layer 2. Finally, we note that the pooling does not require any convolutions.
                                          of an RGB image. The equations for the value of point A in the figure would now                                                                The only function of pooling is to reduce the spatial dimensions of the feature map
                                          have the same form as those we stated for point B—only the weights and biases                                                                  preceding it, so we do not include explicit pooling equations here.
                                          would be different. Thus, the results in the previous discussion for one input image                                                              Equations (12-91) through (12-94) are all we need to compute all values in a
                                          are applicable directly to multiple input images. We will give an example of a CNN                                                             forward pass through the convolutional section of a CNN. As described in Fig. 12.40,
                                          with three input images later in our discussion.                                                                                               the values of the pooled features of the last layer are vectorized and fed into a fully
                                                                                                                                                                                         connected feedforward neural network, whose forward propagation is explained in
                                                                                                                                                                                         Eqs. (12-54) and (12-55) or, in matrix form, in Table 12.2.
                                          THE EQUATIONS OF A FORWARD PASS THROUGH A CNN
                                          We concluded in the preceding discussion that we can express the result of convolv-                                                            THE EQUATIONS OF BACKPROPAGATION USED TO TRAIN CNNS
            As noted earlier, a kernel    ing a kernel, w, and an input array with values ax, y , as
            is formed by organizing                                                                                                                                                  As you saw in the previous section, the feedforward equations of a CNN are similar
            the weights in the shape of
            a corresponding receptive                                    zx, y = ∑ ∑ wl , k ax − l , y − k + b                                                                       to those of a fully connected neural net, but with multiplication replaced by convo-
            field. Also keep in mind                                                l   k                                    (12-89)                                                 lution, and notation that reflects the fact that CNNs are not fully connected in the
            that w and ax,y represent
            all the weights and                                                 = w ! ax,y + b                                                                                       sense defined in Section 12.5. As you will see in this section, the equations of back-
            corresponding values in                                                                                                                                                  propagation also are similar in many respects to those in fully connected neural nets.
            a set of input images or
            pooled features.              where l and k span the dimensions of the kernel, x and y span the dimensions of the                                                           As in the derivation of backpropagation in Section 12.5, we start with the defini-
                                          input, and b is a bias. The corresponding value of ax, y is                                                                                tion of how the output error of our CNN changes with respect to each neuron in the
                                                                                                                                                                                     network. The form of the error is the same as for fully connected neural nets, but
                                                                                             ( )
                                                                                    ax, y = h zx, y                          (12-90)                                                 now it is a function of x and y instead of j:
                                          But this ax, y is different from the one we used to compute Eq. (12-89), in which ax, y                                                                                                                      ∂E
                                                                                                                                                                                                                                    d x , y ( &) =                         (12-95)
                                          represents values from the previous layer. Thus, we are going to need additional                                                                                                                           ∂zx, y (&)
                                          notation to differentiate between layers. As in fully connected neural nets, we use &
                                          for this purpose, and write Eqs. (12-89) and (12-90) as                                                                                    As in Section 12.5, we want to relate this quantity to d xy (& + 1), which we again do
                                                                                                                                                                                     using the chain rule:
                                                                zx , y (&) = ∑ ∑ wl , k (&)ax − l , y − k (& − 1) + b(&)
                                                                            l   k                                            (12-91)                                                                                           ∂E                 ∂E       ∂zu,v (& + 1)
                                                                        = w(&) ! ax,y (& − 1) + b(&)                                                                                                        d x , y ( &) =              = ∑∑                               (12-96)
                                                                                                                                                                                                                             ∂zx, y (&)   u v ∂zu,v (& + 1) ∂zx, y (&)
www.EBooksWorld.ir www.EBooksWorld.ir
                                     where u and v are any two variables of summation over the range of possible values                                                                                          But the kernels do not depend on x and y, so we can write this equation as
                                     of z. As noted in Section 12.5, these summations result from applying the chain rule.
                                        By definition, the first term of the double summation of Eq. (12-96) is d x, y (& + 1).
                                     So, we can write this equation as
                                                                                                                                                                                                                                                    (          )
                                                                                                                                                                                                                                   d x, y (&) = h$ zx, y (&) d x, y (& + 1) ! rot 180 ( w(& + 1))        (12-103)
                                                                                                                                                                                                                    As in Section 12.5, our final objective is to compute the change in E with respect
                                                                                   ∂E                         ∂z (& + 1)
                                                            d x , y ( &) =                  = ∑ ∑ du,v (& + 1) u,v                                    (12-97)                                                    to the weights and biases. Following a similar procedure as above, we obtain
                                                                                 ∂zx, y (&)   u v              ∂zx, y (&)
                                                                                                                                                                                                                            ∂E           ∂E ∂zx, y (&)
                                     Substituting Eq. (12-92) into Eq. (12-91), and using the resulting zu,v in Eq. (12-97),                                                                                                     = ∑∑
                                     we obtain                                                                                                                                                                             ∂wl,k   x y ∂zx, y (&) ∂wl,k
                                                                                                                                                                                                                                                        ∂zx, y (&)
                                                                                                                                                                                                                                  = ∑ ∑ d x , y ( &)
                                                                                ∂                                                                                                                                                                       ∂wl, k
                                      d x, y (&) = ∑ ∑ du,v (& + 1)                                                  (                )
                                                                               ∑ ∑ wl ,k (& + 1) h z u − l ,v − k(&) + b(& + 1) (12-98)
                                                                    ∂zx, y (&)  l k
                                                                                                                                                                                                                                      x   y
                                                                                                                                
                                                                                                                                                                                                                                                                                        (                )
                                                   u v
                                                                                                                                                                                                                                                         ∂                                                
                                                                                                                                                                                                                                  = ∑ ∑ d x , y ( &)            ∑ ∑ wl,k (&) h zx −l , y− k (& − 1) + b(&) (12-104)
                                                                                                                                                                                                                                      x   y             ∂wl,k  l k
                                     The derivative of the expression inside the brackets is zero unless u − l = x and
                                     v − k = y, and because the derivative of b(& + 1) with respect to zx, y (&) is zero. But, if
                                     u − l = x and v − k = y, then l = u − x and k = v − y. Therefore, taking the indicated                                                                                                           x   y
                                                                                                                                                                                                                                                          (
                                                                                                                                                                                                                                  = ∑ ∑ d x, y (&) h zx − l , y − k (& − 1)   )
                                     derivative of the expression in brackets, we can write Eq. (12-98) as
                                                                                                                                                                                                                                  = ∑ ∑ d x, y (&) a x − l , y − k (& − 1)
                                                                                                                                                                                                                                      x   y
                                                                                                                          
                                                     d x, y (&) = ∑ ∑ du,v (& + 1)  ∑ ∑ wu − x,v − y (& + 1) h$ zx, y (&)       (           )       (12-99)
                                                                  u v               u− x v− y                                                                                                                  where the last line follows from Eq. (12-92). This line is in the form of a convolution
                                     Values of x, y, u, and v are specified outside of the terms inside the brackets. Once the                                                                                   but, comparing it to Eq. (12-91), we see there is a sign reversal between the summa-
                                     values of these variables are fixed, u − x and v − y inside the brackets are simply two                                                                                     tion variables and their corresponding subscripts. To put it in the form of a convolu-
                                     constants. Therefore, the double summation evaluates to wu − x,v − y (& + 1) h$ zx, y (&) ,                  (          )                                                   tion, we write the last line of Eq. (12-104) as
                                     and we can write Eq. (12-99) as
                                                                                                                                                                                                                                               ∂E
                                                                                                                                                                                                                                                    = ∑ ∑ d x, y (&) a− ( l − x ), − ( k − y ) (& − 1)
                                                          d x, y (&) = ∑ ∑ du,v (& + 1) wu − x,v − y (& + 1) h$ zx, y (&)
                                                                             u       v
                                                                                                                              (           )                                                                                                   ∂wl,k   x y
                                     The double sum expression in the second line of this equation is in the form of a con-                                                                                      Similarly (see Problem 12.32),
                                     volution, but the displacements are the negatives of those in Eq. (12-91). Therefore,
                                     we can write Eq. (12-100) as
                                                                                                                                                                                                                                                           ∂E
                                                                                                                                                                                                                                                                 = ∑ ∑ d x , y ( &)                          (12-106)
                                                                                     (        )
                                                            d x, y (&) = h$ zx, y (&) d x, y (& + 1) ! w− x,− y (& + 1)                       (12-101)                                                                                                ∂ b(&)   x y
                                                                                                                                                                                                                 Using the preceding two expressions in the gradient descent equations (see
                                     The negatives in the subscripts indicate that w is reflected about both spatial axes.
                                                                                                                                                                                                                 Section 12.5), it follows that
                                     This is the same as rotating w by 180°, as we explained in connection with Eq. (3-35).
                                     Using this fact, we finally arrive at an expression for the error at a layer & by writing                                                                                                                                          ∂E
                                     Eq. (12-101) equivalently as                                                                                                                                                                     wl , k (&) = wl , k (&) − a
                                                                                                                                                                                                                                                                       ∂wl,k
            The 180° rotation is
            for each 2-D kernel in
            a layer.
                                                                             (           )                               (
                                                        d x, y (&) = h$ zx, y (&) d x, y (& + 1) ! rot 180 wx, y (& + 1)              )       (12-102)                                                                                       = wl , k (&) − a dl , k (&) ! rot180 ( a(& − 1))            (12-107)
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                                                                                                                                           Vectorization
                                                                                                                                           (12-108)                          layer used to
                                                                               = b(&) − a ∑ ∑ d x, y (&)                                                                     learn to recognize
                                                                                               x   y                                                                         the images in Fig.
                                                                                                                                                                             12.46.
                                    Equations (12-107) and (12-108) update the weights and bias of each convolution                                                                                                                                                                               3 output
                                                                                                                                                                                                                                                           Two pooled
                                    layer in a CNN. As we have mentioned before, it is understood that the wl , k repre-                                                                                                                                  feature maps                            neurons
                                    sents all the weights of a layer. The variables l and k span the spatial dimensions of                                                                                                             Two feature maps    of size 2 # 2                   8 input neurons
                                                                                                                                                                                                               Image of size 6 # 6
                                    the 2-D kernels, all of which are of the same size.                                                                                                                                                  of size 4 # 4
                                                                                                                                                                                                                                                                             Fully connected
                                        In a forward pass, we went from a convolution layer to a pooled layer. In back-                                                                                                                                                    two-layer neural net
                                    propagation, we are going in the opposite direction. But the pooled feature maps
                                    are smaller than their corresponding feature maps (see Fig. 12.40). Therefore, when                                                                                   epochs, or until the output error of the neural net reaches an acceptable value. The
                                    going in the reverse direction, we upsample (e.g., by pixel replication) each pooled                                                                                  error is computed exactly as we did in Section 12.5. It can be the mean squared error,
                                    feature map to match the size of the feature map that generated it. Each pooled                                                                                       or the recognition error. Keep in mind that the weights in w(&) and the bias value
                                    feature map corresponds to a unique feature map, so the path of backpropagation                                                                                       b(&) are different for each feature map in layer &.
                                    is clearly defined.
                                       With reference to Fig. 12.40, backpropagation starts at the output of the fully con-
                                    nected neural net. We know from Section 12.5 how to update the weights of this net-                                                        EXAMPLE 12.16 : Teaching a CNN to recognize some simple images.
                                    work. When we get to the “interface” between the neural net and the CNN, we have                                                         We begin our illustrations of CNN performance by teaching the CNN in Fig. 12.45 to recognize the small
                                    to reverse the vectorization method used to generate input vectors. That is, before                                                      6 × 6 images in Fig. 12.46. As you can see on the left of this figure, there are three samples each of images
                                    we can proceed with backpropagation using Eqs. (12-107) and (12-108), we have to                                                         of a horizontal stripe, a small centered square, and a vertical stripe. These images were used as the train-
                                    regenerate the individual pooled feature maps from the single vector propagated                                                          ing set. On the right are noisy samples of images in these three categories. These were used as the test set.
                                    back by the fully connected neural net.
                                       We summarized in Table 12.3 the backpropagation steps for a fully connected
                                    neural net. Table 12.6 summarizes the steps for performing backpropagation in the
                                    CNN architecture in Fig. 12.40. The procedure is repeated for a specified number of
            TABLE 12.6
            The principal steps used to train a CNN. The network is initialized with a set of small random weights and biases.
            In backpropagation, a vector arriving (from the fully connected net) at the output pooling layer must be converted
            to 2-D arrays of the same size as the pooled feature maps in that layer. Each pooled feature map is upsampled to
            match the size of its corresponding feature map. The steps in the table are for one epoch of training.
www.EBooksWorld.ir www.EBooksWorld.ir
               As Fig. 12.45 shows, the inputs to our system are single images. We used a receptor field of size 3 × 3,
            which resulted in feature maps of size 4 × 4. There are two feature maps, which means we need two                                                  significant variability in the characters—and this is just a small sampling of the 70,000 characters avail-
            kernels of size 3 × 3, and two biases. The pooled feature maps were generated using average pooling in                                             able for experimentation.
            neighborhoods of size 2 × 2. This resulted in two pooled feature maps of size 2 × 2, because the feature                                              Figure 12.49 shows the architecture of the CNN we trained to recognize the ten digits in the MNIST
            maps are of size 4 × 4. The two pooled maps contain eight total elements which were organized as an                                                database. We trained the system for 200 epochs using a = 1.0. Figure 12.50 shows the training MSE as a
            8-D column vector to vectorize the output of the last layer. (We used linear indexing of each image, then                                          function of epoch for the 60,000 training images in the MNIST database.
            concatenated the two resulting 4-D vectors into a single 8-D vector.) This vector was then fed into the                                               Training was done using mini batches of 50 images at a time to improve the learning rate (see the dis-
            fully connected neural net on the right, which consists of the input layer and a three-neuron output layer,                                        cussion in Section 12.7). We also classified all images of the training set and all images of the test set after
            one neuron per class. Because this network has no hidden layers, it implements linear decision functions                                           each epoch of training. The objective of doing this was to see how quickly the system was learning the
            (see Problem 12.18). To train the system, we used a = 1.0 and ran the system for 400 epochs. Figure 12.47                                          characteristics of the data. Figure 12.51 shows the results. A high level of correct recognition performance
            is a plot of the MSE as a function of epoch. Perfect recognition of the training set was achieved after                                            was achieved after relatively few epochs for both data sets, with approximately 98% correct recognition
            approximately 100 epochs of training, despite the fact that the MSE was relatively high there. Recogni-                                            achieved after about 40 epochs. This is consistent with the training MSE in Fig. 12.50, which dropped
            tion of the test set was 100% as well. The kernel and bias values learned by the system were:                                                      quickly, then began a slow descent after about 40 epochs. Another 160 epochs of training were required
                                                                                                                                                               for the system to achieve recognition of about 99.9%. These are impressive results for such a small CNN.
                      3.0132 1.1808 − 0.0945                                             − 0.7388 1.8832  4.1077 
               w1 =  0.9718 0.7087 − 0.9093 , b1 = − 0.2990                    w2 =  −1.0027 0.3908    2.0357  , b2 = − 0.2834
                      0.7193 0.0230 − 0.883
                                            33                                          −1.2164 −1.1853 − 0.1987 
                                                                                                                                                                                                                                               192 input neurons
            It is important that the CNN learned these parameters automatically from the raw training images. No
                                                                                                                                                                                                                                                                Vectorization
            features in the sense discussed in Chapter 11 were employed.
                                                                                                                                                                                                                                                       12
              EXAMPLE 12.17 : Using a large training set to teach a CNN to recognize handwritten numerals.                                                                                                                 6 pooled         12
                                                                                                                                                                                                                                                    pooled                                    10
                                                                                                                                                                                                                            feature      feature                                            output
            In this example, we look at a more practical application using a database containing 60,000 training and                                                                                                                                 feature
                                                                                                                                                                   Image of size 28 # 28                                   maps of       maps of                                            neurons
                                                                                                                                                                                                 6 feature maps                                     maps of
            10,000 test images of handwritten numeric characters. The content of this database, called the MNIST                                                                                                         size 12 # 12   size 8 # 8
                                                                                                                                                                                                                                                   size 4 # 4
                                                                                                                                                                                                 of size 24 # 24                                                                  Fully connected
            database, is similar to a database from NIST (National Institute of Standards and Technology). The
                                                                                                                                                                                                                                                                                two-layer neural net
            former is a “cleaned up” version of the latter, in which the characters have been centered and for-
            matted into grayscale images of size 28 × 28 pixels. Both databases are freely available online. Figure                                            FIGURE 12.49 CNN used to recognize the ten digits in the MNIST database. The system was trained with 60,000
                                                                                                                                                               numerical character images of the same size as the image shown on the left. This architecture is the same as the
            12.48 shows examples of typical numeric characters available in the databases. As you can see, there is
                                                                                                                                                               architecture we used in Fig. 12.42. (Image courtesy of NIST.)
www.EBooksWorld.ir www.EBooksWorld.ir
Recognition accuracy
                                                                                                                                                                                                                                                                                                 Recognition accuracy
                                                                                                                                                                                                                                             0.98                                                                       0.98
            epoch for the                                                     0.2
                                                               Training MSE
            60,000 training                                                                                                                                                                                                                  0.97                                                                       0.97
            digit images in the
            MNIST database.                                                                                                                                                                                                                  0.96                                                                       0.96
0.94 0.94
                                                                                                                                                                                                                                             0.93                                                                       0.93
                                                                                0                                                                                                                                                                   0   1   2     3 4        5  6    7   8   9                                 0   1   2     3    4     5  6    7   8   9
                                                                                    0         40         80                                 120     160    200                                                                                                  Class (digit number)                                                       Class (digit number)
                                                                                                                  Epoch                                                                                         a b
                                                                                                                                                                                                               FIGURE 12.52 (a) Recognition accuracy of training set by image class. Each bar shows a number between 0 and 1.
                                                                                                                                                                                                               When multiplied by 100%, these numbers give the correct recognition percentage for that class. (b) Recognition
                Figure 12.52 shows recognition performance on each digit class for both the training and test sets. The                                                                                        results per class in the test set. In both graphs the recognition rate is above 98%.
            most revealing feature of these two graphs is that the CNN did equally as well on both sets of data. This
            is a good indication that the training was successful, and that it generalized well to digits it had not seen
            before. This is an example of the neural network not “over-fitting” the data in the training set.
                Figure 12.53 shows the values of the kernels for the first feature map, displayed as intensities. There                                                                                         5 × 5 kernels corresponding to one of the feature maps in the second layer. We used 2 × 2 pooling in
            is one input image and six feature maps, so six kernels are required to generate the feature maps of                                                                                                both layers, resulting in a 50% reduction of each of the two spatial dimensions of the feature maps.
            the first layer. The dimensions of the kernels are the same as the receptive field, which we set at 5 × 5.                                                                                             Finally, it is of interest to visualize how one input image proceeds through the network, using the
            Thus, the first image on the left in Fig. 12.53 is the 5 × 5 kernel corresponding to the first feature map.                                                                                         kernels learned during training. Figure 12.55 shows an input digit image from the test set, and the com-
            Figure 12.54 shows the kernels for the second layer. In this layer, we have six inputs (which are the                                                                                               putations performed by the CNN at each layer. As before, we display numerical results as intensities.
            pooled maps of the first layer) and twelve feature maps, so we need a total of 6 × 12 = 72 kernels and                                                                                                 Consider the results of convolution in the first layer. If you look at each resulting feature map care-
            biases to generate the twelve feature maps in the second layer. Each column of Fig. 12.54 shows the six                                                                                             fully, you will notice that it highlights a different characteristic of the input. For example, the feature map
                                                                                                                                                                                                                on the top of the first column highlights the two vertical edges on the top of the character. The second
                                           1.00
                                                                                                                                                                                                                highlights the edges of the entire inner region, and the third highlights a “blob-like”feature of the digit,
                                                                                                                                         1.00
                                                                                                                                                                                                                as if it had been blurred by a lowpass kernel. The other three feature maps show other features. If you
                                           0.98                                                                                          0.98                                                                   now look at the first two feature maps in the second layer, and compare them with the first feature map
               Training accuracy (#100%)
                                                                                                                                                                                                                in the first layer, you can see that they could be interpreted as higher-level abstractions of the top of the
                                           0.96                                                                                          0.96
                                                                                                                                                                                                                character, in the sense that they show a dark area flanked on each side by white areas. Although these
                                           0.94                                                                                          0.94                                                                   abstractions are not always easy to analyze visually, this example clearly demonstrates that they can be
                                                                                                                                                                                                                very effective. And, remember the important fact that our simple system learned these features auto-
                                           0.92                                                                                          0.92                                                                   matically from 60,000 training images. This capability is what makes convolutional networks so powerful
                                                                                                                                                                                                                when it comes to image pattern classification. In the next example, we will consider even more complex
                                           0.90                                                                                          0.90
                                                                                                                                                                                                                images, and show some of the limitations of our simple CNN architecture.
                                           0.88                                                                                          0.88
                                           0.86                                                                                          0.86
                                               0   40   80                    120       160        200                                          0    40     80       120   160       200                         EXAMPLE 12.18 : Using a large image database to teach a CNN to recognize natural images.
                                                             Epoch                                                                                               Epoch                                          In this example, we trained the same CNN architecture as in Fig. 12.49, but using the RGB color images
             a b
                                                                                                                                                                                                                in Fig. 12.56. These images are representative of those found in the CIFAR-10 database, a popular data-
            FIGURE 12.51 (a) Training accuracy (percent correct recognition of the training set) as a function of epoch for the
                                                                                                                                                                                                                base used to test the performance of image classification systems. Our objective was to test the limita-
            60,000 training images in the MNIST database. The maximum achieved was 99.36% correct recognition. (b) Accu-
            racy as a function of epoch for the 10,000 test images in the MNIST database. The maximum correct recognition                                                                                       tions of the CNN architecture in Fig. 12.49 by training it with data that is significantly more complex
            rate was 99.13%.                                                                                                                                                                                    than the MNIST images in Example 12.17. The only difference between the architecture needed to
www.EBooksWorld.ir www.EBooksWorld.ir
Dog
Frog
                                                                                                                                                                                                                                                             Horse
            FIGURE 12.54 Kernels of the second layer after 200 epochs of training, displayed as images of size 5 × 5. There are six
            inputs (pooled feature maps) into the second layer. Because there are twelve feature maps in the second layer, the
            CNN learned the weights of 6 × 12 = 72 kernels.                                                                                                                                                                                                  Ship
                                                                                                                                                                                           Training MSE
            pooling. The neural                                                                                        4                                 epochs for a train-                              0.35
                                                                                                     Vector
www.EBooksWorld.ir www.EBooksWorld.ir
Recognition accuracy
                                                                                                                                                                                                                                                                                                  Recognition accuracy
                                                                                                                                                                                                                                                                                                                                                                                       0.72
                                                                                                                                                                                                                        0.7   0.69                                            0.68                                       0.7                                                                  0.69
                                                                                                                                                                                                                                                                                                                               0.65                                             0.64
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                                                                                                                                                                                     Vector
            FIGURE 12.61 Weights of the kernels of the second convolution layer after 500 epochs of training. The interpretation                      courtesy of
            of these kernels is the same as in Fig. 12.54.                                                                                            Pearson
                                                                                                                                                      Education.)                                                                                                       6
www.EBooksWorld.ir www.EBooksWorld.ir
                                    images, but this must done with care because the relative sizes of features of interest                                                    been implemented over the past decade, including commercial and free implemen-
                                    would increase proportionally, thus influencing the size selected for receptive fields.                                                    tations. A quick internet search will reveal a multitude of available architectures.
                                       After the number of layers has been specified, the next task is to specify the num-
                                    ber of neurons per layer. We always know how many neurons are needed in the first
                                    and last layers, but the number of neurons for internal layer is also an open question                        Summary, References, and Further Reading
                                    with no theoretical “best” answer. If the objective is to keep the number of layers as                        Background material for Sections 12.1 through 12.4 are the books by Theodoridis and Koutroumbas [2006], by
                                    small as possible, the power of the network is increased to some degree by increas-                           Duda, Hart, and Stork [2001], and by Tou and Gonzalez [1974]. For additional reading on the material on match-
                                    ing the number of neurons per layer.                                                                          ing shape numbers see Bribiesca and Guzman [1980]. On string matching, see Sze and Yang [1981]. A significant
                                       The main aspects of specifying the architecture of a neural network are com-                               portion of this chapter was devoted to neural networks. This is a reflection of the fact that neural nets, and in
                                    pleted by specifying the activation function. In this chapter, we worked with sigmoid                         particular convolutional neural nets, have made significant strides in the past decade in solving image pattern
                                    functions for consistency between examples, but there are applications in which                               classifications problems. As in the rest of the book, our presentation of this topic focused on fundamentals, but
                                    hyperbolic tangent and ReLU activation functions are superior in terms of improv-                             the topics covered were thoroughly developed. What you have learned in this chapter is a solid foundation for
                                    ing training performance.                                                                                     much of the work being conducted in this area. As we mentioned earlier, the literature on neural nets is vast, and
                                       Once a network architecture has been specified, training is the central aspect of                          quickly growing. As a starting point, a basic book by Nielsen [2015] provides an excellent introduction to the topic.
                                    making the architecture useful. Although the networks we discussed in this chapter                            The more advanced book by Goodfellow, Bengio, and Courville [2016] provides more depth into the mathemati-
                                                                                                                                                  cal underpinning of neural nets. Two classic papers worth reading are by Rumelhart, Hinton, and Williams [1986],
                                    are relatively simple, networks applied to very large-scale problems can have mil-
                                                                                                                                                  and by LeCun, Bengio, and Haffner [1998]. The LeNet architecture we discussed in Section 12.6 was introduced in
                                    lions of nodes and require large blocks of time to train. When available, the param-
                                                                                                                                                  the latter reference, and it is still a foundation for image pattern classification. A recent survey article by LeCun,
                                    eters of a pretrained network are an ideal starting point for further training, or for
                                                                                                                                                  Bengio, and Hinton [2015] gives an interesting perspective on the scope of applicability of neural nets in general.
                                    validating recognition performance. Another central theme in training neural nets is                          The paper by Krizhevsky, Sutskever, and Hinton [2012] was one of the most important catalysts leading to the
                                    the use of GPUs to accelerate matrix operations.                                                              significant increase in the present interest on convolutional networks, and on their applicability to image pattern
                                       An issue often encountered in training is over-fitting, in which recognition of the                        classification. This paper is also a good overview of the details and techniques involved in implementing a large-
                                    training set is acceptable, but the recognition rate on samples not used for training is                      scale convolutional neural network. For details on the software aspects of many of the examples in this chapter, see
                                    much lower. That is, the net is not able to generalize what it learned and apply it to                        Gonzalez, Woods, and Eddins [2009].
                                    inputs it has not encountered before. When additional training data is not available,
                                    the most common approach is to artificially enlarge the training set using transfor-
                                    mations such as geometric distortions and intensity variations. The transformations
                                    are carried out while preserving the class membership of the transformed patterns.                            Problems
                                    Another major approach is to use dropout, a technique that randomly drops nodes
                                                                                                                                                  Solutions to the problems marked with an asterisk (*) are in the DIP4E Student Support Package (consult the book
                                    with their connections from a neural network during training. The idea is to change                           website: www.ImageProcessingPlace.com).
                                    the architecture slightly to prevent the net from adapting too much to a fixed set of
                                    parameters (see Srivastava et al. [2014]).                                                                    12.1    Do the following:                                            each bank (for summing currents), and a maxi-
                                       In addition to computational speed, another important aspect of training is effi-                                  (a) * Compute the decision functions of a mini-              mum selector capable of selecting the maximum
                                    ciency. Simple things, such as shuffling the input patterns at the beginning of each                                        mum distance classifier for the patterns in            value of Nc decision functions in order to deter-
                                    training epoch can reduce or eliminate the possibility of “cycling,” in which param-                                        Fig. 12.10. You may obtain the required mean           mine the class membership of a given input.
                                    eter values repeat at regular intervals. Stochastic gradient descent is another impor-                                      vectors by (careful) inspection.                12.5 * Show that the correlation coefficient of Eq. (12-10)
                                    tant training refinement in which, instead of using the entire training set, samples                                                                                               has values in the range [ −1, 1]. (Hint: Express g in
                                                                                                                                                          (b) Sketch the decision boundary implemented
                                    are selected randomly and input into the network. You can think of this as dividing                                                                                                vector form.)
                                                                                                                                                              by the decision functions in (a).
                                    the training set into mini-batches, and then choosing a single sample from each mini-
                                                                                                                                                  12.2 * Show that Eqs. (12-3) and (12-4) perform the           12.6   Show that the distance measure D(a, b) in Eq.
                                    batch. This approach often results in speedier convergence during training.                                                                                                        (12-12) satisfies the properties in Eq. (12-13).
                                       In addition to the above topics, a paper by LeCun et al. [2012] is an excellent over-                             same function in terms of pattern classification.
                                    view of the types of considerations introduced in the preceding discussion. In fact,                          12.3    Show that the boundary given by Eq. (12-8) is         12.7 * Show that b = max ( a , b ) − a in Eq. (12-14) is 0
                                                                                                                                                          the perpendicular bisector of the line joining the           if and only if a and b are identical strings.
                                    the breath spanned by these topics is extensive enough to be the subject of an entire
                                    book (see Montavon et al. [2012]). The neural net architectures we discussed were                                     n-dimensional points m i and m j .                    12.8   Carry out the manual computations that resulted
                                    by necessity limited in scope. You can get a good idea of the practical requirements                          12.4 * Show how the minimum distance classifier dis-                 in the mean vector and covariance matrices in
                                    of implementing practical networks by reading a paper by Krizhevsky, Sutskever,                                      cussed in connection with Fig. 12.11 could be                 Example 12.5.
                                    and Hinton [2012], which summarizes the design and implementation of a large-                                        implemented by using Nc resistor banks (Nc is          12.9 * The following pattern classes have Gaussian prob-
                                    scale, deep convolutional neural network. There are a multitude of designs that have                                 the number of classes), a summing junction at                 ability density functions:
www.EBooksWorld.ir www.EBooksWorld.ir
                                                                                           lem statement can be obtained from this general                                         weights and bias as that network if trained                  (b) Show how the middle term in the third line
                              c1 : {(0, 0)T ,( 2, 0)T ,(2, 2)T ,(0, 2)T }
                                                                                           gradient descent procedure by using the criterion                                       with a sufficiently large number of samples?                     of Eq. (12-70) follows from the middle term
                              c2 : {(4, 4)T ,(6, 4)T ,(6, 6)T ,(4, 6)T }                   function                                                                                Explain.                                                         in the second.
                    (a) Assume that P(c1 ) = P(c2 ) = 1 2 and obtain
                        the equation of the Bayes decision boundary                                    J (w, y ) =
                                                                                                                   1
                                                                                                                   2
                                                                                                                    (wT y − wT y )                                   12.21 Two pattern classes in two dimensions are distrib-
                                                                                                                                                                           uted in such a way that the patterns of class c1 lie
                                                                                                                                                                                                                                         12.26 Show the validity of Eq. (12-72). (Hint: Use the
                                                                                                                                                                                                                                               chain rule.)
                        between these two classes.                                                                                                                         randomly along a circle of radius r1 . Similarly, the
                                                                                           (Hint: The partial derivative of wT y with respect                                                                                            12.27 * Show that the dimensions of matrix D(&) in Eq.
                    (b) Sketch the boundary.                                                                                                                               patterns of class c2 lie randomly along a circle of                   (12-79) are n& × n p . (Hint: Some of the parameters
                                                                                           to w is y.)
                                                                                                                                                                           radius r2 , where r2 = 2r1 . Specify the structure of                 in that equation are computed in forward propaga-
            12.10 Repeat Problem 12.9, but use the following pat-                 12.15 * Prove that the perceptron training algorithm giv-                                a neural network with the minimum number of                           tion, so you already know their dimensions.)
                  tern classes:                                                           en in Eqs. (12-44) through (12-46) converges in                                  layers and nodes needed to classify properly the
                                            T        T        T
                            c1 : {(−1, 0) ,(0, − 1) ,(1, 0) ,(0, 1) }  T
                                                                                          a finite number of steps if the training pattern                                 patterns of these two classes.                                12.28 With reference to the discussion following Eq.
                                                                                          sets are linearly separable. [Hint: Multiply the                                                                                                     (12-82), explain why the error for one pattern is
                            c2 : {(−2, 0)T ,(0, − 2)T ,(2, 0)T ,(0, 2)T }                                                                                            12.22 * If two classes are linearly separable, we can train
                                                                                          patterns of class c2 by −1 and consider a non-                                                                                                       obtained by squaring the elements of one column
                    Note that the classes are not linearly separable.                                                                                                        a perceptron starting with weights and a bias that                of matrix ( A(L) − R ) , adding them, and dividing
                                                                                          negative threshold, T0 so that the perceptron
                                                                                                                                                                             are all zero, and we would still get a solution. Can              the result by 2.
            12.11 With reference to the results in Table 12.1, com-                       training algorithm (with a = 1) is expressed in
                                                                                                                                                                             you do the same when training a neural network
                  pute the overall correct recognition rate for the                       the form w(k + 1) = w(k ), if wT (k )y(k ) > T0 ,                                                                                              12.29 * The matrix formulation in Table 12.3 contains all
                                                                                                                                                                             by backpropagation? Explain.
                  patterns of the training set. Repeat for the pat-                       and w(k + 1) = w(k ) + ay(k ) otherwise. You                                                                                                           patterns as columns of a single matrix X. This is
                  terns of the test set.                                                  may need to use the Cauchy-Schwartz inequality:                            12.23 Label the outputs, weights, and biases for every                      ideal in terms of speed and economy of imple-
                                                                                             2    2
                                                                                           a b ≥ (aT b)2 .]                                                                node in the following neural network using the                        mentation. It is also well suited when training
            12.12 * We derived the Bayes decision functions                                                                                                                general notation introduced in Fig. 12.31.                            is done using mini-batches. However, there are
                                        (       ) ( )
                           d j ( x ) = p x c j P c j , j = 1, 2, …, Nc
                                                                                  12.16 Derive equations of the derivatives of the follow-
                                                                                        ing activation functions:                                                                                                                                applications in which the large number of train-
                                                                                                                                                                                                                                                 ing vectors is too large to hold in memory, and
                    using a 0-1 loss function. Prove that these deci-                      (a) The sigmoid activation function in Fig. 12.30(a).                                                                                                 it becomes more practical to loop through each
                    sion functions minimize the probability of error.
                                                                                           (b) The hyperbolic tangent activation function                                                                                                        pattern using the vector formulation. Compose
                    (Hint: The probability of error p(e) is 1 − p(c),                                                                                                                                     1               1
                                                                                               in Fig. 12.30(b).                                                                                                                                 a table similar to Table 12.3, but using individual
                    where p(c) is the probability of being correct.
                                                                                                                                                                                                                                                 patterns, x, instead of matrix X.
                    For a pattern vector x belonging to class ci ,                         (c) * The ReLU activation function in Fig. 12.30(c).
                    p ( c x ) = p ( ci x ) . Find p(c) and show that p(c) is      12.17 * Specify the structure, weights, and bias(es) of the                                                             1               1
                                                                                                                                                                                                                                         12.30 Consider a CNN whose inputs are RGB color
                    maximum [ p(e) is minimum] when p(x ci )P(ci )                                                                                                                                                                             images of size 512 × 512 pixels. The network has
                                                                                          smallest neural network capable of performing
                    is maximum.)                                                                                                                                                                                                               two convolutional layers. Using this information,
                                                                                          exactly the same function as a minimum distance
                                                                                                                                                                     12.24 Answer the following:                                               answer the following:
            12.13 Finish the computations started in Example 12.7.                        classifier for two pattern classes in n-dimensional
                                                                                          space. You may assume that the classes are tightly                                 (a) The last element of the input vector in Fig.                   (a) * You are told that the spatial dimensions
            12.14 * The perceptron algorithm given in Eqs. (12-44)                                                                                                               12.32 is 1. Is this vector augmented? Explain.
                                                                                          grouped and are linearly separable.                                                                                                                         of the feature maps in the first layer are
                    through (12-46) can be expressed in a more con-
                                                                                                                                                                             (b) Repeat the calculations in Fig. 12.32, but                           504 × 504, and that there are 12 feature
                    cise form by multiplying the patterns of class                12.18 What is the decision boundary implemented by
                                                                                                                                                                                 using weight matrices that are 100 times the                         maps in the first layer. Assuming that no
                    c2 by −1, in which case the correction steps                        a neural network with n inputs, a single output
                                                                                                                                                                                 values of those used in the figure.                                  padding is used, and that the kernels used
                    in the algorithm become w(k + 1) = w(k ), if                        neuron, and no hidden layers? Explain.
                                                                                                                                                                                                                                                      are square, and of an odd size, what are the
                    wT (k )y(k ) > 0, and w(k + 1) = w(k ) + ay ( k )             12.19 Specify the structure, weights, and bias of a neu-                                   (c) * What can you conclude in general from your
                                                                                                                                                                                                                                                      spatial dimensions of these kernels?
                    otherwise, where we use y instead of x to make it                   ral network capable of performing exactly the                                              results in (b)?
                    clear that the patterns of class c2 were multiplied                                                                                                                                                                         (b) If subsampling is done using neighborhoods
                                                                                        same function as a Bayes classifier for two pat-                             12.25 Answer the following:
                    by −1. This is one of several perceptron algo-                                                                                                                                                                                  of size 2 × 2, what are the spatial dimensions
                                                                                        tern classes in n-dimensional space. The classes                                     (a) * The chain rule in Eq. (12-70) shows three
                    rithm formulations that can be derived starting                                                                                                                                                                                 of the pooled feature maps in the first layer?
                                                                                        are Gaussian with different means but equal                                                terms. However, you are probably more famil-
                    from the general gradient descent equation                          covariance matrices.                                                                                                                                    (c) What is the depth (number) of the pooled
                                                                                                                                                                                   iar with chain rule expressions that have two                    feature maps in the first layer?
                                                     ∂J ( w, y )                12.20 Answer the following:                                                                      terms. Show that if you start with the expres-
                           w ( k + 1) = w ( k ) − a                                                                                                                              sion                                                         (d) The spatial dimensions of the convolution
                                                     ∂w  w = w ( k )                     (a) * Under what conditions are the neural net-                                                                                                          kernels in the second layer are 3 × 3. Assum-
                                                                                                 works in Problems 12.17 and 12.19 identical?                                                     ∂E               ∂E     ∂zi (& + 1)
                                                                                                                                                                                      d j ( &) =            =∑                                      ing no padding, what are the sizes of the fea-
                    where a > 0, J(w, y ) is a criterion function, and                                                                                                                           ∂z j ( & )  i ∂zi (& + 1) ∂zj (&)                  ture maps in the second layer?
                                                                                           (b) Suppose you specify a neural net architecture
                    the partial derivative is evaluated at w = w(k ).
                                                                                               identical to the one in Problem 12.17. Would                                                                                                     (e) You are told that the number of feature maps
                    Show that the perceptron algorithm in the prob-                                                                                                                you can arrive at the result in Eq. (12-70).
                                                                                               training by backpropagation yield the same
www.EBooksWorld.ir www.EBooksWorld.ir
                         in the second layer is 6, and that the size of            Develop an image processing system capable of
                         the pooling neighborhoods is again 2 × 2.                 rejecting incomplete or overlapping ellipses, then
                         What are the dimensions of the vectors that               classifying the remaining single ellipses into one
                         result from vectorizing the last layer of the             of the three given size classes. Show your solu-
                         CNN? Assume that vectorization is done                    tion in block diagram form, giving specific details
                         using linear indexing.                                    regarding the operation of each block. Solve the
            12.31 Suppose the input images to a CNN are padded                     classification problem using a minimum distance
                  to compensate for the size reduction caused by                   classifier, indicating clearly how you would go
                                                                                   about obtaining training samples, and how you
                  convolution and subsampling (pooling). Let P
                                                                                   would use these samples to train the classifier.
                  denote the thickness of the padding border, let V
                  denote the width of the (square) input images, let         12.34 A factory mass-produces small American flags
                  S denote the stride, and let F denote the width of               for sporting events. The quality assurance team
                                                                                   has observed that, during periods of peak pro-
                  the (square) receptive field.
                                                                                   duction, some printing machines have a tendency
                    (a) Show that the number, N, of neurons in                     to drop (randomly) between one and three stars
                        each row in the resulting feature map is                   and one or two entire stripes. Aside from these
                                                                                   errors, the flags are perfect in every other way.
                                   V + 2P − F
                               N=                  + 1                             Although the flags containing errors represent a
                                          S                                        small percentage of total production, the plant
                    (b) * How would you interpret a result using this              manager decides to solve the problem. After
                          equation that is not an integer?                         much investigation, she concludes that automatic
            12.32 * Show the validity of Eq. (12-106).                             inspection using image processing techniques is
                                                                                   the most economical approach. The basic specifi-
            12.33 An experiment produces binary images of blobs
                                                                                   cations are as follows: The flags are approximate-
                  that are nearly elliptical in shape, as the following
                                                                                   ly 7.5 cm by 12.5 cm in size. They move length-
                  example image shows. The blobs are of three siz-
                                                                                   wise down the production line (individually, but
                  es, with the average values of the principal axes
                                                                                   with a ±15% variation in orientation) at approxi-
                  of the ellipses being (1.3, 0.7), (1.0, 0.5), and (0.75,
                                                                                   mately 50 cm/s, with a separation between flags of
                  0.25). The dimensions of these axes vary ±10%
                                                                                   approximately 5 cm. In all cases, “approximately”
                  about their average values.
                                                                                   means ± 5%. The plant manager employs you to
                                                                                   design an image processing system for each pro-
                                                                                   duction line. You are told that cost and simplicity
                                                                                   are important parameters in determining the via-
                                                                                   bility of your approach. Design a complete sys-
                                                                                   tem based on the model of Fig. 1.23. Document
                                                                                   your solution (including assumptions and speci-
                                                                                   fications) in a brief (but clear) written report
                                                                                   addressed to the plant manager. You can use any
                                                                                   of the methods discussed in the book.
www.EBooksWorld.ir