Metamorphic Testing
Metamorphic Testing
time Checking of                                                                            such a manner that the change to O (if any) can be predicted.
                                                                                            In cases where the correctness of the original output O
Applications With-
                                                                                            cannot be determined, i.e., if there is no test oracle, program
                                                                                            defects can still be detected if the new output O is not as
                                                                                            expected when using the new input.
                                                                                               For a simple example of metamorphic testing (where we do have
out Test Oracles                                                                            a test oracle), consider a function that calculates the standard devi-
                                                                                            ation of a set of numbers. Certain transformations of the set would
                                                                                            be expected to produce the same result: for instance, permuting
                                                                                            the order of the elements should not affect the calculation, nor
                                                                                            should multiplying each value by -1. Furthermore, other transforma-
                                                                                            tions should alter the output, but in a predictable way: if each value
                                                                                            in the set were multiplied by 2, then the standard deviation should
Jonathan Bell, Columbia University
                                                                                            be twice that of the original set.
Christian Murphy, University of Pennsylvania
                                                                                               Through our own past investigations into metamorphic testing
Gail Kaiser, Columbia University
                                                                                            [4] [5] [6], we have garnered three key insights. First, the meta-
Abstract. For some applications, it is impossible or impractical to know what               morphic properties of individual functions are often different than
the correct output should be for an arbitrary input, making testing difficult. Many         those of the application as a whole. Thus, by checking for addi-
machine-learning applications for “big data”, bioinformatics and cyberphysical              tional and different relationships, we can reveal defects that would
systems fall in this scope: they do not have a test oracle. Metamorphic Testing,            not be detected using only the metamorphic properties of the
a simple testing technique that does not require a test oracle, has been shown              full application. Second, the metamorphic properties of individual
to be effective for testing such applications. We present Metamorphic Runtime               functions can be checked in the course of executing metamor-
Checking, a novel approach that conducts metamorphic testing of both the entire             phic tests on the full application. This addresses the problem of
application and individual functions during a program’s execution. We have ap-              generating test cases from which to derive new inputs, since we
plied Metamorphic Runtime Checking to 9 machine-learning applications, finding              can simply use those inputs with which the functions happened to
it to be on average 170% more effective than traditional metamorphic testing at             be invoked within the full application. Third, when conducting tests
only the full application level.                                                            of individual functions within the full running application in this
                                                                                            manner, checking the metamorphic properties of one function can
                 Introduction                                                               sometimes detect defects in other functions, which may not have
                    During software testing, a “test oracle” [1] is required to indi-       any known metamorphic properties, because the functions share
                 cate whether the output is correct for the given input. Despite a          application state.
                 recent interest in the testing community in creating and evaluat-
                 ing test oracles, still there are a variety of problem domains for         Approach
                 which a practical and complete test oracle does not exist.                     In order to realize these improvements, we present a solution
                    Many emerging application domains fall into a category of               based on checking the metamorphic properties of the entire
                 software that Weyuker describes as “Programs which were written            program and those of individual functions (methods, procedures,
                 in order to determine the answer in the first place. There would be        subroutines, etc.) as the full program runs. That is, the program
                 no need to write such programs, if the correct answer were known           under test is not treated only as a black box, but rather meta-
                 [2].” Thus, in the general case, it is not possible to know the correct    morphic testing also occurs within the program, at the function
                 output in advance for arbitrary input. In other domains, such as           level, in the context of the running program. This will allow for
                 optimization, determining whether the output is correct is at least as     the execution of more tests and also makes it possible to check
                 difficult as it is to derive the output in the first place, and creating   for subtle faults inside the code that may not cause violations of
                 an efficient, practical oracle may not be feasible.                        the full program’s metamorphic properties and lead to appar-
                    Although some faults in such programs - such as those that              ently reasonable output (remember we cannot check whether
                 cause the program to crash or produce results that are obvi-               that output is correct, since there is no test oracle).
                 ously wrong to someone who knows the domain - are easily                       In our new approach, additional metamorphic tests are logi-
                 found, and partial oracles may exist for a subset of the input             cally attached to the individual functions for which metamorphic
                 domain, subtle errors in performing calculations or in adhering            properties have been specified. Upon a function’s execution when
                 to specifications can be much more difficult to identify without           it happens to be invoked within the full program, the correspond-
                 a practical, general oracle.                                               ing function-level tests are executed as well: the arguments are
                    Much recent research addressing the so-called “oracle                   modified according to the function’s metamorphic properties, the
                 problem” has focused on the use of metamorphic testing [3]. In             function is run again (in a sandbox, not shown) in the same pro-
                 metamorphic testing changes are made to existing test inputs               gram state as the original, and the output of the function with the
                 in such a way (based on the program’s “metamorphic proper-                 original input is compared to that of the function with the modified
                 ties”) that it is possible to predict what the change to the output        input. If the result is not as expected according to the metamor-
                 should be without a test oracle.                                           phic property, then a fault has been exposed.
                                                                                                                                     CrossTalk—March/April 2015 9
TEST AND DIAGNOSTICS
10      CrossTalk—March/April 2015
                                                                                                                                                TEST AND DIAGNOSTICS
                                                                                                                                                      CrossTalk—March/April 2015 11
TEST AND DIAGNOSTICS
                  erty simply states that the quality of the solutions should be            overhead was typically less than a few minutes, which
                  increasing with subsequent generations. Even though the value             we consider a small price to pay for being able to detect faults
                  of the fitness is incorrect, it would still be increasing (unless the     in programs with no test oracle.
                  omitted element had a very large effect on the result, which is              Future work could investigate techniques for improving the
                  unlikely), and the property would not be violated.                        performance of a Metamorphic Runtime Checking framework.
                                                                                            Previously we considered an approach whereby tests were
                  Performance Overhead                                                      only executed in application states that had not previously been
                     Although Metamorphic Runtime Checking using function-level             encountered, and showed that performance could be improved
                  properties is able to detect faults not found by metamorphic              even when the functions are invoked with new parameters up to
                  testing based on application-level properties alone, this runtime         90% of the time [12]. It may be possible to reduce the over-
                  checking of the properties comes at a cost, particularly if the tests     head even more, for instance by running tests probabilistically
                  are run frequently. In application-level metamorphic testing, the         (our framework already allows the tester to specify a probability
                  program needs to be run one more time with the transformed in-            for checking each function-level metamorphic property, but we
                  put, and then each metamorphic property is checked exactly once           turned that off for the studies presented here).
                  (at the end of the program execution). In Metamorphic Runtime
                  Checking, however, each property can be checked numerous                  Limitations
                  times, depending on the number of times each function is called,             We used Daikon to create the program invariants for
                  and the overhead can grow to be much higher.                              runtime assertion checking. Although in practice invariants
                     During the studies discussed above, we measured the per-               are typically generated by hand, and some researchers have
                  formance overhead of our C and Java implementations of the                questioned the usefulness of Daikon-generated invariants
                  Metamorphic Runtime Checking framework. Tests were conducted              compared to those generated by humans [13], we chose to
                  on a server with a quad-core 3GHz CPU running Ubuntu 7.10 with            use the tool so that we could eliminate any human bias or hu-
                  2GB RAM. On average, the performance overhead for the Java                man error in creating the invariants.
                  applications was around 3.5ms per test; for C, it was only 0.4ms             Additionally, others have independently shown that metamorphic
                  per test. This cost is mostly attributed to the time it takes to create   properties are more effective at detecting defects than manually
                  sandboxes (so the side-effects of function-level metamorphic test-        identified invariants [14], though for programs on a smaller scale
                  ing do not impact application-level testing).                             than those in our experiment (a few hundred lines, as opposed to
                     This impact can be substantial from a percentage overhead              thousands as in many of the programs we studied).
                  point of view if many tests are run in a short-lived program.                The ability of metamorphic testing to reveal failures is clearly
                  For instance, for C4.5, the overhead was on the order of 10x,             dependent on the selection of metamorphic properties. How-
                  even though in absolute terms it was well under a second.                 ever, we have shown that a basic set of metamorphic properties
                  However, for most programs we investigated in our study, the              can be used without a particularly strong understanding of the
                                                                                            implementation - the authors knew essentially nothing about the
                                                                                            target systems or their domains beyond textbook generality; the
                                                                                            use of domain-specific properties from the developers of these
                                                                                            systems might reveal even more failures [15].
                                                                                            Conclusion
                                                                                               As shown in our empirical studies, Metamorphic Runtime
                                                                                            Checking has three distinct advantages over metamorphic test-
                                                                                            ing using application-level properties alone. First, we are able to
                                                                                            increase the scope of metamorphic testing, by identifying proper-
                                                                                            ties for individual functions in addition to those of the entire appli-
                                                                                            cation. Second, we increase the scale of metamorphic testing by
                                                                                            running more tests for a given input to the program. And third, we
                                                                                            can increase the sensitivity of metamorphic testing by checking
                                                                                            the properties of individual functions, making it possible to reveal
                                                                                            subtle faults that may otherwise go unnoticed.
                                                                                            Acknowledgements
                                                                                              We would like to thank T.Y. Chen, Lori Clarke, Lee Osterweil, Sal
                                                                                            Stolfo, and Junfeng Yang for their guidance and assistance. Sahar
                                                                                            Hasan, Lifeng Hu, Kuang Shen, and Ian Vo contributed to the
                                                                                            implementation of the Metamorphic Runtime Checking framework.
                                                                                              Bell and Kaiser are members of the Programming Systems
                                                                                            Laboratory, funded in part by NSF CCF-1302269, NSF CCF-
                                                                                            1161079, NSF CNS-0905246, and NIH U54 CA121852.
12   CrossTalk—March/April 2015
                                                                                                                                 TEST AND DIAGNOSTICS
CrossTalk—March/April 2015 13