How Effective Mutation Testing Tools Are? An Empirical Analysis of Java Mutation Testing Tools With Manual Analysis and Real Faults
How Effective Mutation Testing Tools Are? An Empirical Analysis of Java Mutation Testing Tools With Manual Analysis and Real Faults
An
Empirical Analysis of Java Mutation Testing
Tools with Manual Analysis and Real Faults
Abstract
Mutation analysis is a popular fault-based testing technique. It requires testers
to design tests based on a set of artificial defects. The defects help in performing
testing activities by measuring their ratio that is revealed by the candidate tests.
Unfortunately, applying mutation to real world programs requires automated tools
due to the vast number of the involved defects. In such a case, the strengths of
the method strongly depend on the peculiarities of the employed tools. Thus,
when employing automated tools, their implementation inadequacies can lead to
inaccurate results. To deal with this issue, we cross-evaluate three popular muta-
tion testing tools for Java, namely MU JAVA, M AJOR and the research version of
PIT, PIT RV , with respect to their fault detection capabilities. We investigate the
strengths of the tools based on: a) a set of real faults and b) manual analysis of the
mutants they introduce. We find that there are large differences between the tools’
effectiveness and demonstrate that no tool is able to subsume the others. We also
provide results indicating the application cost of the method. Overall, we find that
PIT RV achieves the best results. In particular, PIT RV outperforms both MU JAVA
and M AJOR by finding 6% more faults than both of the other two tools together.
1 Introduction
Software testing forms the most popular practice for identifying software defects [1]. It
is performed by exercising the software under test with test cases that check whether its
behaviour is as expected. To analyse test thoroughness, several criteria, which specify
the requirements of testing, i.e., what constitutes a good test suite, have been proposed.
When the criteria requirements have been fulfilled they provide confidence on the func-
tion of the tested systems.
Empirical studies have demonstrated that mutation testing is effective in revealing
faults [7], and capable of subsuming, or probably subsuming, almost all the structural
testing techniques [1, 19]. Mutation testing requires test cases that reveal the artificially
injected defects. This practice is particularly powerful as it has been shown that when
test cases are capable of distinguishing the behaviours of the original (non-mutated) and
∗ Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg, Luxembourg,
p3130019, ngm}@aueb.gr)
1
the defective (mutant) programs, they are also capable of distinguishing the expected
behaviours from the faulty ones [7].
The defective program versions are called mutants and they are typically intro-
duced using syntactic transformations. Clearly, the effectiveness of the technique de-
pends on the mutants that are employed. For instance if the mutants are trivial, i.e.,
they are found by almost every test that exercises them, they do not contribute the test-
ing process. Therefore, testers performing mutation testing should be cautious about
the mutants they use. Recent research has demonstrated that the method is so sensi-
tive to the employed mutants so that it can lead experiments to incorrect conclusions
[41]. Therefore, particular care has to be taken when selecting mutants in order to
avoid potential threats to validity. Similarly, the use of mutation testing tools can lead
to additional threats to validity or incompetent results (due to the peculiarities of the
mutation testing tools).
To date, many mutation testing tools have been developed and used by researchers
and practitioners [41]. However, a key open question is how effective these tools are
and how reliable are the research results based on them. Thus, in this paper, we seek
to investigate the fault revelation ability of popular mutation testing tools with the goal
of identifying their differences, weaknesses and strengths. In short, our aim is three-
fold: a) to inform practitioners about the effectiveness and relative cost of the studied
mutation testing tools, b) to provide constructive feedback to tool developers on how
to improve their tools, and c) to make researchers aware of the tools’ inadequacies.
To investigate these issues, we compare the fault revelation ability of three widely-
used mutation testing tools for Java, namely MU JAVA, M AJOR and PIT RV on a set
of real faults. We complement our analysis using human analysis and comparison
of the tools. Our results demonstrate that one tool, the research version of PIT [8],
named PIT RV [18], is significantly more effective than the others, managing to reveal
approximately 6% more real faults than the other two tools together. However, due to
some known limitations of PIT RV , it cannot fully subsume the other tools.
Regarding a reference effectiveness measure (control comparison at 100% cover-
age level), we found that PIT RV scores best with 91%, followed by MU JAVA with 85%
and M AJOR with 80%. These results suggest that existing tools have a much lower ef-
fectiveness than what they should or what researchers believe they ought to. Therefore,
our findings emphasise the need to build a reference mutation testing tool that will be
strong enough and capable of at least subsuming the existing mutation testing tools.
Another concern, when using mutation, is its application cost. This is mainly due
to the manual effort involved in constructing test cases and due to the effort needed
for deciding when to stop testing. The former point regards the need for generating
test cases while the latter pertains to the identification of the so-called equivalent mu-
tants, i.e., mutants that are functionally equivalent to the original program. Both these
tasks are labour-intensive and should be performed manually. Our study shows that
MU JAVA leads to 138 tests, M AJOR to 97 and PIT RV to 105. With respect to the num-
ber of equivalent mutants, MU JAVA, M AJOR and PIT RV produced 203, 94 and 382,
respectively.
This paper forms an extended study of our previous one [26], published in the
International Working Conference on Source Code Analysis and Manipulation, which
investigated the effectiveness of the tools based on manual analysis. We extend this
previous study by investigating the actual fault revelation ability of the tools, based on
a benchmark set of real faults and by considering the research version of the PIT tool,
which was realised after the previous study [18]. The extended results demonstrate that
PIT RV forms the most prominent choice as it significantly outperforms the other tools
2
both in terms of fault revelation and mutant revelation. Overall, the contributions of
the present paper can be summarised in the following points:
The rest of the paper is organised as follows: Section 2 presents the necessary
background information and Section 3 outlines our study’s motivation. In Section 4,
we present the posed research questions and the adopted experimental procedure and,
in Section 5, we describe the obtained results. In Section 6, we discuss potential threats
to the validity of this study, along with mitigating actions and in Section 7, previous
research studies. Finally, Section 8 concludes this paper, summarising the key findings.
2 Background
This section details mutation testing and presents the studied mutation testing tools.
3
Another problem of mutation testing is that it produces many mutants that are re-
dundant, i.e., they are killed when other mutants are killed. These mutants can inflate
the mutation score making it skew. Thus, previous research has shown that these mu-
tants can have harmful effects on the mutation score measurement with the effect of
leading experiments to incorrect conclusions [41]. Therefore, when mutation testing is
used as a comparison basis, there is a need to deflate the mutation score measurement.
This can be done by using the subset of subsuming mutants [41, 2] or disjoint mutants
[24]. Disjoint mutants approximate the minimum “subset of mutants that need to be
killed in order to reciprocally kill the original set” [24]. We utilise the term disjoint
mutation score for the ratio of the disjoint mutants that are killed by the test cases un-
der assessment (which in our case are those that were designed to kill the studied tools’
mutants).
Mutation’s effectiveness depends largely on the mutants that are used [1]. Thus,
the actual implementation of mutation testing tools can impact the effectiveness of the
technique. Indeed, many different mutation testing tools exist that are based on differ-
ent architectural designs and implementations. As a consequence, it is not possible for
researchers, and practitioners alike, to make an informed decision on which tool to use
and on the strengths and weaknesses of the tools.
This paper addresses the aforementioned issue by analysing the effectiveness of
three widely-used mutation testing tools for the Java programming language, namely
MU JAVA , M AJOR and PIT RV , based on the results of an extensive manual study. Before
presenting the conducted empirical study, the considered tools and their implementa-
tion details are introduced.
4
Table 1: Mutation operators of MU JAVA
Mutation Operator Description
AORB: Arithmetic Opera- {(op1 , op2 ) | op1 , op2 ∈ {+, -, *, /, %} ∧ op1 6=
tor Replacement Binary op2 }
AORS: Arithmetic Operator {(op1 , op2 ) | op1 , op2 ∈ {++, --} ∧ op1 6= op2 }
Replacement Short-Cut
AOIU: Arithmetic Operator {(v, -v)}
Insertion Unary
AOIS: Arithmetic Operator {(v, --v), (v, v--), (v, ++v), (v, v++)}
Insertion Short-cut
AODU: Arithmetic Opera- {(+v, v), (-v, v)}
tor Deletion Unary
AODS: Arithmetic Operator {(--v, v), (v--, v), (++v, v), (v++, v)}
Deletion Short-cut
ROR: Relational Operator {((a op b), false), ((a op b), true),
Replacement (op1 , op2 ) | op1 , op2 ∈ {>, >=, <, <=,
==, !=} ∧ op1 6= op2 }
COR: Conditional Operator {(op1 , op2 ) | op1 , op2 ∈ {&&, ||,∧ } ∧ op1 6= op2 }
Replacement
COD: Conditional Operator {(!cond, cond)}
Deletion
COI: Conditional Operator {(cond, !cond)}
Insertion
SOR: Shift Operator Re- {(op1 , op2 ) | op1 , op2 ∈ {>>, >>>, <<} ∧ op1 6=
placement op2 }
LOR: Logical Operator Re- {(op1 , op2 ) | op1 , op2 ∈ {&, |,∧ } ∧ op1 6= op2 }
placement
LOI: Logical Operator In- {(v, ∼v)}
sertion
LOD: Logical Operator {(∼v, v)}
Deletion
ASRS: Short-Cut Assign- {(op1 , op2 ) | op1 , op2 ∈ {+=, -=, *=,
ment Operator Replacement /=, %=, &=, |=,∧ =, >>=, >>>=, <<=} ∧ op1 6= op2 }
5
Bitwise Operator Mutation (OBBN) are a subset of the ones of MU JAVA’s Logical
Operator Replacement (LOR). Additionally, it employs mutation operators that are
not implemented in MU JAVA, e.g. the Void Method Calls (VMC) and Constructor
Calls (CC) operators. Finally, it should be mentioned that since PIT RV ’s changes are
performed at the bytecode level, they cannot always be mapped onto source code ones.
6
Table 2: Mutation operators of PIT RV
3 Motivation
Mutation testing is important since it is considered as one of the most effective testing
techniques. Its fundamental premise, as coined by Geist et al. [15], is that:
“If the software contains a fault, it is likely that there is a mutant that can
only be killed by a test case that also reveals the fault.”
This premise has been empirically investigated by many research studies which have
shown that mutation adequate test suites, i.e., test suites that kill all killable mutants, are
more effective than the ones generated to cover various control and data flow coverage
criteria [19, 7]. Therefore, researchers use mutation testing as a way to either compare
other test techniques or as a target to automate.
Overall, a recent study by Papadakis et al. [41] shows that mutation testing is
popular and widely-used in research (probably due to its remarkable effectiveness).
In view of this, it is mandatory to ensure that mutation testing tools are powerful and
do not bias (due to implementation inadequacies or missing mutation operators) the
existing research.
To reliably compare the selected tools, it is mandatory to account for mutant sub-
sumption [41] when performing a complete testing process, i.e., using mutation-adequate
tests. Accounting for mutant subsumption is necessary in order to avoid bias from
subsumed mutants [41], while complete testing ensures the accurate estimation of the
tools’ effectiveness. An inaccurate estimation may happen when failing to kill some
7
Table 3: Mutation operators of M AJOR
Mutation Operator Description
AOR: Arithmetic Operator {(op1 , op2 ) | op1 , op2 ∈ {+, -, *, /, %} ∧ op1 6=
Replacement op2 }
LOR: Logical Operator Re- {(op1 , op2 ) | op1 , op2 ∈ {&, |,∧ } ∧ op1 6= op2 }
placement
COR: Conditional Operator {(&&, op1 ), (||, op2 ) | op1 ∈
Replacement {==, LHS, RHS, false}, op2 ∈ {!=,
LHS, RHS, true}}
ROR: Relational Operator {(>, op1 ), (<, op2 ), (>=, op3 ), (<=, op4 ),
Replacement (==, op5 ), (!=, op6 ) | op1 ∈
{>=, !=, false}, op2 ∈ {<=, !=, false}, op3 ∈
{>, ==, true}, op4 ∈ {<, ==, true}, op5 ∈
{<=, >=, false, LHS, RHS}, op6 ∈ {<, >, true,
LHS, RHS}}
SOR: Shift Operator Re- {(op1 , op2 ) | op1 , op2 ∈ {>>, >>>, <<} ∧ op1 6=
placement op2 }
ORU: Operator Replace- {(op1 , op2 ) | op1 , op2 ∈ {+, -, ∼} ∧ op1 6= op2 }
ment Unary
STD: Statement Deletion {(--v, v), (v--, v), (++v, v), (v++, v),
Operator (aMethodCall(), ∅), (a op1 b, ∅) | op1 ∈
{+=, -=, *=, /=, %=, &=, |=,∧ =, >>=, >>>=, <<=}}
LVR: Literal Value Replace- {(c1 , c2 ) | (c1 , c2 ) ∈ {(0, 1), (0, −1),
ment (c1 , −c1 ), (c1 , 0), (true, false), (false, true)}
killable mutants, which consequently results in failing to design tests (to kill these mu-
tants) and, thus, underestimate effectiveness. Even worse, the use of non-adequate test
suites ignores hard to kill mutants which are important [3, 43] and among those that
(probably) contribute to the test process. Since we know that very few mutants con-
tribute to the test process [41], the use of non-adequate test suites can result in major
degradation of the measured effectiveness.
Unfortunately, generating test suites that kill all killable mutants is practically in-
feasible because of the inherent undecidability of the problem [40]. Therefore, from a
practical point of view, examining the partial relationship for the case of non-adequate
test suites is important. Thus, we need to consider both scenarios in order to adequately
compare the tools we study. For all these reasons, we use both mutation adequate test
suites specially designed for each tool that we study (using a small sample of programs)
and non-adequate test suites (using large real world programs).
4 Empirical Study
This section presents the settings of our study, by detailing the research questions, the
followed procedure and the design of our experiments.
8
4.1 Research Questions
Mutation testing’s aim is to help testers design high quality test suites. Therefore, the
first question to ask is whether there is a tool that is more effective or at least as effective
as the other tools in real world cases. In other words, we want to measure how effective
are the studied tools in finding real faults. Since we investigate real faults, we are forced
to study the partial relationship between the tools under the “practical” scenario. Hence
we ask:
RQ1: How do the studied tools perform in terms of real fault detection?
This comparison enables checking whether mutation testing tools have different
fault revelation capabilities when applied to large real world projects. In case we find
that the tools have significant differences in terms of fault detection, we demonstrate
that the choice of mutation testing tools really matters. Given that we find significant
differences between the tools, a natural question to ask is:
RQ2: Does any mutation testing tool lead to tests that subsume the others in terms of
real fault detection? If not, which is the relatively most effective tool to use?
This comparison enables the ranking of the tools with respect to their fault revela-
tion capabilities (with respect to the benchmark set we use) and identifying the most
effective mutation testing tool. It also quantifies the effectiveness differences between
the tools in real world settings. Given that the effectiveness ranking offered by the
above comparison is bounded to the reference fault set and the automatically generated
test suites used, an emerging question is how the tools compare with each other under
complete testing, i.e., using adequate test suites. In other words, we seek to investigate
how effective are the studied tools in killing the mutants of the other tools. Hence we
ask:
RQ3: Does any mutation testing tool lead to tests that kill all the killable mutants
produced by the other tools?
This comparison enables checking whether there is a tool that is capable of sub-
suming the others, i.e., whether the mutation adequate tests of one tool can kill all the
killable mutants of the others. A positive answer to the above question indicates that
a single tool is superior to the others, in terms of effectiveness. A negative answer to
this question indicates that the tools are generally incomparable, meaning that there are
mutants not covered by the tools. We view these missed mutants as weaknesses of the
tools. The main differences from the RQ1 and RQ2 are that we perform an objective
comparison under complete testing, which helps reducing potential threats to validity.
To further compare mutation testing tools and identify their weaknesses, we need to
assess the quality of the test suites that they lead to. This requires either an independent,
to the used mutants, effectiveness measure or a form of “ground truth”, i.e., a golden set
of mutants. Since both are not known, we constructed a reference mutant set, the set of
disjoint mutants, from the superset of mutants produced by all the studied tools together
and all generated test cases. We use the disjoint set of mutants to avoid inflating the
reference set from the duplicated, i.e., mutants equivalent to each other but not to the
original program [40], and redundant mutants, i.e., mutants subsumed by other mutants
of the merged set of mutants [41]. Both duplicated and redundant mutants inflate the
mutation score measurement with the unfortunate result of committing Type I errors
[41]. Since in our case these types of mutants are expected to be numerous, as the tools
9
support many common types of mutants, the use of disjoint mutants was imperative.
Therefore we ask:
RQ4: How do the studied tools perform compared to a reference mutant set? Which
is the relatively most effective tool to use?
This comparison enables the ranking of the tools with respect to their effective-
ness. The use of the reference mutant set also helps aggregate all the data and quantify
the relative strengths and weaknesses of the studied tools in one measure (the disjoint
mutation score). Given the effectiveness ranking offered by this comparison, we can
identify the most effective mutation testing tool and quantify the effectiveness differ-
ences between the tools.
This is important when choosing a tool to use but does not provide any constructive
information on the weaknesses of the tools. Furthermore, this information fails to pro-
vide researchers and tool developers constructive feedback on how to build future tools
or strengthen the existing ones. Therefore, we seek to analyse the observed weaknesses
and ask:
RQ5: Are there any actionable findings on how to improve the effectiveness of the
studied tools?
Our intentions thus far have been concentrated on the relative effectiveness of the
tools. While this is important when using mutation, another major concern is the cost
of its application. Mutation testing is considered to be expensive due to the manual
effort involved in identifying equivalent mutants and designing test cases. Since we
manually assess and apply the mutation testing practice of the studied tools we ask:
RQ6: What is the relative cost, measured by the number of tests and number of equiv-
alent mutants, of applying mutation testing with the studied tools?
An answer to this question can provide useful information to both testers and re-
searchers regarding the trade-offs between cost and effectiveness. Also, this analysis
will better reflect the differences of the tools from the cost perspective.
10
Table 4: Fault Benchmark Set Details
Test Subject Description LoC #Real Faults #Gen. Tests #Faults Found
JFreeChart A chart library 79,949 26 3,758 17
Closure Closure compiler 91,168 133 3,552 12
Commons Lang Java utilities library 45,639 65 6,408 30
Commons Math Mathematics library 22,746 106 8,034 53
Mockito A mocking framework 5,506 38 - -
Joda-Time A date and time library 79,227 27 1,667 13
Total - 324,235 395 23,419 125
these tests reveal the faults. Furthermore, these have not been generated using any con-
trolled or known procedure and thus, they can introduce several threats to the validity
of our results as they are few and may only kill trivial mutants, underestimating our
measurements. To circumvent this problem, we simulated the mutation-based test pro-
cess using multiple test suites generated by two state-of-the-art test generation tools,
namely EvoSuite [13] and Randoop [35]. Although this practice may introduce the
same threats to validity as the developer test suites, it has several benefits as the tests
are generated with a specific procedure, they are multiple and they represent our current
ability of generating automated test suites.
1. For each fault, we run EvoSuite and Randoop on the fixed version of the project
to generate 3 test suites, two with EvoSuite and one with Randoop, for the classes
that were modified in order to fix the corresponding buggy version.
2. We systematically removed problematic test cases from the generated test suites,
e.g. test cases that produced inconsistent results when run multiple times, using
the available scripts of Defects4J.
3. Run the generated test suites against the buggy versions of the projects to identify
the faults they reveal.
The aforementioned procedure identified the faults (from all the faults of Defects4J)
that could be discovered by our automatically generated test suites. In the remainder
of the paper, we call as “triggering test suite” a test suite that contains at least one test
case capable of revealing the fault that it is referring to. Table 4 presents more de-
tails about the results of the previously-described procedure. More precisely, column
“#Gen. Tests” presents the number of test cases in the triggering test suites per project
and column “#Faults Found”, the number of the corresponding discovered faults. Note
that we did not include any results for the Mockito project because most of the result-
ing test cases were problematic. In total, our real fault set consists of 125 real faults
from 5 open source projects and our triggering test suites are composed of 23,419 test
cases.
11
4.3 Manual Assessment
To complement our analysis we manually applied the tools to parts of several real-
world projects. Since manual analysis requires considerable resources, analysing a
complete project is infeasible. Thus, we picked and analysed 12 methods from 6 Java
test subjects for 3 independent times, once per studied tool. Thus, in total, we manually
analysed 36 methods and 5,831 mutants which constitutes one of the largest studies in
the literature of mutation testing, e.g., Yao et al. [44] consider 4,181 mutants, Baker
and Habli [4] consider 2,555. Further, the present study is the only one in the literature
to consider manually analysed mutants when comparing the effectiveness of different
mutation testing tools (see also Section 7). The rest of this section discusses the test
subjects, tool configuration and the manual procedure we followed in order to perform
mutation testing.
We selected 12 methods to perform our experiment; 10 of them were randomly
picked from 4 real-world projects (Commons-Math, Commons-Lang, Pamvotis and
XStream) and another 2 (Triangle and Bisect) from the mutation testing literature [1].
Details regarding the selected subjects are presented in Table 5. The table presents
the name of the test subjects, their source code lines as reported by the cloc tool, the
names of the studied methods, the number of generated and disjoint mutants per tool
and the number of the resulting mutants of the reference mutant set.
• The selected methods were given to students attending the “Software Validation,
Verification and Maintenance” course (Spring 2015 and Fall 2015), taught by
Prof. Malevris, in order to analyse the mutants of the studied tools, as part of
their coursework. The participating students were selected based on their overall
performance and their grades at the programming courses. Additionally, they
all attended an introductory lecture on mutation testing and appropriate tutorials
before the beginning of their coursework. To facilitate the smooth completion of
their projects and the correct application of mutation, the students were closely
supervised, with regular team meetings throughout the semester.
12
Table 5: Test Subject Details: Generated Mutants, Disjoint Mutants and Reference Mutant Set
# Generated Mutants # Disjoint Mutants # Mutants
Test Subject LoC Method M AJOR PIT RV MU JAVA M AJOR PIT RV MU JAVA Reference Mutant Set
gcd 133 392 237 7 12 9 10
Commons-Math 16,489
orthogonal 120 392 155 11 11 11 11
toMap 23 115 32 6 6 5 10
subarray 25 95 64 6 6 3 6
Commons-Lang 17,294 lastIndexOf 29 139 81 11 18 13 17
capitalize 37 132 69 5 4 4 5
13
wrap 71 328 198 12 12 16 13
addNode 89 447 318 16 23 29 37
Pamvotis 5,505
removeNode 18 109 55 7 6 6 6
Triangle 47 classify 139 486 354 31 26 31 55
XStream 15,048 decodeName 73 315 156 8 9 8 10
Bisect 37 sqrt 51 219 135 7 6 7 6
Total 54,420 - 808 3,169 1,854 127 139 142 186
• The designed test cases and detected equivalent mutants were manually analysed
and carefully verified by at least one of the authors.
To generate the mutation adequate test suites, the students were first instructed to
generate branch adequate test suites and then to randomly pick a live mutant and at-
tempt to kill it based on the RIP Model [1]. Although the detection of killable mutants
is an objective process, i.e., the produced test case either kills the corresponding mutant
or not, the detection of equivalent ones is a subjective one. To deal with this issue, all
students were familiarised with the RIP Model [1] and the sub-categories of equiva-
lent mutants described by Yao et al. [44]. Also, all detected equivalent mutants were
independently verified.
It should be noted that for the PIT RV mutants, one of the authors of this paper
extended our previous manual analysis of the PIT mutants [26] by designing new test
cases that kill (or identify as equivalent) the PIT RV mutants that remained alive after
the application of PIT’s mutation adequate test suites. To support replication and wider
scrutiny of our manual analysis, we made all its results publicly available [27].
4.4 Methodology
To answer the stated RQs, we applied mutation testing by independently using each
one of the selected tools. As the empirical study involves two parts, an experiment on
open source projects with real faults and a manual analysis on sampled functions, we
simulate the mutation testing process in two ways.
For the large-scale experiment (using open source projects with real faults), we
constructed a test pool composed of multiple test suites that were generated by Evo-
Suite [13] and Randoop [35]. We then identified the triggering test suites, i.e., the test
suites that contain at least one test case that reveals the studied faults and discarded all
the Defects4J faults for which no triggering test suite was generated.
Related to the sampled functions, we manually generated three mutation adequate
test sets (one for every analysed tool) per studied subject. We then minimised these
test sets by checking, for each contained test case, whether its removal would result
in a decreased mutation score [1]. In case the removal of a test does not result in
any decrease of mutation score, it is redundant (according to the used mutation testing
tool) and has to be removed. As redundant tests do not satisfy any of the criterion
requirements, they can artificially result in overestimating the strengths of the test suites
[1]. We used the resulting tests and computed the set of disjoint mutants produced by
each one of the tools. We then constructed the reference mutant set by identifying the
disjoint mutants (using all the produced tests) of the mutant set composed of all mutants
of all the studied tools. To compute the disjoint mutant set we need a matrix that records
all test cases that kill a mutant. The construction of such a matrix is available in the
case of PIT RV . For MU JAVA, we extended the corresponding script to handle certain
cases that it failed to work, e.g., a case where a class that belonged to a package was
given as input. Finally, in the case of M AJOR, we utilised the scripts accompanying
Defects4J to produce this matrix. The disjoint set of mutants was computed using the
“Subsuming Mutants Identification” process that is described in the study of Papadakis
et al. [41]. Here, we use the term “disjoint” mutants, instead of “subsuming” ones
since this was the original term that was used by the first study that introduced them
and suggested their use as a metric that can accurately measure test effectiveness [24].
In RQ1 we are interested in the fault revelation ability of mutation-based test suites.
Thus, we want to see whether mutation-driven test cases can reveal our faults. We
14
measure the fault revelation ability of the studied mutation testing tools by evaluating
the fault revelation of the test cases that kill their mutants. More precisely, we consider
a fault as revealed when there is at least one mutant which is killed only by triggering
test cases for that particular fault. Thus, based on our generated test suites, if this
mutant is killed, a test case that reveals the respective fault is bound to be generated.
To answer the research question, we compare the number of revealed faults per tool
and project and rank them accordingly, thus, answering RQ2.
To answer RQ3, we used the selected methods and the manually generated test
suites. For each selected tool we used its mutation adequate test suite and calculated
the mutation score and disjoint mutation score that it achieves when it is evaluated with
the mutants produced by the other tools. This process can be viewed as an objective
comparison between the tools, i.e., a comparison that evaluates how the tests designed
for one tool perform when evaluated in terms of the other tool. In case the tests of one
tool can kill all the mutants produced by the other tool, then this tool subsumes the
other. Otherwise, the two tools are incomparable.
To answer RQ4, we used the tests that were specifically designed for each one
of the studied tools (from the manually analysed subjects) and measured the score
they achieve when evaluated against the reference mutant set. This score provides the
common ground to compare the tools and rank them with respect to their effectiveness
and identify the most effective tool.
To answer RQ5, for each tool we manually analysed the mutants that were not
killed by the tests of the other tools with the intention of identifying inadequacies in
the tools’ mutant sets. We then gathered all these instances and identified how we could
complement each one of the tools in order to improve its effectiveness and reach the
level of the reference mutant set. Finally, to answer RQ6, we measured and report the
number of tests and equivalent mutants that we found.
5 Empirical Findings
This section presents the empirical findings of our study per posed research question.
15
evident that none of the tools subsumes the others; there are faults that are only re-
vealed by killing test cases of mutants generated by only one tool and not the others
and, as a consequence, none of the tools alone can reveal all the faults studied. Specif-
ically, M AJOR reveals 2 unique faults (Time-27 and Math-55) compared to PIT RV
and 29 faults compared to MU JAVA. MU JAVA reveals one unique fault (Math-27) com-
pared to M AJOR; when compared to PIT RV , all MU JAVA-revealed faults are also re-
vealed by PIT RV . Finally, PIT RV reveals 9 unique faults (Lang-56, Math-6, Math-22,
Math-27, Math-89, Math-105, Closure-27, Closure-49, Closure-52) compared
to M AJOR and 31 unique faults compared to MU JAVA. One interesting finding is that 2
faults (Math-75, Math-90) are not revealed by any tool. Overall, PIT RV managed to
reveal 121 real faults out of 125 for which it run successfully, M AJOR revealed 114 out
of 125 and MU JAVA 34 out of 66.
16
Table 6: M AJOR’s, PIT RV and MU JAVA fault revelation on real faults
studied
17
Table 6: M AJOR’s, PIT RV and MU JAVA fault revelation on real faults
studied
18
Table 6: M AJOR’s, PIT RV and MU JAVA fault revelation on real faults
studied
19
Table 7: Tools’ Cross-Evaluation Results
M AJOR PIT RV MU JAVA
PIT RV -TS MU JAVA -TS M AJOR-TS MU JAVA -TS M AJOR-TS PIT RV -TS
Method All Dis. All. Dis. All Dis. All. Dis. All Dis. All Dis.
gcd 97.4% 87.5% 97.4% 71.4% 99.4% 58.3% 98.8% 58.3% 99.5% 88.9% 100.0% 100.0%
orthogonal 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 99.5% 90.9% 100.0% 100.0% 100.0% 100.0%
toMap 100.0% 100.0% 77.8% 83.3% 97.9% 66.6% 92.8% 83.3% 100.0% 100.0% 100.0% 100.0%
subarray 90.0% 66.6% 85.0% 50.0% 100.0% 100.0% 97.6% 83.3% 100.0% 100.0% 100.0% 100.0%
lastIndexOf 100.0% 100.0% 92.6% 91.0% 100.0% 100.0% 97.7% 83.3% 100.0% 100.0% 100.0% 100.0%
20
capitalize 93.5% 60.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 98.2% 75.0%
wrap 100.0% 100.0% 100.0% 100.0% 99.0% 75.0% 99.3% 83.3% 98.9% 93.8% 100.0% 100.0%
addNode 100.0% 100.0% 100.0% 100.0% 81.8% 21.7% 99.5% 95.7% 76.5% 37.9% 100.0% 100.0%
removeNode 100.0% 100.0% 100.0% 100.0% 93.7% 33.3% 100.0% 100.0% 91.7% 50.0% 100.0% 100.0%
classify 97.0% 87.0% 100.0% 100.0% 97.1% 88.5% 100.0% 100.0% 98.1% 90.3% 99.1% 90.3%
decodeName 95.9% 80.0% 100.0% 100.0% 95.0% 55.5% 98.4% 55.5% 95.3% 62.5% 99.2% 87.5%
sqrt 100.0% 100.0% 100.0% 100.0% 99.0% 66.7% 99.5% 50.0% 100.0% 100.0% 100.0% 100.0%
Average 97.8% 90.1% 96.1% 91.3% 96.9% 72.1% 98.6% 82.0% 96.7% 85.3% 99.7% 96.1%
Major
100 PITRV
Mujava
90
% of Covered Mutants wrt. Reference Front
80
70
60
50
40
30
20
gc
or
to
su
la
ca
ad
re
cl
de
sq
Av
ra
as
st
M
th
m
d
ba
pi
rt
dN
co
er
In
p
ap
og
ov
si
ta
ag
r
de
de
od
ra
fy
liz
eN
on
e
N
y
e
e
al
am
od
f
e
Per Method
is that all tools have important inadequacies that range from 0-66%. On average, the
differences are 20%, 15% and 9% for M AJOR, MU JAVA and PIT RV .
Overall we found that PIT RV is the top ranked tool, followed by MU JAVA and
M AJOR. PIT RV achieves a higher mutation score (w.r.t. the reference mutant set) than
MU JAVA in 9 cases, equal in 1 and lower in 2. Compared to M AJOR , PIT RV performs
better in 7 cases, equal in 2 and lower in 3. Therefore, we conclude that according to
our sample PIT RV is the most effective tool.
21
100
PITRV-TSvsMajor
PITRV-TSvsmuJava
Major-TSvsPITRV
80
Number of Alive (uncovered) Mutants
Major-TSvsmuJava
muJava-TSvsPITRV
muJava-TSvsMajor
60
40
20
0
C
LV R
AB
AOS
C D
N CR
R
R R
U
AO
AOIS
C IU
R I
AB
AOS
AP D
C
C CR
IC
N
R MC
R R
U
LV
R
ST R
O
R
C
O
C
O
O
O
R
B
V
O
C
O
O
R
D
R
I
Figure 2: Cross-Evaluation Experiment: Number of Alive Mutants per Mutation Op-
erator, Test Suite and Tool
changes in the bytecode. Additionally, PIT RV ’s CRCR operator can be enhanced be-
cause it misses certain cases that M AJOR’s LVR is applied to due to the aforementioned
problem. For instance, at line 410 of the gcd method of the Commons-Math test subject
M AJOR mutates the statement if (u > 0) to if (u > 1). This change is not made
by PIT RV ’s CRCR operator because in the bytecode the zero constant is never pushed
into the stack. In order to make PIT RV more effective such cases should be handled
accordingly.
22
Table 8: Tools’ Application Cost: Number of Equivalent Mutants and Required Tests
M AJOR PIT RV MU JAVA
Method #Eq. #Tests #Eq. #Tests #Eq. #Tests
gcd 17 6 70 7 23 7
orthogonal 3 8 22 8 5 9
toMap 5 7 18 6 7 5
subarray 5 6 12 5 8 6
lastIndexOf 2 8 10 8 4 12
capitalize 6 5 22 4 14 9
wrap 8 10 36 7 19 7
addNode 11 8 57 22 33 34
removeNode 2 5 14 5 7 6
classify 7 25 42 22 38 27
decodeName 24 5 57 7 28 10
sqrt 4 4 22 4 17 6
Total 94 97 382 105 203 138
23
Killable Mutants
1000 Equivalent Mutants
800
Number of Generated Mutants
600
400
200
0
AO
AO
AO
AO
AO
AS
LO
LO
SO
AO
LO
LV
SO
ST
AB
AO
AO
AP
IC
IN
VM
O
VM
O
R
BB
V
R
D
R
S
D
IS
IU
C
I
C
S
N
U
C
muJava Major PITRV
test suites and MU JAVA the greatest. It is interesting to notice that althought PIT RV
generates a high number of mutants, it requires a considerably low number of mutation
adequate test cases indicating that the current version of the tool faces high mutant re-
dundancy, which in turn suggests that the efficiency of the tool can be greatly improved.
Future version of PIT RV should take this finding into account.
The previously-described results indicate that M AJOR is the most efficient tool and
PIT RV the most expensive one, with MU JAVA standing in the middle. Considering
that PIT RV was found the most effective tool both in terms of fault revelation and
mutant killability, it is no surprise that it is the least efficient one. Analogously, M AJOR
requires less effort, a fact justified by its lower performance.
To better understand the nature of the generated equivalent mutants, Figure 3 illus-
trates the contribution of each mutation operator to the generated killable and equiv-
alent mutants per tool. In the case of MU JAVA, AOIS and ROR generate most of the
tool’s equivalent mutants. For M AJOR, ROR generates most of the equivalent mu-
tants, followed by LVR, AOR and COR. In the case of PIT RV , UOI generates the most
equivalent mutants, followed by ROR, CRCR and ABS.
6 Threats to Validity
As every empirical study, this one faces several threats to its validity. Here we discuss
these threats along with the actions we took to mitigate them.
External Validity. External validity refers to the ability of a study’s results to gen-
eralise. Such threats are likely due to the programs, faults or test suites that we use,
as they might not be representative of actual cases. To mitigate the underlying threats,
we utilised a publicly available benchmark (Defects4J), which was built independently
from our work and consists of 6 real world open source projects and real faults. We
also used 6 additional test subjects and manually analysed 12 methods whose applica-
tion domain varies. Although we cannot claim that our results are generalisable, our
findings indicate specific inadequacies in the mutants produced by the studied tools.
24
These involve incorrect implementation or not supported mutation operators, evidence
that is unlikely to be case-specific.
Internal Validity. Internal validity includes potential threats to the conclusions we
draw. Our conclusions are based on a benchmark fault set, automatically generated
test suites and manual analysis, i.e., on the identified equivalent mutants and mutation
adequate test suites. Thus, the use of the tools might have introduced errors in our
measurements. For instance, it could be the case where test oracles generated by the
tools are weak and cannot capture mutants or the studied faults. We also performed
our experiments on the clean (fixed) program versions, which may differ from that
of the buggy version [7], because the existing Java tools only operate on passing test
suites. Moreover this is common practice in this type of experiments. To mitigate
these threats, we carefully checked our scripts, verified some of our results, performed
sanity checks and generated multiple test suites using two different tools. However, we
consider these threats of no substantial importance since our results are consistent in
both the manual and automated scenarios we analyse.
Other threats are due to the manual analysis we performed. To control this fact,
we ensured that this analysis was performed by different persons to avoid any bias in
the results and that all results produced by students were independently checked for
correctness by at least one of the authors. Another potential threat is due to the fact that
we did not control the test suite size. However, our study focuses on investigating the
effectiveness of the studied tools when used as a means to generate strong tests [36].
To cater for wider scrutiny, we made publicly available all the data of this study [27].
Construct Validity. Construct validity pertains to the appropriateness of the mea-
sures utilised in our experiments. For the effectiveness comparison, we used fault
detection (using real faults), mutation score and disjoint mutation score measurements.
These are well-established measures in mutation testing literature [41]. Another threat
originates from evaluating the tools’ effectiveness based on the reference fault and mu-
tant set that are revealed by the manually generated or automatically generated test
suites. We deemed this particular measure appropriate because it constitutes a metric
that combines the overall performance of the tools and enables their ranking. Finally,
the number of equivalent mutants and generated tests might not reflect the actual cost
of applying mutation. We adopted these metrics because they involve manual analysis
which is a dominant cost factor when testing.
7 Related Work
Mutation testing is a well-studied technique with a rich history of publications, as
recorded in the surveys of Offutt [34] and Yia and Harman [19].
The original suggestion of mutation was a method to help programmers generate
effective test cases [10]. Since then, researchers has used it to support various other
software engineering tasks [34]. In particular, mutation analysis has been employed
in: test generation [36], test oracle selection and assessment [14], debugging [38], test
assessment [41] and in regression testing [46]. It has also been applied to artefacts
other than source code, such as models [12] and software product lines configurations
[17].
The main problems of mutation testing are the large number of mutants and the
so-called equivalent mutant problem [40, 22]. To tackle these problems several mutant
selection strategies were suggested. Mutant sampling is perhaps the simplest and most
effective way of doing so. Depending on the sampling ratio it provides several trade-
25
offs between reduced number of mutants and effectiveness loss (fault detection) [37],
e.g., sampling ratios of 10% to 60% have a loss on fault detection from 26% to 6%.
Selective mutation [33] is another form of mutant reduction that only applies specific
types of mutants. However, recent research has shown that there are not significant
differences between selective mutation and random sampling [45, 28]. To deal with
the equivalent mutant problem researchers has adopted compiler optimisations [40],
constraint based techniques [32] and verification techniques [5]. However, despite the
efforts this problem remains open especially for the case of Java. This is the main
reason why we manually identified and report on the equivalent mutants produced by
the tools.
Another problem related to mutation testing regards the generation of redundant
mutants. These mutants do not contribute to the testing process, while at the same time
they introduce noise to the mutation score measurement. Papadakis et al. [41] exper-
imented and demonstrated that there is a good chance of drawing wrong conclusions
(approximately 60%) for arbitrary experiments when measuring test thoroughness us-
ing all mutants rather than with only the disjoint/subsuming ones. Unfortunately, the
above-mentioned result suggests that it is likely to conclude that one testing method
is superior to another one but in fact it is not. The problem of redundant mutants has
been initially identified by Kintis et al. [24] with the notion of disjoint mutants. Later
Ammann et al. [2] formalised the concept based on the notion of dynamic subsump-
tion. Unfortunately, these techniques focus on the undesirable effects of redundant
mutants and not their identification. Perhaps the only available technique that is capa-
ble of identifying such mutants is “Trivial Compiler Equivalence” (TCE) [40]. TCE
is based on compiler optimisations and identifies duplicated mutants (mutants that are
mutually equivalent but differ from the original program). According to the study of
Papadakis et al. [40], 21% of the mutants are duplicated and can be easily removed
based on compiler optimisations. All these studies identified the problems caused by
trivial/redundant mutants but none of them studied the particular weaknesses of mod-
ern mutation testing tools as we do here. Additionally, to deal with trivial/redundant
mutants we used: (1) the mutation score, (2) the disjoint mutation score, and (3) the
fault detection as effectiveness measures.
Manual analysis has been used extensively in the mutation testing literature. Yao
et al. [44] analysed 4,181 mutants to provide insights into the nature of equivalent
and stubborn mutants. Nan et al. [29] manually analysed 2,919 mutants to compare
test cases generated for mutation testing with the ones generated for various control
and data flow coverage criteria. Deng et al. [11] analysed 5,807 mutants generated by
MU JAVA to investigate the effectiveness of the SDL mutation operator. Papadakis et
al. [39] used manual analysis to study mutant classification strategies and found that
such techniques are helpful only to partially improve test suites (of low quality). Older
studies on mutant selection involved manual analysis to identify equivalent mutants
and generate adequate test suites [33].
Previous work on the differences of mutation testing frameworks for Java is due
to Delahaye and Du Bousquet [9]. Delahaye and Du Bousquet compare several tools
based on various criteria, such as the supported mutation operators, implementation
differences and ease of usage. The study concluded that different mutation testing
tools are appropriate to different scenarios. A similar study was performed by Rani
et al. [42]. This study compared several Java mutation testing tools based on a set of
manually generated test cases. The authors concluded that PIT generated the smallest
number of mutants, most of which were killed by the employed test suite (only 2%
survived), whereas, MU JAVA generated the largest number of mutants, 30% of which
26
survived.
Gopinath et al. [16] investigated the effectiveness of mutation testing tools by us-
ing various metrics, e.g., comparing the mutation score (obtained by the test subjects’
accompanying test suites) and number of disjoint/minimal mutants that they produce.
They found that the examined tools exhibit considerable variation of their performance
and that no single tool is consistently better than the others.
The main differences between our study and the aforementioned ones are that we
compare the tools based on their real fault revelation ability and cross-evaluated their
effectiveness based on the results of complete manual analysis. The manual analysis
constitutes a mandatory requirement (see Section 3) for performing a reliable effective-
ness comparison between the tools. This twofold comparison is one of the strengths
of the present paper as it is the first one in the literature to compare mutation testing
tools in such a way. Further, we identified specific limitations of the tools and provided
actionable recommendations on how each of the tools can be improved. Lastly, we
analysed and reported the number and characteristics of equivalent mutants produced
by each tool.
8 Conclusions
Mutation testing tools are widely used as a means to support research. This practice
intensifies the need for reliable, effective and robust mutation testing tools. Today
most of the tools are mature and robust, hence the emerging question regards their
effectiveness, which is currently unknown.
In this paper, we reported results from a controlled study that involved manual
analysis (on a sample of program functions selected from open source programs) and
simulation experiments on open source projects with real faults. Our results showed
that one tool, PIT RV , the research version of PIT, performs significantly better than the
other studied tools, namely MU JAVA and M AJOR. At the same time our results showed
that the studied tools are generally incomparable as none of them always subsumes the
others. Nevertheless, we identified the few deficiencies of PIT RV and made actionable
recommendations on how to strengthen it and improve its practice.
Overall, our results demonstrate that PIT RV is the most prominent choice of muta-
tion testing tool for Java, as it successfully revealed 97% of the real faults we studied
and performed best in our manual analysis experiment.
References
[1] Ammann P, Offutt J (2008) Introduction to Software Testing, 1st edn. Cambridge
University Press, New York, NY, USA
[2] Ammann P, Delamaro ME, Offutt J (2014) Establishing theoretical minimal sets
of mutants. In: Seventh IEEE International Conference on Software Testing, Ver-
ification and Validation, ICST 2014, March 31 2014-April 4, 2014, Cleveland,
Ohio, USA, pp 21–30, DOI 10.1109/ICST.2014.13
[3] Andrews J, Briand L, Labiche Y, Namin A (2006) Using mutation analysis for
assessing and comparing testing coverage criteria. Software Engineering, IEEE
Transactions on 32(8):608–624, DOI 10.1109/TSE.2006.83
27
[4] Baker R, Habli I (2013) An empirical evaluation of mutation testing for improving
the test quality of safety-critical software. Software Engineering, IEEE Transac-
tions on 39(6):787–805, DOI 10.1109/TSE.2012.56
[6] Budd TA, Angluin D (1982) Two notions of correctness and their relation to test-
ing. Acta Informatica 18(1):31–45, DOI 10.1007/BF00625279
[7] Chekam TT, Papadakis M, Traon YL, Harman M (2017) Empirical study on mu-
tation, statement and branch coverage fault revelation that avoids the unreliable
clean program assumption. In: ICSE
[8] Coles H (2010) ”the PIT mutation testing tool”. URL "http://pitest.org/",
“Last Accessed June 2016”
[9] Delahaye M, Du Bousquet L (2015) Selecting a software engineering tool:
lessons learnt from mutation analysis. Software: Practice and Experience
45(7):875–891, DOI 10.1002/spe.2312
[10] DeMillo RA, Lipton RJ, Sayward FG (1978) Hints on test data selection: Help
for the practicing programmer. IEEE Computer 11(4):34–41, DOI 10.1109/C-M.
1978.218136, URL http://dx.doi.org/10.1109/C-M.1978.218136
28
[16] Gopinath R, Ahmed I, Alipour MA, Jensen C, Groce A (2016) Does choice
of mutation tool matter? Software Quality Journal pp 1–50, DOI 10.1007/
s11219-016-9317-7
[17] Henard C, Papadakis M, Traon YL (2014) Mutation-based generation of software
product line test configurations. In: Search-Based Software Engineering - 6th
International Symposium, SSBSE 2014, Fortaleza, Brazil, August 26-29, 2014.
Proceedings, pp 92–106, DOI 10.1007/978-3-319-09940-8 7, URL http://dx.
doi.org/10.1007/978-3-319-09940-8_7
[18] Henry Coles and Thomas Laurent and Christopher Henard and Mike Papadakis
and Anthony Ventresque (2016) PIT: a practical mutation testing tool for java
(demo). In: ISSTA, pp 449–452, DOI 10.1145/2931037.2948707
[19] Jia Y, Harman M (2011) An analysis and survey of the development of mutation
testing. Software Engineering, IEEE Transactions on 37(5):649–678, DOI 10.
1109/TSE.2010.62
[20] Just R, Schweiggert F, Kapfhammer GM (2011) MAJOR: an efficient and exten-
sible tool for mutation analysis in a java compiler. In: 26th IEEE/ACM Interna-
tional Conference on Automated Software Engineering (ASE 2011), Lawrence,
KS, USA, November 6-10, 2011, pp 612–615, DOI 10.1109/ASE.2011.6100138,
URL http://dx.doi.org/10.1109/ASE.2011.6100138
[21] Just R, Jalali D, Ernst MD (2014) Defects4j: a database of existing faults to
enable controlled testing studies for java programs. In: International Sympo-
sium on Software Testing and Analysis, ISSTA ’14, San Jose, CA, USA -
July 21 - 26, 2014, pp 437–440, DOI 10.1145/2610384.2628055, URL http:
//doi.acm.org/10.1145/2610384.2628055
[22] Kintis M (2016) Effective methods to tackle the equivalent mutant problem when
testing software with mutation. PhD thesis, Department of Informatics, Athens
University of Economics and Business
[23] Kintis M, Malevris N (2015) MEDIC: A static analysis framework for equivalent
mutant identification. Information and Software Technology 68:1 – 17, DOI 10.
1016/j.infsof.2015.07.009
[24] Kintis M, Papadakis M, Malevris N (2010) Evaluating mutation testing alterna-
tives: A collateral experiment. In: Proceedings of the 17th Asia-Pacific Software
Engineering Conference, pp 300–309, DOI 10.1109/APSEC.2010.42
[25] Kintis M, Papadakis M, Malevris N (2015) Employing second-order mutation for
isolating first-order equivalent mutants. Software Testing, Verification and Relia-
bility (STVR) 25(5-7):508–535, DOI 10.1002/stvr.1529
[26] Kintis M, Papadakis M, Papadopoulos A, Valvis E, Malevris N (2016) Analysing
and comparing the effectiveness of mutation testing tools: A manual study. In:
International Working Conference on Source Code Analysis and Manipulation,
pp 147–156
[27] Kintis M, Papadakis M, Papadopoulos A, Valvis E, Malevris N, Traon YL (2017)
Supporting site for the paper: How effective mutation testing tools are? an empir-
ical analysis of java mutation testing tools with manual analysis and real faultss.
URL http://pages.cs.aueb.gr/~kintism/papers/mttoolscomp
29
[28] Kurtz B, Ammann P, Offutt J, Delamaro ME, Kurtz M, Gökçe N (2016) Analyz-
ing the validity of selective mutation with dominator mutants. In: Proceedings of
the 24th ACM SIGSOFT International Symposium on Foundations of Software
Engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016, pp 571–
582, DOI 10.1145/2950290.2950322, URL http://doi.acm.org/10.1145/
2950290.2950322
[29] Li N, Praphamontripong U, Offutt J (2009) An experimental comparison of four
unit test criteria: Mutation, edge-pair, all-uses and prime path coverage. In: Soft-
ware Testing, Verification and Validation Workshops, International Conference
on, pp 220–229, DOI 10.1109/ICSTW.2009.30
[30] Lindström B, Márki A (2016) On strong mutation and subsuming mutants. In:
Procs. 11th Inter. Workshop on Mutation Analysis
[31] Ma YS, Offutt J, Kwon YR (2005) MuJava: an automated class mutation system.
Software Testing, Verification and Reliability 15(2):97–133, DOI 10.1002/stvr.
308
[32] Offutt AJ, Pan J (1997) Automatically detecting equivalent mutants
and infeasible paths. Softw Test, Verif Reliab 7(3):165–192, DOI
10.1002/(SICI)1099-1689(199709)7:3<165::AID-STVR143>3.0.CO;2-U,
URL http://dx.doi.org/10.1002/(SICI)1099-1689(199709)7:
3<165::AID-STVR143>3.0.CO;2-U
[33] Offutt AJ, Lee A, Rothermel G, Untch RH, Zapf C (1996) An experimental de-
termination of sufficient mutant operators. ACM Transactions on Software Engi-
neering and Methodology 5(2):99–118
[34] Offutt J (2011) A mutation carol: Past, present and future. Information & Soft-
ware Technology 53(10):1098–1107, DOI 10.1016/j.infsof.2011.03.007
[35] Pacheco C, Ernst MD (2007) Randoop: feedback-directed random testing for
java. In: Companion to the 22nd Annual ACM SIGPLAN Conference on Object-
Oriented Programming, Systems, Languages, and Applications, OOPSLA 2007,
October 21-25, 2007, Montreal, Quebec, Canada, pp 815–816, DOI 10.1145/
1297846.1297902, URL http://doi.acm.org/10.1145/1297846.1297902
[36] Papadakis M, Malevris N (2010) Automatic mutation test case generation via
dynamic symbolic execution. In: Software Reliability Engineering, 21st Interna-
tional Symposium on, pp 121–130, DOI 10.1109/ISSRE.2010.38
[37] Papadakis M, Malevris N (2010) An empirical evaluation of the first and second
order mutation testing strategies. In: Third International Conference on Software
Testing, Verification and Validation, ICST 2010, Paris, France, April 7-9, 2010,
Workshops Proceedings, pp 90–99, DOI 10.1109/ICSTW.2010.50, URL http:
//dx.doi.org/10.1109/ICSTW.2010.50
30
[39] Papadakis M, Delamaro ME, Traon YL (2014) Mitigating the effects of equiva-
lent mutants with mutant classification strategies. Sci Comput Program 95:298–
319, DOI 10.1016/j.scico.2014.05.012, URL http://dx.doi.org/10.1016/
j.scico.2014.05.012
[40] Papadakis M, Jia Y, Harman M, Traon YL (2015) Trivial compiler equivalence:
A large scale empirical study of a simple, fast and effective equivalent mutant
detection technique. In: 37th International Conference on Software Engineering,
vol 1, pp 936–946, DOI 10.1109/ICSE.2015.103
[41] Papadakis M, Henard C, Harman M, Jia Y, Traon YL (2016) Threats to the valid-
ity of mutation-based test assessment. In: Proceedings of the 2016 International
Symposium on Software Testing and Analysis, ACM, New York, NY, USA, IS-
STA 2016
31