0% found this document useful (0 votes)
6 views8 pages

2021 Cec

This document presents a coverage-based grammar-guided genetic programming algorithm for generating test suites aimed at improving software testing efficiency and fault detection. The proposed method utilizes Finite State Machines (FSMs) to model system specifications and employs coverage-based fitness functions to enhance test suite effectiveness. Experimental results indicate that this approach outperforms traditional methods in both performance and execution time while maintaining or improving fault detection capabilities.

Uploaded by

diptikalyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views8 pages

2021 Cec

This document presents a coverage-based grammar-guided genetic programming algorithm for generating test suites aimed at improving software testing efficiency and fault detection. The proposed method utilizes Finite State Machines (FSMs) to model system specifications and employs coverage-based fitness functions to enhance test suite effectiveness. Experimental results indicate that this approach outperforms traditional methods in both performance and execution time while maintaining or improving fault detection capabilities.

Uploaded by

diptikalyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Coverage-Based Grammar-Guided Genetic

Programming Generation of Test Suites


Alfredo Ibias Pablo Vazquez-Gomis Miguel Benito-Parejo
DTRS research group DTRS research group DTRS research group
Universidad Complutense de Madrid Universidad Complutense de Madrid Universidad Complutense de Madrid
28040, Madrid, Spain 28040, Madrid, Spain 28040, Madrid, Spain
aibias@ucm.es pavazq01@ucm.es mibeni01@ucm.es

Abstract—Software testing is fundamental to ensure the re- Another problem of software testing is the so called Oracle
liability of software. To properly test software, it is critical to problem [2], [25]. This problem focuses on how to decide
generate test suites with high fault finding ability. We propose that a SUT is correct or not given the obtained outputs, and
a new method to generate such test suites: a coverage-based
grammar-guide genetic programming algorithm. This evolution- for extension, on how well a test suite detects a fault in the
ary computation based method allows us to generate test suites program. For this task a lot of approaches had been tried,
that conform with respect to a specification of the system under from classical deterministic solutions [7], [11] to more evolved
test using the coverage of such test suites as a guide. We ones, like those based on genetic algorithms [3], [4], [31]. In
considered scenarios for both black-box testing and white-box this paper we use Finite State Machines (FSMs) to represent
testing, depending on the different criteria we work with at
each situation. Our experiments show that our proposed method the specification of the SUT, and we used a coverage-based
outperforms other baseline methods, both in performance and criterion to decide how good the test suites are detecting faults.
execution time. We also use the mutation score to compare our approach
Index Terms—Genetic Programming, Coverage, Software Test- to others from the literature. Mutation score is a measure
ing from mutation testing that calculates the percentage of mutants
killed by a test suite. We say that a test kills a mutant when
I. I NTRODUCTION
the mutation has been discovered when executing the test.
Testing software is a fundamental step on every software Evolutionary computation algorithms are a well known
development process. The goal is to improve the quality of family of algorithms and meta-heuristics that focus on the
the software searching for faults in it, using as few resources evolution of a bunch of individual solutions to obtain an ap-
as possible. However, due to its criticality, testing can cost proximately optimal solution. This family ranges from the Ge-
more than 50% of the development budget [30]. Therefore, netic Algorithms [34] based on the evolution of the genomes
good, cheap and effective methods for testing software are to the Ant Colony Optimisation algorithm [9] based on the
fundamental for the software product cycle. One of the main organisation of an ant colony, passing by the Particle Swarm
techniques in the software testing field [1], [30] is to try to find Optimisation algorithm [5], [23] based on the development of
faults through the execution of input sequences and comparing a flock of birds. Genetic algorithms [34] are an approximation
the outputs obtained with the expected ones. The combination method to find good or nearly good solutions to computation-
of the input sequence and the expected outputs is a test suite, ally exponential problems. They focus on generating random
and the goal of the method is to use the test suites with higher solutions (called individuals) and improving them through the
fault finding ability. In order to generate such test suites, the mixture and mutation of the best generated ones. To decide
most widely known approach is mutation testing [22], [32], which individual is better they use a previously chosen criteria
which uses mutants (i.e. modified versions) of the System Un- (called fitness function) that should guide the evolution. Ge-
der Test (SUT) (or more usually, its specification) to generate netic programming [24] is an extension of genetic algorithms
such test suites. However, one of the main issues with this that manage to deal with structured types, being able to find
method is its computational cost. More feasible approaches solutions to a wider range of problems. Specifically, genetic
focus on finding good test suites with respect to a chosen programming used to work with tree-like structures, which can
criteria, reducing the computational and resources costs but be totally free or bounded to a grammar [27]. In our work we
diminishing at the same time the effectiveness of the method. use a genetic programming algorithm whose individuals are
In this paper we present one of such approaches. bounded to a grammar that will ensure that they will conform
This work has been supported by the Spanish MINECO/FEDER project to valid test suites of the SUT.
FAME (RTI2018-093608-B-C31); the Region of Madrid project FORTE- In this paper we present a Grammar-Guided Genetic Pro-
CM (S2018/TCS-4314) co-funded by EIE Funds of the European Union; gramming algorithm to generate test suites. This algorithm
the Region of Madrid - Complutense University of Madrid (grant number
PR65/19-22452); and the Santander – Complutense University of Madrid uses a coverage-based measure as a guide, and therefore
(grant number CT63/19-CT64/19). obtains test suites that have a high coverage of the FSM that
models the SUT. The goal of this algorithm is to generate actions, O is a finite set of output actions, and T ⊆ Q × (I ×
test suites with a high fault detection ability, and it uses O) × Q is the transition relation. The meaning of a transition
coverage-based fitness functions to that end. Specifically, we (q, (i, o), q 0 ) ∈ T , also denoted by (q, i/o, q 0 ), is that if M
use measures based on t-way coverage, which measures how receives input action i when in state q then it can move to
many groups of t consecutive elements of the SUT are covered state q 0 and produce output action o.
by the test suite. These coverage-based fitness functions are We say that M is deterministic if for all q ∈ Q and i ∈
prepared for two scenarios: a white-box scenario where we I there exists at most one pair (q 0 , o) ∈ Q × O such that
have information about the internal structure of the SUT (like (q, i/o, q 0 ) ∈ T .
states) and a black-box scenario where we do not have such We assume that FSMs are deterministic. This simplifies our
information and we can only observe input and outputs. We scenario so we can use genetic algorithms, without losing
performed experiments to compare this approach to more applicability. However, our algorithm can be used on non-
traditional ones and to different variations of itself. Specif- deterministic FSMs with some adaptations to the specification.
ically, we compared our algorithm with a grammar-guided An FSM can be represented by a diagram in which nodes
genetic programming algorithm that uses as fitness function represent states and transitions are represented by arcs between
the mutation score of the individuals. In these experiments the nodes. In our case, all states are final as long as they are
we obtained that our solutions generated using coverage-based reachable from the initial state.
fitness functions were quite effective, as they took less time We will assume the minimal test hypothesis [21]: the SUT
to be computed, while not losing a lot of (or even having can be modelled as an (unknown) object described in the same
better) finding faults potential. We found that the coverage formalism as the specification (here, an FSM).
criteria based on 1-way transition coverage is the preferable We say that a mutant M 0 = (Q, qin , I, O, T 0 ) of an FSM
choice when generating test suites if we are in a black-box M = (Q, qin , I, O, T ) is another FSM such that T and T 0
scenario. For a white-box scenario, the 2-way state coverage only differ in one transition in the form of (q, (i, o), q 0 ) ∈ T
is preferred. and (q, (i, o), q 00 ) ∈ T 0 with q 0 6= q 00 .
In this paper, Section II introduces the basic concepts over Our main goal while testing is to decide whether the be-
which our algorithm is developed. Section III introduces our haviour of an SUT conforms to the specification of the system
coverage-based grammar-guided genetic programming algo- that we would like to build. In order to detect differences
rithm for generating test suites. Later, Section IV presents our between specifications and SUTs, we need to compare their
experiments evaluating our algorithm. In Section V we discuss behaviours, and the main notion to define such behaviours is
some aspects of our algorithm and our experiments. Section VI given by the concept of a trace.
evaluates the threats to the validity of our experiments. Finally, Definition 2: Let M = (Q, qin , I, O, T ) be an FSM, σ =
Section VII contains the conclusions of our work and lines for (i1 , o1 ) . . . (ik , ok ) ∈ (I × O)∗ be a sequence of pairs and
future work. q ∈ Q be a state. We say that M can perform σ from q if
there exist states q1 . . . qk ∈ Q such that for all 1 ≤ j ≤ k we
II. T HEORETICAL BACKGROUND have (qj−1 , ij /oj , qj ) ∈ T , where q0 = q. If q = qin then we
We model systems as Finite State Machines (FSMs). In order say that σ is a trace of M . We denote by traces(M ) the set
to define an FSM, we first introduce some notation. of traces of M . Note that  ∈ traces(M ) for every FSM M .
Given set A, A∗ denotes the set of finite sequences of ele- Next we define the notion of test. As previously explained,
ments of A; A+ denotes the set of non-empty finite sequences a test is a sequence of (input action, output action) pairs. A
of elements of A; and  ∈ A∗ denotes the empty sequence. test suite will be a set of tests.
We let |A| denote the size of set A. Given a sequence σ ∈ A∗ , Definition 3: Let M = (Q, qin , I, O, T ) be an FSM. We
|σ| denotes its length. Given a sequence σ ∈ A∗ and a ∈ A, say that t = (i1 , o1 ) . . . (ik , ok ) ∈ (I × O)+ is a test for
we have that σa denotes the sequence σ followed by a and M if t ∈ traces(M ). The length of t is the length of the
aσ denotes the sequence σ preceded by a. sequence, that is, |t| = k. In addition, the sequence of input
Throughout this paper we let I be the set of input actions actions of t is λ = i1 . . . ik and the sequence of output actions
and O be the set of output actions. It is important to differen- of t is µ = o1 . . . ok . We will sometimes use the notation
tiate between input actions and inputs of the system. An input t = (λ, µ) ∈ (I + × O+ ). We write (i, o) ∈ t to denote that
of a system will be a non-empty sequence of input actions, the pair (i, o) appears in the test t; (i, o) ∈n t denotes that the
that is, an element of I + (similarly for outputs and output pair (i, o) appears n times in the test t.
actions). A test suite for M is a set of tests for M . Given a test suite
An FSM is a (finite) labelled transition system in which T = {t1 , . . . , tn }, the length of the test P suite is the sum of
every transitions is labelled with an input/output pair (a pair the lengths of its tests, that is, |T | = i=1,...,n |ti |.
containing an input action and an output action). We use this Let t = (λ, µ) be a test for M . We say that the application
formalism to define specifications. of t to an FSM M 0 fails if there exists µ0 such that (λ, µ0 ) ∈
Definition 1: A Finite State Machine (FSM) is represented traces(M 0 ) and µ 6= µ0 . Similarly, let T be a test suite for
by a tuple M = (Q, qin , I, O, T ) in which Q is a finite set M . We say that the application of T to an FSM M 0 fails if
of states, qin ∈ Q is the initial state, I is a finite set of input there exists t ∈ T such that the application of t to M 0 fails.
The notion of coverage is quite simple in our context: it
represent how much of the FSM a test suite traverses. We
define three different coverage criteria based on the t-way
coverage definition:
• t-way transition coverage: the percentage of sets of t
consecutive transition labels that the test suite traverses.
We define transition label as a pair input/output of the
FSM (that is, a pair input/output that corresponds to the
execution of a transition).
• t-way state coverage: the percentage of sets of t consec-
utive states that the test suite visits. We define state as a Fig. 1. GA flowchart
state of the FSM.
• t-way action coverage: the percentage of sets of t consec-
utive actions that the test suite executes. We define action In our work, t ranges from 1 to 3. Along this paper we will
as an input of the FSM. call G set to the grouping of the transition labels/states/actions
We differentiate between these 3 types of coverage as they in sets of t consecutive elements of an FSM.
are the most widely used in the literature [26], [33]. Each Genetic programming is a meta-heuristic that is often used
coverage type focus on a different element of the FSM and to obtain good enough solutions to complicated optimisation
therefore it is mandatory to compare between the three to problems. They are non-deterministic algorithms that con-
see which one yields better results. In order to use state sider multiple possible solutions or individuals at a time,
coverage we need to be aware of the internal state of the and combine their information in order to obtain different
FSM, therefore it can only be considered within a white-box solutions each iteration. Since the objective is to improve the
scenario. However, action coverage and transition coverage final solution, a fitness function evaluates each individual to
only needs observable information and can also be used in prioritise those possible solutions with a better score. This
a black-box scenario. Finally, for a notion of coverage based method is derived from the classical genetic algorithm, with
in transitions (as defined in Def. 1), we require a white-box some adaptations to work with tree-like structures that we use
scenario as well, since the structure of the FSM is also needed. in our work. Usually, a genetic algorithm is divided in 5 steps
This last case will be studied in detail in section IV. structured as follows:
The t-way coverage groups the transition
labels/states/actions of the FSM in sets with exactly t • The initialisation step generates the initial population,
elements. These sets contains t consecutive elements, that is, acting as a seed for the whole process. Such an initialisa-
there exists a sequence of transitions of the FSM such that the tion is usually random, in order not to bias the behaviour
transition labels/states/actions involved in it are exactly the of the algorithm.
ones in the set. For example, for 1-way state coverage, the • The selection step is focused on obtaining the most suited
states of the FSM are group in sets with only one state, and individuals to perform the following steps, and achieving
therefore there are as many sets as states. In the case of 2-way a better solution in next generations.
state coverage however, the states of the FSM are grouped • The crossover step consists of pairing the individuals
in sets of pairs of states. Specifically each set contains two obtained in the selection step, and exchanging parts of
consecutive states, that is, two states that are connected by the structure within each couple.
a transition. Then, in this case there are as many sets as • The mutation step considers each individual after the
transitions, but not as many as combinations of two states, crossover step, and with a small probability performs
because if two states are not connected trough a transition, slight variations or mutations. This process, although
then they are not consecutive. might seem counter-intuitive, it tends to avoid obtaining
With the t-way sets of transition labels/states/actions of the local maximum solutions, by possibly substituting the
FSM, we can check the number of them that appear (without negative-impact elements of the individual for new ones.
repetitions) in a test suite. The percentage of these sets that • The replacement step, finally takes the current population
appear in the test suite will be its t-way transition/state/action and its offspring, and decides which individuals amongst
coverage score. This score represents how much coverage of them conform the following generation.
the SUT this test suite will provide. The idea of genetic algorithms is to iterate the process to
Formally, we define the t-way transition/state/action cover- make the population evolve and produce a better solution.
age score as follows. Therefore, it requires a termination criteria, which usually
Definition 4: Given an FSM M , a grouping G (with |G| considers a bound on the number of iterations.
elements) of its transition labels/states/actions in sets of t It is important to note that the loop for a genetic algorithm
consecutive elements, and a test suite that traverses s of only considers the selection, crossover, mutation and replace-
such sets (without repetitions), the t-way transition/state/action ment steps, as only one initialisation is required. A general
s
coverage score is |G| . flowchart of genetic programming is shown in Fig. 1.
For our genetic programming algorithm we use a grammar • A production rule A −→0 null0 for each state A to a
to guide the generation of the test suites. A grammar is a set of terminal to represent the end of the test.
symbols and rules that restrict the generation capabilities of the With this components, we will produce the nodes of the trees
algorithm in order to ensure the correctness of the generated that represent test suites.
individuals with respect to a chosen criteria (in our case, that
In the initialisation step, we generate random test suites,
they are valid test suites for the SUT).
avoiding duplicated tests. The idea is to generate an initial
III. C OVERAGE - BASED G RAMMAR - GUIDED G ENETIC population as diverse as possible, in order to have a wider
P ROGRAMMING ALGORITHM spectrum to explore with the following steps.
In this section, we describe the specific elements of our To evaluate and compare the individuals of our population,
genetic algorithm, and the structure of the steps we use, to we need a fitness function. As the main contribution of our
present a global idea of our contribution in a more detailed work, we propose several coverage-based fitness functions,
manner. considering the notion of coverage defined in Def. 4. These
Our proposed algorithm is a genetic programming algo- fitness functions are built over coverage criteria to determine
rithm that works with a population of individuals of the which test suites have a higher fault finding capability. We
same size. Each individual is a test suite consisting of n developed 12 fitness functions using t-way coverage criterion.
input/outputs pairs. However, a grammar-guided genetic pro- In particular, we considered transition, state, action t-way
gramming works with tree-like structures (i.e. data structures coverage, and an extension of transition t-way coverage that
that can be abstracted as trees) that conform to a grammar. includes the initial and final states of each transition (i.e. it
Thus, we have to adapt our test suites to be abstractly repre- considered the actual transitions). We instantiated these criteria
sented as grammar-guided trees. We do such a transformation with t ∈ {1, 2, 3}. We will later on compare the results that
defining each node of the tree as an input/output pair (i.e. we obtain with each of these fitness functions.
each node contains an input and an output) with a grammatical In the selection step, we aim at an exploration goal, rather
symbol, then each succession of nodes would be a succession than an exploitation one. In that sense, we consider that our
of input/output pairs, which corresponds to a test. Finally, in selection method should take into account the whole popula-
order to combine the tests of the test suite into a tree, we tion, in order not to lose any genetic diversity. Therefore, our
join them with a dummy node that would be the root of the selection method simply matches pairs of individuals for the
tree. This way, each test would be a branch of the tree. It following steps.
is important to note that we work with a constrained size, For the grammar-guided crossover, we implemented a vari-
which all trees share, in order not to obtain huge trees that ation on the standard crossover, where we select a random
compromise the score of the results, as a bigger size in a test node from each parent such that both nodes have the same
suite would induce a better score. Along the rest of the paper grammatical symbol. Then, we exchange the selected subtrees
we would use input/output pair and node interchangeably, and while maintaining the grammatical correctness of the whole
we would do the same for test and branch, and for test suite tree. This means that the resulting individuals still represent a
and tree. Finally, remark that a subtree of these trees would valid test suite for the given SUT. However, in order to keep
be a section of a test. the length of the test suite (note that such length is the sum of
Since we decided to use a grammar-guided genetic program- the length of each test), we will have to modify the resulting
ming approach, we need to generate a grammar that allows trees accordingly. We have two different scenarios: first, we
to generate test suites that conforms to the FSM. For this have to extend test suites that provide a longer subtree while
grammar, we define the following components [17]: receiving a shorter one; and second we have to bound test
• A start non-terminal symbol S that starts the grammar. suites that receive a longer subtree than the one they provide.
• A non-terminal symbol T S that introduces each test of This is necessary to control the length of a test suite, avoiding
the test suite. an incremental increase of the size of the individuals. The
• A non-terminal symbol A for each state, where A ∈ A reason to keep the length invariant is to have a reasonable
is the state number. comparison between the solutions, as a bigger test suite would
0 0 produce inherently a better score.
• A terminal symbol a/b for each input/output pair present
on the FSM, where a is the input and b is the output. We perform the extension of a subtree by randomly gen-
0 0
• A terminal symbol null to represent the end of a test. erating a grammatically valid continuation of the subtree.
• A production rule S −→ T S to generate the initial test. That means that we expand from the leaf of the subtree
• A production rule T S −→ T S + T S to introduce a new generating new nodes (i.e. input/output pairs) until reaching
test. the adequate length. In the case that we are unable to extend
• A production rule T S −→ 0 to start each test in the such subtree up to the desired length, we generate a new
FSM initial state. random test to substitute the remaining nodes. For the bound
0 0
• A production rule A −→ a/b + A for each transition on larger subtrees, we simply eliminate its final nodes, in order
from the left hand side state A to the right hand side state for the test suite to match the required length. Due to this
A with input/output pair (a, b). variation of the standard crossover, in most cases we add a
little extra genetic information apart from the one belonging suite of fixed length 1000, and therefore the solutions will be
to the parents. test suites for the given FSM. Then, we compare the solution
Our grammar-guided mutation step consists in randomly from both algorithms using mutation score, as it is a well
replacing tests of the test suite for newly generated ones. established method to compare test suites [32].
The idea of this mutation, is to incorporate different tests The experiments were developed as follows: for each FSM
that have not been considered before for the test suite. With we generated two test suites (one using coverage algorithm
this procedure, we add new genetic information (previously and other using mutation score algorithm). Then, we generated
unseen) to the individual and avoid reaching local optimum. 100 mutants of the FSM and computed the mutation score of
We do not consider a mutation on the test suite as such, but each test suite. We repeated this procedure for each of the 100
on each test in the test suite. Since each test is represented FSMs from the experiment subjects set, and we computed the
by a branch of the tree, we remove such test by deleting the average values for all the FSMs. Finally, we performed this
branch, and we include the new test by adding a new branch whole experiment 10 times to obtain a final mean value that
to the dummy root node. It is important to note that a test is is less prone to the randomisation influence. We display these
either fully removed, or is not modified, as we do not interfere final averages on Table I, where:
with partial branches of the tree. • The Success columns indicate the percentage of times
Finally, the last step of our loop combines the best elements where coverage algorithm performed better than the
of both the parent and the offspring populations, to prepare the mutation score algorithm (and vice-versa).
final individuals (either for following iterations, or the end of • The Mutants Killed columns indicate the percentage of
the process). We divided this step in several phases, starting mutants each algorithm was able to kill.
by obtaining the average score for the offspring population. • The Execution Time columns indicate the time that each
With it, we automatically consider the individuals that improve algorithm took to run.
such average value to be in the final population. Next, for The results of the experiments are displayed on Table I.
the remaining individuals, which have a worse score than the There we can see that the three of t-way transition coverage,
average, there is a probability for them to be maintained. To t-way state coverage and t-way action coverage beat mutation
finish, in order to complete the size of the population, that score to a extent, both in fault finding capability and in
should remain invariant, we randomly select among the best computation time. Moreover, we can observe an interesting
individuals from the parent generation. phenomena: the mutation score of the different coverage
The termination criteria that we consider is to perform 100 notions is not uniform with respect to the value of t. This
iterations. However, if during the execution we find that the implies that a greater t does not imply that the resulting test
last 20% of iterations do not improve the score, we terminate suite will have a higher mutation score.
the process. We added such extra condition as we want to
From these results arises the question of what would the
avoid an excessive amount of iterations and computing power
situation be if we computed the t-way transition coverage
that will not yield a better result.
using transitions instead of transition labels. We call this
IV. E XPERIMENTS notion t-way extended transition coverage. It is obvious that
the transitions have more information about the underlying
The experiments we performed aimed to compare our
FSM and its structure, so they could perform better, while not
algorithm with another test suite generation algorithm based on
needing a lot of extra time as they follow a similar concept
mutation score. For these experiments we took as experiment
of coverage. However, the transitions can only be generated if
subjects a set of 100 randomly generated FSMs, each one
we have an oracle (an FSM representing the SUT) or we are
with 100 states. In them, each state has between 5 and 50
in a white-box testing scenario, which limits its applicability.
outgoing transitions, each one with a label conformed by an
We repeated the experiment using the transitions and obtained
input/output pair. These pairs are generated from both input
the results displayed on Table II. As we can observe there, the
and output alphabets, each one with 50 elements. This set of
results are in the line of the ones from the other coverage types,
FSM is generated with the idea of stretching the capabilities
without a huge difference in computation time. Therefore, we
of the proposed algorithm, being big enough SUTs and trying
can conclude that the extra requirements are not worth it.
to be as representative of real-world applications as possible.
Finally, we performed a statistical hypothesis test over all
The idea of the experiments was to compare our proposed
the results. The null hypothesis was that the coverage-based
algorithm (coverage algorithm) to a genetic programming al-
fitness functions and the mutation score fitness function gave
gorithm whose fitness function is the mutation score (mutation
similar results, that is, both produced test suites with similar
score algorithm). Mutation score algorithm uses the same
mutation score. We applied a one-way ANOVA test1 where
grammar-guided genetic programming algorithm as coverage
we tested whether the values of both fitness functions were,
algorithm, the only real change is in the fitness function, and
on average, similar. Then, we computed the p-value for the
therefore in the obtained solutions. Specifically, it generates
experiments. Here, we observed an interesting situation: for
a new set of 10 random mutants each iteration to calculate
the mutant score of the individuals of the population. In these 1 Note that we could use the ANOVA test because we performed an
experiments each individual of the population represents a test homogeneity of variance check and it raised a positive result.
TABLE I
R ESULTS OF THE COMPARISON WITH MUTATION SCORE .

Success Success Mutants Killed Mutants Killed Execution Time Execution Time
Coverage Coverage Mutation Score Coverage Mutation Score Coverage Mutation Score
Type (Percentage) (Percentage) (Percentage) (Percentage) (Seconds) (Seconds)
1 − way transition 0.5644 0.4356 0.3781 0.3660 31.0099 83.9273
coverage
2 − way transition 0.5585 0.4415 0.3738 0.3652 29.1904 84.5565
coverage
3 − way transition 0.5131 0.4869 0.3673 0.3660 9.8079 84.2539
coverage
1 − way state 0.4915 0.5085 0.3660 0.3666 6.8651 85.6319
coverage
2 − way state 0.6167 0.3833 0.3827 0.3633 31.7231 86.0107
coverage
3 − way state 0.5452 0.4548 0.3734 0.3636 31.2126 86.5856
coverage
1 − way action 0.4995 0.5005 0.3663 0.3670 6.8224 85.9126
coverage
2 − way action 0.5074 0.4926 0.3672 0.3654 31.4185 85.9683
coverage
3 − way action 0.4995 0.5005 0.3684 0.3667 10.0687 86.1290
coverage

TABLE II
R ESULTS OF THE COMPARISON WITH MUTATION SCORE ( USING TRANSITIONS ).

Success Success Mutants Killed Mutants Killed Execution Time Execution Time
Coverage Coverage Mutation Score Coverage Mutation Score Coverage Mutation Score
Type (Percentage) (Percentage) (Percentage) (Percentage) (Seconds) (Seconds)
1 − way extended 0.6028 0.3972 0.3823 0.3663 31.2724 87.5435
transition coverage
2 − way extended 0.5458 0.4542 0.3722 0.3651 29.0016 86.3313
transition coverage
3 − way extended 0.4973 0.5027 0.3674 0.3656 10.1902 86.2759
transition coverage

1-way and 2-way transition and extended transition coverage, to the ones that confirmed the null hypothesis. This lack of
and for 2-way and 3-way state coverage the obtained p-values improvement (either because we reached the 100%, or the
were lower than 0.05. However, for the other measures the p- improvement is negligible) triggers the second condition of
values were higher than 0.05. Therefore, we can deny the null the termination criteria, stopping the execution before the 100
hypothesis for the experiments with the first coverage notions iterations are produced, and therefore reducing the execution
with a confidence higher than 0.95, and we have to accept time. This behaviour shows that not all the coverage notions
the null hypothesis for the experiments with the second set that we propose in our work are useful for the average testing
of coverage notions. In order to double-check our results, we practices, because some will arise pointless results.
performed a t-test and obtained the same p-values.
V. D ISCUSSION
We can conclude that some of our coverage-based fitness During the development of our algorithm there were some
functions are better than a mutation score function, both critical decisions we had to make and that can arise some
regarding performance and computation time. However, there questions from an experienced reader. Therefore, in this sec-
is something more that we have to discuss. More precisely, tion we will address such decisions. We discuss in detail our
we detected that the computation time in some experiments decision of using only one kind of coverage measures and our
was lower by a huge margin than for others. After a careful decision of using a genetic algorithm with mutation score as
analysis, we concluded that this behaviour was produced by fitness function as a baseline algorithm for comparing with
the different orders of magnitude between the different G sets. our proposed algorithm. Finally, we also address the decision
For example, the number of sets of 3-way transitions, actions of which crossover to use for our genetic algorithm.
and extended transitions are so huge, that with a test suite
of length 1000 the variation on percentage between two test A. Coverage measures choice
suites is almost null. Alternatively, we have that the number In traditional coverage-based literature there are two kind
of sets of 1-way states and actions are very small (in fact, of coverage metrics: the first one includes in the coverage
these numbers are 100 and 50 in our experimental subjects, metric all the elements a test traverses, and the second one
respectively) and therefore with test suites of length 1000 only includes the last element (or set of elements) of each
is really easy to cover all the sets, obtaining a coverage of test. We decided to stick to the first kind of coverage metrics
100%. It is important to note that these two groups correspond and forget about the second one because the characteristics of
the first kind would help better to the evolution of the genetic was limited. In fact, we could only find two crossovers that
algorithm. We chose the first kind of coverage metrics because conform to those restrictions.
it is easier to obtain different scores for two test suites with The first crossover we considered was simpler but also
the same amount of tests, while the second kind of coverage harder to produce: it consisted in finding the same grammatical
metrics would yield the same score for those two test suites symbol with the same length to the leaf in both trees. This
(if there are no repetitions). For example, let us say we have situation happens very rarely and therefore the amount of
two test suites with only one test: the test of the first test suite crossovers produced was less than optimal (even when trying
has 17 input/output pairs and the test of the second test suite to produce the crossover for each pair of trees). This crossover
has 5 input/output pairs. Let us say also consider the 2-way conformed to the required restrictions because the length was
action coverage. Then, if we use the first kind of metrics the maintained through the interchange of subtrees with the same
first test suite will cover 16 sets of actions and the second test number of nodes, and the grammatical correctness was ensured
suite will cover 4 sets of actions, while if we use the second because both subtrees started with the same grammatical
kind of metrics both test suites cover only 1 set of actions. symbol. However, it obtained worse results than our selected
However, it is clear that we should prefer the first test suite crossover.
over the second, as suggested by the first kind of coverage The second crossover was our selected crossover. It is a bit
measure. more complex but at the same time easier to occur: it consisted
in finding the same grammatical symbol at both trees, exchang-
B. Baseline algorithm choice ing the subtrees that start at such symbols, and then solving
Mutation score is a traditional and widely known measure the possible problems with the length of the offspring. It is
used in mutation testing. It is typically used as a measure in this last step where new genetic information was generated
to compare different test suite generation algorithms, as it in order to extend the shorter tree into the desired length.
has been shown to be highly correlated with the fault finding This crossover conforms to the required restrictions because
ability of a test suite [32]. That’s why we use it to compare the grammatical correctness was ensured due to both subtrees
between different coverage measures and to compare with the starting with the same grammatical symbol, and the length
baseline algorithm. The choice we want to discuss here is why being maintained by generating or bounding the offspring.
we used it as a fitness function of our baseline algorithm. This crossover does not have problems of incapability to be
The reasoning behind this choice is that a genetic algo- produced and therefore it obtains better results than other
rithm (in this case a grammar-guided genetic programming options.
algorithm) whose evolution is guided by mutation score as
fitness function will obtain test suites that obtain a high VI. T HREATS TO VALIDITY
mutation score, and therefore, will obtain the best scores later In this section we briefly discuss some of the possible threats
when comparing with another algorithm using mutation score. to the validity of the results of our experiments. Concerning
However, such algorithm will take a lot of computation time threats to internal validity (results validity), the main threat
due to the high computational cost of generating and using the is associated with the possible faults in the developed experi-
bunch of mutants needed to obtain a mutation score. Therefore, ments because they could lead to misleading results. In order
this kind of algorithm should be a valid baseline with which to reduce the impact of this threat we tested our code with
to compare our proposed algorithm. carefully constructed examples for which we could manually
We could have used other state-of-the-art algorithms to check the results. In addition, we repeated the experiments
which compare our algorithm, like genetic algorithms using many times in order to get a mean so that the randomisation
fitness functions like Test Set Diameter [10], [17], or more impact is reduced. Finally, there is the choice of baseline
classical algorithms like W [11] or Wp [7] methods. However, measure, that if poorly chosen can arise better results than
we considered that a genetic algorithm using mutation score the real ones. This concern is discussed in Section V and we
as fitness function will perform better as a baseline measure. consider it properly addressed.
Anyway, the comparison with these other methods would be The main threat to external validity (results generalisation),
matter of future work, in order to ensure the suitability of our is the different possible systems to which we could apply our
proposed algorithm. algorithm. Such a threat cannot be entirely addressed since
the population of possible systems is unknown and it is not
C. Crossover choice possible to sample from this (unknown) population. In order to
Concerning the crossover selected for our proposed diminish this risk, we perform our experiments over randomly
grammar-guided genetic programming algorithm, our choice generated FSMs prepared to stretch the capabilities of the
arises some concern due to its capability to include new (previ- algorithm.
ously unseen) genetic information into the offspring. We chose Finally, we considered threats to construct validity (exper-
this crossover because we wanted a crossover with two clear iments reality), that is, whether our experiments reflect real-
restrictions: first, the offspring had to be grammatically valid, world situations or not. In our work, the main construct threat
and second the offspring had to keep the same length than its is what would happen if we used our algorithm with much
parents. Then, the spectrum of options that we had available more complex methods. In order to address this concern, we
have performed experiments with huge randomly generated [10] R. Feldt, S. M. Poulding, D. Clark, and S. Yoo. Test set diameter:
FSMs, but there is still room for improvement and it will be Quantifying the diversity of sets of test cases. In 9th IEEE Int. Conf. on
Software Testing, Verification and Validation, ICST’16, pages 223–233.
matter of future work. IEEE Computer Society, 2016.
[11] S. Fujiwara, G. von Bochmann, F. Khendek, M. Amalou, and
VII. C ONCLUSIONS A. Ghedamsi. Test selection based on finite-state models. IEEE
Transactions on Software Engineering, 17(6):591–603, 1991.
Generating test suites with a high fault finding ability is [12] P. Gómez-Abajo, E. Guerra, J. de Lara, and M. G. Merayo. A
tool for domain-independent model mutation. Science of Computer
fundamental for ensuring the quality of software. Moreover, Programming, 163:85–92, 2018.
performing this task in a quick and efficient manner is critical [13] P. Gómez-Abajo, E. Guerra, J. de Lara, and M. G. Merayo. Wodel-Test:
for software budgets. In this paper we have presented a a model-based framework for language-independent mutation testing.
Software and Systems Modeling (in press), 2021.
grammar-guided Genetic Programming algorithm to generate [14] D. Griñán and A. Ibias. Generating tree inputs for testing using evolu-
such tests suites, based on coverage criteria fitness functions. tionary computation techniques. In 22nd IEEE Congress on Evolutionary
We compared our proposal with another method that happens Computation, CEC’20, pages E–24267: 1–8. IEEE Computer Society,
2020.
to be worst than ours. We also found that 1-way transition [15] D. Griñán, A. Ibias, and M. Núñez. Grammar-based tree swarm
coverage is the preferable choice when generating test suites optimization. In 2019 IEEE Int. Conf. on Systems, Man and Cybernetics,
if we are in a black-box scenario, and 2-way state coverage if SMC’19, pages 76–81. IEEE Press, 2019.
[16] R. M. Hierons, M. G. Merayo, and M. Núñez. Bounded reordering
we are in a white-box scenario. in the distributed test architecture. IEEE Transactions on Reliability,
For future work, we would like to explore a wider range 67(2):522–537, 2018.
of coverage notions, specifically t-way coverage with t > 3. [17] A. Ibias, D. Griñán, and M. Núñez. GPTSG: a Genetic Programming
Test Suite Generator using Information Theory measures. In 15th Int.
Evolving over this line of work, we would like to explore Work-Conf. on Artificial Neural Networks, IWANN’19, LNCS 11506,
the significance of the relation between the size of the G pages 716–728. Springer, 2019.
set and the length of the test suites. We would like to deal [18] A. Ibias, R. M. Hierons, and M. Núñez. Using Squeeziness to test
component-based systems defined as Finite State Machines. Information
with bigger numbers of mutants. Therefore, we would like to & Software Technology, 112:132–147, 2019.
include recent work on producing and managing big sets of [19] A. Ibias and M. Núñez. SqSelect: Automatic assessment of failed
mutants [6], [8], [12], [13] into our framework. We would error propagation in state-based systems. Expert Systems with Applica-
tions, 174:114748, 2021.
also like to explore the possibility of using another kind [20] A. Ibias, M. Núñez, and R. M. Hierons. Using mutual information to
of evolutionary computation algorithms, instead of a genetic test from Finite State Machines: Test suite selection. Information &
programming algorithm, like tree swarm optimisation [14], Software Technology, 132:106498, 2021.
[21] ISO/IEC JTCI/SC21/WG7, ITU-T SG 10/Q.8. Information Retrieval,
[15]. In another line of work, we would like to use our recent Transfer and Management for OSI; Framework: Formal Methods in
work on testing using Information Theory concepts [18]–[20] Conformance Testing. Committee Draft CD 13245-1, ITU-T proposed
to implement genetic algorithms using the induced measures recommendation Z.500. ISO – ITU-T, 1996.
[22] Y. Jia and M. Harman. An analysis and survey of the development
as fitness functions. Finally, we would like to extend our of mutation testing. IEEE Transactions on Software Engineering,
framework to deal with FSMs that can represent systems with 37(5):649–678, 2011.
distributed components [16], [28], [29]. [23] J. Kennedy and R. Eberhart. Particle swarm optimization. In 3rd Int.
Conf. on Neural Networks, ICNN’95, pages 1942–1948. IEEE Computer
Society, 1995.
R EFERENCES [24] J. R. Koza. Genetic programming. MIT Press, 1993.
[25] H. Liu, F.-C. Kuo, D. Towey, and T. Y. Chen. How effectively does
[1] P. Ammann and J. Offutt. Introduction to Software Testing. Cambridge metamorphic testing alleviate the oracle problem? IEEE Transactions
University Press, 2nd edition, 2017. on Software Engineering, 40(1):4–22, 2014.
[2] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo. The oracle [26] J. D. McGregor and D. A. Sykes. A Practical Guide to Testing Object-
problem in software testing: A survey. IEEE Transactions on Software Oriented Software. Addison Wesley object technology series. Pearson /
Engineering, 41(5):507–525, 2015. Prentice Hall, 2001.
[3] M. Benito-Parejo, I. Medina-Bulo, M. G. Merayo, and M. Núñez. Using [27] R. I. McKay, N. X. Hoai, P. A. Whigham, Y. S., and M. O’Neill.
genetic algorithms to generate test suites for FSMs. In 15th Int. Work- Grammar-based genetic programming: a survey. Genetic Programming
Conf. on Artificial Neural Networks, IWANN’19, LNCS 11506, pages and Evolvable Machines, 11(3-4):365–396, 2010.
741–752. Springer, 2019. [28] M. G. Merayo, R. M. Hierons, and M. Núñez. Passive testing with
[4] M. Benito-Parejo and M. G. Merayo. An evolutionary algorithm asynchronous communications and timestamps. Distributed Computing,
for selection of test cases. In 22nd IEEE Congress on Evolutionary 31(5):327–342, 2018.
Computation, CEC’20, pages E–24535: 1–8. IEEE Computer Society, [29] M. G. Merayo, R. M. Hierons, and M. Núñez. A tool supported
2020. methodology to passively test asynchronous systems with multiple users.
[5] C. Blum and D. Merkle, editors. Swarm Intelligence: Introduction and Information & Software Technology, 104:162–178, 2018.
Applications. Springer, 2008. [30] G. J. Myers, C. Sandler, and T. Badgett. The Art of Software Testing.
[6] P. C. Cañizares, A. Núñez, and M. G. Merayo. Mutomvo: Mutation John Wiley & Sons, 3rd edition, 2011.
testing framework for simulated cloud and HPC environments. Journal [31] A. Núñez, M. G. Merayo, R. M. Hierons, and M. Núñez. Using genetic
of Systems and Software, 143:187–207, 2018. algorithms to generate test sequences for complex timed systems. Soft
[7] T. S. Chow. Testing software design modeled by finite state machines. Computing, 17(2):301–315, 2013.
IEEE Transactions on Software Engineering, 4:178–187, 1978. [32] M. Papadakis, M. Kintis, J. Zhang, Y. Jia, Y. L. Traon, and M. Harman.
[8] P. Delgado-Pérez and I. Medina-Bulo. Search-based mutant selection Mutation testing advances: An analysis and survey. volume 112 of
for efficient test suite improvement: Evaluation and results. Information Advances in Computers, pages 275 – 378. Elsevier, 2019.
and Software Technology, 104:130–143, 2018. [33] S. Splaine and S. P. Jaskiel. The web testing handbook. STQE Pub.,
[9] M. Dorigo, V. Maniezzo, and A. Colorni. The ant system: Optimization 2001.
by a colony of cooperating agents. IEEE Transactions on Systems, Man [34] M. Srinivas and L. M. Patnaik. Genetic algorithms: A survey. IEEE
and Cybernetics B, 26(1):29–41, 1996. Computer, 27:17–27, 1994.

You might also like