Fault-Based Testing and the Pesticide Paradox
Istvan Forgacs
Posted On: December 14, 2022
11096 Views
9 Min Read
In some sense, testing can be more difficult than coding, as validating the efficiency of the test cases (i.e., the ‘goodness’ of your tests) can be much harder than validating code correctness. In practice, the tests are just executed without any validation beyond the pass/fail verdict. On the contrary, the code is (hopefully) always validated by testing. By designing and executing the test cases the result is that some tests have passed, and some others have failed. Testers do not know much about how many bugs remain in the code, nor about their bug-revealing efficiency.
Unfortunately, there are no online courses available where after entering the test cases the test course platform tells which tests are missing (if any) and why. Thus, testers cannot measure their technical abilities (ISTQB offers to measure non-technical skills only). The consequence is that testers have only a vague assumption about their test design efficiency.
Fortunately, the efficiency of the tests can be measured. Let us assume that you want to know how efficient your test cases are. You can insert 100 artificial, but still realistic bugs into your application. If the test cases find 80 bugs, then you can think that the test case efficiency is about 80%. Unfortunately, the bugs influence each other, i.e., some bugs can suppress some others. Therefore, you should make 100 alternative applications with a single-seeded bug in each, then execute them. Now, if you find 80 artificial bugs, your efficiency is close to 80%, if the bugs are realistic. This is the way how mutation testing, a form of fault-based testing works.
This is the basic concept of fault-based testing, i.e., selecting test cases that would distinguish the program under test from alternative programs that contain hypothetical faults. If the program code contains a fault, then executing the output (behavior) must be different. Therefore, in order to be able to distinguish the correct code from all its alternatives, test cases should be designed in a way that some output is different concerning the correct code and all its faulty alternatives. Each alternative is a textual modification of the code. However, there is an unmanageable number of alternatives, and thus we cannot validate the ‘goodness’ of our test set. Fortunately, it was shown by Offutt that testing a certain restricted class of faults, a wide class of faults can also be found. The set of faults is commonly restricted by two principles, the competent programmer hypothesis, and the coupling effect.
Coupling effect hypothesis means that complex faults are coupled with simple faults in such a way that a test data set detecting all simple faults in a program will detect a high percentage of the complex faults as well. A simple fault is a fault that can be fixed by making a single change to a source statement. A complex fault is a fault that cannot be fixed by making a single change to a source statement. If this hypothesis holds, then it is enough to consider simple faults, i.e., faults where the correct code is modified, (mutated) by a single change (mutation operator). Thus, we can apply mutation testing to get an efficient test design technique with an appropriate test selection criterion.
Mutation testing
As mentioned, mutation testing is the most common form of fault-based testing, in which by slightly modifying the original code we create several mutants. A reliable test data set should then differentiate the original code from the well-selected mutants. In mutation testing, we introduce faults into the code to see the reliability of our test design. Therefore, mutation testing is essentially not testing, but “testing the tests”. A reliable test data set has to “kill” all of them. A test kills a mutant if the original code and the mutant behave differently. For example, if the code is y = x and the mutant is y = 2 * x, then a test case x = 0 does not kill the mutant while x = 1 does.
If a mutant hasn’t been killed, then the reasons can be the following:
- Our test case set is not good enough and we should add a test that will kill the mutant.
- The mutant is equivalent to the original code. It’s not an easy task to decide the equivalence. And what is more, the problem is undecidable in the worst case.
In the case of a first-order mutant, the code is modified in one place. In the case of a second-order mutant, the code is modified in two places, and during the execution, both modifications will be executed. Offutt showed that the coupling effect holds for first and second-order mutants, i.e., only a very small percentage of second-order mutants were not killed when all the first-order mutants were killed. Moreover, Offutt showed that the second-order mutant killing efficiency is between 99.94% and 99.99%. This means that if we have a test set that will kill all the first-order mutants, then it will also kill the second-order mutants by 99.94-99.99%. Thus, it is enough to consider only simple mutants, which is much easier, and there are much fewer of them.
The real advantage is that if we have a test design technique that kills the first-order mutants, then this technique kills the second and higher-order mutants as well. Anyway, we can assume that the bigger the difference between the correct and the incorrect code is, the higher the possibility is to find the bug.
Excellent, we have a very strong hypothesis that if we have a good test design technique to find the first-order mutants as bugs, we can find almost all the bugs, and our software becomes very high quality. Consider the numbers again. Based on the comprehensive book of Jones and Bonsignour (2011), we know that the number of potential source code bugs in 1000 lines of (Java, JavaScript, or similar) code is about 35-36 on average. Thus, a code with one million lines of code may contain 35000-36000 bugs. A test design technique that finds at least 99.94% of the bugs would not detect only 22 faults in this huge code base.
Ok, we made mutation testing and we found that 95% of the mutants have been killed. What’s next? The first solution is to create the missing test cases. We should consider the basic code and a non-killed mutant. We should find an input for which the mutant is killed. This is our new test case. Here the method is reversed concerning testing. We start from the faulty code, the mutant, and search for a test that reveals this bug. If we cannot find a test though we know the fault, then we can assume that a user without this knowledge will not detect the bug either. In other words, we consider this mutant as equivalent even if we cannot prove it.
If the tester hasn’t killed some mutants, then instead of the non-killed mutants, the missing test cases appeared. Knowing your missing tests, you can improve your test design knowledge.
Sometimes, finding the missing test cases for the non-killed mutants is too costly. Unfortunately, knowing that we killed, for example, 95% of the mutants doesn’t mean that our test case efficiency is 95% (i.e., that we can find 95% of the bugs in the code). Generating mutants (bugs) is quite different from making mistakes by the developers. However, historical data can help. If in the past our mutant killing efficiency was 95%, the code quality was good enough for a given risk level, and now it’s only 80% for the same risk, then our test case set is not good enough. In this case, we should find additional test cases based on the mutants until we reach 95%.
We note that fault-based testing differs from defect-based testing since the latter is a test technique in which test cases are developed from what is known about a specific defect type (see ISTQB glossary).
Pesticide paradox
Most testers know and believe in ‘the 7 principles of software testing’. One of them is the pesticide paradox. Originally, Beizer (1990) wrote: “Every method you use to prevent or find bugs leaves a residue of subtler bugs against which those methods are ineffectual.” Now this paradox is described as ‘If the same set of repetitive tests is conducted, the method will be useless for discovering new defects.’ or ‘if the same set of test cases are executed again and again over the period of time then these set of tests are not capable enough to identify new defects in the system.’ In simple words: tests wear out.
Let us assume that we have a good test case set that kills all the first-order mutants. This is not entirely perfect as some bugs haven’t been found (0,06%). Assume that only one bug cannot be found with our tests. Now assume that the specification remains but the code is refactored several times. Except for the only bug, all the others made during the code modification will be detected. Let us assume again that the test case set doesn’t kill a single mutant only. The argument is the same. Any faulty code modification will be detected except for these two bugs.
Therefore, we can say that the error-revealing capacity of a test set is determined by the mutation-killing efficiency that is constant until the requirements remain unchanged. This is independent of whether the mutation testing is done or just could have been done. Therefore, if the cause of the new defects is faulty refactoring, and the requirements remain unchanged, then the good test set remains as good for the modified code as for the original. On the other hand, when the requirements change, or new requirements are added, then new/modified tests must be written. But it does not mean that the original tests wear out; on the contrary, we use (some of) them in our regression test pack. The conclusion is that the pesticide paradox principle is not valid.
Reference
Beizer, B (1990)., Software Testing Techniques, 2nd Edition Itp – Media, ISBN-10 1850328803
Jones, C., and Bonsignour, O. (2011), ‘The economics of software quality’, Addison Wesley, ISBN-10 0132582201
Got Questions? Drop them on LambdaTest Community. Visit now