In my last article, I introduced the first of the hypothesis generators – techniques you can use to generate problem-solving ideas if you get stuck.
The next hypothesis generator to consider is “Make it look like this”. You can use this one if you have two separate deployments of the same software system where one deployment works and one deployment doesn’t. The idea is to make incremental changes to the failing system to make it look like the working system.
In some ways, this is a troublehacker’s dream scenario. Since you have one working system, you know that the reason for the problem must be a difference between the two systems. You can generate hypotheses by identifying all of the differences. Each difference is a potential reason for the problem, and you can test for this by eliminating the differences on the failing system one at a time.
You don’t need to focus only on the failing system. You can also make the working system look more like the failing system, to see if you can induce the failure. The process is the same – generate hypotheses by looking at the differences between the two systems.
What kinds of differences can you look for? The obvious ones would include differences in:
- Operating system
- Database vendor
- Deployment architecture (number of machines and their roles)
- Hardware architecture (CPUs, memory, disk, network)
- Configuration parameters
There can be less obvious differences such as the data in the system, the length of time the system has been in operation, and the workload applied to the system (including the specific use cases being executed as well as level of load). Every system has its own unique history, and that history can be an important difference to consider.
Let’s say you’ve identified all of the differences. There is one more potential complication – the problem might have been caused by an interaction between the differences. You might need to make two or more changes simultaneously before you see a change in system behavior. If you had dozens of differences, would you need to first look at each difference by itself, and then look at pairs of differences, and then groups of three and so on? Wouldn’t this get expensive?
Fortunately, there is a field of study that looks at just these questions: design of experiments. We can apply the sparsity of effects principle:
When there are several variables, the system or process is likely to be driven primarily by some of the main effects and low order interactions.
Douglas C. Montgomery, Design and Analysis of Experiments, p. 372.
If we apply this to a software system, that means it is unlikely that a combination of 3 or more differences is responsible for our problem. We can focus our energy on looking at individual differences, or at worse, a combination of two differences.
Let’s say we’ve identified dozens of differences, and let’s further assume that it is time-consuming to make changes to our system and run tests. You would normally make one change at a time, which could lead to a large number of runs if you are unlucky. Is this the most cost-effective approach?
No! If you vary all of the differences at once in a specific way, you can learn more about the effects of those differences in fewer runs. In experimental design terms, our differences are called factors, and we have two values (or levels) for each factor. Applying the sparsity of effects principle, we can create a fractional factorial experiment. You can see some examples of fractional factorial designs on this site. One example for a system with 11 factors requires only 16 runs (vs. the full combination of 2048 runs – 2 to the 11th).
On the other hand, if your test runs are cheap and changes are easy, there is no need to take on the extra complexity of a formal experimental design. Run through the differences one at a time, and only resort to pairs if you can’t isolate the difference that caused the problem. It is just a matter of time – all you need to do is “make it look like this”!