The scientific method

When a web site goes down, the trouble hackers tasked with bringing it back up are under tremendous pressure to act. When the pressure is high enough, you may have to take actions before you really understand the problem. Sometimes these actions uncover new information that leads to a solution, but sometimes, they just add to the chaos.

There is a better way. Instead of taking random action, we can apply the scientific method.

In brief, the scientific method is a way of acquiring knowledge that involves these principles:

Careful observation
Formulation of hypotheses to explain observations
Testing of hypotheses through experimentation
Refinement (or discarding!) of hypothesis based on experimental results

The scientific method has a built-in awareness of the effect of mental models on perception, and calls for rigorous skepticism in the interpretation of observations.

Science and troublehacking

It is straightforward to apply these principles to trouble hacking.

We make observations of a software system by watching what happens in the user interface, or looking for error messages in log files. We might measure performance, either of the application or of the underlying operating system. We might even listen to user complaints!

We can then make a guess about what is causing our trouble – in other words, we come up with a hypothesis. With our hypothesis in hand, we can define actions that will give us more data to support (or disprove) our hypothesis.

We then carry out those actions in a controlled way and make more observations. We might change a configuration parameter, or turn on additional logging. We might roll back a change, or ask customers to run through a particular set of actions. We might also try to reproduce the problem in a test environment.

The point is that we want our actions to result in observations that give us evidence that supports (or disproves) the hypothesis. If we disprove a hypothesis, or we see behaviors that don’t make sense in light of our mental models, then we generate a new hypothesis and start again.

As a consequence, our actions are never random. Each action tells us more about the problem and leads us to the action that will eventually resolve the problem. Even a dead end is useful, because that means we’ve eliminated something.

Example: applying the scientific method to software

Let’s say that users begin to report performance problems after upgrading to a new version of an application. To be appropriately skeptical, we should really decompose this into two observations. First, the application was recently upgraded. And finally, end users are reporting performance problems. After all, we don’t really know that the two observations are related. They might be, but if we build that into our mental model from the outset, it will influence our actions. That’s OK, but if we do that, let’s do it intentionally.

Well, why not? Let’s formulate an initial hypothesis: the upgrade has introduced a performance problem. We can make a prediction from this hypothesis: the prior version of the software did not suffer from this issue. And that leads to a test: roll back the upgrade and look at the performance again.

Let’s say the performance problem goes away after the rollback. We’ve proven the hypothesis! Well, wait a minute – are we being sufficiently skeptical, or is this confirmation bias? There wasn’t much context about the performance problems in the initial problem description, so we don’t really know how often they happened. We would need to observe the system for longer before we could draw a final conclusion. But if the hypothesis holds up, we have narrowed the problem considerably. Now we can look at the changes that went into the latest release and develop new hypotheses about which change might be the culprit. And the cycle continues.

What if we roll back the new version and the performance problem persists? We have dis-proven the hypothesis! We have narrowed the problem by eliminating the new software from suspicion. We now need a new hypothesis.

If we stay skeptical, we might still suspect that the upgrade is not entirely in the clear. When an upgrade involves downtime, a customer often decides to make other changes to their infrastructure. If we wanted to pursue that hypothesis, the next actions would be to ask about other things that happened right around the time performance changed.

It might also be just a matter of coincidence that the performance problem occurred around the time of the upgrade. We might then speculate that there is a bottleneck somewhere in the system that is impacting performance. To pursue this hypothesis, we would collect performance and utilization metrics and look for anything that is maxed out.

And the cycle would continue. Each action has a purpose, and you learn something from the results of every action. You treat each observation skeptically so that you don’t draw conclusions based on what you hope to see. And in the end, you’ll solve your problem with only the necessary amount of chaos!