Marty, you’re not thinking fourth-dimensionally!
Doc Brown, Back to the Future III
It is easy to underestimate the power of time. In fact, one of the big differences between a test system in the lab and a deployed system is time. As time passes, the number of people using a system can grow, along with the amount of persistent data managed by the system. And a system might go through several software upgrades.
There is variation over shorter time frames, too. For example, system load varies throughout the day, increasing as people start their day and decreasing as people go home. Or, a website might suddenly get more traffic because of some kind of event (think Amazon.com on Prime Day!)
It helps to think of a system as always being in motion, and to remember that our observations are of points in time. Incorporating time into our mental model can remind us that a system has history, and history can sometimes provide key insight when trouble hacking.
Let’s look at an example.
Memory allocation in Java
I deal mostly with software written in Java. Although java will free up memory automatically, it is still possible to run out. When this happens, you can have the java virtual machine generate memory and thread dumps at the time of the out-of-memory event.
These are tremendously useful diagnostic tools, but they capture the state of the system at a single point in time: the instance when the system ran out of memory. Now, you can usually get some information about the method that was trying to allocate memory when the exception was generated. But remember – the time dimension has been stripped away. The method that caused the exception may have nothing to do with the root cause.
In other words, we know something about where the system ended up, but we don’t know much about how it got there.
For a problem like this, it can be helpful to take a longer view, and look at memory usage over time. One option for Java is to gather verbose gc logs, and then use a tool like IBM’s GCMV to get the historical view. We might see a pattern of memory usage over time something like this:
This longer view of memory usage gives us more context for the out-of-memory error. We can see that memory usage has been climbing steadily. There are a few drops in usage as the garbage collector gets aggressive, but the general trend is upwards.
That means that the out-of-memory error was not caused by a single request. Memory was leaking steadily the whole time, and the problem only became severe when the heap was nearly exhausted. That tells us that the activity in the system at the time of the failure may not be relevant to finding the root cause.
All software systems have a history, and in order to think fourth-dimensionally, we have to remember that what we are seeing in the present is shaped by what happened in the past.