The principle of recent change

Recently, I’ve been writing about hypothesis generators – techniques you can use to generate problem-solving ideas if you get stuck.

The next hypothesis generator is the principle of recent change. You can use this one if you are looking at system that worked at one time, but is not working now. The principle of recent change states that the most likely reason that a system has stopped working is that something has changed. So, you can generate hypotheses to test by identifying recent changes.

If you are developing software, you frequently see a problem appear between one build and another. Since you are almost certainly using a source code control system, you can find a list of recent code changes by looking at the change history in the source tree. You can base your hypotheses on which files have been updated between builds.

Something similar applies if the team administering a deployed system uses a change management process. For a managed system, a change request is created for each system change. So, you can identify recent changes by getting a list of change requests.

There is overhead associated with a change management process, so not all systems are formally managed. For unmanaged systems, it can be harder to identify recent changes – you’ll need to ask the system administrators. You may need to jog their memories by asking:

Has any part of the system been upgraded recently?
Has there been any change in the configuration parameters of the system?
Has the deployment architecture changed? Have any new components or integrations been activated? Has anything been retired?
If the system is hosted on a virtualization infrastructure, has the virtualization infrastructure changed?

For a complex system deployed in a large organization, there are likely to be multiple levels of administration, so you won’t get a complete picture of recent changes by talking to just one person. Different IT teams are often responsible for the applications, the database servers, the virtualization infrastructure, and the network. If you don’t get any traction with one team, move on to another.

So far I’ve been focused on direct changes to the system – the things that a system administrator might do. And that’s a good place to start. But there is another potential source of change: the users of the system. As time passes, a system changes as people use it, and people change how they use a system. You can ask another set of questions to identify these sorts of changes:

Has the system load changed significantly? Have new users been onboarded?
Has the usage pattern changed? Are users taking advantage of new features? Have they changed the way that they work?
Has the amount of data in the system changed? Have there been migrations or large imports?

These questions are often beyond the scope of the system administrators. In a large organization, you’d need to find the business owners – the people who understand how the application is being used to deliver business value.

If an IT team has invested in a monitoring infrastructure, you can get data on usage patterns than you can analyze to see what’s changed. Tools like DynaTrace or New Relic can collect data from your application and display it in dashboards. You can generate alerts if key metrics exceed a threshold. Usage metrics from monitoring tools can be invaluable if the business owners are not completely familiar with what the user population is doing.

If you can combine a change management process with a monitoring tool, you’ll be able to easily identify recent changes. And those recent changes are the prime suspects in your troublehacking investigations!