I was running a performance test recently, and I saw some odd behavior. My usual process is to restart the application, let it finish initializing, and then run a simulation for about an hour. That’s usually enough time to collect enough samples to get a good idea of the performance of various requests.
My 1 hour test looked pretty good, so I decided to run the test overnight. When I came back in the morning, the performance had become much worse. In fact, there was a fairly sharp increase in response times after about 3 hours. What happened?
I looked at the operating system metrics during the run, and noticed something interesting. This was a Linux system, and the size of the Linux page cache was decreasing over time. The page cache was using almost 22G of RAM at the beginning of the test, but this had dropped to 7G when performance degraded.
As it turns out, the application I was testing relies on memory-mapped files to access information stored locally on the server, and on Linux, memory-mapped files leverage the page cache. But the page cache can shrink if other applications need memory. Since I had restarted my application at the beginning of the test, it started out using less memory but increased its usage of memory over time. As my application grew, the page cache shrank.
When size of the page cache dropped below a critical level, the page cache became less effective. The memory-mapped files were no longer completely in memory, and so more information was coming off of disk, which made operations slow. As the operations slowed down, a bottleneck formed on the server which then resulted in a sharp increase in response times.
So here’s a case where the future was not like the past. This system did not reach a steady state for several hours, and so the results from the 1 hour test were misleading. This server was in fact undersized for the workload – more memory was required to make performance sustainable over long periods.