Why you need to revisit decisions

Decisions get made, but sometimes they need to get unmade

May 26, 2023

Several years back I was working for a company with a system that was having a lot of performance issues. Page loads in the app were exceptionally slow, and background jobs had thousands of tasks performed every minute and would occasionally get backed up.

Some Background

This all ran on a single MySQL database. No replication. But lots of caching to Redis.

So I spent a lot time trying to understand what was going on. I eventually discovered the people who setup the system were using Redis to do object-level caching everywhere, and the API was pulling records from the Redis cache rather than the MySQL database. So I was wondering how they were keeping the Redis cache from getting stale.

When I got my answer, I had to recheck myself several times, because I couldn’t believe what I was seeing.

The system was setup so that after every API call the individual ID for each record was fetching individually from the database. That is, an API call to list 100 records, hits Redis and pulls 100 records, and then 100 tasks added were to job queue to make 100 database reads for individual rows.

However, there was another catch to this: if you we got a page request that hadn't had its query parameters get added to the cache yet, we'd need to call down to the DB anyhow, and then we'd still do the queue X tasks into the queue to make X database calls.

We were literally beating our own system to do death with excessive database calls of 100x our user load.

The answer was simple: removing the object caching and replace with page caching.

The hardest problem in this case wasn't solving the technical problem. The hardest problem became the highly rational engineers.

Why?

People like to assume that decisions made in the past were either one of two things: right or wrong. It's difficult to convince people that a right decision in the past has become wrong now. It's the "status quo bias" talking.

Additionally, incrementally correct decisions are assumed to lead to a net correct decision. This assumption assumes a static environment, not what most startups encounter, but it's VERY hard for humans to reconcile that incremental steps in an old right direction could have led to a net wrong decision.

Systems don't care. They'll gladly keep going in the direction you sent them.

So it's on you to observe and re-assess over time. Observability and the DORA metrics can help point you in the right direction, but they won't make the choices for you. That's still on you as an engineering org.

Returning to my story

It took about 3 months of prodding engineering leadership and other engineers in general that we should change this. Once we decided to solve the problem, the solution rolled out relatively quickly— it took about a week or so to incrementally rewrite as traditional page caching. After which our performance problems disappeared.

Robert Roskam's Newsletter

Discussion about this post