If you’ve ever spoken to me you know I love “dumb” questions. They tend to drive problem solving in the direction needed before we dive into any hardcore analysis of the issue using our educated guesses as guides. They don’t get asked often enough at the beginning but I guarantee they’ll get asked at the end.
So, here is a story from early in my career of just that.
The problem
Our compute cluster was not performing to the expected level given the size of the machines. The CPU and I/O stats pointed to a problem with our software. We traced it to delays querying the authentication system. That is, authentication was running way over capacity (~10k req/m) and had been for a while.
On one hand, this was a good problem to have. It meant our users were using the platform. On the other hand, it meant we had hit a bottleneck in a portion of our code base we never thought we’d hit.
The Solution
After running the system analysis the group of developers assigned to the case concluded even basic user sharding wouldn’t help. It required a total re-architecture to handle the forecast platform growth over the next 6 months. Any delay jeopardized the stability and growth trajectory of the offering. A new multi-threaded authentication server would be spun up backed by a MongoDB caching layer.
The Question Not Asked
How many users did we have and how many of them were active at any time on the platform?
That’s it. That was the question no one asked. If they had they would have heard 100 and 10.
How could 10 users produce a staggering 10k req/m?! Could it be that we had forgotten to pass user permissions every time we requested data? Could it be that every node on the platform was thus making hundreds of permission requests using the user credentials instead of caching and reusing the previously requested perms? The answer to both is “yes.”
The Aftermath
After the developers proudly showed off the new auth system to much fanfare a non-developer pointed out the absurdity of the situation by asking “why?” Why couldn’t the original platform handle more than 10 users at a time? How did we engineer something like that? Seemed silly.
Leadership pointed out the problem had been “fixed” and the compute clusters were finally operating much closer to their expected level. This was worthy of praise, so that’s what we did. We congratulated them for all the hard work they had put in.
Internally, we raised the same questions. A senior developer then changed a few lines of code to pass user permissions into the computational framework we’d built. The load on the auth system subsequently disappeared. We no longer needed the new system …but we didn’t need to tell anyone it was gone shortly after it had been created.
In Conclusion
Don’t be scared to ask dumb questions. Sometimes the most obvious questions should be asked. I didn’t ask because that wasn’t my part of the system. I had accepted their prognosis and assumed we must have gained a TON of users.
Now that I’m older and, I hope, wiser I make a point to ask.