With all of the shiny new graphing tools that exist – and all the pontificating on the visual display of quantitative information – sometimes we forget the power of the simplest form of structured information that lies at the heart of all of the slick infographics: the humble list.
In dealing with any production environment in the real world, you’re going to have incidents, issues, outages, and problems. The application will catastrophically fail to load a new configuration. A network device’s NVRAM will fail and lose its brains. Your integration partners will suddenly only send you malformed requests. There will be a fiber cut in Nevada. Your best sysadmin will fat-finger a command during a maintenance window and reboot the wrong set of boxes, and so on. When recovering from a production problem, it’s common to want to drop everything and immediately address the cause at hand. Reacting to every single issue as it happens is one way to deal with problems, but this method shows how to prioritize among the issues you have seen before so you can give more attention to the problems that cause you the most pain.
A detailed root cause analysis is helpful to determine the timeline and contributing factors, but there is usually a small set of ways problems can cause a system malfunction: redundancy or capacity was insufficient, configuration was not validated prior to activation, software failed to handle a normal case correctly, multiple software components did not interoperate as intended, a service provider that you depend on had a problem that impacted your service, or a procedure did not exist or was not followed correctly by a human. You can make a list of those things and keep track of how many times each one is happening.
You would want to start by creating your set of categories and cataloging each of your production problems within a category. It doesn’t have to be perfect, but it’s a good idea to iterate this set and recategorize your historical tickets over time as you refine your set of problem categories.
|Bob forgot to pull the latest config before restarting Apache||Procedure Failure|
|Unexplained packet loss to NY site||Provider Problem|
|App crashes when username starts with @||Software Failure|
|Backend API endpoint throttles on large profile pages; users see 500s||Interop Problem|
After you collect enough of these, you will start seeing trends in the root cause. After a couple of weeks, you might have something like this:
This gives you some interesting data. Since you’ve done a root cause analysis for all 45 of your problems in the past time period – Week? (Hope not!) Month? Quarter? Year? – you get the idea that you should probably be better about testing your software components together. Maybe it’s time to look at the staging environment and see if it is similar enough to production so that you can make sure your battery of integration tests is up-to-date.
But what if you have multiple software projects, environments, or teams contributing to your production environment? You’re going to need to figure out who or what is creating the invalid configuration. Again, lists to the rescue. Using a concept borrowed from SQL (GROUP BY multiple columns), we can track the software component to the root of the problem along with the root cause.
|Bob forgot to pull the latest config before restarting apache||Procedure Failure||Apache|
|Unexplained packet loss to NY site||Provider Problem||National ISP|
|App crashes when username starts with @||Software Failure||iOS app|
|App shows 500s when API endpoint times out loading large profile pages||Interop Problem||backend app|
After you have this list, it’s a simple query to create a list grouped by component and problem to tell you where you should be focusing your energy to solve issues. You might see something like this:
|iOS app||Interop Problem||5|
|Back-end app||Interop Problem||4|
|National ISP||Provider Problem||4|
Once you organize the list like this, it shows you that the iOS app and back-end aren’t playing nicely together, and that’s your biggest problem, but also it would be worth getting a new ISP and putting some change control around web server configuration changes. This is just a mock-up but the above table shows 17 issues out of the 45 presented earlier, meaning that if you just focus on doing this handful of things well, you’ll prevent your biggest problems from coming back up. Using this simple grouped list method, you will save yourself time by resolving the core issues first and avoid spinning around, trying to solve every new issue that comes up. Good luck!
Posted by Andrew Dibble, group manager, production engineering.