Debugging production systems is a critical business process. Modern civilization completely depends on the software. The most fundamental infrastructure that makes up the fabric of our lives runs on it, from electrical power, to clean drinking water to systems that monitor the very air we breathe. All the software managing these basic necessities of life must run at all times – 24/7/365. Even for “non-critical” software such as commercial websites, defects can cause severe damage in lost sales, lost customers, and lost reputation. Some estimates put the cost of downtime at $5600 per minute, so to prevent production bugs, the process of creating software includes exhaustive testing. But for all the safeguards you may put in place, production software is still defective. It’s not a matter of IF, but rather WHEN you will discover a defect. Some defects are small and can be fixed behind the scenes without anyone noticing, but others are big enough to crash company stocks, and knock spaceships out of orbit. So I’ll say it again, debugging production systems is critical. Once you discover a bug in Production, you need to fix it before it impacts your business, and if your customers are already feeling it, the urgency is even greater. In this post, I will touch on some of the tools and methods currently used in production debugging and why they are insufficient. Then, I will describe the fundamental pillars that a true and effective production debugger stands on.
Why is production debugging so hard?
Code that is still in development is under the complete control of the developer. The environment is known; the scenario is clear; the developer can put breakpoints anywhere in the code to stop and examine its state and make inline changes to see how they affect the outcome. Debugging production systems is quite different.
- The developer usually can’t recreate the exact environment or scale in which the error occurred.
- Reproducing the error can range from difficult to nearly impossible.
- Often, an error is manifested at one location in the code, but the root cause is somewhere completely different. It can even be in a different module, a specific instance of a microservice, or even in serverless code that triggers an error and then disappears.
- The information relevant to the error is distributed among a set of multiple disparate sources such as log files, the call stack, event queues, database queries, local variables, and more.
- The developer can’t usually just put breakpoints in the code and step through the error scenario since that would stop service to the end-users (assuming the nightmare of a downed system had not already materialized).
An effective production debugger needs to overcome these challenges.
Production debugging wannabes
There are several product categories within the production debugging neighborhood. While they all do something to help debug production systems, none are sufficient to effectively determine root cause to enable a complete and final fix.
This is one of the oldest methods used to debug production systems. Get a memory dump of the system when an error occurs and try to decipher that. The problem with memory dumps is that at best, they are cryptic and hard to decipher, and at worst, they provide a snapshot of the computer’s memory at one point in the flow of execution, while the root cause of the error may be somewhere completely different.
These tools have also been around for a while now. While they are great at helping to make sense of log files, their usefulness in production debugging is limited. To expect a developer to understand what happened after taking a first look at logs presupposes that he knew beforehand where the error was going to occur. And of course, errors are accidents; we never know when and where they will happen. So usually, the developer will have to guess what really happened, add log entries to verify his guess and try to reproduce the error over several such iterations. This can be a long and arduous process that can result in only a partial fix of the error.
Application Performance Monitors and Exception Monitors
Application Performance Monitors (APMs) have been around for about the last ten years. They do a great job of identifying bottlenecks in resource usage that impact application performance (especially around networking and databases), as their name indicates. But they do more. They provide alerts, and some can even home in on exceptions and show you the call stack when one is thrown. But that’s pretty much the depth of information you’ll get, and it’s a snapshot of a very specific moment in time. It certainly helps but is still not the best solution to find the root cause of a problem. Exception monitors are a subset of APMs in that they can provide information about exceptions that occur. They fall short of effective production debugging because, like APMs, they only provide a snapshot of your application at the time of the exception and don’t enable code-level debugging.
The pillars of production debugging
As I see it, the enormous potential cost of production bugs leads to the primary goal of an effective production debugger which has three parts to it:
Fix the bug…at its root…as quickly as possible.
Let’s look at those three parts.
Fix the bug: Well, kind of obvious. However the bug was manifested, you don’t want that to happen anymore.
at its root: A bug can manifest itself in different ways at different times. Fixing just one manifestation of the bug is like taking a pill to alleviate a recurring ailment rather than curing the underlying cause. You’ll find yourself grappling with the same bug time after time as it manifests itself in different ways. You need to be able to track any manifestation of the bug back in the code execution flow to find its root cause. Once you address the root cause of a bug, you’ll know it’s truly and finally fixed.
as quickly as possible: As I’ve already mentioned, time is of the essence with production debugging. Either the bug you’ve detected hasn’t impacted your customers yet, so you want to fix it before it does, or worse, it’s already impacting your customers, and every minute is costing you dearly. An effective production debugger meets this goal by aggregating the functionality of those “wannabes” I mentioned and then adding some capabilities that none of them have.
Autonomous exception capture
You can’t fix what you don’t know about. A production debugger should be a sentry, that operates independently and is constantly on guard to catch any exception your software throws and notifies you about it immediately. This is your starting point; you need to know there’s something to debug, whether it’s already impacting your users or not. Just from the alert, you can already gain some insights. The number of times an exception occurs can provide some indication of the severity of a bug and its user impact. But that’s not enough. Knowing about an exception doesn’t help to debug it. The production debugger needs to capture the whole environment in which the exception happened, including the environment and relevant code, so the developer doesn’t have to work hard to try and reproduce the error. Capturing the error along with all of its associated information (log files, call stack, event timelines, network requests, database queries, etc.) is effectively a perfect bug report encompassing everything the developer needs to fix the bug.
Observability into your code
To really determine the root cause of an error, you need to be able to trace and view your application’s state from where the error manifested itself in the form of an exception, for every line of code that was executed back through the complete chain of events to the original trigger. At the most basic level, that means local variables, and method parameters and return values across the whole call stack. But that’s only half the story. In today’s world of distributed computing with asynchronous event queues, microservices, and serverless code, the trigger of the error may happen in a different thread/module/service than where the corresponding exception is thrown. So the immediate call stack of the service that manifests the error is not enough. You need to be able to trace the sequence of events back across the different microservices, network requests, database queries, etc., that participated in the error from trigger to manifestation. With this kind of radical observability, you should be able to see the exact place and time in your code where something went wrong. And then, there are the log files. A typical application generates gigabytes, if not terabytes of data that is stored on your servers. While this is a wealth of information, it also presents challenges. In today’s typical applications, the logs are as distributed as the application is. The different modules and microservices generate separate log files that need to be pieced together, like a jigsaw puzzle, to get an idea of what happened in the code execution flow of the error in question. And, no matter how hard your developers try, they can never anticipate where an error will occur, so the log entries you start off with are never enough. You’ll always need to do an initial analysis, guess where the real problem is, and add more logs to verify your guess. Now rebuild, redeploy, and look at the logs again. If you were right, great, but often, you’ll need several such iterations.
While log analyzers can do much to help you understand the content of your log files, they don’t do enough to point you to the relevant log entries and do a root cause analysis.
The role of an effective production debugger is to extract the logs relevant to the code execution flow of an error. Once you have only the relevant logs aggregated into one view as you step through the code, you’re in the direction of your root cause.
A snapshot of your code showing the values of variables at the time an exception was thrown is helpful but is not a complete picture of what happened along the way. It’s a bit like finding a relic in a cave and then trying to figure out how it got there. A variable may be invalid at one point in time, but out of context by the time the exception is thrown. To really understand what led to that exception, you need to be able to track back from the exception, through every line of code that was executed (in any module or service that was involved), and view the value of all the variables, and method parameters and return values at every step of the way. It’s like being able to visualize that relic, see the cavemen pick it up and walk backward out of the cave, watch them migrate backward to a different land, unpack a few belongings from their animal skin and then sit down at their fireplace. That’s what I call debugging with true time-travel fidelity.
The widespread adoption of DevOps has brought great benefits to software development and significantly reduced development cycle times. You want your production debugger to fit into the rhythm of your DevOps processes and help maintain those gains.
Three DevOps pillars are automation, collaboration, and CI/CD. Autonomous exception capture sits comfortably with the notion of automation. Once your production debugger is installed, it automatically catches exceptions and notifies you immediately. But what about collaboration? An effective production debugger should promote collaboration between team members, helping them work together towards fixing a bug. That kind of collaboration involves focused communication that easily points team-members to something significant in the debugging process. And finally, your production debugger should fit into your CI/CD process. Since your production debugger knows how many times an exception is thrown, this can be a quality gate through which builds must pass before being promoted from one level of your DevOps pipeline to the next.
Pillars are what your production debugging will stand on
The production debugging neighborhood is slowly being populated with a variety of tools. But, while debugging production takes up a significant portion of developer time, most debugging tools do not provide all four pillars of an effective production debugger. These tools do have a place in the debugging neighborhood and can play nicely together, providing a wealth of useful information. Still, if you want to fix bugs at their roots as quickly as possible, there’s no substitute for autonomous exception capture, observability, time-travel fidelity, and DevOps integration.