There are many articles floating around in cyberspace that quote the Wikipedia definition of observability. If we apply this term, taken from control systems to the realm of software, observability refers to our ability to detect behavior we are not happy with, in a production system, and track down its cause. Naturally, observability is important in QA and staging environments too, but it becomes a critical business need when bad things start happening in Production. The need to understand what’s happening under the hood is not new, and over the years, industry monoliths like Google and Facebook established what is now known as the three pillars of observability, which are logs, metrics, and traces.
Not quite pillars
Observability software such as Application Performance Monitors, Error monitors, and Log Analyzers have become part of the standard tool stack that most enterprises use to obtain the pillars of observability that they need to watch over their production systems. But while the vendors of these tools tout their problem-solving capabilities, when a code-level error is detected in Production, SREs and DevOps practitioners often find themselves scrambling to find a solution. They end up collecting logs, metrics, and traces, and call on developers to fix the error. But to fix a code-level error, developers need code-level production data which they can’t usually access. There is this DevOps/development chasm that must be bridged between the error detected in Production and the developers who have to fix it. There’s no lack of reports of websites going down, and there are voices that are challenging the effectiveness of observability tools that have become industry standards. The discussion can get very technical, going into cardinality and sampling rates for metrics, storage costs of log files, and metadata attached to traces. The validity of these arguments begins the erode the previously perceived stalwartness of those three pillars. Now, nobody is saying that there’s no value in logs, metrics, and traces, but rather than considering them as pillars, we should, perhaps, consider them more as supporting observability.
What is Code-Level Observability for Debugging?
Drawing on that Wikipedia definition for observability, let’s take a crack at defining software observability for debugging.
Observability for debugging is a measure of how well the internal error states of a software system can be inferred from its external outputs.
Translating that into plain English is basically, how well you can determine the root cause of an error from the way it is manifested.
Since an error is something the developers of the system did not anticipate (otherwise, of course, they would have written the code to avoid it in the first place), it is often manifested as an unhandled exception. Many of the tools that support observability claim they can help determine root cause. One may argue that they do help in solving certain problems, but production debugging is a different matter. None of these tools really enable the code level analysis needed to determine the root cause of production bugs. They may do what they do very well, but to debug production systems, you need to take things a bit deeper. So, what are the pillars of software observability for debugging?
- Time-travel debug information
- Relevant logs
- Error execution flow
Time-travel debug information
While aggregative metrics may be useful in identifying performance bottlenecks, or problems with scalability, they don’t necessarily point you to a problem in the code that caused an error under very special circumstances. In fact, aggregative metrics may skip over a particular problematic scenario if the sampling rate is not high enough. This is inadequate for code level debugging. What you need is the time-travel debug information that you get when stepping through the code, line after line, with visibility into the values of every variable, method parameter and return value across the whole call stack. It’s this code-level visibility provided through time-travel that enables an effective root cause analysis, so you really understand what went wrong.
… with an emphasis on the word “relevant.” Applications typically generate volumes and volumes of logs. Even if you have an APM or log analyzer to aggregate them and create colorful reports, sifting the relevant log entries from the clutter can be extremely challenging. In the context of software observability for debugging, an effective production debugger will do all the sifting for you and only aggregate those log entries relevant to the error. By itself, this is not usually enough for a root cause analysis, because the nature of errors is that you never know where they are going to occur, so usually, you have to add logs, redeploy, and reproduce the error to get more insights. Nevertheless, getting the logs relevant to the error takes you leaps and bounds towards a resolution.
Error execution flow
You might compare this to traces that you get on APMs, but it’s much, much more. One of the issues with tracing is the overhead it generates during collection. That could be mitigated by sampling specific traces, but then, you might miss the trace relevant to the error you’re investigating. Getting the exact trace relevant to that error is like using sampling to reduce overhead but knowing exactly which traces to sample. The error execution flow transcends microservices, serverless code, event queues, and any other fork in the road that your code may encounter. You are able to step through the code, line by line, from the interaction trigger, through the root cause of the error, to the exact place that threw the exception. Combine that with the time-travel debug information and relevant logs, and you have yourself a recipe for success (rapid resolution of the bug).
Code-level observability is radical
I call the code-level software observability provided by time-travel debug information, together with relevant logs and error execution flow “radical” because it gives you visibility to unprecedented depths in your code. At any step of the error execution flow, you can drill down to view any variable, across the whole call stack, through microservices, etc. There is no deeper observability available today, and that’s why it is one of the pillars of production debugging. Only this depth of visibility into your code can point out the most esoteric of error states that could easily be missed by other tools. This is what you need to fix bugs, not colorful charts.
View the true state of your code anywhere in the execution flow of an error.
Don’t sink under a sea of log files. Only analyze log entries relevant to the error.
Error execution flow
Trace an error from an exception back to the root cause with full visibility into your code.