Observability platforms play a vital role in an enterprise’s tool stack, providing DevOps/SREs and Production Support staff with a system-level view of their applications’ and infrastructure’s health and performance levels. By alerting the right DevOps and engineering staff to performance bottlenecks and live-site incidents, observability platforms help keep the company’s systems running smoothly to maintain the ever-important business continuity. However, the observability that these platforms provide is primarily at a system level. When a software error surfaces in Production, the people tasked with fixing it are developers. To resolve Production errors, developers need actionable data that enables them to reproduce and debug those errors, and that’s where the current state-of-the-art observability platforms fall short. They barely scrape the surface of the code-level observability that developers need. The performance metrics and stack traces that DevOps/SREs work with are ineffective and frustrating for developers who have to fix an urgent Production issue and only serve to create friction where collaboration is needed.
Ozcode Production Debugger introduces both a paradigm shift and a cultural shift in the realm of resolving Production incidents. By providing developers with the code-level observability they need, Ozcode turns Production Debugging into a monitoring discipline in which developers are empowered to actively participate as part of their day-to-day responsibilities. The rest of this post describes how Ozcode provides developers with the code-level observability they need to do an effective root-cause analysis of Production errors leading to a rapid resolution. It’s important to note that Ozcode Production Debugger and traditional observability platforms are not mutually exclusive but rather complement each other and are equally vital components of an enterprise tool stack.
Taking observability beyond the system and down to code-level
APMs such as New Relic, Dynatrace, AppDynamics, DataDog, and others have developed in recent years into full-fledged observability platforms. They include an enriched set of features to include capabilities like log analysis, real user monitoring, synthetic testing, error monitoring, and more. However, none of these platforms enable code-level observability, which is the crucial missing piece that allows developers and DevOps/SREs to collaborate and quickly resolve issues and prevent faulty deployments from reaching Production.
Ozcode Production Debugger presents a new paradigm for troubleshooting Production errors by turning Production Debugging into a monitoring discipline through the following key capabilities.
Autonomous exception capture replaces reproducing an error
Ozcode Production Debugger uses an agent that runs next to and monitors your application. When your application throws an exception, the Ozcode agent adds byte-code instrumentation to the code along the entire execution path from the initial interaction that triggered the exception to the line of code that threw it. Now, observability platforms also operate by adding instrumentation to your applications and infrastructure, but here’s the difference. Ozcode captures and records code-level Production data that is specific to the complete error execution flow that caused the exception. No observability platform provides that level of data. The Ozcode agent then transmits this debug data to the Ozcode server, where the relevant developer can analyze it independently of the live Production system.
Reproducing a Production error can be extremely challenging for several reasons:
- Matching the scale and structure of Production in a parallel environment is usually not feasible (of possible at all)
- Reproducing the exact user scenario that caused the error may be impossible
- Matching the right source code to the current Production binary may be impossible
- The ephemeral nature of microservices and serverless make it even more difficult for Production systems built on those technologies
By capturing the code execution flow of an error, Ozcode removes the need to reproduce an error. You debug the actual Production code where the error manifested in a completely non-intrusive way.
Time-travel debugging: the “Development experience” on Production code
During development, developers are used to debugging errors by stepping through their code with full visibility into their application’s runtime data. Doing the same in Production ranges from difficult to impossible, and observability platforms don’t even try to address this issue.
One way you might consider is to attach a remote debugger. There are tools on the market that let you connect to a live application for debugging. However, setting them up can be very complex, and company policies often forbid such connections for reasons of security.
And then, even if you were able to set up a remote connection to your Production systems, stepping through the code would require stopping the application flow with breakpoints. Again, company policies usually forbid this activity as it stops the application flow not only for the debugging developer but also for customers.
Ozcode Production Debugger delivers the “Development experience“on Production with Time-Travel debugging.
The detailed debug data stored in Ozcode exception captures provide the same code-level observability that developers are used to getting in their development environments. They can step back and forth through the complete error execution flow across the whole call stack with full visibility into:
- Local variables
- Method parameters and return values
- Network requests
- Database queries
- Relevant log entries
- Event trace across microservices
Moreover, Ozcode makes it easier for developers to understand what happened in the error execution flow through various visual aids. For example, red/green coloring for conditional statements, greyed-out text for code that is not executed, annotations providing the values of variables and method parameters within the body of the code, and more.
Using dynamic logging with tracepoints to make log-based debugging effective
Observability platforms may offer advanced log analysis as part of their enriched feature set. While the traditional way of troubleshooting with logs is better than nothing, it is NOT an effective way to debug Production issues for the following reasons:
- Insufficiency – it’s impossible to predict exactly where an error will occur, so it’s equally impossible to ensure that you have diagnostic log entries in the right places of your code.
- Distribution – logs from the multiple components of today’s complex software systems may be distributed between various sources, including files and databases. Piecing together the right logs to understand an error is extremely difficult.
- Process – since there are never enough log entries present to debug an error, debugging-by-logs is a tedious, iterative process that goes something like this: analyze existing logs → formulate a theory for root-cause of the error → add logs to test the theory → rebuild → redeploy → reproduce the error. Usually, several iterations of this process are required before the root-cause is determined so a fix can be put in place.
Ozcode makes debugging-by-logs highly effective by addressing each of these issues:
- Dynamic logging with tracepoints solves “insufficiency” – with Ozcode, there’s no need to predict where an error will occur. You can add tracepoints anywhere in the code and set up structured dynamic logs to output any data item for analysis or examine the application state every time the code passes through a tracepoint.
- Log aggregation solves “distribution” – Ozcode assembles the log entries relevant to the error execution flow into one place, so there’s no need to dig through multiple sources to extract the relevant log entries.
- Autonomous exception capture with time-travel debugging solves “process” – the process using Ozcode is completely different. Autonomous exception capture gives you the complete error execution flow, so there is no need to reproduce an error. Time-travel debugging lets you step through the error execution flow with code-level observability. Using tracepoints and dynamic logs, you can add and remove log output and examine application state anywhere in your code at will without having to rebuild and redeploy.
Observability platforms and Ozcode, better together
Ultimately, observability platforms may alert you to an error in your application and even point you to where in the code you might start looking; however, from there, it’s a guessing game. These tools do not provide the code-level observability into the error execution flow needed to do an effective root cause analysis. Metrics and dashboards cannot replace time-travel debug information required to fix Production errors.
Nevertheless, it’s important to emphasize again that Ozcode Production Debugger and observability platforms are complementary tools, and both are vital components of an enterprise tool stack. Observability platforms monitor your systems for performance metrics and resource usage to ensure a good level of business continuity during normal operation. However, Production errors are inevitable, and when they occur, they disrupt business continuity. That’s when Ozcode Production Debugger jumps in to restore it.
In other words…
Observability platforms maintain business continuity during normal operations. Ozcode Production Debugger restores business continuity when errors occur.