A production debugger plays a critical role in your tool stack because errors in production can be extremely damaging to your business. In the best-case scenario, you catch them before they impact your customers. But if they have already been affected by an error, it really starts costing you. Either way, when a production error hits you, your developers are in a crunch to fix it and are not able to spend their time doing what they love most, writing awesome code. The thing about production errors is that they are harder to fix than bugs detected earlier in the deployment pipeline. Not only is it difficult to recreate a production environment, but developers also don’t have convenient tools for Production like the IDEs they use for debugging in development. Consequently, a typical “fix” for production errors is to roll back to a previous version, which of course, makes great new features roll away.
Therefore, enterprises invest significant time and resources to try and ensure application health under normal operations, and if an error does hit production, they want to fix it as quickly as possible; before customers notice. While there are many monitoring tools available on the market, none of them provides an adequate solution for debugging. The best way to go is to choose the best tool for the job and combine them to form a complete solution that includes monitoring, alerts, logging, and production debugging.
The Enterprise Application Health Ecosystem
The three main categories of tools used in the enterprise to monitor and maintain application health are Error Monitors, Log Analyzers, and APMs. There is some overlap in the capabilities of these tools:
All do some degree of monitoring.
All claim to help resolve production issues.
All of them fall short of being an effective solution to find the root cause of production bugs.
Errors can occur anywhere in a production system. Error monitoring tools excel in aggregating all the errors into dashboards so they can be tracked and handled with ease. These tools may also provide the stack trace that led to an error along with some context that can point the developer in the right direction. Log analyzers try to go a step further by analyzing all logs generated in your system to provide visibility into what’s happening. Looking at logs is a common way for engineers to try to understand and debug distributed systems. The capabilities of log analyzers begin with searching and filtering logs but can also stretch to applying AI algorithms to detect anomalies. Application Performance Monitors (APMs) keep an eye on application performance by merging and analyzing logs and errors while measuring a variety of application metrics. These metrics can be divided into two categories: user/application metrics (such as number of users, number of transactions, and response time), and infrastructure metrics (such as CPU time, memory usage, and storage loads). To avoid impacting application performance through their measurements, APMs typically use periodic sampling to reduce load on the production systems. So, for example, a report on how long an HTTP request took to complete may be an average value of several samples taken over a given time interval.
Metrics and dashboards can’t replace real production debugging
The metrics and dashboards that error monitors, log analyzers, and APMs provide go a long way in alerting you about errors in your system. They may even offer clues as to where the error lies, but that’s where it ends; clues. Software errors usually manifest as exceptions that the system throws. To understand what’s really causing the error (the root cause), you need the ability to track your code from the point where the exception was thrown all the way back to the original interaction. This is where these tools fall short of a production debugger.
Aggregated errors, even with context and a complete stack trace, do not provide all the variable values along the error execution flow. Moreover, microservices and serverless architectures make it even more difficult to track the flow of an error. The different metrics provided by APMs don’t provide enough information for a code-level analysis, so the best you can do is form a theory which then needs to be tested in a new deployment to production. You might think that with logs, you can output anything you want, but logs are a game of hit and miss. There are never enough. You usually end up guessing what happened and adding more log entries to verify your guess. But then, you must also rebuild, redeploy, and reproduce the error, which can be extremely complex in a highly scaled production environment. The conclusion is that nothing can replace a code-level analysis of the real-time scenario that caused the error. As a simplistic example, consider the code below:
What this example emphasizes is that a stack trace may alert you to the fact that an exception was thrown, but this is not enough. The developer still needs to figure out the values of the variables and function return values along the way from other sources such as logs, HTTP requests, and other metrics in order to build a complete picture of what happened. A production debugger does all that work for you and displays all the relevant values just where you need them.
Production debugging fits right in
A production debugger doesn’t come to replace any of these tools. Enterprises need error monitors, log analyzers, and APMs. But it does augment their artillery when something bad happens in their production systems. APMs and error monitors may be early warning systems indicating something is wrong and giving you a chance to tweak and maintain system equilibrium before customers notice anything. Log analyzer can do a great job of helping you understand the numerous and disparate log files that are distributed around the different modules in your system. But when it comes to really understanding the root cause of an error, you need a tool that provides the four pillars of production debugging: autonomous exception capture, code-level observability, time-travel fidelity, and DevOps integration.
Ozcode Production Debugger
The cost of downtime can be enormous. Some estimates put it as high as $5,600 per minute. So clearly, you want to avoid downtime, but just as clearly, websites go down all the time. This only serves to emphasize the critical and imperative nature of production debugging. Anything that is going to cut down on the time it takes to fix production errors is money on the table. While error monitors, log analyzers, and APMs can help point you in the right direction, they can’t take you all the way to root cause. Think of it like navigating to an address in the middle of a city you’re unfamiliar with. Those other tools might take you to the outskirts of the city, but to get to the exact address you need, a production debugger is what’s going to give you precise directions.