The technological advances of the past decade have forced companies to undergo a digital transformation. It was the CTOs job to navigate the transformation successfully, highlighting one of a CTO`s most important roles; to use technology to generate value for the company and achieve its business goals.
So, let me show you how live debugging serves that role. The TL;DR you need to know is:
You need it because bugs in production are consuming valuable developer resources and costing you money.
It’s efficient and effective in helping you make your production software more robust and enabling you to recover more quickly from production errors.
But before we go into all the reasons why, let’s look briefly at what a live debugger is.
What is a live debugger?
A live debugger is a tool or platform you use to resolve errors in your live production and pre-production environments like QA and staging. The production use case is naturally the most acute one, where resolving errors can become an emergency. However, applying live debugging in QA and staging will also accelerate your developer velocity and make your production software more robust while doing wonders for your DevOps KPIs.
The idea of debugging in production is not new. Over the years, tools like event viewers, log files, dump files, Application Performance Monitors (APMs), and others have helped developers resolve errors in production. However, none of these tools are ideally suited for the job. They are either intrusive to your production environment, requiring downtime and dramatically impacting performance, or they’re not very effective in determining the root cause. They don’t provide enough data and may require multiple CI/CD cycles to deploy new builds specifically designated for debugging.
Modern live debuggers modify the code in your live environments non-intrusively using byte-code instrumentation to generate the data developers need in two ways: recording the complete error execution flow of an exception along with all the debug data, and adding dynamic log entries on the fly, without having to rebuild the application. While APMs and Observability platforms have been using byte code instrumentation for the last ten years to generate various metrics displayed in beautiful graphs and charts, applying it to generate debug data is relatively new.
Why you need a live debugger
Amazon CTO Werner Vogel famously said, “Everything fails all the time.” But this is not new. In fact, it’s as old as the original Murphy’s law, “Anything that can go wrong, will go wrong.” No matter how many safeguards you have in place, you will have errors in production. It happens to the best of us. From UI glitches…
…to crashing company stocks and knocking spaceships out of orbit. A quick look at downdetector.com will show you that at any time, household names like Comcast, YouTube, Instagram, AT&T, Verizon, Microsoft, and others experience outages that Gartner estimates can cost companies thousands of dollars per minute.
Live debugging is efficient and effective
Debugging in production is hard. Production environments are usually far too complex to recreate, and reproducing production errors for debugging can be impossible. You can’t put breakpoints in production. Even with modern, sophisticated log analyzers and distributed tracing, log files don’t usually contain enough data to determine the root cause of an error. In most cases, you don’t have access to the production environments running your applications where the errors occurred.
A live debugger overcomes all of these hurdles. Capturing exceptions along with the complete error execution flow removes the need to reproduce production errors. The full application state along the error execution path is available for a developer to step through very much as she would do in development. This is what we call time-travel debugging. Dynamic logging with tracepoints (aka, non-breaking breakpoints) provides a similar degree of observability into the code for logical errors that don’t generate exceptions. The developer can simply add log entries to investigate the application state anywhere in the code without having to rebuild and redeploy the application.
Between autonomous exception capture and dynamic logging, a live debugger offers the developer a fast path to the root cause of an error which slashes debugging time by up to 80%.
A live debugger overcomes the challenges of modern software architectures
Modern software architectures present special challenges when debugging live systems. It’s almost impossible to follow the complex code execution path of errors as they traverse a multitude of redundantly deployed, ephemeral microservices, with intermittent database requests, networking, and messaging, all while generating terabytes of log entries.
Fortunately, autonomous exception capture tracks execution flow across microservices, so you can follow the path of an error from one microservice to another and examine the application state at each step as the error unfolds. Similarly, for logical errors, you can place dynamic log entries anywhere in your code and trace the execution of an action across all the microservices in your application.
A live debugger is secure and compliant
A developer needs access to production data to understand the nature of an error. However, a variety of privacy regulations place restrictions on the production data you are allowed to expose. A live debugger finds the optimal balance between these two mutually exclusive forces with highly configurable PII redaction capabilities and enhanced data controls. Data can be redacted according to regular expressions, identifier names, or whole classes and namespaces, providing granular control over what is exposed to the developer. Data configured for redaction is masked before it ever leaves the production environments, and as a backup, data is redacted again at the front end to cover possible changes in redaction configuration. Moreover, the live debugger admin has complete control over data retention policies and can explore an exhaustive audit trail of data access.
A live debugger complements and integrates with observability platforms
You may wonder why you need a live debugger if you already have a modern observability platform (such as New Relic, DataDog, Logz.IO, Dynatrace, etc.) closely monitoring your systems. These sophisticated platforms provide many capabilities like log analysis, performance monitoring, error monitoring, and more, displaying a host of system metrics you can follow. While these capabilities are indispensable for the daily maintenance and monitoring of modern software systems, none of these platforms provide the code-level observability needed to do an effective root cause analysis of production errors. At best, they will register an exception and show you the relevant stack trace, but only a live debugger will let you drill down into the complete error execution flow to analyze the error, step-by-step, to determine its root cause. True, a live debugger cannot replace your observability platform; rather, it is complementary and provides data at the next level of detail needed to resolve production errors. Observability platforms will point you in the right direction, but you need a live debugger to drill down and determine the root cause of errors.
The business of technology
As a CTO, you make strategic decisions about your company’s technology stack. Out of the countless tools available, you have to choose those that will make the most impact in driving your business forward. I may be biased, but it seems clear that a live debugger is a no-brainer. It’s necessary infrastructure that’s as important as your observability platform, if not more. Not only will it make your products more robust by preventing faulty deployments from reaching production, but it will also drastically cut the time to resolve those errors that do slip through the barricades of your DevOps pipeline. Whether it’s reducing the time your developers spend on debugging in production (liberating them to add new features and value to your products) or reducing the number of customers impacted by an error, the ROI is huge and will immediately be reflected in your company’s bottom line.