Table of Contents
Over the years, the software industry has undergone many paradigm shifts that sprouted new technologies. A move to distributed architectures like microservices and serverless brought us products like Docker, Kubernetes, AWS Lambda, and Azure Functions. Performance and error monitoring tools evolved to APMs, which have now become full-blown observability platforms. Today we are in the midst of a revolution in debugging live systems.
Debugging in development is pretty much the same as when I was a developer over 20 years ago. Reproduce the error, place breakpoints in your code, and hit F5, F10, and F11 (or some other key combination) to start the debugger and step through your code to understand what went wrong. Not so for production.
Debugging in production has always been vastly different. For one, developers don’t usually have access to their production environments, so they have to try and reproduce errors in their development IDEs, a near-impossible feat in the age of distributed cloud computing. Even if developers were given access to production, they wouldn’t be able to put breakpoints in the code to examine the application state.
All that is changing.
Over the last few years, a new industry segment has emerged that focuses on changing how we debug live applications in production environments to make the process as straightforward as in development. Several pioneering companies have entered this new field aiming to accelerate the resolution of errors in production (which still do and always will occur) without incurring any downgrade to service. As a new field, the industry hasn’t yet settled on a name, so you may have heard about:
- Continuous debugging
- Live debugging
- Modern debugging
- Remote debugging
- Code-level observability
- Autonomous debugging
- Software understandability
The rest of this article summarizes the sentiments of these trailblazers in the art of debugging across the SDLC in their responses to Adam’s questions. For the complete recording, scroll down to the end of the article.
Why has live debugging in production become so difficult?
Software has changed in all respects. How it’s developed, how it’s deployed, and even in our expectations from software developers. Monoliths have been replaced by microservices and serverless. Waterfall has been replaced by agile, which then sprouted DevOps. The whole infrastructure is different. A LEMP stack won’t cut it anymore. You now have multiple instances of your software running in different locations, managed by a load balancer at an insane scale.
While we keep doing bigger and greater things with these new software architectures, one of the consequences of this increasing complexity is that it’s extremely difficult for a programmer to understand where in the chain of causality things break down and why. The typical debugging paradigm is several cycles of adding more logs, going through a lengthy CI/CD cycle, and shipping out a patch for debugging.
But developers don’t encounter these issues only in production. Today’s pre-production environments are also so complex that developers cannot recreate them on their local machines. Between large Kubernetes clusters, external dependencies, 3rd party libraries, and APIs, replicating an environment to reproduce a bug is very cumbersome, and fixing bugs is a long and frustrating process for engineers.
And yet, today, we live in an environment of high expectations. Customers demand performance and new functionality delivered frequently; applications must run smoothly 24/7, and maintenance downtime is not acceptable. A slip on any of these parameters immediately affects the business. These high demands place a heavy burden on developers’ shoulders. But when it comes to debugging, the tooling has not kept pace and doesn’t address all these new challenges that developers face. There’s a faulty feedback loop that doesn’t provide developers with enough information.
To compound the problems developers face, software is much more diverse than it was 10 – 15 years ago. Teams choose the best language and platform to develop any particular service, so developers need a much broader base of knowledge and capabilities than before.
As time goes by, applications are not properly maintained. Technical debt creeps in, and teams change until finally, nobody really understands how the application works. Enterprises are hesitant, even scared, to touch legacy code, and it can be daunting to chase down bugs.
The challenges of debugging “the old way”
The first challenge a developer faces when fixing a bug is trying to reproduce it. Sometimes, even the most detailed QA reports don’t provide enough data, and QA engineers find themselves arm-wrestling with developers over the proverbial “It works on my machine.” This kind of development/QA friction, even for pre-production environments, wastes valuable development resources and ultimately results in lower quality code. It’s even worse for bugs in production.
The increased complexity of modern production environments makes it nearly impossible to reproduce production issues in a developer’s local environment. Cloud computing poses many obstacles, and between the growing trend to use feature flags and specialized configuration for different customer environments, it’s even not feasible to recreate the environment where a production error occurred.
The most common production debugging tool in current use is static logging. There’s this built-in, traditional reliance on logs. While logs can provide first-responders with many insights, they’re not a productive tool for fixing production errors. As customer demands push companies to ship versions to production at an ever-increasing pace, there’s the corresponding demand for developer velocity to increase. But to fix bugs quickly, developers need feedback. In development, they get initial feedback from their compilers, then from static code analysis tools, then from their CI/CD pipeline. But when it comes to debugging in production with static logs, the feedback becomes prohibitively painful. Developers never have the logs they need in place and have to go through multiple cycles of log-only builds to get it.
Modern debugging tames logs
Static logging has been commoditized by companies who are making it cheaper, faster, and more robust. But this mindset of “log everything and analyze later” incentivizes engineers to write reams of logs without investing enough thought into what they’re logging. The result is very noisy output, most of which is never viewed or analyzed anyway. When a problem arises, engineers immediately race to their logging tools but are swamped by the over-abundance of logs that don’t provide the right data.
Logging needs to be more ergonomic for developers. The tools should feel familiar like the “GitHubs” of the world, so developers feel comfortable with them. Developers need easy access to the information they need without having to endlessly scroll through a browser window.
Modern debugging tools take a different approach and treat logs dynamically. They empower developers to add logs to live code, when and where they need them, simply and securely. This enables them to capture the data they need without the noise of reams of useless static logs. These tools close the gap between errors in production and the code that caused them providing developers with the data they need to fix those errors.
The magic behind dynamic logging
The magic behind modern debugging tools can be broadly categorized as dynamic instrumentation. This is a technology borrowed from the cybersecurity space that changes applications at runtime. By manipulating byte code, modern debugging tools change live code to output logs anywhere in the application that the developer wants to inspect. Essentially, at the click of a button, developers can get the data they need without adding code, without stopping the application or affecting its performance, and without needing anyone else in the organization.
The front end of these tools is as important as the back end. They use UX patterns and paradigms that developers are familiar with, either integrating directly with popular IDEs or presenting themselves in an IDE-like user interface. The act of adding a dynamic log entry is virtually identical to adding a breakpoint, which developers are so familiar and comfortable with. Indeed, the different terms used to name this feature include non-breaking breakpoints, tracepoints, snapshots, data points, and more.
Modern debugging, observability, and understandability
The concept of observability has been widely embraced, with several companies becoming well-entrenched in the industry. The premise of modern debugging tools is also around observability, but at the code level. Traditional observability tools will show you the state of your machines and maybe even your applications. They may identify a spike or a crash and pinpoint it to a server, cluster, or application instance. But these parameters don’t provide enough data to debug production issues.
You could think of it as the contextual data an engineer wants to see in a JIRA ticket in order to make decisions and solve problems based on line-by-line data such as local variables, method parameters and return values, and stack traces. This data helps the engineer understand exactly what caused that spike or crash, get to the root cause, and fix it quickly.
The difference between traditional observability and code-level observability mirrors the difference between how IT/Ops handle incidents compared to developers. IT/Ops and SREs may be the first responders to alerts on traditional observability platforms, but when they can’t fix an issue and conclude the problem is deeper than a cluster or a machine, there’s the proverbial “throwing over the wall.” However, developers can’t act on those logs, metrics, and traces that the SREs are throwing at them. Developers need debuggers so they can delve deeper. So, there’s this chasm between IT/Ops and developers. As the integration between traditional observability and code-level observability improves, this chasm will be bridged, and collaboration between the teams will improve. But the collaboration extends beyond Dev and Ops. There’s also Dev/QA and even Dev/Dev collaboration. As the different teams intensify their collaborative efforts, development organizations will become more powerful because developers don’t work alone. There’s this fallacy that debugging is a solitary effort; a developer sitting in front of a bright screen in a dark room trying to figure out complex interactions between the moving parts of an application. In reality, putting IT/Ops, Dev, and QA in the same context will help the developers assemble the different pieces of data to solve the issues they’re debugging.
Hurdles to overcome
In spite of the clear benefits this technology brings to software, companies in this space encounter a lot of resistance.
In general, people resist change, and that’s no different when it comes to introducing new technology to software organizations. Developers are so used to having to reproduce errors in their local environments and then adding logs to try and understand what caused them. It’s the first course of action when tackling production issues. Even if agents do get installed in production environments, developers have this knee-jerk reaction of,” We don’t have access to production.” However, as the complexities of the cloud make this approach less and less feasible, awareness of the alternatives will grow, and adoption will increase.
Integration with observability platforms
Observability platforms were in a similar situation about ten years ago, but with widespread adoption, they have become the first line of defense for production issues. Modern debugging tools aren’t trying to replace observability platforms (and anyway, no enterprise is going to abandon observability). These are complementary technologies, and modern debuggers will have to play nicely with observability platforms. There are different ways to make these two classes of tools work together. Some approaches focus on keeping the products separate, communicating via APIs; others go for full-fledged integrations. Either way, working together with observability platforms with a smooth and intuitive workflow is a must.
Maturity and security
As an emerging category, companies are very concerned with both the maturity and security of modern debugging tools. They all require installing agents in the customer’s production environment, and fundamentally they are modifying the customer’s code.
The pioneers in this market are well aware of and understand the security concerns, so they build security into their products as a primary feature from the ground up. Modern debuggers offer highly configurable redaction capabilities for personal data. They store and transfer data in compliance with the strictest security requirements, offer fine-grained access control, and audit access to sensitive data.
Resistance is futile
The need is there. The technology is evolving. Resistance is futile.