DevOps has been a revolution in the software industry. Companies who have embraced DevOps and implemented it correctly have seen tremendous improvements in the velocity at which they can deliver high-quality software. There’s no lack of DevOps tools and resources for the .NET ecosystem, but if you include Live Debugging in production as part of your .NET DevOps pipeline, then you must come to the following sad conclusion:
DevOps is broken.
It works fine… until you deploy your applications to Production. Then it breaks down.
Because a successful DevOps implementation requires fast feedback loops that enable a rapid cadence of continuously improving deployments. And in companies that do not implement DevOps correctly, these feedback loops break and the DevOps/Development chasm that was always there haunts developers who continue to chase elusive Production errors.
Live Debugging in Production is a whole different ball-game
A pre-Production system is a controlled environment, so when something goes wrong, it’s relatively easy to reproduce the error and get all the debug information needed to fix it. In many cases, the developer can run the relevant code in her local debugger and step through the error execution flow to find the root cause. In Production, things change dramatically.
Production systems run on a whole different scale. The number of user interactions and the volume of data creates scenarios you never expected. Throw in the cloud, microservices, and serverless code, and things begin to get very tricky. The actual point of failure where the application throws an exception may be clear, but that’s not enough. You can’t put breakpoints in your production environment and step through the code. And in the case of microservices or serverless, the service or function with the root cause may not even be running anymore.
Traditional .NET debugging tools can’t cut it
Traditionally, the way to approach Production errors was through log files, what we call “printf debugging.” But logs are not very effective in determining the root cause. The nature of errors is that they occur where you least expect them, so you rarely happen to have the right log entries in place. You try your best to understand what happened, put more logs in place to verify your evaluation, and then have to redeploy and wait for the error to recur. You’ll usually need several iterations before you really solve the problem.
There are observability tools available that help. They provide snapshots with information about locals and variables at the point of failure. Still, we already established that the root cause of the error is often somewhere way back in the execution flow. So the information these tools provide only shows you where to start looking.
There are also .NET debugging tools touting debugging capabilities that use methods similar to Visual Studio’s Tracepoints, but all they’re doing is enabling you to add logs dynamically to your Production code. This is much better than the traditional logging-based debugging techniques since it removes the need to rebuild and redeploy your changes each time. But is that enough?
Not really. In the end, what you’re getting is hints at what happened at different points along the way. If you decide to log the right variables, you may see what went wrong, but it’s still an exercise of guessing what you need to log.
There is a better way.
Time-Travel debugging for .NET
There is a good reason that Time-Travel fidelity is one of the pillars of Production debugging. Not only does it provide a complete picture of the application state at the point of failure, but it does so at each step of the way back to the root cause. With Time-Travel fidelity, you can step back and forth through the code and see the values of all locals, variables, and method return calls across the whole call stack along the complete error execution flow. No need to guess what to log where – it’s all there for the taking.
Ozcode provides Time-Travel fidelity in steps for maximal effectiveness with minimal instrumentation.
The first time an error occurs, and an exception is thrown, the Ozcode agent reports it in the dashboard and adds primary instrumentation. Next time the error occurs, Ozcode shows the point of failure and corresponding local variables. (Error monitors provide this information too, but for them, that’s where it ends.) In many cases, that is enough to solve the error, but the scale and complexity of Production systems often require deeper investigation. You can now ask Ozcode to add more instrumentation to the error execution flow, so that next time the error occurs, you’ll get the full time-travel capture that provides debug information across the call-stack.
Let’s see what this looks like in Ozcode Live Debugger.
Time-Travel debug information completes the feedback loop
While logs (whether static or dynamic) and observability tools go some of the way to help developers fix Production errors, they still leave a wide gap between the location in the code where an exception is thrown to determining the root cause of the error. Detailed time-travel debug information fills that gap with code-level visibility into the complete error execution flow. By putting this Production data into the developers’ hands, time-travel debug information closes the feedback loop from Production back to development to complete the DevOps cycle for Live Debugging in Production.
Ozcode Live Debugger
Autonomous exception capture
Don’t chase after exceptions. They automatically appear in your dashboard.
Get insights into your code down to the deepest levels.
View the true state of your code anywhere in the execution flow of an error.