Remember the days when a software release was a media event? Announcements went out to the press, plastic discs were shipped to customers, and the company celebrated the end of a half-year journey to get the release out the door. One of the reasons releases took so long is that they needed massive amounts of QA and testing because once those CDs were burned, there was no going back. And yet, errors were still found in production, and patches had to be released through the same cumbersome process of burning discs and shipping them worldwide.
Fast-forward to 2021, and things have changed. SaaS rules with over 80% of workloads executing in the cloud. DevOps has been widely adopted, and companies are releasing software updates at a frenzied pace.
But for all the safety measures and quality gates that a build has to pass before being deployed to production, bugs still get through. And when a serious bug is discovered in production, it’s like a fire in the building. Everyone goes into emergency mode. One way to minimize the number of flareups is testing in production.
Testing releases with exception tracking, feature flags, and canary deployments
Companies that have accepted that production errors are inevitable are embracing testing in production to reduce the number of “fires” that break out in their production environments. In her series of articles on testing in production, Cindy Sridharan mentions canary deployments, feature flags, and exception tracking as three techniques used in the “release” phase of production.
Testing in production with exception tracking
Exceptions that turn up in a new release reflect the differences between production and staging. You simply cannot faithfully reproduce the structure, scale, and complexity of your production workloads in pre-production. It’s precisely those traffic spikes, unique user scenarios, and weird code paths that generate an exception and could not have been foreseen (otherwise, you would have fixed them before releasing).
Exception tracking solutions have been around for a while, from error monitoring tools to APMs to full-blown observability platforms. These tools typically catch exceptions and show you the stack trace and locals where the exception was thrown. While this information helps with an issue, it’s not usually enough. Developers typically find themselves digging through log files to try and understand what went wrong, but even logs don’t usually provide enough information. Developers usually have to go through several “log-only” CI/CD cycles to add the data they need to understand and fix the issue. But now, consider how testing in production and error resolution might look if your exception tracking tool could laser-guide you to the solution.
Ozcode’s exception tracking goes much deeper and uses autonomous exception capture to track exceptions.
When your application throws an exception, the Ozcode agent uses dynamic instrumentation to record a full time-travel recording of the complete error execution flow. That means that not only do you get the exception, stack trace, and locals that those other tools provide, but you also get method parameters and return values, relevant log entries, database queries, and HTTP request/response payloads. This methodology is much better than reproducing the error in a different environment (if you manage to do that in the first place) because it replays the error as it unfolded, providing you with real production data line by line every step of the way. There’s no better source of information a developer could want.
Debugging in production with feature flags and canary deployments
Feature flags are a great tool to implement canary deployments. They allow you to enable code for a distinct and defined subset of your users, providing a way to fine-tune how you can deploy new features to more and more customers until they’re widely available to the complete user base.
But in addition to managing feature rollout, feature flags are also data, data that can help developers resolve issues that suddenly turn up when a flag is enabled. With Ozcode, you can see the state of all your feature flags for each exception that is thrown.
In the above example, 3 feature flags are displayed as contextual data in the Exception Distribution panel. While the “Fast Checkout” and “New Sign In” feature flags don’t always cause the exception, the “Reg CTA” feature flag does 100% of the time. Whenever that feature flag is enabled, we get this exception raising suspicion of a bug in the code it exposes. You can now select that exception and debug it to understand what went wrong.
When you expand the rollout to the next stage, and a new exception suddenly turns up, you’re faced with a dilemma. Leave the feature flag turned on so you can debug the issue? You might get more data in the logs. Or, perhaps you should switch it off, so your users don’t get any degradation of service? With Ozcode, you don’t have to make that decision because you have the full time-travel recording of the exception. You can disable the offending feature flag and still debug the error.
But what if the feature you’re rolling out causes an error without an exception. Something’s wrong with the flow. That’s where feature flags with tracepoints can point you in the right direction.
Ozcode tracepoints provide a way to compare the correctly functioning “current” version with the canary deployed new version with the bug. Here’s how to do it.
- Whenever you enable a feature flag, add it as an item of contextual data (e.g. FF (Reg CTA) = Enabled)
- Do a canary deployment with your feature flag enabled for a small subset of your user base (one that will manifest the error).
- Put a tracepoint at your suspicious location in the code – near where you think the error is
- Run the application to ensure you get the corresponding tracepoint hits.
- Now you can filter all your tracepoint hits to include one agent that is running with the current code and another agent that is running with your canary deployment
- Duplicate that browser window and view it side-by-side.
- On the left, select a tracepoint hit on the canary deployed version. You can verify the feature flag by displaying it as a column in the Tracepoint Hits panel.
On the right, select the corresponding tracepoint hit on the current version.
- Now you can step back through the code from the tracepoint hit in both versions to see where the flow breaks. This is your classic “delta debugging” technique comparing a “good” example of code execution to a “bad” one where the bug occurs, and playing a game of “spot the difference.”
Here’s what it looks like getting started:
Debugging in production to mitigate the risks of testing in production
Testing in production with canary deployments is all about mitigating the risks of deploying new code to your user base, and it works on two levels. First, the new code is only released to a subset of your users; second, if the code causes an error, you can disable it immediately by switching off the corresponding feature flag. The problem with that is if you disable the feature flag, you can’t debug the issue as it no longer occurs. Still, you have to resolve the issue as quickly as possible. You didn’t invest valuable resources to build a shiny new feature, only to hide it behind a disabled feature flag. So, before you switch off that feature flag, try Ozcode’s time-travel debugging on your canary deployment. If you’re cutting 80% off your debugging time, you may not have to disable that feature flag after all.