DevOps has become mainstream. No doubt about it. Everyone’s doing DevOps these days. Or so they say. And yet, in many companies, DevOps has meant nothing more than creating a position called something like “DevOps Engineer” or “Head of Delivery,” and assigning it to someone in the Ops team. If you ask developers in these companies what DevOps is, they will probably start talking about IT, infrastructure as code, configuring build and release automation tools, maintaining and monitoring production deployments, etc. The different teams continue to work in silos, and all that has happened is that the Ops/Development chasm that existed earlier has evolved into a DevOps/Development chasm.
That’s not DevOps.
DevOps is not just a role, position, or specific function. It’s an organizational culture, a set of best practices and tools, a complete mindset that companies need to adopt in order to deliver better software faster. One of the tenets of a successful DevOps adoption is having close collaboration between Operations and Development. This post will show how building and maintaining a DevOps/Development bridge to cross that chasm can help teams perform better and improve several classic DevOps KPIs.
DevOps is broken - the DevOps/Development chasm
Different studies repeatedly show that companies that successfully adopt DevOps consistently deliver better software faster. What that means is that they’re performing better on a set of DevOps KPIs. For example, the diagram below taken from the Accelerate State of DevOps 2019 Report shows that companies evaluated as “Elite” adopters of DevOps can deploy changes on demand within an hour with a failure rate of less than 15%. And if service is disrupted, they can restore it within an hour. On the other end of the spectrum, “low” level adopters take months before deploying a change. They deploy new versions once a week at best, about half of the deployments fail, and it can take up to a month to restore service.
Source: Accelerate State of DevOps 2019 Report, DORA (Google Cloud)
One of the reasons that companies on the lower end of the spectrum don’t perform as well is the lack of DevOps/Development communication. The different teams still work in silos. Developers push changes to source control. Automated systems then run tests, build versions, and promote those versions up the pipeline through QA and Staging until they finally reach DevOps for deployment to Production. But by that time, the developers who pushed the changes for that build are already several versions ahead or working on a different project altogether.
Similarly, when something goes wrong in Production, DevOps uses observability tools to gather data snapshots of logs, metrics, and traces (the pillars of observability) and throws them over the proverbial wall back at developers. But developers don’t operate at the level of logs, metrics, and traces. They may not know how to leverage that kind of data, they may not have access to the DevOps engineers who produced it, and they certainly don’t have access to Production systems. They can only work with what has been dumped on them. Developers operate at a code-level. To do an effective root cause analysis of Production errors, they need code-level observability into those Production systems, which they do not get. In this sense, DevOps is broken in companies with low adoption because the continuous cycle of feedback and communication between DevOps and development is broken.
Bridging the chasm with the four pillars of production debugging
Let’s examine how the four pillars of production debugging can bridge the DevOps/development chasm and fix the broken DevOps cycle.
Autonomous exception capture means errors are detected as soon as they occur and are captured together with all the relevant data. Developers don’t have to spend any time or energy trying to reproduce an error that happened in Production on their development machines to understand what went wrong. They don’t need anything from the DevOps engineers.
Code-level observability provides developers with the Production data that they need. Regardless ofthe scale of the Production system or the complexity of the scenario that caused the error, the developer has the exact error execution flow along with all relevant line-by-line data to examine. There’s no need for DevOps to collect anything and share it with developers.
Time-travel debugging. With code-level data and the complete error execution flow, developers can step through the Production code line-by-line to evaluate the root cause of the error, just like they’re used to doing on their development builds. Again, the feedback from Production is inherent in the error capture. Developers know exactly where the exception was thrown and have all the data they need built into the system across the complete call stack.
Boosting DevOps KPIs
Change Failure RateThis metric is the number of deployments that fail divided by the total number of deployments. As you increase your deployment frequency (the “faster” part of “better software faster”), you may find your change failure rate increasing. If that happens, you should consider scaling down your deployment frequency and look more carefully at the issues in your code, especially if they recur from one deployment to the next. The term “Production” Debugger is a little bit misleading because you can install one in your pre-Production systems too. Running a Production Debugger in your Staging and even in your QA environment can shift left detection of flaws in your deployments. The closer your staging environment is to your production environment, the more effective it will be in surfacing errors before they reach Production. As a result, more deployments will succeed, and your change failure rate should decrease.
Mean Time to Detection (MTTD)This KPI measures how long it takes you to detect an issue once it has occurred. It’s easy to see how a Production Debugger that uses autonomous exception capture can reduce MTTD for Production software failures. As soon as an exception is thrown, it appears on your dashboard and is immediately detected.
Mean Time Between Failures (MTBF)This metric is an indicator of the quality of your Production software. The fewer bugs in Production, the longer the time between failures. Just like detecting bugs in your pre-Production systems can improve your change failure rate, it can also improve your MTBF. By shifting left your handling of software issues, fewer bugs will reach Production, and MTBF should increase.
Mean Time to Recovery (MTTR)MTTR reflects the average time it takes you to recover from a software failure and resolve the issue. Now, it’s well known that the further up the CI/CD pipeline a bug is detected, the more difficult it is to fix. In other words, it’s much easier to fix a bug in your development environment than one in Production. But what if you could now approach Production issues in the same way you approach bugs in your development IDE. That’s exactly the edge that a Production Debugger gives you. Time-travel debugging with code-level observability gives a development-like experience and can cut debugging time by up to 80%. This can certainly cut this very critical KPI down. When Production downtime can cost thousands of dollars a minute, every improvement in MTTR directly impacts the company’s bottom line.
Defect Escape RateThis is another metric that can be managed by shifting left your production debugging. Defect escape rate is the ratio of bugs found in Production to bugs found in Production and pre-Production systems. If your defect escape rate is too high, you might just be deploying too frequently and not putting enough emphasis on code quality in QA and Staging. It’s clear that testing more in QA and using a Production Debugger on your Staging environment can reduce your defect escape rate.
Customer TicketsAt the end of the day, it’s all about keeping your customers happy. Happy customers translate into more revenue from sales and fewer expenses on addressing issues. If you use a production debugger to prevent issues from getting to Production and fix them as quickly as possible if they do, then you should be seeing fewer customer tickets. While you may want to qualify which types of customer tickets to include for this metric, it can be a good overall assessment of how successfully you have adopted DevOps.
The role of a Production Debugger in bridging the chasm
There are two factors to the DevOps/Development chasm that is breaking DevOps. One is the lack of code-level observability that developers need in order to explain the logs, metrics, and traces that DevOps collects when things go wrong in Production. The other is the DevOps/Development disconnect when these two groups work in silos. It basically amounts to a poor level of feedback from DevOps back to development, both at a code level and at a human communication level. The role of a Production Debugger is to bridge the chasm on both factors by shifting left to detect and handle issues in pre-Production environments and providing the observability and means for effective communication on issues that do crop up in Production (and we know that they will). At the end of the day, bridging that chasm enables you to release faster while baking quality into your code, and you’ll see the effects as those classic DevOps KPIs improve over time.