Twelve years after Patrick Debois coined the term “DevOps,” it’s clear that DevOps is here to stay. However, not all DevOps adoptions are equal. The 2019 Accelerate State of DevOps Report showed that companies on the most successful end of the spectrum could deploy changes on demand within an hour with a failure rate of less than 15%. However, when DevOps adoption falters, changes can take months to deploy, and about half of them fail. To put yourself on the right end of that spectrum, you must keep track of your performance against industry-standard DevOps KPIs. Thoughts of improving your DevOps adoption instinctively take you to testing, automation, collaboration, and other pillars of DevOps. I’m here to tell you that adopting a Live Debugger is increasingly becoming a trend among DevOps engineers, who find that this tool has a real positive impact on DevOps KPIs.
What is live debugging?
Live systems have errors that are often severe enough to cause an outage. In some cases, the errors are at a system level, and the different observability platforms available on the market provide enough information to determine their root cause and fix them. However, in many cases, the errors are at code level.
Bugs in production
To fix these bugs, you need the four pillars of live debugging so you can closely examine your production code and data along the error execution flow. And assuming a bug has not crashed your application (which doesn’t necessarily reduce its severity), you need to debug it without interrupting your customers’ experience in any way. Let’s see how live debugging in production and pre-production environments can do wonders for your DevOps KPIs.
The live debugging connection to DevOps KPI
There are many KPIs you could be monitoring to assess your DevOps adoption, and using a live debugger can dramatically improve several of them.
Change Failure Rate
This KPI is one of four key metrics identified by Google’s DevOps Research and Assessment (DORA) team that indicates how a software team is performing. It measures the percentage of deployments that cause a failure in production and is an indication of the product’s stability.
In the 2019 Accelerate State of DevOps Report, DORA found that Elite DevOps teams had a seven times lower change failure rate than low performing teams (i.e., deployments are only 1/7th as likely to fail).
Source: 2019 Accelerate State of DevOps Report
How does a live debugger help reduce change failure rate?
By shifting debugging left to pre-production. Here’s what you can do.
Install the Ozcode agent alongside your application on your staging environment. Your application will, most likely, throw exceptions. Any of those exceptions could cause a deployment in production to fail and increase your change failure rate. Ozcode will catch all of those exceptions autonomously and provide you with the full time-travel debugging information to fix those errors ON STAGING.
Fewer bugs on staging means fewer bugs in production and a lower change failure rate.
Defect Escape Rate
This term is also quite self-explanatory and is a measure of defects that “escape” your pre-production systems and get into production. The calculation is quite simple:
If your Defect Escape Rate is too high, you should re-evaluate your deployment frequency. You might find that you’re rushing through QA and staging to meet release deadlines. The consequence is more buggy code (ergo, a less stable application) in production which can cause anything from a loss of reputation to direct loss of revenue. As in the case of Change Failure Rate, using Ozcode on your staging environment and even on QA can reduce the number of bugs that escape and make it to production, hence lowering the Defect Escape Rate.
Mean Time to Detection
How long will a defect exist in production before you detect it? You want this number to be as low as possible because the earlier you detect a defect, the earlier you can fix it, so your customers will be less likely to experience it.
There are two primary factors affecting MTTD:
- When an incident occurs
- How long it takes you to detect it.
You have no control over when an incident occurs. An incident is unexpected; otherwise, you would have already implemented a fix to prevent it. Once you detect an incident, you can review your log files or monitoring systems to timestamp when it first occurred. With that data, it’s a simple calculation of detection time minus start time. MTTD is an average over any time interval you choose. Consider this example of an organization detecting three incidents:
|Start time||Detection time||Elapsed time (min)|
|4:26 pm||5:02 pm||36|
|3:05 pm||8:51 pm||346|
|10:15 am||10:17 am||2|
For the time interval of this sample, MTTD = (36+346+2)/3 = 128 min.
Now, there are different ways you can assess your MTTD, for example, by removing outlier values or segmenting by incident severity, but that’s a topic for another post.
The time taken to detect an incident depends on whether it’s caused by a logical bug or a software error. You may only detect a logical bug once a user (or preferably, one of your own QA staff who is testing in production) reports an issue. Typically, MTTD for this kind of bug will be longer. On the other hand, a software error usually throws an exception, and in these cases, Ozcode Live Debugger can dramatically reduce MTTD.
As soon as your application throws an exception, Ozcode captures the error execution flow and displays the exception on the dashboard.
While you can get similar detection capabilities from modern APMs, with Ozcode’s Live Debugger, you can just click one of those exceptions to debug it directly. Essentially, MTTD for exceptions should evaluate to Zero, and from here, you’re in a race to reduce MTTR, which I discuss below.
Mean Time Between Failures (MTBF)
MTBF is another indicator of your software’s quality. It stands to reason that the more robust your software, the less likely it is to fail and the more available it will be to your customers. Here too, the calculation is quite simple:
For example, if a system failed 4 times in 24 hours, and the total outage time was 2 hours, then for that 24-hour period:
MTBF = (24-2)/4 = 5.5 hours
MTBF goes hand-in-hand with Defect Escape Rate and Change Failure Rate in that improving those KPIs is likely to have a positive effect on MTBF. Using Ozcode to reduce the number of bugs in your pre-production environments will help deploy more robust releases and thus improve (i.e., increase) MTBF, but Ozcode can also have a direct effect on MTBF. By reducing your system downtime (i.e., the recovery time – see MTTR in the next section), Ozcode directly increases total operating time, and therefore, your MTBF.
If your system uptime is approaching your MTBF, start taking extra care with new deployments and closely monitor your operations.
Mean Time to Recovery (MTTR)
MTTR measures how quickly you get your service running again after it crashes. This is probably one of the best-known DevOps KPIs because it relates to managing an emergency situation. Business stakeholders are watching the clock, pagers are beeping in Operations, and developers are getting phone calls in the middle of the night. Basically, anyone who cares is burning the midnight oil to get your systems up and running again. Here’s the calculation:
Using our MTBF example again, MTTR for that 24-hour period is 2 hours/4 failures = 0.5 hours.
MTTR is an indication of how quickly you can respond to an outage and fix it. The quicker you can debug an issue that crashed your system, the lower your MTTR will be, and this is where Ozcode can help. Ozcode can reduce the time to debug an issue by up to 80% because:
- There’s no need to reproduce the issue. Ozcode records the complete error execution flow directly on your production environment.
- The time-travel debug information that Ozcode captures provides the developer who has to debug the issue with all the production data they need.
Providing the developer with this kind of code-level observability into the production system’s error state and allowing them to step through the error execution flow is another form of “shift-left debugging,” and it dramatically reduces the time from failure to recovery.
Bridging Dev and Ops with live debugging
DevOps engineers have recognized the value of having a live debugger in their enterprise tool stack. While they are the first responders to production incidents, they understand the need to bridge the gap to developers who must fix the bugs that cause those incidents. To understand exactly what went wrong, developers need access to production data so they can step through the execution flow of an error exactly how it happened with full visibility into production data at code level. With that data at their fingertips, developers can improve the quality of code in production without sacrificing release velocity, thereby improving DevOps KPIs.