Datadog is an industry-leading observability platform and brings a wide variety of observability data into one integrated view. From details captured in your processed logs, Datadog lets you switch to traces to see how the corresponding user request was executed. In case of an error in your software, Datadog displays the full stack trace and then lets you use faceted search to drill down into the corresponding traces and logs to determine the cause of the issue. Datadog continuously monitors your production environment and provides system-level alerts such as traffic spikes, elevated latency, or looming bottlenecks to help you troubleshoot issues and keep things running smoothly.
The system-level data Datadog provides goes a long way to determining the root cause of errors, but in many cases, observability at the level of logs, metrics, and traces does not provide enough information to understand what really went wrong with your application. Think of it like a car. If you see the temperature gauge rising, you might guess that you need to top up the radiator fluid. But are you sure that’s really why your engine is overheating? Is there a leak in your cooling system, or is the engine overheating because you’re losing oil? To find out, you have to pop the hood with Ozcode.
Popping the hood on your production environment
When Datadog shows you something has happened with your software, Ozcode pops the hood and takes you on an observability journey from system level down to code level. Datadog can provide a great starting point, showing you anomalies in metrics and even the stack trace of exceptions. From there, you go to Ozcode.
To investigate anomalies surfaced by DataDog, Ozcode lets you add dynamic logging using tracepoints. You can add these log entries on the fly to your live running code without having to deploy a new build through your CI/CD pipeline. Using dynamic logs to reveal the value of locals, variables, method parameters, and return values anywhere in your code goes a long way to exposing the root cause of an incident.
Ozcode also pops the hood on exceptions. Ozcode autonomously captures any exception that your application throws along with full, time-travel debug information so you can step through the error execution flow with full visibility into your production data at every step of the way. This is what we call code-level observability.
When the impossible happens
Let’s see how this integration might work with an eCommerce nightmare.
Black Friday or some other purchasing frenzy is just around the corner. All systems are GO. Everything has been tested, retested, and reinforced.
And then, the impossible happens. Customers can’t complete checkout.
Everybody’s face-palming, and phones and pagers are going off everywhere in IT/Ops.
The first place your DevOps engineers go to is your observability platform. Datadog to the rescue.
A quick look at the service map shows which service is throwing errors.
Let’s drill down into the App Analytics screen for that service and investigate the errors.
Time to pop the hood
Ozcode steps in when you need to start working with code. Setting up Ozcode to work with Datadog is easy – just install the Ozcode extension from the Datadog marketplace and get the Ozcode agent installed on your servers. Once you’re set up, Ozcode will show you all the exceptions you saw in Datadog, and now you can time-travel debug them with full visibility into the error execution flow on your live production environment.
But that’s not always enough. We also saw that even in cases where the request returned a 200 OK, customers can’t seem to check out. Let’s dig a little deeper.
Observability hits code level
Going back to your Datadog dashboard, you discover that some critical requests are showing unusually high latency.
Let’s set a tracepoint (a.k.a dynamic log) in the method that tries to fill orders.
The Log View correlates the Ozcode dynamic log entry to the trace of the request that generated it. Analyzing this visual representation of the internal workings of our application shows us exactly where the application is spending time and why checkout is taking too long.
Having discovered the problematic variable, you may now want to monitor it for a while to make sure a fix you implement is working correctly.
Let’s go back up the observability path to Datadog.
Since Ozcode pipes dynamic log output back to Datadog, you can use Live tail and watch how your variables change in real-time. In fact, you can use all of the platform’s analytics capabilities for your new live log entries.
Using dynamic logging to pipe variables back into Datadog opens up a world of opportunity. You can watch how anything changes in real-time on a new chart you define for your dashboard. Taking the car analogy, you’ve added gauges to measure your radiator fluid and oil level in real-time with no effort.
From system to code and back
Observability is critical to keep systems running smoothly and fix them when they don’t. Our journey into observability started at the system level when Datadog’s Service Map showed that one of our services was throwing errors. A look at the Analytics Panel revealed what the error was and even gave us the stack trace. To understand the root cause of the error, we first used Ozcode to time-travel debug an error and then drilled down by adding tracepoints on the fly. These tracepoints generated dynamic logs, which we fed back into Datadog, and even created ad-hoc metrics and visuals to monitor suspicious variables. As soon as a variable went off the scale somewhere, we could examine the live application state that caused it in great detail to take us directly to the root cause of the error.
When you’re thinking about observability, you need to think about the full round trip; from system, down to the code, and back.