Ozcode https://oz-code.com Debug like a Wizard Sun, 12 Sep 2021 10:20:22 +0000 en-US hourly 1 https://wordpress.org/?v=5.8.1 https://oz-code.com/wp-content/uploads/2020/03/ozcode_logo.svg Ozcode https://oz-code.com 32 32 Making Sure You’re You https://oz-code.com/blog/production-debugging/making-sure-youre-you Sun, 12 Sep 2021 12:36:00 +0000 https://oz-code.com/?p=19902 Ozcode Live Debugger now offers best-of-breed authentication with MFA, SSO, and SAML.

The post Making Sure You’re You appeared first on Ozcode.

]]>

We’ve been hacked!

Those are probably the three words that any CEO, CTO, CSO, CISO, or “VP whatever” dreads more than any other. But it’s bigger than that. While those with Cs and Vs in their titles will be the ones answering tough questions very soon, those three words will often mobilize the whole organization. Nobody is going to get much sleep until the breach is contained.

With the average cost of a data breach reaching $4.24 million, it’s no surprise that global cybersecurity spending is skyrocketing and is forecasted to reach $345 Billion by 2026. Still, for all the safe-guards that well-meaning companies put in place, in the end, data breaches are a “people problem,” with 95% of cybersecurity breaches caused by human error.

Passwords suck!

By far, the most common “secret key” needed to access your account on any web service is still your password. And in most cases, it’s the only key needed. But, usually, passwords are not secure for many reasons. In fact, most passwords can be hacked within 13 seconds, with “123456” being the most popular password found in data breaches in 2020. But even if you’re vigilant and always use a strong password (which is easy enough with a password manager like LastPass, 1Password, etc.), there are so many ways malicious hackers can steal your credentials through social engineering, like phishing attacks or using malware like password dumpers. It’s exactly for this reason that we have upped our security posture here at Ozcode and vastly upgraded authentication on Ozcode Live Debugger.

Beyond passwords

Ozcode Live Debugger now offers best-of-breed authentication, providing different ways to authenticate users in your organization. We have upgraded all our servers to provide enterprise-grade security for your valuable data, and if you haven’t noticed it already, you’ll see the new login screen next time you sign in to your Ozcode account.

Ozcode Login

Let’s learn about the different ways you can now be authenticated and access your Ozcode Live Debugger account.

MFA-Multi-Factor Authentication

Authenticating with passwords is based on a secret password you’re supposed to keep to yourself, or “what you know.” Since, as we’ve seen, we’re not very good at keeping secrets, modern systems ask you for additional means of authentication based on something you pysically possess. Some of us have used hardware keys, such as YubiKeys to log into secure systems, but these are only viable for enterprises, not the general public. But everyone has a phone today.

SMS is the most common form of MFA in use today. Most of us have already encountered OTPs, one-time passwords texted to us when trying to access our credit card statements online, or some other sensitive site from a new device. But the truth is, SMS is not secure. Messages are not encrypted, may travel through different networks, and security of the infrastructure is questionable. A more secure form of MFA is through an authentication application such as Google or Microsoft Authenticator, which is what Ozcode offers today. 

Any user can (and should) enable MFA for their account, although it is optional. As an administrator, you can enforce MFA, and I highly recommend you do so to make sure nobody gets unauthorized access to your source code and data.

Ozcode Live Debugger - MFA Policy

SSO - once you’re in, you’re in

While MFA does provide a high level of security, it still requires users to have a password for Ozcode. In an enterprise setting, users may have to log in to many different applications, and ensuring every employee safely manages all those passwords with a password manager becomes impractical. That’s why many enterprises enforce SSO – Single Sign-On. Today, Ozcode supports SSO using any authentication provider, including Azure Active Directory, Google, and others. Once an Ozcode administrator connects the organization’s Live Debugger account to the authentication provider, all users are automatically logged in once authenticated with any other application connected to the SSO provider.

Ozcode Live Debugger - SSO

Ozcode also supports SSO for on-premises instances of Active Directory. Just select SAML as your IDP when configuring SSO.

Additional security measures

While MFA and SSO are the most significant updates in this release, a well-rounded security posture would not be complete without the following measures that Ozcode also supports now:

  1. To prevent brute force attacks, you can configure the maximum number of incorrect password entries before users get locked out of their accounts.
  2. To prevent users from repeating previous passwords, you can specify how many passwords back to keep track of.
  3. Audit logs maintain a record of all user activity connected with security and authentication.
    Ozcode Live Debugger - Audit Logs

Learn more about Ozcode Live Debugger Security in our white paper:

The post Making Sure You’re You appeared first on Ozcode.

]]>
The Common Thread Between Live Debugging and Testing in Production https://oz-code.com/blog/production-debugging/the-common-thread-between-live-debugging-and-testing-in-production Sun, 22 Aug 2021 13:03:31 +0000 https://oz-code.com/?p=19849 Learn how to use feature flags and canary deployments with autonomous exception capture and tracepoints to debug your live applications in production.

The post The Common Thread Between Live Debugging and Testing in Production appeared first on Ozcode.

]]>

Remember the days when a software release was a media event? Announcements went out to the press, plastic discs were shipped to customers, and the company celebrated the end of a half-year journey to get the release out the door. One of the reasons releases took so long is that they needed massive amounts of QA and testing because once those CDs were burned, there was no going back. And yet, errors were still found in production, and patches had to be released through the same cumbersome process of burning discs and shipping them worldwide.

Fast-forward to 2021, and things have changed. SaaS rules with over 80% of workloads executing in the cloud. DevOps has been widely adopted, and companies are releasing software updates at a frenzied pace.

DevOps is ubiquitous in software development and has achieved such widespread adoption that it’s easy to forget this wasn’t always the case.

Opening sentence of the Puppet 2021 State of DevOps Report

But for all the safety measures and quality gates that a build has to pass before being deployed to production, bugs still get through. And when a serious bug is discovered in production, it’s like a fire in the building. Everyone goes into emergency mode. One way to minimize the number of flareups is testing in production.

Testing releases with exception tracking, feature flags, and canary deployments

Companies that have accepted that production errors are inevitable are embracing testing in production to reduce the number of “fires” that break out in their production environments. In her series of articles on testing in production, Cindy Sridharan mentions canary deployments, feature flags, and exception tracking as three techniques used in the “release” phase of production.

Phases of Production - Ozcode

Testing in production with exception tracking

Exceptions that turn up in a new release reflect the differences between production and staging. You simply cannot faithfully reproduce the structure, scale, and complexity of your production workloads in pre-production.  It’s precisely those traffic spikes, unique user scenarios, and weird code paths that generate an exception and could not have been foreseen (otherwise, you would have fixed them before releasing).

Exception tracking solutions have been around for a while, from error monitoring tools to APMs to full-blown observability platforms. These tools typically catch exceptions and show you the stack trace and locals where the exception was thrown. While this information helps with an issue, it’s not usually enough. Developers typically find themselves digging through log files to try and understand what went wrong, but even logs don’t usually provide enough information. Developers usually have to go through several “log-only” CI/CD cycles to add the data they need to understand and fix the issue. But now, consider how testing in production and error resolution might look if your exception tracking tool could laser-guide you to the solution.

Ozcode’s exception tracking goes much deeper and uses autonomous exception capture to track exceptions.

Docker Dashboard - Ozcode

When your application throws an exception, the Ozcode agent uses dynamic instrumentation to record a full time-travel recording of the complete error execution flow. That means that not only do you get the exception, stack trace, and locals that those other tools provide, but you also get method parameters and return values, relevant log entries, database queries, and HTTP request/response payloads. This methodology is much better than reproducing the error in a different environment (if you manage to do that in the first place) because it replays the error as it unfolded, providing you with real production data line by line every step of the way. There’s no better source of information a developer could want.

Debugging in production with feature flags and canary deployments

Feature flags are a great tool to implement canary deployments. They allow you to enable code for a distinct and defined subset of your users, providing a way to fine-tune how you can deploy new features to more and more customers until they’re widely available to the complete user base.

But in addition to managing feature rollout, feature flags are also data, data that can help developers resolve issues that suddenly turn up when a flag is enabled. With Ozcode, you can see the state of all your feature flags for each exception that is thrown.

Filtering exceptions by feature flag - Ozcode

In the above example, 3 feature flags are displayed as contextual data in the Exception Distribution panel. While the “Fast Checkout” and “New Sign In” feature flags don’t always cause the exception, the “Reg CTA” feature flag does 100% of the time. Whenever that feature flag is enabled, we get this exception raising suspicion of a bug in the code it exposes. You can now select that exception and debug it to understand what went wrong.

Debug exception filtered by feature flag - Ozcode

When you expand the rollout to the next stage, and a new exception suddenly turns up, you’re faced with a dilemma. Leave the feature flag turned on so you can debug the issue? You might get more data in the logs. Or, perhaps you should switch it off, so your users don’t get any degradation of service? With Ozcode, you don’t have to make that decision because you have the full time-travel recording of the exception. You can disable the offending feature flag and still debug the error.

But what if the feature you’re rolling out causes an error without an exception. Something’s wrong with the flow. That’s where feature flags with tracepoints can point you in the right direction.

Ozcode tracepoints provide a way to compare the correctly functioning “current” version with the canary deployed new version with the bug. Here’s how to do it.

  1. Whenever you enable a feature flag, add it as an item of contextual data (e.g. FF (Reg CTA) = Enabled)
  2. Do a canary deployment with your feature flag enabled for a small subset of your user base (one that will manifest the error).
  3. Put a tracepoint at your suspicious location in the code – near where you think the error is
  4. Run the application to ensure you get the corresponding tracepoint hits.
  5. Now you can filter all your tracepoint hits to include one agent that is running with the current code and another agent that is running with your canary deployment
  6. Duplicate that browser window and view it side-by-side.
  7. On the left, select a tracepoint hit on the canary deployed version. You can verify the feature flag by displaying it as a column in the Tracepoint Hits panel.

    On the right, select the corresponding tracepoint hit on the current version.

  8. Now you can step back through the code from the tracepoint hit in both versions to see where the flow breaks. This is your classic “delta debugging” technique comparing a “good” example of code execution to a “bad” one where the bug occurs, and playing a game of “spot the difference.”

Here’s what it looks like getting started:

Debugging in production to mitigate the risks of testing in production

Testing in production with canary deployments is all about mitigating the risks of deploying new code to your user base, and it works on two levels. First, the new code is only released to a subset of your users; second, if the code causes an error, you can disable it immediately by switching off the corresponding feature flag. The problem with that is if you disable the feature flag, you can’t debug the issue as it no longer occurs. Still, you have to resolve the issue as quickly as possible. You didn’t invest valuable resources to build a shiny new feature, only to hide it behind a disabled feature flag. So, before you switch off that feature flag, try Ozcode’s time-travel debugging on your canary deployment. If you’re cutting 80% off your debugging time, you may not have to disable that feature flag after all.

Ozcode Live Debugger

The post The Common Thread Between Live Debugging and Testing in Production appeared first on Ozcode.

]]>
Fine-Tuning Live Debugging with Conditional and Time-travel Tracepoints https://oz-code.com/blog/production-debugging/fine-tuning-live-debugging-with-conditional-and-time-travel-tracepoints Sun, 08 Aug 2021 13:27:39 +0000 https://oz-code.com/?p=19738 Dynamic logging with tracepoints hailed in a new era of live debugging. Conditional and time-travel tracepoints are the next generation and take live debugging to new heights.

The post Fine-Tuning Live Debugging with Conditional and Time-travel Tracepoints appeared first on Ozcode.

]]>

Developers have this love/hate relationship with logging. Logs are one of the pillars of observability and the first line of defense against production errors. But using them for debugging in production is very cumbersome. Dynamic logging is much more effective. We introduced Tracepoints into Ozcode Live Debugger several months ago, so you can use dynamic logging to debug elusive bugs and production issues – the ones that don’t throw an exception but make your application misbehave.

The imperfections of dynamic logging

Dynamic logs are a giant leap towards effective incident resolution compared to static logs. First and foremost, you can update logs on the fly without going through a complete CI/CD process. Storage is not a problem since dynamic logs can be switched off as soon as the issue at hand is resolved. Dynamic logs also never go stale. They are easily removed once they’re not needed, keeping your source code clean and focused on the actual business logic at hand.

However, you still need to address the haystack and the unknowns.

The imperfections of dynamic logging

Dynamic logs are a giant leap towards effective incident resolution compared to static logs. First and foremost, you can update logs on the fly without going through a complete CI/CD process. Storage is not a problem since dynamic logs can be switched off as soon as the issue at hand is resolved. Dynamic logs also never go stale. They are easily removed once they’re not needed, keeping your source code clean and focused on the actual business logic at hand.

However, you still need to address the haystack and the unknowns.

The haystack

In some cases, the error you are investigating only happens under a very particular set of conditions. If your dynamic log entry fires for every tracepoint hit, you may find yourself digging through many logs before you identify the relevant one. While this may feel familiar to those who are used to digging through mountains of static logs, it’s exactly what we’re trying to avoid.

The unknowns

While a snapshot of application data at a tracepoint is helpful, often, it’s not enough. The values of variables at the line of code where you placed the tracepoint provide some insights, but you still don’t know how those variables got there. How the code execution flow affected the application state line-by-line until you got to the tracepoint is still unknown. You have to mentally step back in the code to try and figure out which conditional execution paths were traversed and the value of each variable at every step.

Ozcode Production Debugger - LEARN MORE

Conditional tracepoints to the rescue

Welcome to the next generation of tracepoints in Ozcode Live Debugger. Conditional tracepoints give you a lot more control over when to capture a tracepoint and output a log. When setting a tracepoint, you can define a set of conditions based on any variable in scope to determine when that tracepoint should actually fire a log entry to the output stream.

As the video above shows, autocomplete helps you select the variables to output, and you can build complex conditional expressions using the AND or OR operators.

Time-travel data per tracepoint

Understanding how a variable “develops” in the code execution flow; well, we’ve taken care of that too. You can now record time-travel debug information along with the stack trace for any tracepoint. It’s all about understanding “how we got here.” When examining a tracepoint hit, color-coded conditional statements clearly show how the code executed, and annotations display the value of each variable at every step of the way, from the beginning of the method down to the tracepoint. This step-by-step data provides deep insights into the chain of causality of the error in question.

Let’s see how this can be helpful.

Debugging logic errors in microservices

Microservices are great. As small, distinct pieces of code, they’re relatively easy to develop. But once they’re deployed into your distributed architecture, things can get tricky. Logical bugs that only appear under special circumstances of your complex production environment can be tough to pin down. This is where conditional tracepoints with time-travel data can really help. Here are a few tactics you can use.

  • Use the stack trace to understand the code execution flow when an error occurs.
  • Add tracepoints at each level in the stack trace so you can monitor relevant application data to see when something goes wrong. The time travel data within the scope of each tracepoint will be very helpful in showing how data changes with the error execution flow.
  • Identify the conditions under which an error occurs and use conditional tracepoints to provide data – but only when the errors occur. No need to add straw to the haystack! Note that conditions can be based on contextual data such as customer name, machine name, etc., as well as the values of local variables.
  • Use the Agents filter to ignore any tracepoint hits from services that aren’t connected to the error you’re debugging.
  • Once you start homing in on the source of the error, you can add columns to the Tracepoint Hits panel and filter on the service causing the error

The real thing (almost)

 Developers need ten fingers to write code, but only one to debug it in their IDEs. That’s right, F5 (Start debugger), F10 (Step over), and F11 (Step into). OK, not quite, but you get the point. Thing is, they want the same kind of experience when debugging in production. Digging through reams of static log files is unacceptable. Stepping through decompiled code to examine application state around tracepoints feel much more natural. You can do that with one finger.

Ozcode Live Debugger

The post Fine-Tuning Live Debugging with Conditional and Time-travel Tracepoints appeared first on Ozcode.

]]>
Accelerating Developer Velocity with Time-Travel Debugging https://oz-code.com/blog/production-debugging/accelerating-developer-velocity-with-time-travel-debugging Thu, 15 Jul 2021 07:54:47 +0000 https://oz-code.com/?p=19420 Every company should want to increase its developer velocity. Time-travel debugging is one of the best-in-class tools that are a primary driver of developer velocity and a top contributor to business success.

The post Accelerating Developer Velocity with Time-Travel Debugging appeared first on Ozcode.

]]>

Ten years ago, software started eating the world. Today, every company is a software company. In every industry segment, from IT (of course), through medical, financial, energy, retail, … everything, companies depend on software, and therefore, software developers, to achieve their business goals. In the never-ending quest to improve, industry leaders have coined the term “Developer Velocity” as the ability to improve business performance through software development. So, every company should want to increase their developer velocity. But what exactly does that mean?

During 2020, in the midst of the global COVID-19 crisis, McKinsey sought to identify and quantify what drives developer velocity. They asked hundreds of engineering and technology executives to rate their company’s performance on 46 drivers across 13 capability areas in the three broad categories of technology, working practices, and organizational enablement. The weighted average of scores across all the drivers was defined as the Developer Velocity Index (DVI).

Not surprisingly, Mckinsey found that companies with a high DVI outperform others in the market by 4 – 5 times.

Revenue CAGR - Ozcode

Source: McKinsey

Drilling down into the numbers, McKinsey also found that tools in general, and development tools, in particular, were one of the drivers that had the greatest impact on business performance, which brings me to tools for debugging in production.

Studies have shown that 43% of developers spend 25% of their time debugging in production.

That’s time spent fixing errors instead of delivering more value that contributes to business performance. This is why time-travel debugging in production is one of those tools that can help every business accelerate developer velocity.

Impact on Developer Velocity - Ozcode

Source: McKinsey

Ozcode Production Debugger - LEARN MORE

Alternatives aren’t good enough

While there are several alternatives for resolving errors in production, none of them provide the same level of production data and insights as time-travel debugging.

Legacy tools

Resolving errors in production is not new. Developers have had to grapple with production issues since the dawn of computing. Over the years, many tools were introduced, including Dump files, post-mortem analysis tools, profilers, remote debuggers, and more.

While all of these tools are better than nothing, either they don’t provide enough data for an effective root cause analysis, or they incur an unacceptable impact on performance. For example, remote debuggers may provide exception information, but they block the server, which is unacceptable in production. Dump files may not block your production servers, but they don’t show you the latest logs or HTTP requests and will only provide you with local variables or source code if the code is not optimized (which is usually the case in production systems). Furthermore, dump files only represent a single point in time in the program’s history, which is not usually enough to understand that all-elusive chain of causality that caused things to break down.

So, these legacy tools don’t quite cut it.

Observability platforms

The last decade has seen a rise of Application Performance Monitoring tools which have evolved into full-blown observability platforms. These sophisticated tools have moved monitoring and error resolution in production systems forward by leaps and bounds and work well with modern architectures like microservices and serverless. However, none of these platforms provide the code-level observability you get with time-travel debugging. They are supremely suited for system-level errors, detecting overloaded microservices, or downed virtual machines, but they do not provide the insights needed to resolve exceptions and logical errors that only manifest under unique circumstances of production systems.

Traditional log-based debugging

There isn’t a developer out there who doesn’t write log entries. It’s, by far, the most common way developers try to debug production errors. It’s just so easy to write something like,

				
					Logger.LogInformation(“About to invoke transaction {id} on table {table}”, transactions.Id, table.Name)
				
			

But log-based debugging is both inefficient and ineffective. Inefficient because developers write way too many log entries and typically never observe or analyze 99.9% of them. Ineffective because, for all the log lines they write, they never have the right data when and where they need it. This is the paradox of static logs: If you know what to log, you’ve already solved the bug. Therefore, debugging in production with logs is an arduous, time-consuming process that usually requires several iterations.

Debugging with logs, the traditional way - Ozcode

How time-travel debugging drives developer velocity

Time-travel debugging can slash the time developers spend on resolving production errors by up to 80%, which means that those developers can spend more time delivering business value.

It’s like any type of problem-solving. To resolve production errors effectively, developers need data. In development, they have all the data they need at their fingertips right in their IDE’s debugger. Not so in production.

To begin with, production errors can be very hard to reproduce in the first place. You can’t place breakpoints in production since that would interrupt service to your customers. Matching your production code to the right source code version is not as trivial as it may seem, and modern microservices and serverless architectures where the offending code is running one moment and gone the next only complicate matters. In most cases, developers don’t even have access to the production environments they need to debug. So, usually, they rely on log files, and we’ve just been through how inefficient and ineffective those are.

Time-travel debugging provides the development experience in production.

When your application throws an exception, Ozcode Live Debugger automatically captures a vast amount of data related to that exception. It starts with a complete recording of the code execution flow of the error across microservices (or serverless code) from user interaction to the line of code that threw the exception. This means that developers can step through the error execution flow, line by line, with full visibility into the call stack, locals, method parameters and return values, HTTP requests, and database queries at every step of the way. Ozcode makes this data available without interrupting service and with no noticeable impact on your production systems.

Ozcode Production Debugger - LEARN MORE

But not all software errors generate exceptions. Logical bugs can make your application display incorrect behavior without throwing an exception. Here, Ozcode’s dynamic logging with tracepoints provides developers with the production data they need to resolve errors. By placing tracepoints in strategic locations in the code where they suspect the error originates, they can add and remove log entries on the fly without impacting performance. In addition to the dynamic log entries, Ozcode also adds time-travel debug information to the methods that contain tracepoints so that developers can track production data step by step through the lines of code. Doesn’t that sound familiar. It’s exactly the experience a developer gets when debugging on her local environment – only now, it’s in production.

Time-travel debugging, backed by science

Mckinsey’s research showed that best-in-class tools are the primary driver of developer velocity and a top contributor to business success. This is exactly the category in which Ozcode’s live, time-travel debugger sits through the ability to slash 80% off the time taken to resolve production errors. Mckinsey’s research fully supports this assertion:

Additional areas that executives believe will accelerate software innovation and impact in the future include increased usage of product telemetry to make product decisions and automation in detecting and remediating production issues.

Source: Mckinsey

So, if you want to be in that quartile with 5x business performance, time-travel debugging might not be the only change you need, but it’s a great place to start.

Ozcode Live Debugger

Zero code changes

Install lightweight agent in 5 minutes.

Run anywhere

Easily deployed on-premises or in the cloud. Azure, AWS, Windows and Linux.

Low footprint

Less than 3% impact on runtime performance.

Ozcode Production Debugger - LEARN MORE

The post Accelerating Developer Velocity with Time-Travel Debugging appeared first on Ozcode.

]]>
What Every CTO Needs to Know About Live Debugging https://oz-code.com/blog/production-debugging/what-every-cto-needs-to-know-about-live-debugging Wed, 16 Jun 2021 13:46:48 +0000 https://oz-code.com/?p=18839 Live debugging supports one of a CTO's most important roles; using technology to generate value for the company and drive the business.

The post What Every CTO Needs to Know About Live Debugging appeared first on Ozcode.

]]>

The technological advances of the past decade have forced companies to undergo a digital transformation. It was the CTOs job to navigate the transformation successfully, highlighting one of a CTO`s most important roles; to use technology to generate value for the company and achieve its business goals.

So, let me show you how live debugging serves that role. The TL;DR you need to know is:

You need it because bugs in production are consuming valuable developer resources and costing you money.

It’s efficient and effective in helping you make your production software more robust and enabling you to recover more quickly from production errors.

It works with modern software architectures.

It’s secure and compliant.

It integrates with observability platforms.

But before we go into all the reasons why, let’s look briefly at what a live debugger is.

What is a live debugger?

A live debugger is a tool or platform you use to resolve errors in your live production and pre-production environments like QA and staging. The production use case is naturally the most acute one, where resolving errors can become an emergency. However, applying live debugging in QA and staging will also accelerate your developer velocity and make your production software more robust while doing wonders for your DevOps KPIs.

The idea of debugging in production is not new. Over the years, tools like event viewers, log files, dump files, Application Performance Monitors (APMs), and others have helped developers resolve errors in production. However, none of these tools are ideally suited for the job. They are either intrusive to your production environment, requiring downtime and dramatically impacting performance, or they’re not very effective in determining the root cause. They don’t provide enough data and may require multiple CI/CD cycles to deploy new builds specifically designated for debugging.

Modern live debuggers modify the code in your live environments non-intrusively using byte-code instrumentation to generate the data developers need in two ways: recording the complete error execution flow of an exception along with all the debug data, and adding dynamic log entries on the fly, without having to rebuild the application. While APMs and Observability platforms have been using byte code instrumentation for the last ten years to generate various metrics displayed in beautiful graphs and charts, applying it to generate debug data is relatively new.

Why you need a live debugger

Amazon CTO Werner Vogel famously said, “Everything fails all the time.” But this is not new. In fact, it’s as old as the original Murphy’s law, “Anything that can go wrong, will go wrong.” No matter how many safeguards you have in place, you will have errors in production. It happens to the best of us. From UI glitches…

Live Debugger for CTOs - Ozcode
UI Glitches in TripAdvisor, Southwest Airlines, and Amazon

Source: Applitools.com

…to crashing company stocks and knocking spaceships out of orbit. A quick look at downdetector.com will show you that at any time, household names like Comcast, YouTube, Instagram, AT&T, Verizon, Microsoft, and others experience outages that Gartner estimates can cost companies thousands of dollars per minute.

Ozcode Production Debugger - LEARN MORE

Live debugging is efficient and effective

Debugging in production is hard. Production environments are usually far too complex to recreate, and reproducing production errors for debugging can be impossible. You can’t put breakpoints in production. Even with modern, sophisticated log analyzers and distributed tracing, log files don’t usually contain enough data to determine the root cause of an error. In most cases, you don’t have access to the production environments running your applications where the errors occurred.

A live debugger overcomes all of these hurdles. Capturing exceptions along with the complete error execution flow removes the need to reproduce production errors. The full application state along the error execution path is available for a developer to step through very much as she would do in development. This is what we call time-travel debugging. Dynamic logging with tracepoints (aka, non-breaking breakpoints) provides a similar degree of observability into the code for logical errors that don’t generate exceptions. The developer can simply add log entries to investigate the application state anywhere in the code without having to rebuild and redeploy the application.

Between autonomous exception capture and dynamic logging, a live debugger offers the developer a fast path to the root cause of an error which slashes debugging time by up to 80%.

A live debugger overcomes the challenges of modern software architectures

Modern software architectures present special challenges when debugging live systems. It’s almost impossible to follow the complex code execution path of errors as they traverse a multitude of redundantly deployed, ephemeral microservices, with intermittent database requests, networking, and messaging, all while generating terabytes of log entries.

Fortunately, autonomous exception capture tracks execution flow across microservices, so you can follow the path of an error from one microservice to another and examine the application state at each step as the error unfolds. Similarly, for logical errors, you can place dynamic log entries anywhere in your code and trace the execution of an action across all the microservices in your application.

A live debugger is secure and compliant

A developer needs access to production data to understand the nature of an error. However, a variety of privacy regulations place restrictions on the production data you are allowed to expose. A live debugger finds the optimal balance between these two mutually exclusive forces with highly configurable PII redaction capabilities and enhanced data controls. Data can be redacted according to regular expressions, identifier names, or whole classes and namespaces, providing granular control over what is exposed to the developer. Data configured for redaction is masked before it ever leaves the production environments, and as a backup, data is redacted again at the front end to cover possible changes in redaction configuration. Moreover, the live debugger admin has complete control over data retention policies and can explore an exhaustive audit trail of data access.

Ozcode System Architecture with PII Redaction

Ozcode Production Debugger - LEARN MORE

A live debugger complements and integrates with observability platforms

You may wonder why you need a live debugger if you already have a modern observability platform (such as New Relic, DataDog, Logz.IO, Dynatrace, etc.) closely monitoring your systems. These sophisticated platforms provide many capabilities like log analysis, performance monitoring, error monitoring, and more, displaying a host of system metrics you can follow. While these capabilities are indispensable for the daily maintenance and monitoring of modern software systems, none of these platforms provide the code-level observability needed to do an effective root cause analysis of production errors. At best, they will register an exception and show you the relevant stack trace, but only a live debugger will let you drill down into the complete error execution flow to analyze the error, step-by-step, to determine its root cause. True, a live debugger cannot replace your observability platform; rather, it is complementary and provides data at the next level of detail needed to resolve production errors. Observability platforms will point you in the right direction, but you need a live debugger to drill down and determine the root cause of errors.

Ozcode Live Debugger Complements APMs

The business of technology

As a CTO, you make strategic decisions about your company’s technology stack. Out of the countless tools available, you have to choose those that will make the most impact in driving your business forward. I may be biased, but it seems clear that a live debugger is a no-brainer. It’s necessary infrastructure that’s as important as your observability platform, if not more. Not only will it make your products more robust by preventing faulty deployments from reaching production, but it will also drastically cut the time to resolve those errors that do slip through the barricades of your DevOps pipeline. Whether it’s reducing the time your developers spend on debugging in production (liberating them to add new features and value to your products) or reducing the number of customers impacted by an error, the ROI is huge and will immediately be reflected in your company’s bottom line.

Ozcode Production Debugger - LEARN MORE

The post What Every CTO Needs to Know About Live Debugging appeared first on Ozcode.

]]>
5 DevOps KPIs Live Debugging Can Improve https://oz-code.com/blog/devops/5-devops-kpis-live-debugging-can-improve Thu, 03 Jun 2021 08:56:56 +0000 https://oz-code.com/?p=18534 A live debugger can improve DevOps KPIs. Learn what code-level observability can do for MTBF, MTTD, and MTTR.

The post 5 DevOps KPIs Live Debugging Can Improve appeared first on Ozcode.

]]>

Twelve years after Patrick Debois coined the term “DevOps,” it’s clear that DevOps is here to stay. However, not all DevOps adoptions are equal. The 2019 Accelerate State of DevOps Report showed that companies on the most successful end of the spectrum could deploy changes on demand within an hour with a failure rate of less than 15%. However, when DevOps adoption falters, changes can take months to deploy, and about half of them fail. To put yourself on the right end of that spectrum, you must keep track of your performance against industry-standard DevOps KPIs. Thoughts of improving your DevOps adoption instinctively take you to testing, automation, collaboration, and other pillars of DevOps. I’m here to tell you that adopting a Live Debugger is increasingly becoming a trend among DevOps engineers, who find that this tool has a real positive impact on DevOps KPIs.

What is live debugging?

Live systems have errors that are often severe enough to cause an outage. In some cases, the errors are at a system level, and the different observability platforms available on the market provide enough information to determine their root cause and fix them. However, in many cases, the errors are at code level.

Bugs in production

Bugs in Production - Ozcode

To fix these bugs, you need the four pillars of live debugging so you can closely examine your production code and data along the error execution flow. And assuming a bug has not crashed your application (which doesn’t necessarily reduce its severity), you need to debug it without interrupting your customers’ experience in any way. Let’s see how live debugging in production and pre-production environments can do wonders for your DevOps KPIs.

Ozcode Production Debugger - LEARN MORE

The live debugging connection to DevOps KPI

There are many KPIs you could be monitoring to assess your DevOps adoption, and using a live debugger can dramatically improve several of them.

DevOps KPIs can be improved with a Live Debugger - Ozcode

Change Failure Rate

This KPI is one of four key metrics identified by Google’s DevOps Research and Assessment (DORA) team that indicates how a software team is performing. It measures the percentage of deployments that cause a failure in production and is an indication of the product’s stability.

\(Change Failure Rate = \frac{\displaystyle Deployments\,causing\,a\,failure\,in\,production}{\displaystyle Total\,deployments}\;x\;100\%\)

In the 2019 Accelerate State of DevOps Report, DORA found that Elite DevOps teams had a seven times lower change failure rate than low performing teams (i.e., deployments are only 1/7th as likely to fail).

Source: 2019 Accelerate State of DevOps Report

How does a live debugger help reduce change failure rate?

By shifting debugging left to pre-production. Here’s what you can do.

Install the Ozcode agent alongside your application on your staging environment. Your application will, most likely, throw exceptions. Any of those exceptions could cause a deployment in production to fail and increase your change failure rate. Ozcode will catch all of those exceptions autonomously and provide you with the full time-travel debugging information to fix those errors ON STAGING.

Fewer bugs on staging means fewer bugs in production and a lower change failure rate.

Defect Escape Rate

This term is also quite self-explanatory and is a measure of defects that “escape” your pre-production systems and get into production. The calculation is quite simple:

\(Defect\,Escape\,Rate\;=\;Bugs\,in(\frac{\displaystyle Production}{\displaystyle Production\,+\,Pre-production})\)

If your Defect Escape Rate is too high, you should re-evaluate your deployment frequency. You might find that you’re rushing through QA and staging to meet release deadlines. The consequence is more buggy code (ergo, a less stable application) in production which can cause anything from a loss of reputation to direct loss of revenue. As in the case of Change Failure Rate, using Ozcode on your staging environment and even on QA can reduce the number of bugs that escape and make it to production, hence lowering the Defect Escape Rate.

Mean Time to Detection

How long will a defect exist in production before you detect it? You want this number to be as low as possible because the earlier you detect a defect, the earlier you can fix it, so your customers will be less likely to experience it.

There are two primary factors affecting MTTD:

  1. When an incident occurs
  2. How long it takes you to detect it.

You have no control over when an incident occurs. An incident is unexpected; otherwise, you would have already implemented a fix to prevent it. Once you detect an incident, you can review your log files or monitoring systems to timestamp when it first occurred. With that data, it’s a simple calculation of detection time minus start time. MTTD is an average over any time interval you choose. Consider this example of an organization detecting three incidents:

Start timeDetection timeElapsed time (min)
4:26 pm5:02 pm36
3:05 pm8:51 pm346
10:15 am10:17 am2

For the time interval of this sample, MTTD = (36+346+2)/3 = 128 min.

Now, there are different ways you can assess your MTTD, for example, by removing outlier values or segmenting by incident severity, but that’s a topic for another post.

The time taken to detect an incident depends on whether it’s caused by a logical bug or a software error. You may only detect a logical bug once a user (or preferably, one of your own QA staff who is testing in production) reports an issue. Typically, MTTD for this kind of bug will be longer. On the other hand, a software error usually throws an exception, and in these cases, Ozcode Live Debugger can dramatically reduce MTTD.

As soon as your application throws an exception, Ozcode captures the error execution flow and displays the exception on the dashboard.

While you can get similar detection capabilities from modern APMs, with Ozcode’s Live Debugger, you can just click one of those exceptions to debug it directly. Essentially, MTTD for exceptions should evaluate to Zero, and from here, you’re in a race to reduce MTTR, which I discuss below.

Ozcode Production Debugger

Mean Time Between Failures (MTBF)

MTBF is another indicator of your software’s quality. It stands to reason that the more robust your software, the less likely it is to fail and the more available it will be to your customers. Here too, the calculation is quite simple:

\(MTBF = \frac{\displaystyle Total\,operating\,time}{\displaystyle Number\,of\,failures}\)

For example, if a system failed 4 times in 24 hours, and the total outage time was 2 hours, then for that 24-hour period:

MTBF = (24-2)/4 = 5.5 hours

MTBF goes hand-in-hand with Defect Escape Rate and Change Failure Rate in that improving those KPIs is likely to have a positive effect on MTBF. Using Ozcode to reduce the number of bugs in your pre-production environments will help deploy more robust releases and thus improve (i.e., increase) MTBF, but Ozcode can also have a direct effect on MTBF. By reducing your system downtime (i.e., the recovery time – see MTTR in the next section), Ozcode directly increases total operating time, and therefore, your MTBF.

TIP:
If your system uptime is approaching your MTBF, start taking extra care with new deployments and closely monitor your operations.

Mean Time to Recovery (MTTR)

MTTR measures how quickly you get your service running again after it crashes. This is probably one of the best-known DevOps KPIs because it relates to managing an emergency situation. Business stakeholders are watching the clock, pagers are beeping in Operations, and developers are getting phone calls in the middle of the night. Basically, anyone who cares is burning the midnight oil to get your systems up and running again. Here’s the calculation:

\(MTTR= \frac{\displaystyle Total\,downtime\,due\,to\,failures}{\displaystyle Number\,of\,failures}\)

Using our MTBF example again, MTTR for that 24-hour period is 2 hours/4 failures = 0.5 hours.

MTTR is an indication of how quickly you can respond to an outage and fix it. The quicker you can debug an issue that crashed your system, the lower your MTTR will be, and this is where Ozcode can help. Ozcode can reduce the time to debug an issue by up to 80% because:

  • There’s no need to reproduce the issue. Ozcode records the complete error execution flow directly on your production environment.
  • The time-travel debug information that Ozcode captures provides the developer who has to debug the issue with all the production data they need.

Providing the developer with this kind of code-level observability into the production system’s error state and allowing them to step through the error execution flow is another form of “shift-left debugging,” and it dramatically reduces the time from failure to recovery.

Bridging Dev and Ops with live debugging

DevOps engineers have recognized the value of having a live debugger in their enterprise tool stack. While they are the first responders to production incidents, they understand the need to bridge the gap to developers who must fix the bugs that cause those incidents. To understand exactly what went wrong, developers need access to production data so they can step through the execution flow of an error exactly how it happened with full visibility into production data at code level. With that data at their fingertips, developers can improve the quality of code in production without sacrificing release velocity, thereby improving DevOps KPIs.

Ozcode Live Debugger

3 users, 10 monthly agents, 100K events – ALWAYS FREE Ozcode Production Debugger - LEARN MORE

The post 5 DevOps KPIs Live Debugging Can Improve appeared first on Ozcode.

]]>
The Road to Observability from System Down to Code https://oz-code.com/blog/production-debugging/the-road-to-observability-from-system-down-to-code Mon, 24 May 2021 15:06:53 +0000 https://oz-code.com/?p=18489 When Datadog shows you something has happened with your software at a system level, Ozcode takes you on an observability journey down to code level.

The post The Road to Observability from System Down to Code appeared first on Ozcode.

]]>

Datadog is an industry-leading observability platform and brings a wide variety of observability data into one integrated view. From details captured in your processed logs, Datadog lets you switch to traces to see how the corresponding user request was executed.  In case of an error in your software, Datadog displays the full stack trace and then lets you use faceted search to drill down into the corresponding traces and logs to determine the cause of the issue. Datadog continuously monitors your production environment and provides system-level alerts such as traffic spikes, elevated latency, or looming bottlenecks to help you troubleshoot issues and keep things running smoothly.

The system-level data Datadog provides goes a long way to determining the root cause of errors, but in many cases, observability at the level of logs, metrics, and traces does not provide enough information to understand what really went wrong with your application. Think of it like a car. If you see the temperature gauge rising, you might guess that you need to top up the radiator fluid. But are you sure that’s really why your engine is overheating? Is there a leak in your cooling system, or is the engine overheating because you’re losing oil? To find out, you have to pop the hood with Ozcode.

Popping the hood on your production environment

When Datadog shows you something has happened with your software, Ozcode pops the hood and takes you on an observability journey from system level down to code level. Datadog can provide a great starting point, showing you anomalies in metrics and even the stack trace of exceptions. From there, you go to Ozcode.

To investigate anomalies surfaced by DataDog, Ozcode lets you add dynamic logging using tracepoints. You can add these log entries on the fly to your live running code without having to deploy a new build through your CI/CD pipeline. Using dynamic logs to reveal the value of locals, variables, method parameters, and return values anywhere in your code goes a long way to exposing the root cause of an incident.

Ozcode also pops the hood on exceptions. Ozcode autonomously captures any exception that your application throws along with full, time-travel debug information so you can step through the error execution flow with full visibility into your production data at every step of the way. This is what we call code-level observability.

When the impossible happens

Let’s see how this integration might work with an eCommerce nightmare.

Black Friday or some other purchasing frenzy is just around the corner. All systems are GO. Everything has been tested, retested, and reinforced.

And then, the impossible happens. Customers can’t complete checkout.

Everybody’s face-palming, and phones and pagers are going off everywhere in IT/Ops.

The first place your DevOps engineers go to is your observability platform. Datadog to the rescue.

A quick look at the service map shows which service is throwing errors.

Datadog Service Map
Image source: Datadog

Let’s drill down into the App Analytics screen for that service and investigate the errors.

Image source: Datadog
While the HTTP request to “checkout” returns a 200 OK, you see many errors and can even see the exception that is thrown. But what now? Now it’s time for developers to dig down into the code, and the collaborative features of both Datadog and Ozcode help break the silos between IT/Ops and developers to get them working together.

Ozcode Production Debugger - LEARN MORE

Time to pop the hood

Ozcode steps in when you need to start working with code. Setting up Ozcode to work with Datadog is easy – just install the Ozcode extension from the Datadog marketplace and get the Ozcode agent installed on your servers. Once you’re set up, Ozcode will show you all the exceptions you saw in Datadog, and now you can time-travel debug them with full visibility into the error execution flow on your live production environment.

But that’s not always enough. We also saw that even in cases where the request returned a 200 OK,  customers can’t seem to check out. Let’s dig a little deeper.                      

Observability hits code level

Going back to your Datadog dashboard, you discover that some critical requests are showing unusually high latency.

Datadog showing latency
Image source: Datadog

Let’s set a tracepoint (a.k.a dynamic log) in the method that tries to fill orders.

Set Tracepoint
Now, as customers continue trying to checkout, you’ll start collecting tracepoint hits; only now, you’ll have source code and will be able to view all locals and variables in scope for each tracepoint.   With the new integration, the Ozcode app is embedded right inside the DataDog platform, so you never have to leave.
Ozcode tracepoints in Datadog
Need even more data? No problem. You can keep adding tracepoints without worrying about performance until you have all the data you need. No need to rebuild and redeploy.

Ozcode Production Debugger - LEARN MORE

Let’s examine one of those dynamic log entries inside Datadog’s Log View.

The Log View correlates the Ozcode dynamic log entry to the trace of the request that generated it. Analyzing this visual representation of the internal workings of our application shows us exactly where the application is spending time and why checkout is taking too long.

Having discovered the problematic variable, you may now want to monitor it for a while to make sure a fix you implement is working correctly.

Let’s go back up the observability path to Datadog.

Since Ozcode pipes dynamic log output back to Datadog, you can use Live tail and watch how your variables change in real-time. In fact, you can use all of the platform’s analytics capabilities for your new live log entries.

Datadog LIvetail
Image source: Datadog

Using dynamic logging to pipe variables back into Datadog opens up a world of opportunity. You can watch how anything changes in real-time on a new chart you define for your dashboard. Taking the car analogy, you’ve added gauges to measure your radiator fluid and oil level in real-time with no effort.

From system to code and back

Observability is critical to keep systems running smoothly and fix them when they don’t. Our journey into observability started at the system level when Datadog’s Service Map showed that one of our services was throwing errors. A look at the Analytics Panel revealed what the error was and even gave us the stack trace. To understand the root cause of the error, we first used Ozcode to time-travel debug an error and then drilled down by adding tracepoints on the fly. These tracepoints generated dynamic logs, which we fed back into Datadog, and even created ad-hoc metrics and visuals to monitor suspicious variables. As soon as a variable went off the scale somewhere, we could examine the live application state that caused it in great detail to take us directly to the root cause of the error.

When you’re thinking about observability, you need to think about the full round trip; from system, down to the code, and back.

Ozcode Lightweight Time-Travel Debugger

Up to 3 users, 10 monthly agents, 100K monthly events – ALWAYS FREE

Ozcode Production Debugger

The post The Road to Observability from System Down to Code appeared first on Ozcode.

]]>
Frictionless On-Premises Incident Resolution. Don’t Rub Your Customers the Wrong Way https://oz-code.com/blog/production-debugging/frictionless-on-premises-incident-resolution-dont-rub-your-customers-the-wrong-way Wed, 28 Apr 2021 11:36:34 +0000 https://oz-code.com/?p=18186 On-premises deployments create hurdles when something goes wrong making incident resolution in production a painstaking game of trial and error. But there's hope in sight.

The post Frictionless On-Premises Incident Resolution. Don’t Rub Your Customers the Wrong Way appeared first on Ozcode.

]]>

Cloud computing is all the rage. Yes, the simplicity, agility, and scalability of the cloud are the driving forces of the digital transformation many companies are undergoing. Struggling with a remote workforce in the aftermath of COVID-19 only pushed this trend, with Gartner estimating that public cloud spending will reach over $360 Billion by 2022. But this does not mean that on-premises workloads are going away any time soon. Many applications are natively on-premises, and there is even a swing back with many services being repatriated from the cloud back to on-premises infrastructure. It seems like there’s a 12-lane highway between cloud and on-prem with workloads moving in both directions. So, on-prem is here to stay. The problem is that on-premises deployments create hurdles when something goes wrong. The difficulty of accessing your customer’s infrastructure makes incident resolution in production a painstaking game of trial and error. But there’s hope in sight.

The reasons for going on-prem

Here are some of the reasons companies remain on-prem:

Security: There’s an ongoing debate about the security of the public cloud compared to an on-premises private cloud. Many are opting for on-prem, especially in sensitive industries like finance, military, and health care.

Regulatory compliance: While the public cloud sports many certifications, not all clouds can satisfy all industries. The ultimate responsibility for data privacy and governance remains with you, and the cloud cannot always accommodate you with enough availability regions or account for human error.

Cost: The pay-per-use model with zero CapEx of the public cloud is appealing for many companies moving in that direction. They quickly realize that just lifting and shifting workloads to the cloud does not bring the cost benefits that the cloud promises and find themselves with a cloud hangover.

Edge computing: The number of connected devices we use is exploding, from smart cars to business analytics to automated factories. With more and more data being created by devices, there is a growing need to analyze that data at on-prem data centers near the compute edge.

Ozcode Live Debugger

Resolving incidents on-premises usually means a lot of customer friction

With so many companies keeping or moving their workloads on-prem, it’s likely that at least some of your customers will run your software at their on-prem data centers.

And then your support team gets that call.

Something’s not working right with your software, and your customer wants an urgent fix.

If your customer is willing to give you access to their servers on which your software is running, you can go about your investigation, but that’s not usually the case, especially in sensitive industries. So, you ask your customer to send you logs since you don’t have much else to go on.  You try and figure out what went wrong and add more logs to validate your theory, but now you have to reproduce the error with the new logs. You send your customer a hotfix and ask them to deploy it to their production environment. This process causes a lot of friction with your customer. It requires a great deal of their time and interaction, not to mention unplanned deployments to production, which may take days to happen. Worse, you rarely get it right the first time and will have to go through several iterations like this with your customer. More time, more aggravation. By the time you really figure out the problem, you’ve lost quite a bit of trust, and your customer’s upcoming license renewal may be very shaky.

Traditional log-based on-prem incident resolution
Traditional log-based on-prem incident resolution

Ozcode Production Debugger - LEARN MORE

The frictionless approach to on-premises live incident resolution

Ozcode supports on-premises installations. That means you can deploy the Ozcode agent alongside your customer’s software and install the Ozcode server at your customer’s site. If your customer’s site is truly air-gapped, you’ll have to log on to your favorite travel site, book your ticket and a hotel, and get on a plane. If you’re lucky, you might be able to drive there. Without the ability to create a connection to the world outside the customer’s network, there’s no other option. That’s why Ozcode is adding a “technician mode” to its Live Debugger in which anyone who has access to the Ozcode server on site will be able to export exception captures and tracepoint sessions with a single click of a button so your engineers can import them to your local Ozcode installation and time-travel debug in the comfort of their own desks.

Frictionless on-prem incident resolution
Frictionless on-prem incident resolution

This approach to resolving incidents in production related to your software on-premises will go much more smoothly with your customer. There are no repeated hotfixes to deploy just to get more logs and no downtime. Your engineers don’t have to reproduce the issue; they can just play back the autonomous exception capture to analyze, and time-travel debug it. And if they need more logs, no problem. They can use tracepoints to add dynamic logs wherever they need to in the code – no need to redeploy. The bottom line is that you’ll solve that gnarly bug much more quickly without rubbing your customer the wrong way.

Ozcode lightweight, time-travel, live debugger

Up to 3 users, 10 monthly agents, 100K monthly events – ALWAYS FREE

Ozcode Production Debugger - LEARN MORE

The post Frictionless On-Premises Incident Resolution. Don’t Rub Your Customers the Wrong Way appeared first on Ozcode.

]]>
Supercharging Web Apps by Testing and Debugging in Production https://oz-code.com/blog/production-debugging/supercharging-web-apps-testing-debugging-in-production Wed, 31 Mar 2021 14:08:05 +0000 https://oz-code.com/?p=17999 Two ways your web application can break are UI bugs and deep-down logical bugs in the server. You can detect and resolve both types with Selenium Test Automation and debugging in Production.

The post Supercharging Web Apps by Testing and Debugging in Production appeared first on Ozcode.

]]>

This post is co-authored by Himanshu Sheth, Senior Manager, Technical Content Marketing at LambdaTest.

“Move fast and break things,” goes the famous saying by Mark Zukerberg. But developers know that there’s a delicate balance between your release velocity and how robust your application is going to be. When bugs slip through Staging and get to Production, they start affecting your customers. When that happens (and it certainly does), it’s going to get everybody’s attention and become your top priority. Two ways your web application can break are UI bugs and deep-down logical bugs in the server.

In this post, I’ll show how you can detect and resolve both these types of Production bugs, hopefully before your customers notice them. First, I’ll show how to use Selenium test automation using LambdaTest Cloud Grid to run a web app simultaneously on multiple browsers to catch UI glitches.

With a cloud-based Selenium Grid, you can catch UI issues way ahead of time by testing the features across a range of browser and platform combinations. The fierce battle of quality vs. time can be won by testing on a cloud-based Selenium Grid!

Then I’ll show how Ozcode Live Debugger’s time-travel debugging digs deep into your live server code to help you debug exceptions and logical errors in Production. My reference application is this mock eCommerce site where you can purchase all sorts of goodies related to .NET.

Selenium test automation is a necessity, not a luxury

One of the biggest challenges faced by web developers is uniformity of the UI across different browsers, devices, and platform combinations. Cross-browser compatibility issues can create a huge bottleneck on the user experience, especially if you have not tested the UI on browsers & platforms that are widely used by your target audience. You do not want to irk your customers with misplaced buttons, overlapping texts, and other such usability issues that would drive them away from your website (or web application). However, it is impossible to cover the entire gamut of browsers and operating systems since the list can be an endless one. Did you know that despite the dominance of Chrome and Firefox, Internet Explorer is still relevant, even today? Free up your IT team from the unnecessary burden of constantly maintaining an in-house Selenium Grid that is high on maintenance and yields lower returns. Instead, prioritize the browser & OS combinations on which you intend to perform testing and kick-start with testing on a reliable & scalable cloud-based Selenium Grid by LambdaTest.

How to get started with Automated Browser Testing using a Selenium Grid On-Cloud

If you’re not already familiar with Selenium and how it works, I would recommend reading the “What is Selenium” guide by LambdaTest.

If you’ve worked with Selenium before and prefer it for automating browser interactions, you should give LambdaTest a spin. It helps to overcome existing infrastructure issues with your automation testing script.

To run your existing Selenium test script over LambdaTest Grid, you will need to change the Hub URL from your local machine to LambdaTest cloud. You can do that by declaring your username and access key in the remote Hub URL to successfully authenticate your access to the LambdaTest cloud servers. Your LambdaTest username & access key can be obtained from the Profile Page.

Here are the brief set of steps to perform Automation testing on LambdaTest:

You can monitor the status of the tests run on LambdaTest Grid by navigating to the Automation Dashboard.

Now that you have set up the account on LambdaTest, it’s time to port the working existing test implementation to LambdaTest. Suppose you have used the NUnit framework in Selenium C# for writing the automation tests. The change will be majorly involved in the method implemented under the [SetUp] annotation. This is where you have to instantiate the browser on which the test needs to be performed.

Here is the code snippet which showcases the instantiation of the Chrome browser on a local Selenium Grid:

				
					using NUnit.Framework;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using System.Reflection;
using System.Threading;
using System.Collections.Generic;
using System.Web;

namespace NUnitTest
{
    public class NUnitTest
    {
        String test_url = "test_url";
        public IWebDriver driver;

        [SetUp]
        public void start_Browser()
        {
            /* Local Selenium WebDriver */
            driver = new ChromeDriver();
            driver.Url = test_url;
            driver.Manage().Window.Maximize();
        }
        /* Tests follow here */
    }
}

				
			

As seen above, the start_browser() method instantiates the Chrome browser, after which the URL under test is set. The test(s) would be implemented in method(s) that are under the [Test] annotation.

Before running the tests, generate the desired browser capabilities using the LambdaTest Capabilities Generator. As shown below, select the appropriate browser, browser version, and platform on which you intend to perform the test:

So, how do we port this implementation such that the existing tests run on cloud-based Selenium Grid from LambdaTest? Well, the changes are only involved in the method implemented under the [SetUp] annotation. Instead of a local Selenium WebDriver, we use the Remote WebDriver that passes the test request to the LambdaTest Hub [@hub.lambdatest.com/wd/hub].

				
					using NUnit.Framework;
using OpenQA.Selenium;
using OpenQA.Selenium.Remote;
using System.IO;
using System.Reflection;
using System.Threading;
using System.Collections.Generic;
using System.Web;

namespace NUnitTest
{
    public class NUnitTest
    {
        String test_url = "test_url";
        public IWebDriver driver;

        /* LambdaTest Credentials and Grid URL */
        String username = "user-name";
        String accesskey = "access-key";
        String gridURL = "@hub.lambdatest.com/wd/hub";

        [SetUp]
        public void start_Browser()
        {
            DesiredCapabilities capabilities = new DesiredCapabilities();

            capabilities.SetCapability("user", username);
            capabilities.SetCapability("accessKey", accesskey);
            capabilities.SetCapability("build", "[C#] Demo of LambdaTest Grid");
            capabilities.SetCapability("name", "[C#] Demo of LambdaTest Grid");
            capabilities.SetCapability("platform", "Windows 10");
            capabilities.SetCapability("browserName", "Chrome");
            capabilities.SetCapability("version", "latest");

            driver = new RemoteWebDriver(new Uri("https://" + username + ":" + accesskey + gridURL), capabilities, TimeSpan.FromSeconds(600));
            driver.Url = test_url;
            driver.Manage().Window.Maximize();
        }
        /* Tests follow here */
    }
}

				
			

With this, you are all set to run your tests on the LambdaTest Selenium Grid. On execution, you can visit the Automation Dashboard to keep a watch on the status of the tests.

Have a look at how your website (or web app) can render differently on different browsers (and browser versions):

Shown below is a cross-browser test performed on IE 8 (running on Windows 7).  Not only is the rendering messed up, but the “Next” button (which is in the SVG format) is also displayed incorrectly.

Compare this with a working test that is run on Chrome 89 + Windows 10 combination. There are no issues whatsoever in the rendering of the web page.

The key takeaway is that cross-browser testing at scale should feature in your automation testing checklist. With this, your customers would be greeted with an ever-lasting product experience that works like a charm on browsers and devices that they love to use!

Online Selenium Grid such as LambdaTest has made it super-easy for us to ensure a cross-browser compatible experience without having to worry much about the infrastructure limitations that curtail browser and test coverage. LambdaTest offers much-needed scalability and reliability so that cross-browser tests can be performed at scale!

Debugging logic in Production with Time-Travel Fidelity

Let’s now look at that other type of bug which I earlier mentioned – a logical bug. Our mock eCommerce site offers a Buy 2 Get 1 FREE deal with some bugs built-in. When I chose 2 of those nifty sweatshirts, the site automatically gave me a third one. Well, they’re cool, but not that cool, so I decided to bag it. But when updating the quantity to 0, the site throws an exception.

Watch this.

Ozcode automatically catches the exception and displays it in the dashboard. We can see it’s an ArgumentOutOfRangeException.

To debug the exception, I click the Debug button.

Ozcode shows where the exception was thrown in the code, and you immediately understand why. There’s an OutOfRange Guard clause, and the value of Input is -1.

Now let’s climb up the call stack a bit and look at method AdjustQuantity where we implemented the Buy2 Get 1 Free deal.

First off, from the red/green color coding, we see exactly which parts of this method were executed in this error flow. The first “if” statement handles the Buy 2 Get 1 Free.

				
					if (newQuantity == 2)
{
	newQuantity = 3
}

				
			

But that handles the case when a customer modifies the number of items from 1 to 2.

In this case, I’ve changed my mind and updated the quantity back to 0, so the second “if” statement is executed (as we can easily see because it’s green).

				
					if (currentQuantity > 2 && newQuantity < 2)
{
	newQuantity--
}

				
			

But someone has not considered an input value of 0 to newQuantity, so we get our exception.

Now, there are any number of APMs or error monitoring tools that will show you that ArgumentOutOfRangeException with the invalid input of -1. None of those will show you the code across the whole call stack and the values of all locals, variables, and method parameters and return values that show you exactly HOW you got to that invalid input. It’s only once you have that data that the fix for this bug becomes trivial.

Now, you may be thinking, “this was a simple example; real life is more complicated.” You may be right, but even for an example like this, you may have found yourself guessing at the solution, adding log entries to validate it, and rebuilding to test. This kind of observability into the code along the whole execution flow of the error is what makes it easy (or at least much easier) to fix any bug, whether it’s in a monolithic application, a redundant microservice, or a serverless function that runs for a microsecond and is then gone – let’s see you reproduce that. With Ozcode, there’s no need to reproduce it. It’s all recorded for you.

Testing and debugging in production, better together

Testing and debugging are two inseparable facets of delivering robust, working software. For Dev/QA collaboration to be frictionless, developers need as much information about errors as possible, and it’s up to QA to provide it. A while ago, I maintained that a perfect bug report could be provided as a single shareable link, and that’s true for server-side logic bugs. If we now consider UX, we need a bit more data, and that’s what LambdaTest provides to complete the picture. LambdaTest can simultaneously test your UI across a large matrix of OSs and browser versions. If one of those combinations generates an exception, data about the exact scenario, configurations, and versions can be provided to Ozcode Production Debugger, where you can take the analysis down to code level. Being able to debug errors connected to specific OS/browser combinations at a code level will drastically cut down the time it takes to understand where the problem is and fix your code. This is truly end-to-end error resolution.

The post Supercharging Web Apps by Testing and Debugging in Production appeared first on Ozcode.

]]>
Finding the Bug in the Haystack: Hunting down Exceptions in Production https://oz-code.com/blog/production-debugging/finding-the-bug-in-the-haystack-hunting-down-exceptions-in-production Wed, 24 Mar 2021 06:00:08 +0000 https://oz-code.com/?p=17756 As companies move fast and break things, they then have to fix all those things they have broken. With machine learning you can find the bugs that matter, and with time-travel debugging you can then fix them.

The post Finding the Bug in the Haystack: Hunting down Exceptions in Production appeared first on Ozcode.

]]>

This post is co-published by Logz.io and is co-authored by Omer Raviv, Co-founder & CTO @ Ozcode, and Dotan Horovits, Product Evangelist @ Logz.io.

Software companies are in constant pursuit to optimize their delivery flow and increase release velocity. But as they get better at CI/CD in the spirit of “move fast and break things,” they are also being forced to have a very sobering conversation about “how do we fix all those things we’ve been breaking so fast?”

As a result, today’s cloud-native world is fraught with production errors and in dire need of observability.

Climbing the ELK Stack Everest

The depth and breadth of production errors in today’s cloud-native world are apparent from the vast number of exceptions that these applications generate. And how do companies address the issue?

Logs, logs, and more logs.

Modern applications generate mountains of logs, and those logs are generously peppered with exceptions. The sheer magnitude of exceptions makes it extremely difficult to weed out just the right ones. Which exceptions are new? Which are just noise? Which contain important information, such as an error in a newly deployed feature or a customer that’s having a terrible experience and is about to churn?

Let machine learning find the needle in a haystack of errors in Kibana with Logz.io

Let’s take a look at a real-world scenario. If you’ve ever worked at an eCommerce company, this will sound familiar.

The end of November rolls around.

Your friends and family are giddy about all the neat things they’re going to buy.

You are somewhere between stressed and having a full-blown panic attack. It’s your company’s biggest day of the year for sales. Your infrastructure and code had better be up for the task.

Black Friday hits, your website traffic is peaking, and the nightmare begins.

Despite all of your best efforts and meticulous testing, your “buy 2 get 1 free” coupon code simply DOES NOT WORK.

What now?

Let’s look at some logs.

I already mentioned that your logs are likely to contain loads of exceptions. How are you going to pick out the ones related to your coupon code? The open-source ELK Stack is popular for ingesting those mountains of logs and slicing and dicing them in Kibana Discover to understand the scenario at hand. Each log entry can contain structured data, so you can filter on a specific field or piece of contextual data. Logs can also be enriched with additional contextual data you can filter on, such as a user’s email, the browser type, etc.

In our Black Friday nightmare scenario, you might filter on the particular services that are flaking out, the relevant time frame, and on your coupon code field:

A typical investigation in Kibana Discover involves an iterative process of filtering and querying to narrow down the search context, which can be tedious and time-consuming when having so many outstanding exceptions in the environment.

Logz.io offers a Log Management service based on the ELK Stack that saves you the hassle of managing the open source yourself at scale. But it does much more than that. Logz.io’s Exceptions tab within Kibana Discover does a fantastic job doing what no human can – looking through the hundreds of thousands of log lines that contains exceptions and using machine learning smarts (Logz.io’s Insights Engine) to group them together to a concise aggregated view, which can be filtered in all the same useful ways we apply filters in Kibana Discover.

In our Black Friday incident, even after filtering out, we’re faced with more than half a million log hits. However, the Logz.io’s Exceptions tab in Kibana flags only 17 clustered exceptions in this search context. Let’s take a closer look at these errors:

In the Exceptions tab, we immediately spot a new exception – ArgumentOutOfRangeException – that started firing intensively during the incident time window. In a real-world, cloud-native system, this would filter out the noise and let you home in on the right exceptions.

You now know where to start your final assault, where to start looking. But where do you go from here?
 

Ozcode – see the code behind the logs

The logs are the telemetry of our software’s black box. It records what the system tells us it is doing. Now that we used Logz.io’s Insights Engine to find out which exception we should focus on, we’d like to open up the black box and get code-level understanding of that exception. This is where Ozcode’s exception capture comes in. Ozcode Live Debugger’s exception capture includes all the data we need: you can time travel to see line-by-line code execution, up to the point where your application threw an exception,  viewing locals variables, method parameters and return values, network requests, database queries, and more.

The ArgumentOutOfRangeException exception and call stack we saw in Logz.io’s Kibana don’t provide enough data for us to understand what happened. However, by simply jumping over to the Ozcode dashboard and filtering for the specific exception type and time range, we can delve deeper…

The Ozcode recording shows us a visual look at the code execution that led to the bug – every expression that was false is highlighted in red, every expression that was true is highlighted in green, and every variable and method call show their exact value. We can see we had a simple calculation error in our “Buy 2 Get 1 free” sale, which made us think the customer wanted to buy a negative number of items.

Now that we understand what happened, That’s an easy fix! No need to try to reproduce the issue on the local dev machine to solve the mystery.

Zoom in fast and fix things

The ELK stack, and Kibana in particular, gives us tremendously powerful tools to investigate logs. Using Logz.io’s machine learning-based insights, we can surface the relevant exceptions and related logs inside Kibana out of the endless noise and millions of logs that modern cloud-based systems generate. The Ozcode Live Debugger enhances this experience even further by giving us code-level observability and time travel recording to quickly understand the root cause behind each exception. You can combine that with additional telemetry such as metrics and traces to increase your system’s observability and enhance your troubleshooting capabilities.

Ozcode Live Debugger

Ozcode Production Debugger

The post Finding the Bug in the Haystack: Hunting down Exceptions in Production appeared first on Ozcode.

]]>