The Benefits of Test Observability for DevSecOps
Matt Heusser
Posted On: April 5, 2023
21972 Views
7 Min Read
Imagine for a moment that you are working on an internet of things product. It could be a doorbell tied to a security alarm, or perhaps an connection from your phone to your automobile. Either way, the system consists of a complex series of web services. The products generally have a local “hub” (your router or vehicle) which connects to external services, plus a back-end at your data center, that requires authentication. Now imagine something goes wrong. The neighbor presses the doorbell, or you click to unlock the car doors, and … nothing happens. What went wrong?
It could be a problem with the handheld mobile device, the router, the device, the internet, the connection to the back-end, the back-end services, or perhaps even some other dependency we don’t know about. Most of us looking for testing help are building complex products. When something goes wrong, we need to understand what component failed, what state it was in, and what input it was sent.
Enter observability. Observability is the extent to which the internal elements of a system state can be understood from a given interaction. A highly observable system is the opposite of a “black box”; you can see inside of it. This blog will dive into the benefits of observability, starting with debugging and testing.
Debugging
This is the first step – what component failed? Was it the hardware at the door, the router, or the internet? With simple programs, the programmer has a “call stack”, to trace where the error occurred, what methods were called, and what call values were passed in. An API call stack isn’t that different – we can see the message went from the device to the router, and then … nothing.
Of course, sometimes the software will register a legitimate error. Other times, things simply take too long. On one automotive project, we would see everything work, but a door unlock or a horn honk might take two minutes to process. Without observability, the defect ticket is “car door unlock is slow.” With observability, we can see when the messages left at each step of the process. To improve performance, we need this data, so we can find the step of the process that takes the longest but shouldn’t, and reduce it. Without that, the team, fundamentally, is left to guess, poke and prod.
Test setup
Without observability, we don’t know what went wrong, exactly. Instead, the programmer has to try to repeat the scenario and re-run the exercise until something fails, testing the software as a system. Sometimes, the problem could be the network, the wireless connection, or the router, leading to “flaky tests” or “unable to reproduce” bugs.
Observability gives us the entire trace of the software. When an API is called for the GETNEXTDAY function on a leap year, the API itself locks up. Don’t laugh too hard; this has caused failures at top companies.
Repeatable errors make for great automated checks. Once the check is in place (and it fails), the programmer can fix the code – and the “code is done when the test runs.” That means when a problem happens in production, the programmer can start by writing the test, then write the code to fix it, then run the regression suite, approaching a time-to-fix that is essentially continuous delivery.
To do that, the programmer needs to know what component failed, on what input. To get that information quickly (and sometimes at all), the team needs observability.
Scaling and growth
Any simple system, such as an automobile or bicycle, is only as strong as the weakest piece. The first thing to go – a tire that ruptures or a chain that breaks – will take down the entire system. The same is true for software systems, especially for performance. A single component that gets overloaded, like a database, can bring down the entire website. Before that component breaks, it will show stress, it will get slow. The customers might not complain; if they do, customer service won’t be able to do much. Most observability tools provide a performance dashboard, so you can see what subsystems are slowing down. Sort by time to respond (average, or better yet, median), look at outliers, or even calculate deceleration – how much the module is slowing down.
This provides the data for accurate performance testing, but also the data for accurate performance improvement. Imagine reinforcing a bike chain, or replacing tires, before the incident that forces you off the road. In the case of software, we can calculate the value of the improvement using the cost of delay. That means we can calculate the return on investment of the observability project!
Another advantage of the traffic graph is a defense from man-in-the-middle and other attacks. The graph can show traffic that is leaving the website and allow you to drill down into it. A well-configured system could actually alert when the first invalid packet starts to transmit data, such as a man-in-the-middle attack.
Building resilience
“High availability” is quickly becoming less of a competitive advantage and more of a cost of doing business. The way most companies get to High Availability is by reducing Mean Time to Failure (MTTF). That likely means delays between deploys along with more rigorous testing. That testing is, well, expensive. The company cannot capture the value of the software until it is delivered. Continuous delivery becomes impossible.
Another way to accomplish that result is to focus on reducing Mean Time To Discovery (MTTD) and Mean Time To Recovery (MTTR). A traditional scrum team that fixes bugs every two weeks, and deploys at the end of each sprint, will be three hundred times slower than a team that can find and fix defects in an hour. The team that can respond more quickly could have thirty times the defects – and yet have one-tenth the negative customer experience as the classic Scrum team.
That one hour of downtime sounds ambitious – but imagine a dashboard that reports 500 errors, login errors, and other API errors as they appear. Not only reports but ALERTS with a text. After all, a 500 error means something is broken. This might mean more operations time, but if debugging and finger-pointing are eliminated, it probably actually means less.
The bottom line
Even a simple modern web application consists of components – the web page itself, the javascript glue, front-APIs, backend-APIs, third-party authentication, and more. That is a distributed system, and distributed systems have multiple points of failure. If we observe these points of failure, we can find and fix problems fast. On the other hand, if we treat the entire system like a black box, when something breaks, all we can do is poke and prod.
The lack of observability is a pattern in recent failures. For example, in January when the Notice to Air Missions failure halted all airline departures in the United States for twenty hours, no one knew exactly what went wrong. With observability, failures take moments to minutes to isolate and find. Imagine if commercial aviation had been down for an hour or thirty minutes. A few people would be late, but most of the schedules could have been made up for in the air.
Would you rather have your site down for a half-hour – or a day?
The choice is yours.
Got Questions? Drop them on LambdaTest Community. Visit now