Building for Failure — Best Practices for Easy Production Debugging
Quite a few years ago, I was maintaining a database-driven system and ran into a weird production bug. The column I was reading from had a null value, but this wasn’t allowed in the code, and there was no place where that value could have been null. The database was corrupt in a bad way, and we didn’t have anything to go on. Yes, there were logs. But due to privacy concerns, you can’t log everything. Even if we could, how would we know what to look for?
Programs fail; that’s inevitable. We strive to reduce failures, but failure will happen. We also have another effort that gets less attention: failure analysis. There are some best practices and common approaches, most famously logging. I’ve often said that logs are pre-cognitive debugging, but how do we create an easier debug application?
How do we build the system so that when it fails like that, we would know what went wrong?
A common military axiom goes, “Difficult training makes combat easy.” Assuming the development stage is the “training,” any work we do here will be harder as we don’t yet know the bugs we might face in production. But that work is valuable as we arrive prepared for production.
This preparation goes beyond testing and QA. It means preparing our code and our infrastructure for that point where a problem occurs. That point is where both testing and QA fail us. By definition, this is preparation for the unexpected.
Defining a Failure
We first need to define the scope of a failure. When I talk about production failures, people automatically assume crashes, websites going down, and disaster-level events. In practice, those are rare. OPS and system engineers handle the vast majority of these cases.
When I ask developers to describe the last production problem they ran into, they often stumble and can’t recall. Then upon discussion and querying, it seems a customer reported a recent bug they dealt with in production. They had to reproduce it somehow locally or review information to fix it. We don’t think of such as production bugs, but they are. The need to reproduce failures that have already happened in the real world makes our job harder.
What if we could understand the problem just by looking at how it failed in production?
Simplicity
The rule of simplicity is common and obvious, but people use it to argue both sides. Simple is subjective. Is this block of code simple?
return obj.method(val).compare(otherObj.method(otherVal));
Or is this block simple?
var resultA = obj.method(val);
var resultB = otherObj.method(otherVal);
return resultA.compare(resultB);
In terms of lines of code, the first example seems simpler, and indeed many developers will prefer that. This would probably be a mistake. Notice that the first example includes multiple points of failure in a single line. The objects might be invalid. There are three methods that can fail. If a failure occurs, it might be unclear what part failed.
Furthermore, we can’t log the results properly. We can’t debug the code easily as we must step into individual methods. If a failure occurs within a method, the stack trace should lead us to the right location, even in the first example. Would that be enough?
Imagine if the methods we invoked there changed state. Was obj.method(val) invoked before otherObj.method(otherVal)?
With the second example, this is instantly visible and hard to miss. Furthermore, the intermediate state can be inspected and logged as the values of resultA and resultB.
Let’s inspect a common example:
var result = list.stream()
.map(MyClass::convert)
.collect(Collectors.toList());
That’s a pretty common code that is similar to this code:
var result = new ArrayList<OtherType>();
for(MyClass c: list) {
result.add(c.convert());
}
There are advantages to both approaches in terms of debuggability, and our decision can have a significant impact on the long-term quality. A subtle change in the first example is that the returned list is unmodifiable. This is a boon and a problem. Unmodifiable lists fail at runtime when we try to change them. That’s a potential risk of failure. However, the failure is clear. We know what failed.
A change to the result of the second list can create a cascading problem but might also solve a problem without failing in production.
Which should we pick?
The read-only list is a major advantage. It promotes the fail-fast principle, which is a major advantage when debugging a production issue. When failing fast, we reduce the probability of a cascading failure. Those are the worst failures we can get in production as they require a deep understanding of the application state, which is complex in production.
When building big applications, the word “robust” gets thrown around frequently. Systems should be robust, but they should offer that outside of your code which should fail fast.
Consistency
In my talk about logging best practices, I mention that every company I ever worked for had a style guide for code or at least aligned with a well-known style. Very few had a guide for logging, where we should log, what we should log, etc. This is a sad state of affairs.
We need consistency that goes deeper than code formatting. When debugging, we need to know what to expect. If specific packages are prohibited, I expect this to apply to the entire code base. If a specific practice in coding is discouraged, I’d expect this to be universal.
Thankfully, with CI, these consistency rules are easy to enforce without burdening our review process. Automated tools such as SonarQube are pluggable and can be extended with custom detection code. We can tune these tools to enforce our set of consistent rules to limit usage to a particular subset of the code or require proper logging.
Every rule has an exception, we shouldn’t be bound to overly strict rules. That’s why the ability to override such tools and merge a change with a developer review is important.
Double Verification
Debugging is the process of verifying assumptions as we circle the area of the bug. Typically, this happens very quickly. We see what’s broken, verify, and fix it. But sometimes, we spend an inordinate amount of time tracking a bug—especially a hard-to-reproduce bug or a bug that only manifests in production.
As a bug becomes elusive, it’s important to take a step back. Usually, it means that one of our assumptions was wrong. In this case, it might mean that how we verified the assumption was faulty. Double verification aims to test the assumption that failed using a different approach to ensure the result is correct.
Typically, we want to verify both sides of the bug, e.g., let’s assume I have a problem in the backend. It would express itself via the frontend where data is incorrect. To narrow the bug, I initially made two assumptions:
- The front end displays the data correctly from the backend
- The database query returned the right data
I can open a browser to verify these assumptions and look at the data. I can inspect responses with the web developer tools to ensure the data displayed is what the server query returned. For the backend, I can issue the query directly against the database and see if the values are correct.
But that’s only one way of verifying this data. Ideally, we would want a second way. What if a cache returned the wrong result? What if the SQL made the wrong assumption?
The second way should ideally be different enough, so it wouldn’t simply repeat the failures of the first way. Our knee-jerk reaction would be to try a tool like cURL for the frontend code. That’s good, and we probably should try that. But a better way might be to look at logged data on the server or invoke the WebService that underlies the front end.
Similarly, we would want to see the data returned from within the application for the backend. This is a core concept in observability. An observable system is a system for expressing questions and getting answers. During development, we aim our observability level at two ways to answer a question.
Why Not Three Ways To Verify?
We don’t want more than two ways because that would mean we’re observing too much, and as a result, our costs can go up while performance goes down. We need to limit the information we collect to a reasonable amount. Especially given the risks of personal information retention, which is an important aspect to keep in mind!
Observability is often defined through its tools, pillars, or similar surface area features. This is a mistake. Observability should be defined by the access it provides us. We decide what to log and what to monitor. We decide the spans of the traces. We decide the information’s granularity and whether we wish to deploy a developer observability tool.
We need to make sure that our production system will be properly observed. We need to run failure scenarios and possibly chaos game days to do that. When running such scenarios, we need to think about the process of solving the issues that come up. What sort of questions would we have for the system? How could we answer such a question?
For example, when a particular problem occurs, we often want to know how many users are actively modifying data in the system. As a result, we can add a metric for that information.
Verifying With Feature Flags
We can verify an assumption using observability tools, but we can also use more creative verification tools. One unexpected tool is the feature flag system. A feature flag solution can often be manipulated with very fine granularity, we can disable or modify a feature only for a specific user, etc.
This is very powerful, we can toggle a feature that could provide us with verification of a specific behavior if that specific code is wrapped in a flag. I don’t suggest spreading feature flags all over the code, but the ability to pull levers and change the system in production is a powerful debugging tool that is often underutilized as such.
Bug Debriefs
Back in the 90s, I developed flight simulators and worked with many fighter pilots. They instilled in me a culture of debriefing. Until then, I thought of these things only for discussing failures, but fighter pilots go to debrief immediately after the flight, whether a successful or failed mission.
There are a few important points we need to learn here:
- Immediate — we need this information fresh in our minds. Some things get lost if we wait, and our recollection changes significantly.
- On success and failure — Every mission gets things right and wrong. We must understand what went wrong and what went right, especially in successful cases.
When we fix a bug, we just want to go home. We often don’t want to discuss it anymore. Even if we want to “show off,” it’s often our broken recollection of the tracking process. By conducting an open discussion of what we did right and wrong… with no judgment. We can create an understanding of our current status. This information can then be used to improve our results when tracking issues.
Such debriefs can highlight gaps in our observability data, inconsistencies, and problematic processes. A common problem in many teams is indeed in the process. When an issue is raised, it is often:
- Encountered by the customer
- Reported to support
- Checked by ops
- Passed to R&D
If you’re in R&D, you’re four steps away from the customer and receive an issue that might not include the necessary information. Refining these processes isn’t a part of the code, but we can include tools within the code to make it easier to locate a problem. A common trick is to add a unique key to every exception object. This propagates to the UI in case of a failure.
When a customer reports an issue, there’s a good possibility they will include the error key, which R&D can find within the logs. These are the types of process refinements that often arise through such debriefs.
Review Successful Logs and Dashboards
Waiting for failure is a problematic concept. We need to review logs, dashboards, etc., regularly to track potential bugs that aren’t manifesting and to get a sense of a “baseline.” What does a healthy dashboard or log look like?
We have errors in a normal log. If, during a bug hunt, we spend time looking at a benign error, then we’re wasting our time. Ideally, we want to minimize the amount of these errors as they make the logs harder to read. The reality of server development is that we can’t always do that. But we can minimize the time spent on this through familiarity and proper source code comments.
I went into more detail in the logging best practices post and talk.
Final Word
A couple of years after founding Codename One, our Google App Engine bill suddenly jumped to a level that would trigger bankruptcy within days. This was a sudden regression due to a change on their backend.
This was caused because of uncached data, but due to the way the app engine worked at the time, there was no way to know the specific area of the code triggering the problem. There was no ability to debug the problem, and the only way to check if the issue was resolved was to deploy a server update and wait a lot.
We solved this through dumb luck. Caching everything we could think of in every single place. To this day, I don’t know what triggered the problem and what solved it.
What I do know is this:
I made a mistake when I decided to pick “App Engine.” It didn’t provide proper observability and left major blind spots. Had I taken the time before the deployment to review the observability capabilities, I would have known that. We lucked out, but I could have saved much of our cash early on had we been more prepared.
Check out my new book which launched last week.
Building for Failure — Best Practices for Easy Production Debugging was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.