Recently I was thinking what was the single most important thing that changed my point of view on programming other than switching from PHP to Kotlin, and that thing is observability. I’ll try to share my journey with the concept and provide some experiences and recommendations for anyone who wants to start doing it.
What is observability?
There are various definitions of observability found on the Internet, but I would like to provide an analogy to explain my understanding of it.
Imagine a carrot field with numerous ready-to-be-harvested carrots. As a farmer, your goal is to make a profit by selling these carrots. Therefore, it would be detrimental if the carrots became sick or were stolen. A wise farmer would regularly inspect the condition of the carrots and apply appropriate remedies if any symptoms arise. However, is this approach truly effective? Not quite. It requires significant resources, and the larger the field, the more resources it consumes (in this case, the resource being time).
Now, envision a scenario where instead of manually checking the condition of each carrot, they could be planted in special pots that automatically report their condition to a central software system. By having constant and ubiquitous information about the condition of the carrots, the farmer would save a considerable amount of time. This is what we mean by monitoring the field.
As the farm expands, more fields are maintained, and it’s no longer solely about carrots. The farmer now desires to understand the overall, aggregated condition of all the fields on the farm. Achieving this would be challenging if we focused solely on individual fields, but it becomes effortless if we adopt a top-down approach and consider the entire IT system as a whole. That, to me, is the essence of observability — looking at the entire IT system from a higher-level perspective.
Why has it changed my life as a developer?
A decade ago, I was a farmer with one field, and monitoring was more than enough to maintain a few modular monoliths that I created and worked with. Just having some logs in a text file was enough. Now the systems I work with are much more complex, and it would be extremely bad to have only one monolith. I wouldn’t be learning about this without the tremendous help of my coworker, who pushed hard to adapt the concept within our company. My team was the first to play with the new toys, and the best part is that you never get bored playing with them. Every day you can learn and discover new things and ideas on how to play with the spans and traces you collect from your microservices.
My favorite toy is the OpenTelemetry framework. I can’t tell you how much I appreciate the work they’re doing for the entire IT industry. The integrations are ready or almost ready for the major languages currently used for web/mobile development. Thanks to the traceability, we can capture spans from the user click in the mobile/front-end app to async event processing on a message broker. Doing the same story manually would probably take several hours to find and match the right logs (if they were all stored).
Of course, collecting the traces wouldn’t be enough, and we need a tool to help us understand the RAW data. We decided to use HoneyComb.io and it’s the last toy that changed my life as a developer. Having the ability to query any field in the range without having to prepare special indexes is awesome. It allows you to build very complex queries and execute them quickly. It saves many hours of debugging and usually helps to identify the broken microservice in just a few minutes.
Having all the toys together has drastically reduced the resolution time for any production issue in my company. I’ve gotten so used to having the right observability setup that it would be very hard to take it off my production readiness list.
Do I need to adapt the concept to my team/company?
As usual — it depends. If your company is a small business built on a service architecture that rarely communicates with each other, the benefits are probably not that great. The tools can still help you build proper SLIs/SLOs and set up alerting on top of them, which in my opinion works much better than based on pure metrics.
If you are running multiple microservices that communicate with each other, even if they are written in different languages, observability is a way to go. The only downside is that you should work on adding the OpenTelemetry APIs/agents to all of them to not miss any spans in your traces. It is best to start with non-critical, preferably a standalone service, and then systematically add integration to more critical services. This approach can help you avoid unnecessary production disruptions while you learn how to work with the new tools.
Okay, so this all sounds great, but what are the downsides?
The tools like HoneyComb.io can get expensive if you have a lot of traffic. Putting all traces directly into the API is probably not the best idea. To reduce the volume, you’ll need to introduce sampling. This can be done in several ways, such as using an OpenTelemetry collector and/or HoneyComb Refinery. You can define sampling rules and ratios to focus on the important traces and drop those that are irrelevant or redundant. For example, you can apply a 0% sample ratio to any HTTP response with status >=400, while applying an 80% sample ratio to everything else. With this setup, you can still identify the errors and track the problematic samples without losing much context from the successful traffic. For some small integrations, it may be sufficient to use sampling within SDKs.
Another fairly obvious drawback is the increased complexity of the architecture. You have to use several external libraries that need to be maintained. You may need to integrate the collector/refinery into your infrastructure and maintain it like any other microservice. Especially at the beginning of the journey, you may face several issues that need your attention.
You also need to be careful, as with logging, not to expose sensitive data within the traces. Writing proper rules within the services or collector is a must.
Observability may be a success indicator
Observability is a critical concept in modern systems engineering and software development that focuses on gaining insight into complex, distributed systems. It goes beyond traditional monitoring approaches by emphasizing the ability to understand and reason about systems’ internal states and behaviors from the outside. By employing a combination of monitoring, logging, tracing, and metrics, observability enables engineers to gain a holistic understanding of system performance, identify and troubleshoot issues, and make informed decisions to improve system reliability, efficiency, and user experience. In today’s increasingly interconnected and dynamic technological landscape, observability plays a vital role in enabling organizations to build and maintain robust, scalable, and resilient systems. If your company is struggling to resolve production issues quickly, or if it’s hard for your teams to find them in the first place, give it a try and improve customer satisfaction for good.
How Observability Changed My (developer) Life was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.