Observability

Observability is both a quality attribute or property of complex systems, and it is also a set of strategies and tools for understanding and managing complex systems. Observability is a subset of visibility and complements traditional monitoring strategies.

As a quality attribute, the observability of a system can be improved in many ways. If you add more logs and metrics, you will improve observability. And if you train your team in using monitoring tools, you will improve observability that way, too.

But more commonly we talk about observability as a set of tools and strategies that provide data and capabilities for us to be able to ask questions about our systems – questions that, perhaps, we didn’t know we needed to ask from outside the system. With observability tools, you can discover the questions you want to ask about your system by exploring your system’s telemetry data.

This is done by using telemetry data to analyze how the internal state of a system changes over time, and to diagnose unexpected behaviors. The goal is to gain insights into how and why a system behaves in certain ways, enabling deeper analysis and supporting troubleshooting activities.

While monitoring helps us to deal with the known unknowns, observability helps us to discover the unknown unknowns.

Observability strategies originated in distributed software design. In complex systems like distributed ones, failures often result from complex interactions between components rather than from isolated events. Distributed software fail in new and unexpected ways all the time. You are often dealing with "unknown unknowns", because the system is constantly evolving, with different teams incrementing different components of the system at different times. It is impossible to predict every way that such systems might fail.

In such complex systems, it is necessary to collect and analyze data from multiple sources to gain a complete view of overall system behavior.

Observability has been described as the business intelligence of developer tools. Specifically, observability encompasses three types of tools – known as the three pillars of observability:

  • Logging: Recording discrete events or states.

  • Metrics: Aggregating numerical data points over time, and monitoring changing patterns in those metrics.

  • Tracing: Tracking the flow of requests through the system.

Observability strategies may emphasize one or more of these pillars, depending on the specific needs of the system. Metrics are ideal for recording event occurrences, item counts, action durations, or reporting the status of resources like CPU and memory. Logs are more suited to capturing detailed narratives of events, especially for errors and warnings. Traces offer a window into the journey of individual request-response cycles, and are particularly beneficial to track changing state through asynchronous messaging systems.

In a well-understood system, it may be sufficient to prioritize high quality metrics over detailed (and expensive) logs and complete traces. But the more complex the system, the more difficult it is to anticipate all the information you will need, and therefore to define the required metrics up-front. Therefore, visibility of such systems is going to lean more heavily on logs and traces, and less on monitoring and metrics.

Tracing is the solution to the difficulties associated with monitoring and debugging distributed software. Tracing helps us to understand the cause and effect of events in complex distributed software. What effect is there of a component publishing an event on the rest of the system? Likewise, how do you understand what happened upstream?

For this reason, the most important aspect of observability in a distributed, asynchronous system is distributed tracing. This is the most important of the three pillars.

In most use cases, some combination of all three strategies – metrics, logging, and tracing – will be required. Add in traditional monitoring strategies such as application performance management, and you will have comprehensive visibility of the internal workings of a system. For example, you should be able to detect problems through metrics, diagnose them through logs, and use traces to understand the flow of requests that led to the issue. Without a combination of all three strategies, you may miss important context or details that help you to understand the underlying causes of issues.

The relative balance between the three strategies – metrics, logging, and tracing – will change over time, and they will each inform and shape the others. For example, imagine that a business-critical database, in the order processing services, goes offline. You diagnose the issue using logs, and discover that the database has run out of disk space. You fix the issue and restart the database. Then you put in place monitoring and alerting on metrics for disk usage, so you can proactively resolve the issue before it happens next time.

Besides the trade-offs between the three pillars of observability, there are also trade-offs to be made in the depth and breadth of the telemetry data collected. More data will provide greater insights, but with additional cost for storage and processing. In large-scale system, you may not be able to afford capturing every piece of available data, else you could end up with a US$65m bill. It may be necessary to strike a reasonable balance between the cost of collecting and analyzing some kinds of telemetry data, and the value it provides.

The best approach, usually, is to start with a basic set of metrics, logs, and traces, and then iterate as you learn more about your system and its behavior. This [iterative] approach to monitoring and observability will help you to refine your strategy over time, and to ensure that you are collecting the most valuable data to understand and optimize your system.

Observability tools

  • Elastic — Elastic is a suite of tools for search, analytics, and observability.

  • Honeycomb — Observability service.

  • Splunk — Now owned by Cisco.