At Interlock we focus on customer experience above all – our service’s availability and performance are our top priority. That requires a strong culture of observability across our teams and systems.
As a result, we invest a lot in the reliability of our application. But unpredictable failures are inevitable, and when they happen it’s humans that fix them.
We operate a socio-technical system, and its ability to recover when faced with adversity is called resilience. One of the crucial components of resilience is observability, the steps we take to enable humans to “look” inside the systems they run.
This post will explore the road to building a stronger culture of observability, and the lessons we’ve learned along the way.
What do we mean by observability?
At Interlock
, we ship to learn. Our production environment is where our code, infrastructure, third-party dependencies, and our customers come together to create an objective reality – it’s the only place to learn and validate the impact of our work. We define observability as a continuous process of humans asking questions about production, and getting answers*.
Let’s break that down a little more:
Continuous process: Successful observability means that folks observe as frequently as possible.
Questions about production: We wanted our definition to be wide, generic, and representative of the broad scope of workflows we cater for.
Answers*: Note the asterisk. No tool will give you answers, only offer leads you can follow to find the real answers. You have to use your own mental models and understanding of the systems you run.
Stage 1: Problem and solution
Armed with our own definition of observability, we assessed our existing practices and formulated a problem statement. Until recently, our observability tooling has been primarily based on metrics. A typical workflow involved looking at a dashboard full of charts with metrics sliced and diced by various attribute combinations. Folks would look for correlations but often leave without fulfilling insights.
“Metrics are easy to add and understand, but they are missing high-cardinality attributes (e.g. Customer ID), making it difficult to complete an investigation”