Data downtime is a hot topic in data at the moment, and for obvious reasons. The cost of data downtime – a term coined by Monte Carlo to refer to periods where data is partial, erroneous, missing or otherwise inaccurate – can be significant for companies who rely on behavioral data for decision making.
If making key strategic decisions based on inaccurate data or wasting valuable time finding and diagnosing issues with data sounds commonplace, then your company suffers from data downtime.
But how exactly does data downtime occur? And what can we do to eliminate it?
A real-life example of data downtime at Acme
Every Monday morning at 9am, a weekly strategy meeting takes place at Acme with attendees dialling in from around the world. Ralph, the SVP of Commerce runs through the numbers for the past week, and key decisions are made for the week and month ahead. The report includes data from multiple sources; from online and offline sales, to payments, promotions and so on.
The report lands in Ralph’s inbox ahead of the meeting every Monday, giving him time to look through the data and prepare. However, this week there is a problem. Ralph believes the numbers look off; he was expecting much better performance last week and sends an urgent email out to the entire Data team questioning the accuracy of the data and requesting that it is resolved as soon as possible.
The team frantically tries to find the problem. Was Ralph correct? Is the data inaccurate, or is there data missing? And if so, the matter is made worse by the complex data stack at Acme, with multiple sources, pipelines, data modelling jobs and siloed teams, all feeding this one business critical report. Where is the issue or the bottleneck?
It takes valuable time to find, root cause and resolve the issue and by the time it has been resolved, the weekly strategy meeting has already taken place. Ralph has lost confidence in the data and so has the rest of the global Leadership team who also had to run blind in this week’s strategy meeting.
The spiralling cost of detecting data quality issues too late
This kind of scenario is not uncommon, but what is most damaging is how far downstream the data quality issue is detected, making it significantly more costly to Acme. It is far better to spot an issue, debug and resolve it at the point it occurs and as far upstream as possible – in order to minimise data downtime.
You do not want Ralph, or any other data consumer spotting your data quality issues, or worse, using incomplete or inaccurate data to drive key decisions. Towards the end of the graph, once the data is out and being used by a plethora of data consumers, damage limitation is difficult to contain. At least without eroding trust in the data.
The same applies for data being used in real time. If your product recommendations engine isn’t using the freshest data, then your users are going to be served outdated recommendations, negatively impacting the user experience and harming your bottom line.
The need for data observability
The challenge outlined above is the exact problem that data observability aims to fix. Data observability gives you transparency and control over the health of your data pipeline, such that when an issue does occur you can quickly understand:
- Where is the problem?
- Who needs to resolve it?
Knowing this information makes it possible to find and resolve issues far quicker and minimize data downtime.
But, how is data observability any different to monitoring?
The best way to describe the difference is that monitoring covers the ‘known unknowns’, whereas observability covers the ‘unknown unknowns’.
To take one example: as a Data Engineer, I know that I need to monitor the CPU usage of a microservice. But what is the complete landscape of things that could go wrong that could impact the delivery of complete and accurate data?
It is impossible to predict every issue that could arise, and this is where observability steps in. Data observability assumes we don’t know what all the questions are to ask, and instead gives us visibility of the things that really matter so that when something does go wrong we can investigate and resolve it quickly.
Get started on your journey
Our approach to observability at Snowplow
What Ralph and the rest of the business really care about is whether they can trust the data. Is it complete and is it fresh? Crucially, data observability should align the more technical data with business outcomes so that business and engineering teams are talking the same language and moving in the same direction.
Our approach to observability at Snowplow is to focus on two key metrics; throughput and latency. These are emitted from each part of the pipeline and as a result, if a bottleneck occurs at any point then it is far simpler to diagnose, allowing you to take corrective action immediately.
Our plan at Snowplow is to make it the most observable behavioural data pipeline available. We have already added observability to our BigQuery Loader, and we’ll soon be launching it within our RDB loader and Enrich assets too.
Get started on your journey
This is a 5-part series
Click below to navigate to the next chapter
Chapter 1 The challenges of working in data in 2021
Chapter 2 A guide to data team structures with examples
Chapter 3 Breaking communication barriers with a universal language
Chapter 4 Reducing data downtime with data observability
Chapter 5 How data storytelling can make your insights more effective