Snowplow, the global leader in customer data infrastructure (CDI) for AI, enables every organization to own and unlock the value of its customer behavioral data to fuel AI-driven marketing, digital products and services, customer experiences, and fraud mitigation.
Snowplow, the global leader in customer data infrastructure (CDI) for AI, enables every organization to own and unlock the value of its customer behavioral data to fuel AI-driven marketing, digital products and services, customer experiences, and fraud mitigation.
One of the features that makes Snowplow unique is that we actually report bad data: any data that hits the Snowplow pipeline and fails to be processed successfully. This is incredibly valuable, because it means you can:
Spot data tracking issues that emerge, quickly, and address them at source
Have a corresponding high degree of confidence that trends in the data reflect trends in the business and not data issues
Recently we extended Snowplow so that you can loadbad data into Elasticsearch. In this guide, we will walk through how to use Kibana and Elasticsearch to:
Monitor the number of bad rows
Spot problems that emerge
Quickly diagnose the route causes of the issues, so that they can be addressed upstream
The Kibana discover UI: a great interface for diagnosing bad data
The Kibana discover interface provides a great UI for debugging your bad rows.
On the top of the page we have a graph showing the number of bad rows per run. Underneathe we have a selection of data, and the ability to drill in and explore any of them in more detail:
Pick your time horizion
The first thing to do is choose the time horizon you want to look at your bad rows over. By default, Kibana will load with a time horizon of only 15 minutes. It can be useful to look at bad rows over a much longer horizon. By clicking on the top right corner of the screen we can select the time period to view:
We then select to look at 30 days worth of data:
Filtering out bad rows that we need not worry about
One of the interesting things that jumps out when you monitor the bad rows is that there is a fair amount of data that is generated not by Snowplow trackers or webhooks that fail to process (which mean lost data) but data that is generated from malicious bots pinging the internet looking for security vulnerabilities. The below is an example:
That was a request to our trial-collector.snplow.com/admin. The next request was generated by a bot trying to ping trial-collector.snplow.com/freepbx/admin/changes:
As these bad rows do not represent data that we want but failed to process, we can safely ignore them. To do that, we simply filter these out by entering the following query in the Kibana search box at the top of the screen:
This removes all rows that represent requests to paths that the collector does not support.
Another bad row that you need not worry about are rows like the following:
They are caused by OPTIONS requests that are made when users of the Javascript tracker send data via POST. (It is necessary for the browser to send an OPTIONS request prior to issuing the POST request with the actual data – this is a CORS requirement.) Again, they can safely be ignored as they don’t represent failed attempts to send data: they represent a necessary step in the process of sending data from the Javascript tracker to the collector. We can filter these out, with the bad rows generated by malicious bots, using the following query in Kibana:
The remaining rows should all be geniune bad data i.e. data generated by the trackers or webhooks, so we need to drill into what is left and to unpick the errors.
Diagnosing underlying data collection problems
Now that we’ve filtered out bad rows that we need not worry about, we can identify real issues with our tracking.
I recommend working through the following process:
1. Identify the first error listed
Inspecting the first error, we might find something like the following:
The above error message is caused by a failure to validate data against the associated schema. Specifically, a two fields have been included in a JSON sent into Snowplow that are not allowed:
buildIsReleased, and
buildVarient
2. Identify how many bad rows are caused by this error
Now we’ve identified a tracker error, we want to understand how prevalent this is. We can do that by simply updating our Kibana query to return rows with this type of error message i.e.
In our case, we can see that this error was only introduced yesterday, but that since then it accounts for almost 2500 bad rows:
Addressing this issue is essential. Fortunately, this should be straightforward: most likely we need to create a new version of the schema that allows for the two fields, and update the tracker code to send in a reference to the new schema version in the different self-describing JSONs.
3. Filter out those bad rows and repeat
Now we’ve dealt with the first source of bad rows, let’s identify the second. This is easy, we update our Kibana query to filter out the bad rows we were exploring above:
and in addition filter out the bad rows that we did not need to worry about:
We repeat the above process until all the sources of bad data have been identified and dealt with!
Further reading
We need to talk about bad data, our blog post explaining why we have to confront and manage bad data, rather than pretending it does not exist