Snowplow R118 beta release with new bad row format

We are excited to release Snowplow R118 Morgantina, named after the archaeological site in east central Sicily, southern Italy.
This Snowplow beta release includes the long-awaited new bad row format. We have updated the microservices that make up the Snowplow pipeline to emit much more highly structured bad rows, that should be significantly easier to query and debug. Now it will be simpler for companies running Snowplow to reliably monitor data quality issues, spot when they arise, and then quickly diagnose the source of the problem, the event or entity it impacts, the size of the problem and the app it occurred on. For more information on the thinking behind this new functionality, see the original RFC.
In addition, this release also includes important refactoring in the libraries that we use (e.g. cats
instead of scalaz
and circe
instead of json4s
), a bump of Beam version to 2.11.0, an updated Iglu Client with different validator and fixed caching mechanism, as well as an improvement to the referer parser enrichment.
Scala Stream Collector and the enrich jobs versions (Stream Enrich and Beam Enrich) are bumped to 1.0.0
.
As introduced in the RFC on batch deprecation, we are no longer adding new features to Spark from this release. This release, and all subsequent releases, will focus on our streaming solution.
1. Beta release
This release represents a major upgrade of the core Snowplow technology. We are currently testing this tech internally – no small feat because of the size and scope of the release. Whilst we are running those tests, we have decided to make the code available to open source users to test themselves: we would welcome any feedback on the new functionality, which we hope to release in production in R119.
If you are an open source user who would like to test this new functionality, we would be happy to provide a high-priority OSS support during this testing phase (covering setup, debug, recovery or other aspects). We are committed to iterating our release candidates for R119 rapidly.
2. New bad row format
2.1. Why update the bad row format?
The Snowplow pipeline has always been non-lossy: when there is an issue processing an event, rather than silently discard the event, it is saved as a “bad row”. This is only half the challenge! Snowplow also gives the ability to monitor the number of bad rows, inspect them, fix them and reprocess them to complete your data set.
The previous format for these bad rows (generic JSON) was not straightforward to use. Specifically:
- It was hard to distinguish “bad rows” that mattered i.e. real data that was not successfully processed, with rubbish data e.g. generated by bots crawling the collector endpoint
- Where there were geniune issues, it was hard to
- aggregate over the bad row data to size those issues,
- figure out what type of data caused the issue (e.g. for a schema validation error, what schema the issue was with, and what field the issue was with), and
- identify the source of the error so that it can be quickly fixed e.g. is it an issue with a specific mobile app?
To address these issues, we have reworked the format from the ground up. More background on that thinking can be found on the RFC.
2.2. New bad row format
Bad rows emitted by Snowplow pipeline are now self-describing JSONs. All their schemas can be found on Iglu central.
Events that failed to be processed for different reasons at different points in the Snowplow pipeline are emitted as bad rows with different schemas. The further down the pipeline, the more processing the event has undergone, and the more rich data can be exposed as part of the bad row for querying.
An overview of the new types of bad rows is provided below.
2.3. Bad rows emitted by the Scala Stream Collector
2.3.1 Size violation
The payload received by the collector can exceed the maximum size allowed by the message queue being used (1MB for Kinesis on AWS and 10MB for PubSub on GCP). In this scenario, the payload can’t be stored in its entirety, and a size violation bad row is emitted. For example this could be the case if a big image is sent via a form.
This type of bad row is generally unrecoverable.
2.4. Bad rows emitted by the enrich jobs
2.4.1. Collector payload format violation
If something goes wrong when the enrich job tries to deserialize the payload serialized by the collector, a bad row of this type is emitted. For instance this could happen in case of:
- Malformed HTTP
- Truncation
- Invalid query string encoding in URL
- Path not respecting
/vendor/version
2.4.2. Adapter failure
We support many webhooks that can be connected to the Snowplow pipeline. Whenever the event sent by a webhook doesn’t respect the expected structure and list of fields for this webhook, an adapter failure is emitted.
This could happen for instance if a webhook is updated and stops sending a field that it was sending before.
2.4.3. Tracker protocol violation
Events sent by a Snowplow tracker are expected to follow our tracker protocol. Whenever this protocol is not correctly respected, a tracker protocol violation bad row is emitted, usually revealing an error on the tracker side.
2.4.4. Schema violation
The Snowplow pipeline can receive self-describing events. To be processed successfully, there need to be corresponding schemas for these events in the user’s schema registry. In addition, it is possible to add additional information to any event recorded in Snowplow through the use of custom entities contexts, again by using self-describing JSONs.
As part of the processing of each event, every self-describing JSON (either event or context) is validated against its schema. Whenever the self-describing JSON is not valid for one of these 2 cases, a schema violation bad row is emitted.
2.4.5. Enrichment failure
This type of bad row is emitted whenever something goes wrong in the enrichments. This could happen for instance if the credentials used for the SQL enrichment are wrong, or because of a bad link to the MaxMind database for IP lookup.
For those familiar with the previous bad rows emitted in case of enrichment failure, please note that with this new version, when an error occurs while enriching an event (that can come from a collector payload with several events), the bad row will contain only this event. The other events of the collector payload can be successfully enriched. This is the same for schema violations.
2.4.6. Size violation
Similarly as for Scala Stream Collector, this type of bad row is emitted whenever the size of the enriched event is too big for the message queue. It can also be emitted if the size of another kind of bad row is too big for the message queue. In this case it will be truncated and wrapped in a size violation bad row instead.
2.5. Working with bad rows in Scala
Scala library snowplow-badrows contains all the case classes defining the new bad rows and makes it easier to work with them. By using the same version of the library that the application producing bad rows, we can be sure that there won’t be any incompatibility when processing them.
To parse a String
containing a bad row, one could write:
import com.snowplowanalytics.snowplow.badrows._
import com.snowplowanalytics.iglu.core.circe.instances._
import com.snowplowanalytics.iglu.core.SelfDescribingData
import cats.implicits._
import io.circe.parser
import io.circe.syntax._
def parseBadRow(jsonStr: String): Either[String, BadRow] =
for {
json <- parser.parse(jsonStr).leftMap(_.getMessage)
sdj <- SelfDescribingData.parse(json).leftMap(_.code)
badrow <- sdj.data.as[BadRow].leftMap(_.getMessage)
} yield badrow
It is then straightforward to determine the type of bad row with pattern matching and to access its fields:
val badRow: BadRow = ???
badRow match {
case BadRow.SchemaViolations(_, failure, _) =>
val timestamp = failure.timestamp
doSomethingWithTimestamp(timestamp)
case BadRow.EnrichmentFailures(_, _, payload) =>
val maybeCountry = payload.enriched.geo_country
doSomethingWithCountry(maybeCountry)
case _ => ???
}
3. Referer parser enrichment improvement
Referer parser enrichment requires a file with a list of referers to do its job.
From this release, the file doesn’t need to be embedded any more and its URI can be specified as an input parameter of the enrichment.
The database can thus be updated without any modification to the code.
4. Upgrading
The upgrading guide can be found on our wiki page.
5. Roadmap
While R118 contains release candidates of the components emitting the new bad rows (Scala Stream Collector and enrich jobs), R119 will have the final versions of these components, after further testing from our side and from the OSS community.
Along with R119, new releases of the components processing these bad rows will complete the ecosystem:
- S3 loader and GCS loader will partition the bad rows on disk according to their type
- ES loader will contain a lot of refactoring and a better indexing of the bad rows
- Event recovery will recover the new bad rows
It should also be stressed that as we speak the loaders are also getting ready to emit the new bad row format.
Stay tuned for announcements of more upcoming Snowplow releases soon!
6. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please visit our Discourse forum. Open source users will receive high-priority support for components of this release.