Snowplow 0.8.12 released with a variety of improvements to the Scalding Enrichment process
We are very pleased to announce the immediate availability of Snowplow 0.8.12. We have quite a packed schedule of releases planned over the next few weeks – and we are kicking off with 0.8.12, which consists of various small improvements to our Scalding-based Enrichment process, plus some architectural re-work to prepare for the coming releases (in particular, Amazon Kinesis support).
- Background on this release
- Scalding Enrichment improvements
- Re-architecting our Enrichment process
- Installing this release
This release has two core objectives:
- To make a set of small improvemnts to our Scalding-based Enrichment process
- To re-architect our Enrichment process to make it usable from Amazon Kinesis
We will detail both of these after the jump.
The improvements made to our Scalding Enrichment process in this release are as follows:
- We have updated our user-agent parsing library to its latest version, thanks to Rob Kingston for the suggestion (#416)
- We have fixed an issue where Snowplow raw events without a page URI were automatically sent to the bad bucket. As a general purpose event analytics platform, raw events are no longer automatically expected to have an associated page URI. Thanks to Simon Rumble for this suggestion (#399)
- We have added missing validation for fields which the Tracker Protocol expects to be set to ‘0’ or ‘1’ (#408)
- We have added missing validation of the numeric fields in ecommerce transactions (#400)
- We have tweaked the code which parses CloudFront access logs to make it a little more permissive of missing fields if they are not required by Snowplow (#410)
Additionally we have also upgraded some of the Enrichment process’s underlying components: Scala to 2.10.3, Scalding to 0.8.11, SBT to 0.13.0 and sbt-assembly to 0.10.0.
Finally, although not exactly an Enrichment improvement, we have now fixed a bug in cube-pages.sql
, bumping this to 0.1.1. Many thanks to community member Matt Walker for this fix!
Here at Snowplow we are hugely excited about the recent release of Amazon Kinesis, a fully managed service for continuous data processing. We plan to use Kinesis to enable real-time collection and processing of Snowplow event data: as well as enabling us to deliver real-time reporting via Amazon Redshift, this also opens up the possibility of building operational systems on top of the Snowplow event stream.
As a first step, Brandon, one of our Snowplow winterns, is working on a new Snowplow collector which will collect raw Snowplow events and sink them onto a further Kinesis stream; for more details on this collector see this blog post.
The next logical step is to create a “Kinesis application”, which reads raw events off one Kinesis stream, enriches them using our existing Scala Enrichment code, and then writes the enriched events back to another Kinesis stream for further processing or storage (e.g. drip feeding into Amazon Redshift).
The only problem? Our existing Scala Enrichment code was tightly coupled to our Scalding/Cascading Scala project – making it hard to re-use it in a future Kinesis application. This 0.8.12 release fixes this with some ‘software surgery’:
Thus the new Scala Common Enrich is a shared library for processing raw Snowplow events into validated and enriched Snowplow events. Common Enrich is designed to be used within a “host” enrichment process: initially our existing Scala Hadoop Enrich process, but it should be relatively straightforward to also embed this in a Kinesis application.
If you are using the existing Scalding-based Enrichment process, the only difference you should notice is the new composite v_etl
for Snowplow events: “hadoop-0.3.6-common-0.1.0”.
Assuming you are using EmrEtlRunner, you simply need to update your configuration file, config.yml
, to use the latest version of the Hadoop ETL:
And that’s it! As always, you can find more detail on the tickets in this release under the Snowplow v0.8.12 release in GitHub.