Snowplow 0.8.5 released with ETL bug fixes


We are pleased to announce the immediate availability of Snowplow 0.8.5. This is a bug fixing release, following on from our launch last week of Snowplow 0.8.4 with geo-IP lookups.
This release fixes one showstopper issue with Snowplow 0.8.4, and also includes a set of smaller enhancements to help the Scalding ETL better handle “bad quality” event data from webpages. We recommend everybody on the Snowplow 0.8.x series upgrade to this version.
Many thanks to community members Peter van Wesep and Gabor Ratky for their help identifying and debugging the issues fixed in this release!
In this post we will cover:
1. The showstopper bug fix
Many thanks to Peter van Wesep for spotting the showstopper issue in the Snowplow 0.8.4 release: when the Snowplow ETL process was run from an Amazon Web Services account other than Snowplow’s own, the Hadoop ETL code was unable to read the MaxMind geo-IP data file from an S3:// link hosted from a Snowplow public bucket. This issue did not affect users who are self-hosting the ETL assets.
This has now been fixed – we now provide the MaxMind geo-IP file on an HTTP:// link, and the Scalding ETL downloads it and adds it to HDFS before running.
2. Other improvements
We have made a series of other improvements to the Scalding ETL, to make it more robust. These improvements are:
- We have widened the
page_urlport
andrefr_urlport
fields - We now strip control characters (e.g. nulls) from fields alongside tabs and newlines, to prevent Redshift load errors
- The ETL no longer dies if a huge (larger than an integer) numeric value is sent in for a screen/view dimension
- We have increased the size of
se_value
from a float to a double se_value
is always now output as a plain string, never in scientific notation, to prevent Redshift load errors- It is now possible to build the ETL locally (we added a missing dependency to the project configuration)
3. Upgrading
There are three components to upgrade in this release:
- The Scalding ETL, to version 0.3.1
- EmrEtlRunner, to version 0.2.1
- The Redshift events table, to version 0.2.1
Alternatively, if you are still using Infobright with the legacy Hive ETL, you can upgrade your Infobright events table, to version 0.0.9.
Let’s take these in turn:
Hadoop ETL
If you are using EmrEtlRunner, you need to update your configuration file, config.yml
, to the latest version of the Hadoop ETL:
:snowplow: :hadoop_etl_version: 0.3.1 # Version of the Hadoop ETL
EmrEtlRunner
You need to upgrade your EmrEtlRunner installation to the latest code (0.8.5 release) on GitHub:
$ git clone git://github.com/snowplow/snowplow.git $ git checkout 0.8.5
Redshift events table
We have updated the Redshift table definition – you can find the latest version in the GitHub repository here.
If you already have your Snowplow data in the previous version of the Redshift events table, we have written a migration script to handle the upgrade. Please review this script carefully before running and check that you are happy with how it handles the upgrade.
Infobright events table
If you are storing your events in Infobright Community Edition, you can also update your table definition. To make this easier for you, we have created a script:
4-storage/infobright-storage/migrate_008_to_009.sh
Running this script will create a new table, events_009
(version 0.0.9 of the Infobright table definition) in your snowplow
database, copying across all your data from your existing events_008
table, which will not be modified in any way.
Once you have run this, don’t forget to update your StorageLoader’s config.yml
to load into the new events_009
table, not your old events_008
table:
:storage: :table: events_009 # NOT "events_008" any more
Done!
4. Getting help
As always, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.
You can see the full list of issues delivered in Snowplow 0.8.5 on GitHub.