Snowplow 77 Great Auk released with EMR 4.x series support


Snowplow release 77 Great Auk is now available! This release focuses on the command-line applications used to orchestrate Snowplow, bringing Snowplow up-to-date with the new 4.x series of Elastic MapReduce releases.
- Elastic MapReduce AMI 4.x series compatibility
- Moving towards running Storage Loader on Hadoop
- Retrying the job in the face of bootstrap failures
- Monitoring improvements
- Removal of snowplow-emr-etl-runner.sh and snowplow-storage-loader.sh
- Bug fixes and other improvements
- Upgrading
- Roadmap
- Getting help
1. Elastic MapReduce AMI 4.x series compatibility
Snowplow is now capable of running on version 4.x of Amazon EMR. The 4.x series of releases for Amazon EMR have some great new features, including support for Apache Spark 1.6, Apache Zeppelin and private VPCs.
Achieving this involved four changes:
- A new bootstrap action to put the correct resources on the classpath
- Minor changes to Scala Hadoop Shred to prevent a
NullPointerException
thrown by thejava.net.URL
class post-upgrade - Upgrading to the latest version of the Elasticity library
- Switching from using
javax.script
toorg.mozilla.javascript
for the JavaScript Script Enrichment, to prevent compatibility issues
To get up to date with the latest AMI version, change the “ami_version” field of your configuration YAML to “4.3.0”. Make sure you also change the “hadoop_shred” field to at least “0.8.0” to get a compatible version of Scala Hadoop Shred.
2. Moving towards running StorageLoader on Hadoop
At the moment, processing raw data using Snowplow involves two commands: you need to run both EmrEtlRunner, to process the data on Elastic MapReduce, and StorageLoader, to load the processed data into Redshift or Postgres.
In the future, StorageLoader will be invisible to the end user – it will become simply a custom jar step in the jobflow on EMR. In this release we have moved towards this goal in two ways.
Getting credentials from EC2
When running StorageLoader on EC2, you no longer need to configure it with your AWS credentials. Instead you can set the credentials fields to “iam”:
aws: access_key_id: iam secret_access_key: iam
StorageLoader will then look up the credentials using the EC2 instance metadata.
Base64-encoded configuration
It is now possible to pass a Base64-encoded configuration string as a command line argument instead of the path to the configuration file. For example:
./snowplow-storage-loader --base64-config-string $(base64 -w0 path/to/config.yml)
This will make it easier for us to invoke StorageLoader from Hadoop in the future.
3. Retrying the job in the face of bootstrap failures
Sometimes the process of bootstrapping the cluster before the job starts can fail. This release improves the ability of EmrEtlRunner to recognise these bootstrap failures and restart the job.
4. Monitoring improvements
EmrEtlRunner’s internal Snowplow monitoring can be configured with name-value tags which are sent to a Snowplow collector with every monitoring event. In this release, we also attach those tags to the EMR job itself, so that you can see them in the EMR web UI.
We have also upgraded both apps to the latest version (0.5.2) of the Snowplow Ruby Tracker.
5. Removal of snowplow-emr-etl-runner.sh and snowplow-storage-loader.sh
These scripts were originally used to run EmrEtlRunner and StorageLoader as native Ruby apps using RVM. Now that those apps are available on Bintray as easy-to-deploy JRuby jars, these scripts are no longer necessary.
Running EmrEtlRunner and StorageLoader as Ruby (rather than JRuby apps) is no longer actively supported.
6. Bug fixes and other improvements
Snowplow R77 Great Auk also includes some important bug fixes and improvements:
- We fixed a serious error in the Currency Conversion Enrichment, whereby an exception would be thrown (failing the overall event) if you attempted to convert from and to the same currency, such as attempting to convert €9.99 to euros (#2437)
- Historically, StorageLoader has performed
ANALYZE
statements immediately after theCOPY
statements (in fact in the same transaction), and before anyVACUUM
statements. It is more correct to performANALYZE
afterVACUUM
, so we have reversed the order. Many thanks to Ryan Doherty for flagging this! (#1361) - EmrEtlRunner now supports an optional
aws:emr:additional_info
field inconfig.yml
, which you can use to access beta EMR features (#2211)
7. Upgrading
The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
The recommended AMI version to run Snowplow is now 4.3.0 – update your configuration YAML as follows:
emr: ami_version: 4.3.0 # WAS 3.7.0
You will need to update the jar versions in the same section:
versions: hadoop_enrich: 1.6.0 # WAS 1.5.1 hadoop_shred: 0.8.0 # WAS 0.7.0 hadoop_elasticsearch: 0.1.0 # UNCHANGED
For a complete example, see our sample config.yml
template.
Note also that the snowplow-runner-and-loader.sh
script has been updated to use the JRuby binaries rather than the raw Ruby project.
8. Roadmap
Upcoming Snowplow releases include:
- Release 78 Great Hornbill, which will bring the Kinesis pipeline up-to-date with the most recent Scala Common Enrich releases. This will also include click redirect support in the Scala Stream Collector
- Release 79 Black Swan, which will allow enriching an event by requesting data from a third-party API
Note that these releases are always subject to change between now and the actual release date.
9. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.