Snowplow 87 Chichen Itza released


We are pleased to announce the immediate availability of Snowplow 87 Chichen Itza.
This release contains a wide array of new features, stability enhancements and performance improvements for EmrEtlRunner and StorageLoader. As of this release EmrEtlRunner lets you specify EBS volumes for your Hadoop worker nodes; meanwhile StorageLoader now writes to a dedicated manifest table to record each load.
Continuing with this release series named for archaelogical sites, Release 87 is Chichen Itza, the ancient Mayan city in the Yucatan Peninsula in Mexico.
Read on after the fold for:
- Specifying EBS volumes for Hadoop in EmrEtlRunner
- EmrEtlRunner stability and performance improvements
- A load manifest for Redshift
- StorageLoader stability improvements
- Upgrading
- Roadmap
- Getting help
1. Specifying EBS volumes for Hadoop in EmrEtlRunner
A recurring request from the Snowplow community has been for increased control over how the Snowplow batch pipeline runs on Elastic MapReduce.
Over time, our plan is to give you total control over this, with our planned migration from EmrEtlRunner to our new Dataflow Runner, as per our RFC. However, this plan will take some time, and in the meantime we are continuing to invest in improving EmrEtlRunner.
In this release we add the ability to specify an EBS volume to attach to each core instance in your EMR cluster. This is particularly powerful for two scenarios:
- When you want to use new EBS-only instance types, such as the
c4
series, for your EMR jobs - When you have significant event volumes and you would like much more finegrained control over the amount of disk you are allocating to the HDFS cluster that is formed out of the task nodes
EmrEtlRunner lets you attach one EBS volume to each node, broadly exposing the functionality described in the EMR documentation for Amazon EBS volumes. For an example, please see the upgrade section 5.2 Updating config.yml below.
2. EmrEtlRunner stability and performance improvements
We have made a variety of “under-the-hood” improvements to the EmrEtlRunner.
Most noticeably, we have migrated the archival code for raw collector payloads from EmrEtlRunner into the EMR cluster itself, where the work is performed by the S3DistCp distributed tool. This should reduce the strain on your server running EmrEtlRunner, and should improve the speed of that step. Note that as a result of this, the raw files are now archived in the same way as the enriched and shredded files, using run=
sub-folders.
For more robust monitoring of EMR while waiting for jobflow completion, EmrEtlRunner now anticipates and recovers from additional Elasticity errors (ThrottlingException
and ArgumentError
).
For users running Snowplow in a Lambda architecture we have removed the UnmatchedLzoFilesError
check, which would prevent EMR from starting even though an LZO index file missing from the processing folder is in fact benign.
In the case that a previous run has failed or is ungoing, EmrEtlRunner now exits out with a dedicated return code (4
).
Finally, we have bumped the JRuby version for EmrEtlRunner to 9.1.6.0, and upgraded the key Elasticity dependency to 6.0.10.
3. A load manifest for Redshift
As of this release, StorageLoader now populates a manifest table as part of the Redshift load. The table is simply called manifest
and lives in the same schema as your events
and other tables.
Here are the last 5 loads for one of our internal pipelines:
snplow=# select * from atomic.manifest order by etl_tstamp desc limit 5<span class="p">; etl_tstamp | commit_tstamp | event_count | shredded_cardinality -------------------------+----------------------------+-------------+---------------------- 2017-02-21 19:01:49.648 | 2017-02-21 19:28:33.394898 | 4340 | 8 2017-02-21 15:02:02.03 | 2017-02-21 15:27:24.534884 | 2891 | 7 2017-02-21 11:02:09.63 | 2017-02-21 11:28:00.317476 | 2210 | 7 2017-02-21 08:02:14.957 | 2017-02-21 08:27:59.829461 | 1945 | 6 2017-02-21 03:01:39.476 | 2017-02-21 03:27:03.493547 | 1986 | 8 (5 rows)
The fields are as follows:
etl_tstamp
is the time at which the Snowplow pipeline run startedcommit_tstamp
is the time at which the load transaction startedevent_count
is the number of events loaded into theevents
table as part of this load transactionshredded_cardinality
is how many different self-describing event and context tables were loaded as part of this load transaction
At the moment this manifest table is only informational, however in the future we want to use it proactively – for example to prevent a batch of events from being accidentally double-loaded into Redshift.
4. StorageLoader stability improvements
As with EmrEtlRunner, we have bumped the JRuby version for StorageLoader to 9.1.6.0.
We have also fixed a critical bug for loading events into Postgres via StorageLoader (#2888).
5. Upgrading
5.1 Upgrading EmrEtlRunner and StorageLoader
The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
5.2 Updating config.yml
To make use of the new ability to specify EBS volumes for your EMR cluster’s core nodes, update your configuration YAML like so:
jobflow: master_instance_type: m1.medium core_instance_count: 1 core_instance_type: c4.2xlarge core_instance_ebs: # Optional. Attach an EBS volume to each core instance. volume_size: 200 # Gigabytes volume_type: "io1" volume_iops: 400 # Optional. Will only be used if volume_type is "io1" ebs_optimized: false # Optional. Will default to true
The above configuration will attach an EBS volume of 200 GiB to each core instance in your EMR cluster; the volumes will be Provisioned IOPS (SSD), with performance of 400 IOPS/GiB. The volumes will not be EBS optimized. Note that this configuration has finally allowed us to use the EBS-only c4
instance types for our core nodes.
For a complete example, see our sample config.yml
template.
5.3 Upgrading Redshift
You will also need to deploy the following manifest table for Redshift:
This table should be deployed into the same schema as your events
and other tables.
6. Roadmap
Upcoming Snowplow releases include:
- R8x [HAD] Cross-batch natural deduplication, removing duplicates across ETL runs using an event manifest in DynamoDB
- R9x [HAD] Spark port, which will port our Hadoop Enrich and Hadoop Shred jobs from Scalding to Apache Spark
- R9x [HAD] 4 webhooks, which will add support for 4 new webhooks (Mailgun, Olark, Unbounce, StatusGator)
Note that these releases are always subject to change between now and the actual release date.
7. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.