We are pleased to announce the release of Snowplow 83 Bald Eagle. This release introduces our powerful new SQL Query Enrichment, long-awaited support for the EU Frankfurt AWS region, plus POST support for our Iglu webhook adapter.
- SQL Query Enrichment
- Support for eu-central-1 (Frankfurt)
- POST support for the Iglu webhook adapter
- Other improvements
- Getting help
1. SQL Query Enrichment
The SQL Query Enrichment lets you effectively join arbitrary entities to your events during the enrichment process, as opposed to attaching the data in your tracker or in your event data warehouse. This is very powerful, not least for the real-time use case where performing a relational database join post-enrichment is impractical.
The query is plain SQL: it can span multiple tables, alias returned columns and apply arbitrary
WHERE clauses driven by data extracted from any field found in the Snowplow enriched event, or indeed any JSON property found within the
derived_contexts fields. The enrichment will retrieve one or more rows from your targeted database as one or more self-describing JSONs, ready for adding back into the
For a detailed walk-through of the SQL Query Enrichment, check out our new tutorial, How to enrich events with MySQL data using the SQL Query Enrichment.
You can also find out more on the SQL Query Enrichment page on the Snowplow wiki.
2. Support for eu-central-1 (Frankfurt)
We are delighted to be finally adding support for the EU Frankfurt (eu-central-1) AWS region in this release; this has been one of the most requested features by the Snowplow community for some time now.
To implement this we made various changes to our EmrEtlRunner and StorageLoader applications, as well as to our central hosting of code artifacts for Elastic MapReduce and Redshift loading.
AWS has a healthy roadmap of new data center regions opening over the coming months; we are committed to Snowplow supporting these new regions as they become available.
3. POST support for the Iglu webhook adapter
Our Iglu webhook adapter is one of our most powerful webhooks. It lets you track events sent into Snowplow via a
GET request, where the name-value pairs on the request are composed into a self-describing JSON, with an Iglu-compatible
schema parameter being used to describe the JSON.
Previously this adapter only supported
GET requests; as of this release the adapter also supports
POST requests. You can send in your data in the
POST request body, either formatted as a JSON or as a form body; the
schema parameter should be part of the request body.
Many thanks to community member Mike Robins at Snowplow partner Snowflake Analytics for contributing this feature!
For information on the new
POST-based capability, please check out the setup guide for the Iglu webhook adapter.
4. Other improvements
This release also contains further improvements to EmrEtlRunner and StorageLoader:
- In EmrEtlRunner, we now pass the GZIP compression argument to S3DistCp as “gz” not “gzip” (#2679). This makes it easier to query enriched events from Apache Spark, which does not recognize
.gzipas a file extension for GZIP compressed files
- Also in EmrEtlRunner, we fixed a bug where files were being double compressed as the output of the Hadoop Shred step if the Hadoop Enrich step was skipped (#2586)
- In StorageLoader, we opted to use the Northern Virginia endpoint instead of the global endpoint for us-east-1 (#2748). This may have some benefits in terms of improved eventual consistency behavior (still under observation)
Upgrading is simple – update the
hadoop_enrich job version in your configuration YAML like so:
For a complete example, see our sample
We have renamed the upcoming milestones for Snowplow to be more flexible around the ultimate sequencing of releases. Upcoming Snowplow releases, in no particular order, include:
- R8x [HAD] 4 webhooks, which will add support for 4 new webhooks (Mailgun, Olark, Unbounce, StatusGator)
- R8x [RT] ES 2.x support, which will add support for Elasticsearch 2.x to our real-time pipeline, and also ad
d the SQL Query Enrichment to this pipeline
- R8x [HAD] Spark data modeling, which will allow arbitrary Spark jobs to be added to the EMR jobflow to perform data modeling prior to (or instead of) Redshift
- R8x [HAD] Synthetic dedupe, which will deduplicate event_ids with different event_fingerprints (synthetic duplicates) in Hadoop Shred
Note that these releases are always subject to change between now and the actual release date.
7. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.