We are pleased to announce the immediate availability of Snowplow 61, Pygmy Parrot.
This release has a variety of new features, operational enhancements and bug fixes. The major additions are:
- You can now parse Amazon CloudFront access logs using Snowplow
- The latest Clojure Collector version supports Tomcat 8 and CORS, ready for cross-domain
- EmrEtlRunner’s failure handling and Clojure Collector log handling have been improved
The rest of this post will cover the following topics:
- CloudFront access log parsing
- Clojure Collector updates
- Operational improvements to EmrEtlRunner
- Bug fixes in Scala Common Enrich
- Getting help
1. CloudFront access log parsing
We have added the ability to parse Amazon CloudFront access logs (web distribution format only) to the Snowplow Hadoop-based pipeline.
If you use CloudFront as your CDN for web content, you can now use Snowplow to process your CloudFront access logs. Snowplow will enrich these logs with the user-agent, page URI fragments and geo-location as standard.
To process CloudFront access logs, first create a new EmrEtlRunner
- Set your
:raw:in:bucket to where your logs are written
- Set your
- Provide new bucket paths and a new job name, to prevent this job from clashing with your existing Snowplow job(s)
If you are running the Snowplow batch (Hadoop) flow with Amazon Redshift, you should deploy the relevent event table into your Amazon Redshift database. You can find the table definition here:
You can either load these events using your existing
atomic.events table, or if you prefer load into an all-new database or schema. If you load into your existing
atomic.events table, make sure to schedule these loads so that they don’t clash with your existing loads.
2. Clojure Collector updates
We have updated the Clojure Collector to run using Tomcat 8, which is now the default Tomcat version when creating a new Tomcat application on Elastic Beanstalk.
As of this release the Clojure Collector supports CORS and the CORS equivalent for ActionScript; this will allow the Clojure Collector to support events being
We have also added the ability to disable the setting of third-party cookies altogether: simply configure the cookie duration to
0 and the Clojure Collector will not set its third-party cookie.
3. Operational improvements to EmrEtlRunner
We have made various operational improvements to EmrEtlRunner.
If there are no raw event files to process, EmrEtlRunner will now exit with a specific return code. This return code is then detected by
snowplow-runner-and-loader.sh, which will then exit without failure. In other words: an absence of files to process no longer causes
snowplow-runner-and-loader.sh to exit with failure.
We have updated EmrEtlRunner’s handling of Clojure Collector logs. The logs now get renamed on move to:
This new filename format means that the raw logs will be archived to
/yyyy-MM-dd sub-folders, just as the CloudFront Collector’s logs are.
Finally, we have updated EmrEtlRunner to also check that the enriched events bucket is empty prior to moving any raw logs into staging. If any enriched events are found, then the move to staging will not start. This makes for much smoother operation when you are running your enrichment process very frequently (e.g. hourly).
4. Bug fixes in Scala Common Enrich
We have fixed various bugs in Scala Common Enrich, mostly related to character encoding issues:
- We fixed a bug where our Base64 decoding did not specify UTF-8 charset, causing problems with Unicode text on EMR where the default characterset is
- We removed an incorrect extra layer of URL decoding from non-Bas64-encoded JSONs (#1396)
- There was a mismatch between the Snowplow Tracker Protocol, which mandated
ti_nmfor transaction item’s names, and Scala Common Enrich, which was expecting
ti_nafor the same. Scala Common Enrich now supports both options (#1401)
- We have updated the
SnowplowAdaptercomponent to accept “charset=UTF-8” as a content-type parameter, because some web browsers always attach this content-type parameter to their
You need to update EmrEtlRunner to the latest code (0.12.0) on GitHub:
If you currently use
snowplow-runner-and-loader.sh, upgrade to the latest version too.
This release bumps the Hadoop Enrichment process to version 0.13.0 .
config.yml file, update your
hadoop_shred jobs’ versions like so:
For a complete example, see our sample
This release bumps the Clojure Collector to version 1.0.0.
You will not be able to upgrade an existing Tomcat 7 cluster to use this version. Instead, to upgrade to this release:
- Download the new warfile by right-clicking on this link and selecting “Save As…”
- Log in to your Amazon Elastic Beanstalk console
- Browse to your Clojure Collector’s application
- Click the “Launch New Environment” action
- Click the “Upload New Version” and upload your warfile
When you are confident that the new collector is performing as expected, you can choose the “Swap Environment URLs” action to put the new collector live.
6. Getting help
For more details on this release, please check out the r61 Pygmy Parrot Release Notes on GitHub.
If you have any questions or run any problems, please raise an issue or get in touch with us through the usual channels.