Snowplow R109 Lambaesis real-time pipeline upgrade
We are pleased to announce the release of Snowplow 109 Lambaesis, named after [the archeological site in north-eastern Algeria][lambaesis]. This release focuses on upgrading the AWS real-time pipeline components, although it also updates EmrEtlRunner and Spark Enrich for batch pipeline users.
This release is one of the most community-driven releases in the history of Snowplow Analytics. As such, we would like to give a huge shout-out to each of the contributors who made it possible:
- Kevin Irwin and Rick Bolkey from OneSpot
- Saeed Zareian from the Globe and Mail
- Arihant Surana from HiPages
- Dani Solà from Simply Business
- Robert Kingston from Mint Metrics
Please read on after the fold for:
- Enrichment process updates
- Scala Stream Collector updates
- EmrEtlRunner bugfix
- Supporting community contributions
- Upgrading
- Roadmap
- Help
1. Enrichment process updates
1.1 Externalizing the file used for the user agent parser enrichment
Up until this release, the User Agent Parser Enrichment relied on a “database” of user agent regexes that was embedded along the code. With this release, we have externalized this file to decorrelate updates to the file from updates to the library, which gives us a lot more flexibility.
This User Agent Parser Enrichment update is available for both batch and real-time users, and we’ll be doing the same thing for the Referer Parser Enrichment as well.
Huge thanks to Kevin Irwin for contributing this change!
1.2 More flexible Iglu webhook
Up to this release, if you were to POST a JSON array to the Iglu webhook, such as:
The Iglu webhook would assume you were sending a singleton event with an array of objects at its root; the schema would look like the following:
We have now changed this behavior to instead treat an incoming array as multiple events which, in our case, would each have the following schema:
This should make it easier to work with event sources which need to POST
events to Snowplow in bulk.
1.3 Handle a comma-separated list of IP addresses
We have seen Snowplow users and customers encountering X-Forwarded-For
headers containing a comma-separated list of IP addresses, occurring when the request went through multiple load balancers. The header in the raw event payload will indeed accumulate the different IP addresses, for example:
According to the specification for this header, the first address is supposed to be the original client IP address whereas the following ones correspond to the successive proxies.
Based on this, we have made the choice to only conserve the first IP address in the case of a comma-separated list.
1.4 Stream Enrich updates
This section is for updates that apply to the real-time pipeline only.
Before this release, the Kinesis endpoint for Stream Enrich was determined by the AWS region that you wanted to run in. Unfortunately, this didn’t allow for use of projects like localstack which let you mimic AWS services locally.
Thanks to Arihant Surana, it is now possible to optionally specify a custom endpoint directly through the customEndpoint
configuration.
Note that this feature is also available for the Scala Stream Collector.
1.5 Spark Enrich updates
This section is for updates that apply to the batch pipeline only.
This release introduces support for the 26-field CloudFront format that was released in January, for Snowplow users processing CloudFront access logs using Snowplow.
You can find more information in the AWS documentation</a>; thanks to Moshe Demri for signaling the issue.
We have also taken advantage of our work on CloudFront to leverage the x-forwarded-for
field to populate the user’s IP address. Thanks a lot to Dani Solà for contributing this change!
1.6 Miscellaneous updates
Thanks a lot to Saeed Zareian for a flurry of build dependency updates and Robert Kingston for example updates.
2. Scala Stream Collector updates
2.1 Reject requests with “do not track” cookies
The Scala Stream Collector can now reject requests which contain a cookie with a specified name and value. If the request is rejected based on this cookie, no tracking will happen: no events will be sent downstream and no cookies will be sent back.
The configuration takes the following form:
You will have to set this cookie yourself, on a domain which the Scala Stream Collector can read.
2.2 Customize the response from the root route
It is now possible to customize what is sent back when hitting the /
route of the Scala Stream Collector. Whereas the collector always sent a 404 before, you can now customize it through the following configuration:
This neat feature lets you provide an information page about your event collection and processing on the collector’s root URL, ready for site visitors to review.
2.3 Support for HEAD requests
The Scala Stream Collector now supports HEAD
requests wherever GET
requests were supported previously.
2.4 Allow for multiple domains in crossdomain.xml
You can now specify an array of domains when specifying your /crossdomain.xml
route:
3. EmrEtlRunner bugfix
In R108 we started leveraging the official AWS Ruby SDK in EmrEtlRunner and replaced our deprecated Sluice library.
Unfortunately, the functions we wrote to run the different empty file checks were recursive and can blow up the stack if you have a large number of EMR S3 empty files (more than 5,000 files in our tests).
This issue can prevent the Elastic MapReduce job from being launched.
We’ve now fixed this by making those functions iterative.
On a side note: we now encourage everyone to use s3a
when referencing buckets in the EmrEtlRunner configuration because, when using s3a
, those problematic empty files are simply not generated.
4. Supporting community contributions
We have taken advantage of this release to improve how we support our community of open source developers and other contributors. This initiative translates into:
- A new Gitter room for Snowplow, where you can chat with the Snowplow engineers and share ideas on contributions you would like to make to the project
- A new contributing guide
- New issue and pull request templates to give better guidance if you are looking to contribute
5. Upgrading
5.1 Upgrading Stream Enrich
A new version of Stream Enrich incorporating the changes discussed above can be found on our Bintray here.
5.2 Upgrading Spark Enrich
If you are a batch pipeline user, you’ll need to either update your EmrEtlRunner configuration to the following:
or directly make use of the new Spark Enrich available at:
s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.16.0.jar
5.3 Upgrading the User Agent Parser Enrichment
To make use of an external user agent database, you can update your enrichment file to the following:
Note the bump to the version 1-0-1
as well as the specification of the location of the user agent database. The database is the one maintained in the uap-core repository.
An example can be found in our repository.
We will be keeping the external user agent database that we host in Amazon S3 up-to-date as the upstream project releases new versions of it.
5.4 Upgrading the Scala Stream Collector
A new version of Stream Enrich incorporating the changes discussed above can be found on our Bintray here.
To make use of this new version, you’ll need to amend your configuration in the following ways:
- Add a
doNotTrackCookie
section:
- Add a
rootResponse
section:
- Turn
crossDomain.domain
intocrossDomain.domains
:
A full configuration can be found in the repository.
5.5 Upgrading EmrEtlRunner
The latest version of EmrEtlRunner is available from our Bintray here.
We also encourage people to switch all of your bucket paths to s3a
, which will prevent the pipeline’s S3DistCp steps from creating empty files, like so:
6. Roadmap
Upcoming Snowplow releases include:
- R110 Vallei dei Templi, porting our streaming enrichment process to Google Cloud Dataflow, leveraging the Apache Beam APIs
- R11x [BAT] Increased stability, improving batch pipeline stability
7. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problem, please visit our Discourse forum.