Snowplow R103 Paestum released with IP Lookups Enrichment upgrade
We are proud to announce the release of Snowplow R103 Paestum. This release is centered around upgrading the IP Lookups Enrichment for both the batch and streaming pipelines given the impending end of life of Maxmind’s legacy databases.
It also ships with a security improvement for cross-domain policy management on the Clojure Collector.
Read on for more information on R103 Paestum, named after [the ancient city in in Italy][paestum]:
- Upgrading the IP lookups enrichment
- Cross domain policy management for the Clojure collector
- PII enrichment for the batch pipeline
- Community contributions
- Upgrading
- Roadmap
- Help
1. Upgrading the IP Lookups Enrichment
As described in our Discourse post, MaxMind will not provide monthly updates to their now-legacy databases starting April 2nd.
To tackle this issue and keep the IP Lookups Enrichment as accurate as possible, we are releasing a new version of the enrichment, for both the batch and streaming pipelines, which interacts with GeoIP2 databases, Maxmind’s new format.
A special thanks to Tiago Macedo and Andrew Korzhuev, who worked on the scala-maxmind-iplookups library upgrade, without which this enrichment upgrade wouldn’t have been possible.
2. Cross-domain policy management for the Clojure collector
On the security side of things, we have made the cross-domain policy of the Clojure Collector configurable; this change is inline with the updates made to the Scala Stream Collector back in Release 98 Argentomagus.
First, what is a Flash cross-domain policy? Quoting the Adobe website:
A cross-domain policy file is an XML document that grants a web client, such as Adobe Flash Player or Adobe Acrobat (though not necessarily limited to these), permission to handle data across domains. When clients request content hosted on a particular source domain and that content make requests directed towards a domain other than its own, the remote domain needs to host a cross-domain policy file that grants access to the source domain, allowing the client to continue the transaction.
To allow a Flash media player hosted on another web server to access content from the Adobe Media Server web server, we require a crossdomain.xml file. A typical use case will be HTTP streaming (VOD or Live) to a Flash Player. The crossdomain.xml file grants a web client the required permission to handle data across multiple domains.
A cross-domain policy file gives the necessary permissions when, for example, you are trying to make a request to a Snowplow collector from a Flash game given that both are running on different hosts.
The Clojure Collector embeds what was a very permissive cross-domain policy file, giving permission to any domain and not enforcing HTTPS:
With this release, we’re completely removing the /crossdomain.xml
route by default – should you need it, manually re-enable it by adding the two following environment properties to your Elastic Beanstalk application:
SP_CDP_DOMAIN
: the domain that is granted access,*.acme.com
will match bothhttp://acme.com
andhttp://sub.acme.com
.SP_CDP_SECURE
: a boolean indicating whether to only grant access to HTTPS or both HTTPS and HTTP sources
3. PII enrichment for the batch pipeline
This release also marks the availability of the PII enrichment for the batch pipeline, check out the dedicated blog post to learn more.
4. Community contributions
This release contains quite a few community contributions which we’d like to highlight, huge thanks to everyone involved!
4.1 Improvement to the IP address extractor
Thanks to Mike Robins from Snowflake Analytics, extracting IP addresses from collector payloads originating from the Scala Stream Collector has gotten better.
Snowplow now successfully extracts IPv6 IPs from these Scala Stream Collector payloads, and now inspects the Forwarded
header in addition to the historically supported X-Forwarded-For
header.
4.2 Improvements to the Mandrill integration
An unexpected subaccount
property in the Mandrill events format has meant that many Mandrill events have been failing enrichment.
To resolve this, community member Adam Gray has authored new 1-0-1 schemas for our Mandrill events, and updated the adapter to emit these new versions.
4.3 Documentation improvements
Finally, thanks to Kristoffer Snabb and Thales Mello for improving the repo-embedded documentation, as follows:
- Redirecting our users to Discourse for support requests in our
CONTRIBUTING.md
- Renaming Caravel to Superset in our
README.md
5. Upgrading
5.1 Upgrading the IP Lookups Enrichment
Whether you are using the batch or streaming pipeline, it is important to perform this upgrade if you make use of the MaxMind IP Lookups Enrichment.
To make use of the new enrichment, you will need to update your ip_lookups.json
so that it conforms to the new 2-0-0
schema.
An example is provided in the GitHub repository.
5.1.1 Stream Enrich
If you are a streaming pipeline user, a version of Stream Enrich incorporating the upgraded IP Lookups Enrichment can be found on our Bintray
here.
5.1.2 Spark Enrich
If you are a batch pipeline user, you’ll need to either update your EmrEtlRunner configuration to the following:
or directly make use of the new Spark Enrich available at:
s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.13.0.jar
5.2 Upgrading the Clojure Collector
The new Clojure Collector is available in S3 at:
s3://snowplow-hosted-assets/2-collectors/clojure-collector/clojure-collector-2.0.0-standalone.war
To re-enable the /crossdomain.xml
path, make sure to specify the SP_CDP_DOMAIN
and SP_CDP_SECURE
environment properties as described above.
6. Roadmap
We have a packed schedule of new and improved features coming for Snowplow. Upcoming Snowplow releases will include:
- R104 Stoplesteinan, fixing some issues in EmrEtlRuner’s “Stream Enrich mode” which were identified in R102 following release
- R10x [STR] PII Enrichment phase 2, enhancing our recently-released GDPR-focused PII Enrichment for the realtime pipeline
- R10x [STR] New webhooks and enrichment, featuring Marketo and Vero webhook adapters from our partners at Snowflake Analytics
- R10x Vallei dei Templi, porting our streaming enrichment process to Google Cloud Dataflow, leveraging the Apache Beam APIs
7. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please visit our Discourse forum.