Snowplow 72 Great Spotted Kiwi released
We are pleased to announce the release of Snowplow version 72 Great Spotted Kiwi. This release adds the ability to track clicks through the Snowplow Clojure Collector, adds a cookie extractor enrichment and introduces new deduplication queries leveraging R71’s event fingerprint.
The rest of this post will cover the following topics:
- Click tracking
- New cookie extractor enrichment
- New deduplication queries
- Upgrading
- Getting help
- Upcoming releases
1. Click tracking
Although the Snowplow JavaScript Tracker offers link click tracking, there are scenarios where you want to track a link click without having access to JavaScript. Two common examples are: tracking clicks on ad units, and users downloading files using curl
or wget
.
To support these use cases we have added a new URI redirect mode into the Clojure Collector. You update your link’s URI to point to your event collector, and the collector receives the click, logs a URI redirect event and then performs a 302 redirect to the intended URI. This is the exact model followed by ad servers to track ad clicks.
To use this functionality:
- Set your collector path to
/r/tp2?
– the/r/tp2
tells Snowplow that you are attempting a URI redirect - Add a
&u=
argument to your collector URI, where “ is your URL-encoded final URI to redirect to - On clicking this link, the collector will register the link and then do a 302 redirect to the supplied “
- As well as the &u= parameter, you can populate the collector URI with any other fields from the Snowplow Tracker Protocol
The URI redirection will be recorded using the JSON Schema com.snowplowanalytics.snowplow/uri_redirect/jsonschema/1-0-0.
For more information on how this functionality works, check out the Click tracking section in our Pixel Tracker documentation.
We will be adding this capability into the Scala Stream Collector in Release 74.
2. New cookie extractor enrichment
One powerful attribute of having Snowplow event collection on your own domain (e.g. events.snowplowio.site.strattic.io
) is the ability to capture first-party cookies set by other services on your domain such as ad servers or CMSes; these cookies are stored as HTTP headers in the Thrift raw event payload by the Scala Stream Collector.
Prior to this release there was no way of accessing these cookies in the Snowplow Enrichment process – until now, with Snowplow community member Kacper Bielecki’s new Cookie Extractor Enrichment. This is our first community-contributed enrichment – a huge milestone and hopefully the first of many! Thanks so much Kacper.
The example configuration JSON for this enrichment is as follows:
This default configuration is capturing the Scala Stream Collector’s own sp
cookie – in practice you would probably extract other more valuable cookies available on your company domain. Each extracted cookie will end up a single derived context following the JSON Schema org.ietf/http_cookie/jsonschema/1-0-0.
For more information see the Cookie extractor enrichment page on the Snowplow wiki.
Please note that this enrichment only works with events recorded by the Scala Stream Collector – the CloudFront and Clojure Collectors do not capture HTTP headers.
3. New deduplication queries
This release comes with 3 new SQL scripts that deduplicate events in Redshift using the event fingerprint that was introduced in Snowplow R71. For more information on duplicates, see the recent blogpost that explores the phenomenon in more detail.
The first script deduplicates rows with the same event_id
and event_fingerprint
. Because these events are identical, the script leaves the earliest one in atomic and moves all others to a separate schema. There is an optional last step that also moves all remaining duplicates (same event_id
but different event_fingerprint
). Note that this could delete legitimate events from atomic.
The second is an optional script that deduplicates rows with the same event_id
where at least one row has no event_fingerprint
(older events). The script is identical to the first script, except that an event fingerprint is
generated in SQL.
The third script is a template that can be used to deduplicate unstructured event or custom context tables. Note that contexts can have legitimate duplicates (e.g. 2 or more product contexts that join to the same parent event). If that is the case, make sure that the context is defined in such a way that no 2 identical contexts are ever sent with the same event. The script combines rows when all fields but root_tstamp
are equal. There is an optional last step that moves all remaining duplicates (same root_id
but at least one field other than root_tstamp
is different) from atomic to duplicates. Note that this could delete legitimate events from atomic.
These scripts can be run after each load using SQL Runner. Make sure to run the setup queries first.
4. Upgrading
Upgrading the Clojure Collector
This release bumps the Clojure Collector to version 1.1.0.
To upgrade to this release:
- Download the new warfile by right-clicking on this link and selecting “Save As…”
- Log in to your Amazon Elastic Beanstalk console
- Browse to your Clojure Collector’s application
- Click the “Upload New Version” and upload your warfile
Updating the configuration files
You need to update the version of the Enrich jar in your configuration file:
If you wish to use the new cookie extractor enrichment, write a configuration JSON and add it to your enrichments folder. The example JSON can be found here.
Updating your database
Install the following tables in Redshift as required:
- For the new URI redirect functionality, com_snowplowanalytics_snowplow_uri_redirect_1
- For the new cookie extractor enrichment, org_ietf_http_cookie_1
5. Getting help
For more details on this release, please check out the R72 Great Spotted Kiwi release notes on GitHub. Specific documentation on the two new features is available here:
- The Click tracking section in our Pixel Tracker documentation
- The Cookie extractor enrichment page
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.
6. Upcoming releases
By popular request, we are adding a section to these release blog posts to trail upcoming Snowplow releases. Note that these releases are always subject to change between now and the actual release date.
Upcoming releases are:
- Release 73 Cuban Macaw, which removes the JSON fields from
atomic.events
and adds the ability to load bad rows into Elasticsearch - Release 74 Bird TBC, which brings the Kinesis pipeline up-to-date with the most recent Scala Common Enrich releases. This will also include click redirect support in the Scala Stream Collector
Other milestones being actively worked on include Avro support #1, Weather enrichment and Snowplow CLI #2.