Snowplow 71 Stork-Billed Kingfisher released
We are pleased to announce the release of Snowplow version 71 Stork-Billed Kingfisher. This release significantly overhauls Snowplow’s handling of time and introduces event fingerprinting to support deduplication efforts. It also brings our validation of unstructured events and custom context JSONs “upstream” from our Hadoop Shred process into our Hadoop Enrich process.
The rest of this post will cover the following topics:
- Better handling of event time
- JSON validation in Scala Common Enrich
- New unstructured event fields in enriched events
- New event fingerprint enrichment
- More performant handling of missing schemas
- New CloudFront access log fields
- Other changes
- Using SSL in the StorageLoader
- New approach to atomic.events upgrades
- Upgrading
- Getting help
1. Better handling of event time
This release implements our new approach to determining when events occurred, as introduced in the recent blog post Improving Snowplow’s understanding of time.
Specifically, this release:
- Renames
dvce_tstamp
todvce_created_tstamp
to remove ambiguity - Adds the
derived_tstamp
field to our Canonical Event Model - Adds the
true_tstamp
field, in readiness for our trackers adding support for this - Implements the algorithm set out in that blog post to calculate the most accurate
derived_tstamp
available
2. JSON validation in Scala Common Enrich
Previously, validation of unstructured events and custom context self-describing JSONs was only performed in our Hadoop Shred process, in preparation for loading Redshift. With self-describing JSONs growing more and more central to event modeling within Snowplow, it became increasingly important to bring this validation “upstream” into Scala Common Enrich.
Thanks to Dani Sola, the Scala Hadoop Shred validation code for unstructured event and custom context JSONs is now also executed from within Scala Common Enrich.
This means that Scala Hadoop Enrich now validates unstructured event and custom context JSONs; in the next Kinesis pipeline release, Scala Kinesis Enrich will validate these JSONs too.
Please note: if the unstructured event or any of the custom contexts fail validation against their respective JSON Schemas in Iglu, then the event will be failed and written to the bad bucket.
3. New unstructured event fields in enriched events
Now that we are validating unstructured events in Scala Common Enrich (rather than simply passing them through), we can extract some key information about the unstructured event for storage in our Canonical event model.
Therefore, Dani has added event_vendor
, event_name
, event_format
, and event_version
fields to our enriched event model. This makes it a lot easier to analyze the distribution of your event types just by looking at atomic.events
. Many thanks Dani!
These are the values of the new event fields for our five “legacy” event types which aren’t (yet) modeled using self-describing JSON:
Legacy event type | event_name | event_vendor | event_format | event_version |
---|---|---|---|---|
Page view | page_view |
com.snowplowanalytics.snowplow |
jsonschema |
1-0-0 |
Page ping | page_ping |
com.snowplowanalytics.snowplow |
jsonschema |
1-0-0 |
Transaction | transaction |
com.snowplowanalytics.snowplow |
jsonschema |
1-0-0 |
Transaction item | transaction_item |
com.snowplowanalytics.snowplow |
jsonschema |
1-0-0 |
Structured event | event |
com.google.analytics |
jsonschema |
1-0-0 |
4. New event fingerprint enrichment
Duplicate events are a hot topic in the Snowplow community – see the recent blog post Dealing with duplicate event IDs for a detailed exploration of the phenomenon.
As a first step in making it easier to identify and quarantine duplicates, this release introduces a new Event fingerprint enrichment.
The new enrichment creates a fingerprint from a hash of the Tracker Protocol fields set in an event’s querystring (for GET requests) or body (for POST requests). You can configure a list of Tracker Protocol fields to exclude from the hash generation. For example, in our default configuration we exclude:
- “eid” (
event_id
), because we will typically review event IDs separately when investigating duplicates - “stm” (
dvce_sent_tstamp
), since this field could change between two different attempts to send the same event - “nuid” (
network_userid
), because a single event that is se
nt twice to a collector on a computer that does not accept third party cookies would be assigned differentnetwork_userid
s (despite being a duplicate) - “cv” (
v_collector
), because this is attached by the Clojure Collector rather than by the tracker
The example configuration JSON for this enrichment is as follows:
5. New CloudFront access log fields
In July, an AWS update added four new fields to the CloudFront access log format.
The Snowplow CloudFront access log input format (not to be confused with the CloudFront Collector) now supports these new fields. You can use this migration script to upgrade your Redshift table accordingly.
6. More performant handling of missing schemas
Previously the Scala Hadoop Shred process would take an extremely long time to complete if a JSON Schema referenced across many events could not be found in any Iglu repository.
This was because, although our underlying Iglu client cached successfully-found schemas, it did not remember which schemas it had already failed to find; this led to an expensive HTTP lookup on every missing schema instance. The latest release fixes this problem.
7. Using SSL in the StorageLoader
Snowplow community member Dennis Waldron has contributed the ability to connect to Postgres and Redshift using SSL. To do this, add an “ssl_mode” field to each target in your configuration YAML:
Thanks Dennis!
8. New approach to atomic.events upgrades
Starting in this release, we are taking a new approach to upgrading the atomic.events
table. Previous upgrades would typically rename the existing table as “atomic.events_{{old version}}”, create a new table with the new structure and copy all events over.
From this release onwards, our upgrades to atomic.events
will always only mutate the existing table using ALTER
statements. This is intended to make upgrades to existing Redshift databases much faster.
To prevent confusion about the version of a particular atomic.events
table, the table creation and migration scripts now add the version to the table as a comment using the COMMENT statement.
9. Other improvements
We have also:
- Upgraded Scala Hadoop Shred to use Hadoop version 2.4 (#1720)
- Added validation for
v_collector
andcollector_tstamp
(#1611) - Upgraded to version 0.2.4 of the referer-parser (#1839)
- Upgraded to version 1.16 of user-agent-utils (#1905)
- Changed the BadRow class to use ProcessingMessages rather than Strings (#1936)
- Added an exception handler around the whole of Scala Common Enrich (#1954)
- Updated our
web-incremental
data models so that failure is recoverable (#1974) - Fixed a bug where Scala Hadoop Enrich didn’t correctly attach the original Thrift payloads to bad rows (
#1950)
10. Upgrading
Installing EmrEtlRunner and StorageLoader
The latest version of the EmrEtlRunner and StorageLoadeder are available from our Bintray here.
Unzip this file to a sensible location (e.g. /opt/snowplow-r71
).
Updating the configuration files
You should update the versions of the Enrich and Shred jars in your configuration file:
You should also update the AMI version field:
For each of your database targets, you must add the new ssl_mode
field:
If you wish to use the new event fingerprint enrichment, write a configuration JSON and add it to your enrichments folder. The example JSON can be found here.
Updating your database
Use the appropriate migration script to update your version of the atomic.events
table to the latest schema:
If you are ingesting Cloudfront access logs with Snowplow, use the Cloudfront access log migration script to update your com_amazon_aws_cloudfront_wd_access_log_1.sql
table.
11. Getting help
For more details on this release, please check out the R71 Stork-Billed Kingfisher release notes on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.