Snowplow 63 Red-Cheeked Cordon-Bleu released
We are pleased to announce the immediate availability of Snowplow 63, Red-Cheeked Cordon-Bleu. This is a major release which adds two new enrichments, upgrades existing enrichments and significantly extends and improves our Canonical Event Model for loading into Redshift, Elasticsearch and Postgres.
The new and upgraded enrichments are as follows:
- New enrichment: parsing useragent strings using the
ua_parser
library - New enrichment: converting the money amounts in e-commerce transactions into a base currency using Open Exchange Rates
- Upgraded: extracting click IDs in our campaign attribution enrichment, so that Snowplow event data can be more precisely joined with campaign data
- Upgraded: our existing MaxMind-powered IP lookups
- Upgraded: useragent parsing using the
user_agent_utils
library can now be disabled
This release has been a huge team effort – with particular thanks going to Snowplow winterns Aalekh Nigam (2014/15) and Jiawen Zhou (2013/14) for their work on the new enrichments and the foundational scala-forex library respectively.
Table of contents:
- New enrichment: useragent parsing using ua_parser
- New enrichment: currency conversion for e-commerce transaction
- Upgraded enrichment: click ID extraction for campaign attribution
- Upgraded enrichment: timezone lookup from IP address using MaxMind
- Upgraded enrichment: useragent parsing using user_agent_utils
- Other improvements to Scala Common Enrich
- Updates to atomic.events
- Updates to the Kinesis applications
- Upgrading your Snowplow pipeline
- Getting help
1. New enrichment: useragent parsing using ua_parser
Since close to its inception, Snowplow has used the user-agent-utils Java library to perform useragent parsing. Various limitations with that library have led us to explore and evaluate other options, including the ua-parser project with its uap-java library for the JVM. Testing suggests that this library handles some useragent strings (such as mobile app useragents) better than user-agent-utils. We particularly like that the ‘database’ the ua-parser looks user agent strings up against is a YAML file, making it straightforward for users to update the treatment of new useragents as they emerge themselves, rather than waiting on a new release of user-agent-utils.
As part of our move towards pluggable enrichments, from this release Snowplow users can employ user-agent-utils, or ua-parser, or both, or neither. We believe we are the first analytics platform to give users such a high degree of choice in their Enrichment process.
The behavior of user-agent-utils if enabled is unchanged; if ua-parser is enabled, then the Snowplow Enrichment process will write its results into a new context, ua_parser_context.
For more details on this enrichment, see the ua parser enrichment wiki page.
2. New enrichment: currency conversion for e-commerce transactions
Since early 2014, Snowplow trackers including the JavaScript Tracker have let you record the currency in which an e-commerce transaction took place. This is the first release of Snowplow which extracts those currency fields into our Canonical Event Model – but we have gone further and this release also includes a new enrichment which will automatically convert these transactions into your “base” or “home” currency for reporting and analytics.
Working with Open Exchange Rates, our new currency conversion enrichment will look up the end-of-day (“EOD”) rate between your transaction’s currency and your preferred base currency for the day prior to the e-commerce transaction, and use this to convert all monetary amounts in your e-commerce currency into your base currency. The converted values are all stored in new fields, so you can continue to work with the original amounts as well.
To take advantage of this new enrichment, you will need to sign up for an account at Open Exchange Rates and provide your API key in the enrichment’s JSON configuration file.
For more details on this enrichment, see the Currency conversion enrichment wiki page.
3. Upgraded enrichment: click ID extraction for campaign attribution
Many advertising systems attach a globally unique “click ID” tracking parameter to destination URIs to help advertisers attribute clicks to campaigns. The most well known of these click IDs are gclid
(Google), msclkid
(Microsoft) and dclid
(DoubleClick).
We have extended our existing campaign attribution enrichment to look for and extract the value assigned to this click ID, populating the mkt_clickid
field with the click ID and the mkt_network
field with the name of the network. This should make it possible to join Snowplow data back to campaign data from marketing channel that drove the user to your site in a much more precise way than is possible using the existing source / medium / term / campaign / content parameters, so you can report exactly what you paid for that click, and then calculate the return on investment for it, based on the user’s subsequent actions.
By default the campaign attribution enrichment identifies the three click IDs given above, but you can also configure the enrichment with your own list of click IDs and network names:
If you know of another click ID that would be useful to the wider Snowplow community, please do add it to this ticket.
For more details on this enrichment, see the campaign attribution enrichment wiki page.
4. Upgraded enrichment: timezone lookup from IP address using MaxMind
We have extended the IP lookup enrichment to extract the timezone information that MaxMind provides about an IP address. This timezone information is stored in a new field in the Canonical Event Model, geo_timezone
.
For more details on this enrichment, see the IP lookups enrichment wiki page.
5. Upgraded enrichment: useragent parsing using user_agent_utils
Our existing useragent parsing enrichment built on user-agent-utils is no longer hardcoded to run – instead, it is now a user-configurable enrichment. The fields that it populates in atomic.events
are unchanged.
To enable it to run as before, you must add in a JSON configuration file into your folder of enrichments. See 9.1.1 Configuring enrichments for details.
For more details on this enrichment, see the ua parser enrichment wiki page.
6. Other improvements to Scala Common Enrich
A set of smaller new features and capabilities have been to Scala Common Enrich in this release:
- Netaporter’s more permissive URI library is used to parse querystrings if the Apache Commons httpclient fails. Many thanks to Dani Solà for this contribution! (#1429)
- The
refr_domain_userid
andrefr_dvce_tstamp
fields as set by the JavaScript Tracker’s new cross-domain linker are now extracted (#1461). This makes it possible for users running a network of websites to stitch user actions across domains using only first party cookies. (We will cover how to do this in a future blog post.) - The
session_id
field is now populated based on the “sid” parameter. Session ID is a client-side generated UUID to complement the existing session index (#1541) - The
dvce_sent_tstamp
field is now populated based on the “stm” parameter. This is useful for determining when a tracker sent an event (versus creating that event) ([#1383][issue1383-]) - bumped referer-parser to 0.2.3 (#670)
- extracted original IP address from CollectorPayload headers (#1372)
7. Updates to atomic.events
This release makes a comprehensive set of updates to the atomic.events
table (whether Redshift or Postgres), specifically:
- New fields as per the updated Canonical Event Model. These new fields are largely for the new enrichments, but we are also aiming to somewhat “future-proof”
atomic.events
by adding new fields which we plan on using in the near future - Updates to existing fields, primarily so Snowplow can record a wider range of values in those fields
The new fields required in atomic.events
(whether Redshift or Postgres) are as follows:
Column name | Data type (1) | From (2) | Description |
---|---|---|---|
tr_currency |
char(3) |
Tracker | Currency for e-commerce transaction |
tr_total_base |
dec(18, 2) |
CCE | Conversion to base currency |
tr_tax_base |
dec(18, 2) |
CCE | Conversion to base currency |
tr_shipping_base |
dec(18, 2) |
CCE | Conversion to base currency |
ti_currency |
char(3) |
Tracker | Currency for e-commerce transaction item |
ti_price_base |
dec(18, 2) |
CCE | Conversion to base currency |
base_currency |
char(3) |
CCE | CCE configuration option |
geo_timezone |
varchar(64) |
ILE | Timezone for IP address |
mkt_clickid |
varchar(64) |
CAE | Unique ID for advertising click |
mkt_network |
varchar(64) |
CAE | Advertising network of click ID |
etl_tags |
varchar(500) |
Enrich | Tags describing this run. Not yet implemented |
dvce_sent_tstamp |
timestamp |
Tracker | When device sent event |
refr_domain_userid |
varchar(36) |
Tracker | Extracted from cross-domain linker |
refr_dvce_tstamp |
timestamp |
Tracker | Extracted from cross-domain linker |
derived_contexts |
varchar(15000) |
Enrich | Contexts derived in the Enrich process |
domain_sessionid |
char(36) |
Tracker | Client-side session ID, complements index |
derived_tstamp |
timestamp |
Enrich | Calculated timestamp. Not yet implemented |
(1) The data type is taken from Redshift; data types for some columns in Postgres are different
(2) Where:
- CCE = Currency conversion enrichment
- ILE = IP lookups enrichment
- CAE = Campaign attribution enrichment
We have also made the following changes to the table definitions:
Column name | New data type (1) | Old data type | Reason |
---|---|---|---|
user_ipaddress </code> |
varchar(45) |
varchar(19) |
To support IPv6 addresses |
unstruct_event |
varchar(15000) * |
varchar(10000) |
To support larger JSONs |
domain_userid |
varchar(36) |
varchar(16) |
To support UUIDs as domain user IDs |
contexts |
varchar(15000) * |
varchar(10000) |
To support more custom contexts |
page_urlpath |
varchar(3000) |
varchar(1000) |
To support longer page URL paths |
page_urlquery |
varchar(6000) |
varchar(3000) |
To support longer page URL queries (e.g. from ad servers) |
page_urlfragment |
varchar(3000) |
varchar(255) |
To support longer page fragments |
refr_urlpath |
varchar(3000) |
varchar(1000) |
To support longer page URL queries (e.g. from ad servers) |
refr_urlquery |
varchar(6000) |
varchar(3000) |
|
refr_urlfragment |
varchar(3000) |
varchar(255) |
To support longer page fragments |
se_category |
varchar(1000) |
varchar(255) |
To support longer category labels |
se_action |
varchar(1000) |
varchar(255) |
|
se_label |
varchar(1000) |
varchar(255) |
|
se_property |
varchar(1000) |
varchar(255) |
* Also changed column encoding in Redshift from raw
to lzo
(1) The data type is taken from Redshift; data types for some columns in Postgres are different
In addition to these changes, for Postgres we have removed the primary key constraint on event_id (#1187).
Finally, we have also added a foreign key constraint to all Redshift shredded JSON tables to make the joins back to the parent atomic.events
table more performant (#1365).
8. Updates to the Kinesis applications
The main update to both Kinesis applications is to support the new enriched event format (see 7. Updates to atomic.events for details). Other noteworthy updates to the Scala Kinesis Enrich:
- The Scala Kinesis Enrich application now uses Scala Common Enrich 0.13.0, the latest version (#1369). Previously it was using Scala Common Enrich 0.11.0. This means that you can take advantage of all the enrichment updates in the Kinesis flow, and it also brings the Kinesis flow up-to-date with the various encoding-related fixes implemented in Scala Common Enrich 0.12.0
- Community member Kacper Bielecki updated the Scala Kinesis Enrich’s logging configuration (#1367)
There is also an important update to the Kinesis Elasticsearch Sink: we have stopped verifying the number of fields found in enriched event (#1333). This should make the Elasticsearch Sink more tolerant of potential future updates to Scala Kinesis Enrich.
This section contains upgrading instructions which are common to both our Elastic MapReduce and Kinesis pipelines.
To continue parsing useragent strings using the user_agent_utils
library, you must add a new JSON configuration file into your folder of enrichment JSONs:
The name of the file is not important but must end in .json
.
Configuring other enrichments is at your discretion. Useful resources here are:
There are two steps to upgrading the EMR pipeline:
- Upgrade your EmrEtlRunner to use the latest Hadoop job versions
- Upgrade your Redshift and/or Postgres
atomic.events
table to the latest definitions
This release bumps:
- The Hadoop Enrichment pro
cess to version 0.14.0 - The Hadoop Shredding process to version 0.4.0
In your EmrEtlRunner’s config.yml
file, update your Hadoop jobs versions like so:
For a complete example, see our sample config.yml
template.
You need to use the appropriate migration script to update to the new table definition:
And that’s it – you should be fully upgraded.
If you want to make use of the new ua_parser based useragent parsing enrichment in Redshift, you must also deploy the new table into your atomic
schema:
This release updates:
- Scala Kinesis Enrich, to version 0.4.0
- Kinesis Elasticsearch Sink, to version 0.2.0
The new version of the Kinesis pipeline is available on Bintray as snowplow_kinesis_r61_red_cheeked_cordon_bleu.zip. The download contains the latest versions of all of the Kinesis apps (Scala Stream Collector, Scala Kinesis Enrich, Kinesis Elasticsearch Sink, and Kinesis S3 Sink).
The components in the Kinesis topology updated in this release are highlighted in this graph:
Our recommended approach for upgrading is as follows:
- Kill your Scala Kinesis Enrich cluster
- Leave your Kinesis Elasticsearch Sink cluster running until all remaining enriched events are loaded, then kill this cluster too
- Upgrade your Scala Kinesis Enrich cluster to the new version
- Upgrade your Kinesis Elasticsearch Sink cluster to the new version
- Restart your Scala Kinesis Enrich cluster
- Restart your Kinesis Elasticsearch Sink cluster
10. Getting help
For more details on this release, please check out the r63 Red-Cheeked Cordon-Bleu Release Notes on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.