Snowplow 0.7.3 released, tracking additional data
We’re excited to announce the release of Snowplow version 0.7.3. This release adds a set of 16 all-new fields to our event model:
- A new Event Vendor field
- The Page URL split out into its component parts (scheme, host, port, path, querystring, fragment/anchor)
- The web page’s character set
- The web page’s width and height
- The browser’s viewport (i.e. visible width and height)
- For page pings, we are now tracking the user’s scrolling during the last ping period (four fields)
These fields should make a new set of analyses on Snowplow data, including analysing how deeply users engage with different web pages (e.g. what percentage of a web page have they viewed, and how fast). In addition, it should make some analyses easier, e.g. aggregating (and comparing) metrics by page by page and domain.
In addition, the new release includes some minor bug fixes. In this post we will cover:
1. New fields
We are hugely excited to be including 16 new fields in this release – we believe that these fields should unlock a whole host of new analyses on Snowplow data.
For completeness, we list out all of the new fields below. Note that all of the new fields are available in both the S3 (aka Hive) and Infobright (aka non-Hive) storage outputs:
Field | Datatype | Description |
---|---|---|
event_vendor |
string | Which company or org. defined this event type |
page_urlscheme |
string | Scheme aka protocol, e.g. “https” |
page_urlhost |
string | Host aka domain, e.g. “www.snowplowio.site.strattic.io” |
page_urlport |
int | Port if specified, 80 if not |
page_urlpath |
string | Path to page, e.g. “/product/index.html” |
page_urlquery |
string | Querystring, e.g. “id=GTM-DLRG” |
page_urlfragment |
string | Fragment aka anchor, e.g. “4-conclusion” |
br_viewwidth |
integer | The width of the browser’s viewport in pixels |
br_viewheight |
integer | The height of the browser’s viewport in pixels |
doc_charset |
string | The page’s character encoding, e.g. “UTF-8” |
doc_width |
integer | The total width of the page (incl. non-viewed area) |
doc_height |
integer | The total height of the page (incl. non-viewed area) |
pp_xoffset_min |
integer | Minimum page x offset seen in the last ping period |
pp_xoffset_max |
integer | Maximum page x offset seen in the last ping period |
pp_yoffset_min |
integer | Minimum page y offset seen in the last ping period |
pp_yoffset_max |
integer | Maximum page y offset seen in the last ping period |
Don’t worry if some of these new fields don’t make immediate sense based on the descriptions above – we will take a look at each of these fields in the sub-sections below:
1.1 Event vendor
As we have previously blogged, we are in the process of developing the Snowplow event model: the list of first-class events for which we’ve defined a structured data model. As we stressed in the blog post, we well understand that different models will be appropriate for different websites and applications, and that model we develop will not be ideal for everyone. In the future, we plan to enable different companies to develop their own first class data model within Snowplow. As a first step in this direction, we have added the Event vendor field to the Snowplow data model: when a company develops its own event data model, it will be identifiable to that vendor using this field. This will open up the possibility of:
- Ingesting proprietary events from third-party systems (e.g.
event_vendor
=”com.sendgrid” or “com.appnexus”) - Ingesting clickstream events from other analytics services (e.g.
event_vendor
=”com.adobe” or “com.mixpanel”) - Tracking custom events defined by a specific Snowplow user (e.g.
event_vendor
=”au.com.asnowplowuser”)
At the moment, however, all events will have an event_vendor
field that will be “com.snowplowanalytics” (using the Java package-style naming convention).
1.2 Page URL components
We have split the page_url
into its six component parts (the unprocessed page_url
field is left unchanged). Having these fields broken out should make it much easier to do page URL-based analyses, such as aggregating data for specific page_url
s (ignoring query strings) or investigating HTTPS traffic.
1.3 Viewport fields
Each event now t
racks the current viewport of the browser – in other words, the viewable area (width x height) current available within the browser.
This will enable analysts to distinguish browsing behavior based on viewport size, and see if there are specific events on a customer journey that trigger a user resizing his / her browser. (Which is a useful user-experience indicator.)
1.4 Document width and height
We are now tracking the complete width and height of the current document (aka web page) on each event. This tells you the total width and height of the current page, as perceived by the browser. This measures the whole document – i.e. including the non-viewable part of the document.
This can be used in conjunction with the new viewport fields (above) and page ping offsets (below) to analyse what fraction of a document a user has engaged with, and over what time period.
1.5 Page ping offsets
These four new offset fields are perhaps the most complex new additions. First of all: these fields are only set if you enableActivityTracking()
on your site. In a nutshell, activity tracking:
- Silently checks for user activity (mouse movements, scrolling, key presses etc) on a page for a specified time period (e.g. 10 seconds)
- If any user activity was detected in those 10 seconds, the tracker sends a “page ping” back to Snowplow. (No user activity, no page ping)
- This is then repeated for each new time period, until the user navigates away from the page
In this release we are sending four new offset fields along with each page ping event. These offsets track the minimum and maximum horizontal and vertical page offsets scrolled to by the user in the last page ping period. In other words: these four fields tell you how far left/right and up/down the user scrolled during the last ping period.
Simply put: these new offset fields are designed to provide a clear view of how your users scroll around your webpages over time. (Especially when combined with the viewport and document width and height fields also listed above.)
Huge thanks to Rob Kingston for providing the original idea and implementation around page ping offsets, and helping us to test our implementation!
1.6 Document characterset
Each event now tracks the document’s charset where available (not all browsers set this).
As well as the new fields introduced above, this release also includes a small set of bug fixes in the JavaScript tracker which are worth noting:
- Our
logImpression()
method was not working (it was using the wrong argument names) – this has now been fixed. - The activity tracking (page ping) behavior was too fragile: if a single monitoring period elapsed with no activity, then all future monitoring would be cancelled. This could easily lead to on-page activity not being recorded. This has now been fixed
The following table tracks the breaking changes and deprecations in this version. When upgrading to the latest version of the JavaScript tracker (0.10.0), please update your JavaScript tags as per the instructions below to avoid problems:
Type of change | Component | Change | Comment |
---|---|---|---|
Breaking change | JavaScript tracker | setAccount() removed |
Use setCollectorCf() instead |
Breaking change | JavaScript tracker | setTracker() removed |
Use getTrackerCf() instead |
Breaking change | JavaScript tracker | setHeartBeatTimer() removed |
Use enableActivityTracking() instead |
Deprecation | JavaScript tracker | trackEvent() deprecated |
Use trackStructEvent() instead |
Data change | S3 & Infobright storage | event =”custom” changed |
Changed to event =”struct” |
The first three changes are simply cleanup: we are removing tracker methods which we previously deprecated some time ago.
The last two changes are us starting to re-structure our event tracking – we are making space in our event model to support unstructured events, which will be coming soon. Please check out our previous blog post, Help us build out the Snowplow Event Model for more background on this.
Upgrading is a three-step process:
4.1 JavaScript tracker
Please update your website(s) to use the latest version of the JavaScript tracker, which is version 0.10.0. As always, the updated minified tracker is available here:
http(s)://d1fc8wv8zag5ca.cloudfront.net/0.10.0/sp.js
Don’t forget to update your Snowplow tags as per the updates in [breaking changes] (#breaking-changes) and deprecations above.
4.2 ETL
If you are using EmrEtlRunner, you need to update your configuration file, config.yml
, to use the latest versions of the Hive serde and HiveQL scripts:
:snowplow: :serde_version: 0.5.4 :hive_hiveql_version: 0.5.5 :non_hive_hiveql_version: 0.0.6
4.3 Infobright
If you are using Infobright Community Edition for analysis, you will need to update your table definition. To make this easier for you, we have created two scripts:
4-storage/infobright-storage/migrate_004_to_006.sh 4-storage/infobright-storage/migrate_005_to_006.sh
Choose the appropriate script depending on whether your current events table is events_004
or events_005
.
Running this script will create a new table, events_006
(version 0.0.6 of the Infobright table definition) in your snowplow
database, copying across all your data from your existing events
table, which will not be modified in any way.
Once you have run this, don’t forget to update your StorageLoader’s config.yml
to load into the new events_006
table, not your old events
table:
:storage: :type: infobright :database: snowplow :table: events_006 # NOT "events_004" or "events_005" any more
Done!
5. Getting help
As always, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.