How to get Snowplow data models to do more for you
It’s been over two years since the first commit of our snowplow-web dbt package, and almost three years since the first commit of our SQL Runner web models. In that time we have added countless features, increased performance, and are now used by over 100 unique dbt projects.
The latest versions of our web and utils packages are the result of months of work with a single goal – to make it easier and faster for our users to derive value from their data. This was done in two ways: first, to reduce the need to build custom models with our packages, and second, to expose the core logic of our packages in a flexible way so you can take advantage of the years of work and testing we have put into them.
With these latest releases, we enable you to:
- Add any fields from the atomic events table, including entities, directly to our derived page_views, sessions and users tables.
- Change the package’s session and user identifiers simply by changing a variable.
- Create and customize your own snowplow-sessionized models with custom identifiers via a series of macros.
Passthrough fields
Until version 0.16.0 of snowplow-web, if you wanted to use a field from the atomic events table or a custom entity in one of our derived tables, you had to disable our model and create a custom model where you connected to snowplow_web_base_events_this_run
to get the field. This was inefficient and often resulted in creating custom models just to add a single column.
Now, adding these fields is as simple as setting a variable. For example, to add the context_mycompany_important_entity_1
column in its entirety to the page views table, you need to set:
vars:
snowplow_web:
snowplow__page_view_passthroughs: ['context_mycompany_important_entity_1']
This change will then be reflected in the page views table for all future page views, without the need for a full update. More details and examples can be found here.
Before we delve into the other new features, it’s important to understand how our packages go beyond the standard incremental approach of dbt-core and why this is so important.
Aside – incremental sessionization
Imagine that you have set your dbt job to run once a day at exactly midnight, which means that all events from the previous day are processed in a single batch. This works fine for any session that occurs only on that day, but what about a user who starts a session at 23:55 and doesn’t finish it until 00:15 the next day? In that case, you would only have the first five minutes of events when the job runs.
Don’t worry, when it runs the next night, you’ll have the full 20 minutes of events to process. However, since dbt’s incremental approach only considers new events, you only process the new 15 minutes. Your aggregate session information is based on those last 15 minutes, overwriting the data from the first five minutes.
You could include a look-back buffer, but that would just shift the problem to a different cut-off time. You could build the model so that all aggregates are carefully incremented using the existing and new data, but a single error in a run would mean that a full update would be required. This is where our package logic comes in.
Our packages use a set of manifest tables to keep track of which sessions have been processed, and to ensure that all events in a session are reprocessed when new events occur in a run. To do this, we need to:
- Identify all new events since the last run.
- Identify the domain_sessionid of these events.
- Collect all events from these sessions.
- Process these events in the package.
This may sound simple, but we also quarantine long-running sessions to reduce costs, account for late-arriving events to increase accuracy, track session start and end times to optimize queries, and ensure that each session has only a single domain_userid
associated with it to avoid downstream problems. Trying to implement all of this yourself would be time-consuming and error-prone (trust us, we’ve already made and fixed most of these errors).
Customizable identifiers
Previously, the logic for modeling web data was set to domain_sessionid
as the session identifier, collector_tstamp
as the incremental timestamp, and domain_userid
as the user identifier. This meant that if you wanted to use a field other than these, you had to create your own version (or at least change ours). Not anymore.
With version 0.16.0 of snowplow-web, we have introduced snowplow_session_identifiers
, snowplow_session_timestamp
, and snowplow_user_identifiers
💡 Using load_tstamp for efficiency
If you have load_tstamp for all your data, and especially if it is the partition key of your atomic events table, then setting this as your snowplow__session_timestamp will not only reduce the cost of running our package, but also allow you to reduce the look-back window to near-zero as load_tstamp will never be earlier than the date from the last run for any new records, saving even more compute!
As an example, if you have a custom entity that you use to keep track of a session_id, and only want to fall back to domain_sessionid when it is not populated, then you can use this configuration:
vars:
snowplow_web:
snowplow__session_identifiers: [{'schema': 'com_mycompany_session_identifier_1', 'field': 'session_id', 'prefix': 'si'}, {'schema': 'atomic', 'field': 'domain_sessionid'}]
This would set your session identifier to be session_id
if it is not null, and to domain_session_id
if it is. More details and examples can be found here, including how to use custom SQL to transform your identifiers before using them.
In the web package, we have set this identifier to overwrite the domain_sessionid
field in all of the tables, to reduce the number of breaking changes we would make.
Snowplow “base” macros
To allow for this flexibility in our core package logic, we have combined all of this logic into a set of macros, which we collectively refer to as base
macros. Together, these six macros encapsulate the core sessionization logic of our packages and have a significant number of arguments that you can use to customize and optimize how they work to meet your needs.
These macros produce the new session_identifier
and user_identifier
fields to avoid conflicts, and are how we will create all new packages in the future. Using these macros, you can build custom models based on our package logic, using whatever identifiers you are tracking. Since these can be quite complex, we have created a repository of out-of-the-box examples that you can use here, and you can follow our quick start guide here.
What’s next?
With the addition of passthrough fields and custom session identifiers, we’ve made the biggest leap in the flexibility of our packages since their release, but we’re not stopping there.
Over the next year, we plan to add these features to all of our packages that use the same sessionization approach, as well as develop new packages, such as one to unify web and mobile data. We don’t have specific dates for these packages yet, but keep an eye on our Discourse for news as they are released!
You can find the latest versions of all our packages on the dbt hub and more information about each package in our documentation.