Introducing a new generation of our web data model


We are very excited to announce the release of Redshift Web model v1.1. This is the first of a series of planned releases intended to address a hugely important need for Snowplow : extensible, scalable, incremental data modeling. Next, we will be working on BigQuery and Snowflake versions of this model, as well as a standard mobile model.
A brief history of data modeling with Snowplow
One of the biggest advantages to using Snowplow is that it gives you full ownership of your raw, unopinionated data. This data is then aggregated and transformed using business logic in order to produce insights. It is hugely advantageous to have full control over this modeling logic so that you can tailor it specifically to the nuances of your business. Therefore, one of the key drivers of deriving value with Snowplow is found in building out a data modeling process .
Our initial approach to helping our customers and community with data modeling was to release some example drop-and-recompute web models that aggregated the out-of-the-box tracking from the Snowplow JavaScript tracker into a set of derived tables (page views, sessions and users), with the expectation that users could adapt them to their needs. Over time however, some of the challenges with this approach have become apparent. The drop-and-recompute structure isn’t suitable for large datasets, and developing an incremental structure isn’t necessarily straightforward. Additionally, without a solid understanding of some of the nuances to how the data is tracked and processed, certain parts of the logic are difficult to reason about.
Over the years, we have developed various incremental models to address these challenges for Snowplow BDP customers. In guiding our customers through customizing and expanding on these models, further challenges have come to light. Firstly, maintaining SQL is incredibly difficult. Once edited, rolling out changes or bug fixes is almost impossible. Secondly, developing upon a complex structure that has been written by someone else is incredibly difficult even for the most skilled Analytics Engineer, creating a barrier to customers benefiting from our models to the full extent.
Why data modeling is important
There have been a few shifts in recent years that have hugely impacted the role data modeling plays in a company’s data strategy:
- Companies increasingly move away from packaged analytics vendors towards a data stack assembled from best in class tools
- Companies no longer use data for reporting only, but to inform decisions and trigger actions
- Companies collect more data than ever before, capturing user behaviour in granular detail
These shifts have meant that data modeling has become a core part of company’s data infrastructure. Therefore, the ability to test models properly, to easily maintain and upgrade models as tracking and business goals change, or to keep track of versions of models is now crucial.
What the new model brings
This new generation of the web model attempts to address these challenges. Specifically, it is designed to implement a SQL-as-software structure:
- We establish core modules which can be thought of as source code
- Each module has an explicit input and output (each module also has side-effects – this is unavoidable)
- Each module has an ‘entry point’ for custom logic, which can be treated as a plugin
- Each module is testable in isolation
- Tests can be extended to custom modules
This structure allows us to segregate the ‘heavy lifting’ of an incremental Snowplow module by extrapolating the incremental logic into its own ‘base’ module. The base module produces a table which contains only events relevant to this run of the incremental logic, both new events and those event that require recomputing (for example because they are part of an ongoing session). The same structure can then be applied to all three tables, i.e. the page views model acts as the base module for the sessions model, etc.


This structure and approach has two key benefits. It removes the complexity from customization, as all subsequent logic can operate on this input, as if it were a simple drop-and-recompute model, but the mode’s structure ensures an efficient incremental update. This means that the end user only needs to be concerned with the aggregation logic they care about, rather than expending effort on how to make that logic work within a complex structure. It also simplifies maintenance and upgrades, as the standard (Snowplow-maintained) and custom aspects of the model are separate modules.
Additional features introduced
We have also introduced some smaller, but promising features, such as feature-flags, metadata logging, and a more robust testing framework using the excellent great expectations framework.
More information
For more information on the model structure and a quickstart guide, take a look at the technical documentation as well as the README in the GitHub repository.
For a general introduction to Snowplow’s approach to data modeling, check out our 4-part webinar series.