At Snowplow we describe data modeling as a sequence of SQL steps run on a schedule in your cloud data warehouse.
What is data modeling?
However definitions do vary and some organizations do run their scheduled sequence of steps in a different language – often to the detriment of code readability but associated with some kind benefits e.g. reduced cloud costs or easier implementation of specific advanced data transformations. For the vast majority of organizations the simplicity of declarative languages like SQL is hard to overstate, and comes with a whole host of other benefits such as accessibility and reduced code run times.
What is the purpose of data modeling?
The purpose of this sequence of SQL steps is to automate away repeatable tasks. These repeatable tasks are data transformations that are designed to reduce complexity so that time to value for new and existing users of the data model is minimised.
Viewing data modeling in this light leads to the realization that data models are data products that should be considered to be analogous to any other tech asset or software product that adds significant value to an organisation.
Why is data modeling important for behavioural data?
Behavioural data is messy. Any public-facing application can expect to be interacted by a whole host of entities, each with their own agenda, a short off the top of the head list might include the following entities:
- The intended users of the application
- Internal users
- Integrated testing tools
- Bots (set up by competitors or customers)
All of these different types of users and more will be performing both expected and unexpected actions on your website or application. Much of the noise is filtered out as it is never tracked, but even with the best tracking design and protocols part of this noise can end up in the final dataset.
It is the job of the data model owner to understand what noise is inherent in the final dataset and to make decisions around what should be done as a result.
How To Approach Data modeling
Stating our goal to be “the minimization of time to value for the end user” naturally leads to the question of “How?”.
This is achieved by standardizing common operations that are required in every query that queries the raw event level data. These common operations are as follows:
Event level data can be difficult for the casual user to understand as the concept of an event is relatively abstract, aggregating up to more familiar higher order concepts can help to promote understanding of the data. For example if a user wanted to understand in-session behaviour over time they might be more inclined to query a ‘sessions’ table over an ‘events’ table.
Equally important is to consider that many data warehouses charge based on the volume of data scanned in a query or the compute resources used. By querying aggregate data instead of event level data the cost of data access can be significantly reduced.
In the simplest case, if upfront validation is not performed on event-level data before it lands in the data warehouse then poor quality data will need to be filtered out of the final dataset. An example could be an anomalous transaction event not being filtered out of a marketing attribution model resulting in inefficient allocation of marketing spend.
The events table contains a large number of fields – only the fields that are relevant to a particular data model table should be selected, this helps to reduce the signal to noise ratio for any downstream analysis.
By default internal users and bots should be removed from any final analytics dataset. Your SQL data model is the ideal place to define what constitutes an internal user or bot (often this decision is informed by the Snowplow IAB enrichment).
The Snowplow tracker comes with built-in semantics to ensure that every event is sent at least once, this will inevitably result in a small number of duplicate events in the final dataset, these duplicate events should be filtered out prior to data consumption.
- Application of Business opinion
The Snowplow trackers deliver unopinionated data. It is up to each individual organization to apply opinion to this data to make it specific their business model.
There is the classic example of a user visiting a webpage as a result of marketing activity, they will land on the webpage with UTM parameters in their querystring – it is trivial to parse out these parameters, but the collecting organization is the only one with the nuanced understanding of how to classify these parameters into different marketing channels.
At Snowplow we commonly get the question “what happens if my business opinion changes” – this is expected behavior the contents of data models is expected to be transient and change over time. The raw event level data is immutable and opinionated whereas the data models can be deleted and recomputed to reflect an updated understanding of the specific parameters each organization operates under.
4. Join together multiple datasets
Your behavioral data is incredibly valuable; this value can be multiplied by enriching it with other data. If a transaction is recorded on the client side it is generally not possible, or advisable, to track the associated margin client side too. However, if a transaction ID is recorded then this can be used as a join key to combine your client side and server side data – this particular example is a simple paradigm shift that can drastically increase the value derived from marketing attribution models by attributing credit for margin, not just revenue to marketing campaigns.
5. Perform Operations Incrementally
The data volumes associated with behavioural data are typically large enough to justify investing the extra effort into ensuring that the previous 4 operations are performed incrementally. This will ensure that the benefits above are realised without breaking the bank to create and maintain the data models.
Data modeling in action: An example
Internally at Snowplow we are very opinionated about what the “bounce rate” metric means, as an organization we simply do not believe that it is correct to attribute the same value to the following 2 sessions:
- A user visits the homepage and then abruptly leaves
- A user visits a specific article, scrolls to the bottom of the page over a 5 minutes period and then leaves
Without data modeling an analyst might have to construct the following query to understand at a basic level what sessions on a particular day are truly bounced sessions and which sessions are quality sessions.
WITH sessions AS( SELECT DISTINCT domain_sessionid AS session_id , FIRST_VALUE(page_title) OVER(PARTITION BY domain_sessionid ORDER BY derived_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_page_visited , COUNT(1) OVER(PARTITION BY domain_sessionid) AS page_views_in_session FROM atomic.events AS ev WHERE event_name = 'page_view' AND DATE(collector_tstamp) = ‘2021-03-18’ ) , time_in_session AS( SELECT domain_sessionid AS session_id , COUNT(DISTINCT event_id) * 10 AS time_in_session FROM atomic.events WHERE event_name = 'page_ping' AND DATE(collector_tstamp) = ‘2021-03-18’ GROUP BY 1 ) , pre_agg AS( SELECT pv.session_id , CASE WHEN s.page_views_in_session = 1 AND s.first_page_visited = 'homepage' THEN 'bounce' WHEN s.page_views_in_session = 1 AND s.first_page_visited = 'how-to-guide-article’' AND tis.time_in_session < 60 THEN 'bounce' ELSE 'quality_session' END AS bounce FROM sessions AS s JOIN time_in_session AS tis ON s.session_id = tis.session_id ) SELECT bounce , count(1) FROM pre_agg GROUP BY 1
This is a relatively complex query that has multiple CTEs, each one querying a different event type and applying different operations to the event level data. These CTEs then have to be joined together into a final table that contains a case statement that is required to classify sessions as either quality sessions or bounced sessions.
There are a multitude of problems with this query, including but not limited to:
- The level of SQL for this basic analysis is too advanced, non SQL fluent users would not be able to build such a query;
- Introducing multiple steps into a query means there are more places where mistakes can be made;
- The query is not optimized and contains expensive and slow to run window functions;
- There is no version control on the case statement, every user who wants to analyze bounce rate has to have knowledge of where to find the latest version of the case statement in order to perform similar analysis self sufficiently;
- The query directly queries the events table meaning it unnecessarily scans a large amount of data every time it runs.
The end result of all of this is that any reporting or visualisation that is based on querying only event level data is likely to result in a very difficult to maintain and likely very expensive reporting setup. A better approach is needed.
This better approach would be to codify all of this logic in a central, versioned data model that might allow for the following query:
SELECT bounce , count(1) AS sessions FROM derived.sessions WHERE (session_start_date) = '2021-03-18' GROUP BY 1
Which allows the user to use simple SQL or even a drag and drop tool to calculate bounce rate for a specific date or date range with minimal effort.
Data model Architecture
The events table contains our immutable unopinionated event log. Each table is dependent on the table below it, and data is aggregated and filtered incrementally in the operations that take place between each step of the model.
This is a preparatory data model that contains core business opinion such as what marketing parameters constitute what channel, what constitutes a bounced session, and what constitutes a conversion.
This data model is a good starting point for an organization that tracks web data only. But for any organization that has customer touchpoints outside of the web it is extremely valuable to integrate these into the data model and build a single customer view. An example of this is provided below where a mobile data model analogous to the web data model has been integrated with the example above to build such a single customer view.
Single customer views like this that capture customer touchpoints across a variety of media are hugely valuable due to the unparalleled insight they can offer into customer behavior.
For example, any attribution model that is built to combine both web and mobile touchpoints and capture the whole customer journey will be orders of magnitude better than an attribution model that is only able to attribute marketing credit to single device customer journeys.
What value does a data model add to an organization?
With the examples discussed above it is possible to see how data models can add value to an organization in a variety of ways. Primarily data model products are solutions that enable self service for a variety of teams throughout any organization.
If an organization truly wants to become data-driven then effort must be invested into developing datamodels that democratize access to clean, feature rich, opinionated, pre-aggregated data to create a platform that is readily accessible to:
- Less technical users for simple self service.
- Analysts as a jumping off point for more advanced analytics (e.g. attribution).
- Data-scientists as an input to machine learning models.
- Data engineers as a component of a real time application.
How to get started
Use of a dedicated data modeling tool is highly recommended when getting started with data modeling, there are a variety of tools available that fit the bill such as: Airflow, Dataform, Databricks, dbt or even Snowplow’s open-sourced, in-house tool SQL-runner. Your choice of tool will primarily depend on the pre-existing cloud environment and data warehouse that are native to your organization.
Someone with SQL fluency such as an analytics engineer, a senior analyst, a data scientist or data engineer can then use this tool to start building out different data models.
In general it is not advisable to start from scratch as many organizations (including Snowplow) have faced similar challenges in the past and have developed open source data models that can be used as a robust starting point. See the recently released V1 Snowplow data models for Redshift, Bigquery & Snowflake for example starting points.
By building a data model you are making an upfront investment in a foundational product to democratize and standardize data and insights across your organization – one that will prevent the analytics team from becoming a bottleneck to the organization and that will yield dividends for all other teams in the organization for years to come.
Investing in your strategic data asset is one of the best things you can do to build a competitive advantage in today’s competitive landscape. In our next chapter, we’ll explore how Snowplow can help organizations take advantage of their behavioral data from web.
This is an 8-part series
Click below to navigate to the next chapter:
Chapter 1 The state of web analytics in 2021
Chapter 2 Privacy updates, ad blockers and the need for 1st-party tracking
Chapter 3 Building a web analytics stack: packaged vs modular
Chapter 4 The best-in-class tools for web analytics
Chapter 5 Redefining web analytics metrics
Chapter 6 Data modeling for web analytics
Chapter 7 Snowplow for web analytics
Chapter 8 How Welcome to the Jungle took ownership of their web data with Snowplow