Modeling behavioral data. Your questions, answered
Behavioral data is a goldmine of information, but until you take that raw, behavioral event data and apply your own business logic to it, you can’t use it to inform key decisions. That’s where Snowplow and dbt come in.
Recently we sat down with Will Warner, Analytics Engineer at Snowplow and Sanjana San, another Analytics Engineer at dbt Labs, alongside Archit Goyal from Snowplow’s Product team for a robust discussion around behavioral data modeling. Will and Sanjana shared their best practices for data transformation, including concrete examples of how to make the most of dbt and Snowplow in the data modeling workflow.
If you missed it, fear not. We’ve compiled answers to your most pressing questions below so you can catch up on the conversation below.
Click here for the full recording.
What common mistakes do you see being bad in the data modeling process and specifically in attribution modeling?
When it comes to rolling out Snowplow and dbt, the most important thing is to involve the right people. With behavioral data, you’re going to need to involve your analytics engineer, a representative from marketing and a representative from product, and ideally a front-end engineer who deeply understands how the website or mobile application works.. They are going to be the ones who help you troubleshoot when your data outputs aren’t as expected.
A lack of rigorous documentation is also a common misstep in the data modeling workflow.
Laying the groundwork in terms of documenting your events, data models, and all your end tables, is crucial. Without this, visibility can break down quickly and it can become much harder to resolve issues in the process.
Auditing is another often overlooked part of the data modeling cycle that is hugely important.
Ideally, audits should take place early and at multiple checkpoints. This is even more important with behavioral data than other types of data.
Finally, try not to do too much at once with your behavioral data. The temptation with Snowplow data, because it’s so rich and because they’re so many insights you can gain from it, is to try and leverage as much value as you can, in the shortest possible time – leading to a lot of ad-hoc queries written against your events tables. While initially this will give you lots of answers, this won’t scale particularly well with a high volume events table (behavioral data is voluminous by nature).
On the flip side, by planning about how you’ll be leveraging your event data, you can come up with a series of derived tables, like your session tables , user tables,and more to minimize the amount of queries on your atomic events table – a golden rule when dealing with behavioral data.
What are the most common issues you see in audits?
Often there is a ‘tug of war’ that takes place when comparing Snowplow data against data from a packaged analytics solution like Google Analytics. Google Analytics (GA) is a black box – giving you no visibility as to what’s going on in terms of processing. This can severely limit your ability to understand variances between the data sets. It’s worth deciding as a team what your acceptable variances are. For example, 3-5% variance from GA data might be a good start, but it’s ultimately up to you and your organization what your tolerance is.
Another solution that can help here is Snowplow inspector. This is a powerful tool that allows you to see events firing in your test environment and later in production. It’s incredibly useful to see your behavioral data being delivered in real time, and to check your triggers are working as expected.
How do you balance the decision to apply business logic in the visualization layer, vs applying it in the data model for a single source of truth?
This is a tough decision for any data team. It’s worth remembering that, while data modeling is a challenging process, as far as possible you want to ensure that your models are broad and can serve a variety of use cases.
Your aim is for your data models to be able to answer a multitude of questions. The final models – monthly, daily aggregations should be pushed out to your Business Intelligence (BI) tool. Ideally, it’s only the final additions to your models that should be outsourced to your BI tools.
And while simple aggregations are fine for the BI tool, complex transformations should be left for the main data model. Major changes should be part of the main data model as part of your transformation workflow, to ensure your models remain iterative.
I’ve worked in companies before where you have nested calculations in Tableau, and it’s such a pain to work out how these calculations are being performed. And 95% of the time you could quite easily integrate those processes into the model itself, which gives clarity to all the other analysts in the business. – Will Warner, Analytics Engineer at Snowplow
Would you try to keep the number of different tables as low as possible, or create multiple tables that get really specific?
This depends heavily on the purpose your tables are serving. If there’s an opportunity to consolidate tables, it can absolutely be worth reducing the number of tables you’re working with. In some instances, however, it’s necessary to create multiple tables and not easily avoided.
This isn’t necessarily the end of the world. If they’re high volume tables you can leverage dbt’s incrementalization to mitigate performance bottlenecks and reduce cloud costs you might otherwise experience with multiple tables.
What’s the importance of incrementalization?
Fundamentally, incrementalization enables you to process only the minimum amount of data required. Typically you’re only processing new events, which reduces the overhead on your warehouse.
The challenge analytics engineers face, particularly with web data with sessionized models contain a series of activities based over a period of time. This means we’re dealing with behavioral data over an entire session, however long that might be, and you need all the events within that session to calculate certain actions properly.
This can present challenges when it comes to working with historical data sets. For example, bot sessions can take place over a year or even two years; you don’t want to have to look back over two years worth of behavioral data to account for bots. Instead, aim to put safeguards in place, such as keeping session lengths to three days, to prevent data quality compromises as far as possible.
What’s the best next step after attribution modeling?
Before deploying your attribution model, it’s worth taking a step back and considering how it will deliver real value in your organization. Is the model answering the right questions? Has it taken our business model into account?
For instance, if an attribution model doesn’t account for the fact that, say 90% of deals close over the phone, with no real way of tracking phone calls – it’s not going to be effective. This is a collaborative exercise that needs to happen between the data team and business stakeholders to make the attribution model as relevant as possible. That being said, it’s important to remember that no attribution model will ever be perfect.
How does Snowplow compare to streaming GA4 event data into BigQuery?
GA4 is an excellent tool, taking event streaming into the mainstream and allowing people to stream event data into BigQuery. At Snowplow, we see GA broadly as an excellent tool for organizations getting started with behavioral data. When you’re starting off with fundamental questions about your users and how they’re interacting with your website and mobile app, because it has loads of reports built into the UI.
Where Snowplow comes in is at the point where the requirements of your data start to increase as you become more data mature, with more sophisticated questions to be answered or more complex use cases. This is where Snowplow data becomes far more effective.
This is because Snowplow was built from the ground up for data governance and to enable data teams to build assurance in the quality of their data. Snowplow data is validated upfront against JSON schemas and so it arrives in your data warehouse in a format that’s highly expected. That’s great for your dbt models because you can treat your dbt models almost like a product – you have expected outputs and you know there should be no surprises in terms of what the data looks like.
There’s also the governance aspect of Snowplow, in that data teams can control what’s being sent through into the warehouse. Finally, Snowplow data is rich by nature. When you send Snowplow data, you’re delivering 100s of properties into the warehouse with each event. That means you have to do far fewer joins on the data itself to gain valuable insights.
Is there a dbt package for iOs and Android trackers on the roadmap?
Absolutely! The product team at Snowplow is working hard behind the scenes to release these packages as soon as possible. In the meantime, you can get up and running with our SQL-Runner mobile model here.