Snowplow vs. Rudderstack the dating-app challenge

Let’s begin

Hi! I’m Ryan, an analytics engineer at Snowplow. I joined about 6 months ago.

One of the things we often get asked is how Snowplow compares to CDPs, like Rudderstack, and while I had been told the reasons I wanted to get a more hands-on understanding, so I designed an app and built some tracking to put the two head to head.

I wanted to build something I knew, so as a 20-something single person living in London, I of course designed a dating-app.

What does good data look like?

I’ve been working with data all my career, both consumption and creation, so I had a few red flags I would be looking out for as I went:

Are there multiple ways to do something that lead to different results, i.e., no single source of truth?
Is it easy to accidentally generate bad/incorrect data? (I, of course, am perfect but not everyone is.)
Is it hard to discover and explore the data in the warehouse?
Am I duplicating my code in multiple places?
If the person who knew everything left, how hard would it be to learn what they knew? (If you aren’t worried about this person leaving your team, then either you work in a fantastically documented job, or more likely you _are_ the person.)

These aren’t exactly the quantitative measures I like to worth with, but they are the things that have caused me the most problems over the years, and when it comes to dating apps you have to follow your heart.

Key differences between Snowplow and Rudderstack

Before we begin, I should clarify that Rudderstack is actually a different type of tool – a Customer Data Platform (CDP) – whereas Snowplow is a Behavioral Data Platform (BDP).

As a CDP, Rudderstack is focused on forwarding data to multiple locations and performing some CDP use cases, such as activating audience segments.

As a BDP, Snowplow is focused on advanced analytics and AI, i.e., the sophisticated usage of data. As part of this, Snowplow takes a warehouse-first approach, with analysis taking place on data modeled in your central storage destination.

Event forwarding is, however, possible from Snowplow via GTM server-side or by using Snowbridge, so you’re not stuck in the warehouse. Other Modern Data Stack tools are then used to create solutions such as composable CDPs.

We have a page exploring top-level differences between Rudderstack and Snowplow.

My Journey – creating and tracking a dating app

Comparing two tools like Snowplow and Rudderstack is no easy task, so I started simple, using the OS Rudderstack tools and a basic Snowplow OS pipeline.

I began by building a mockup of a dating app which I called DatingXS, using next.js to generate some behavioral events which I sent to BigQuery. My original plan was to send this data to Databricks, but unfortunately the self-hosted destination hub from Rudderstack (Control Plane Lite https://github.com/rudderlabs/config-generator) is no longer supported and doesn’t offer Databricks as a destination.

I had 12 events (11 custom and a page view), which were:

Both tools could handle basic tracking on a simple app, with the tracking code being similar between them. One big difference in tracking was the use of schemas and entities with Snowplow, which I’ll talk about lower down. After tracking some events from my app (sadly I did not find love), I was ready to compare the outputs and experience.

Rudderstack creates more tables than an IKEA showroom

Snowplow has a single atomic events table, with all the attributes of my tracked events and the events themselves stored in the same place.

Rudderstack instead generates a table (and a view) for every event type. Even with my XS implementation, this started to made my analytics queries across multiple events extremely long and full of joins – they weren’t necessarily more complex but they were more tedious and harder to read. Beyond that, having events in their own table AND the unified (but thin) tracks table meant that I had multiple ways to answer a question, but ones that could give different results.

That to me is a red flag.

Rudderstack’s table structure

A typical dating app query

A common query that an analyst might be asked is something like “do premium users get more matches?“

An easy question to answer, we have an ‘is_premium’ field in our ‘match’ table, so we just need to group by that and average the count of rows right?

Wrong.

Doing that, we would be excluding any users who had never got a match.

If we start with the tracks table instead we would have all events…

Maybe we could start with the users table?

As you can see, we now have multiple options with different answers, and all for a seemingly simple query! With Snowplow, as there is only a single table, we do not have to deal with this confusion – I could still write the query wrong, but it is less likely.

Snowplow’s atomic data table is used to create all other tables

Rudderstack created columns seemingly at random

With Rudderstack, the columns were also ordered in what appeared to be a completely random way in each table.

What is column 3 in one table is column 45 in another.

All these issues meant that I found data discovery and query creation a real challenge, as visually exploring the data in a preview was not at all helpful.

Another red flag.

With Rudderstack, data seems to switch columns randomly between tables

What is an entity and why are they useful?

Snowplow pioneered the concept of entities – the nouns of my data – which can be reused across different events.

In my app I had 2 entities, user and profile, each with their own schema to define them.

This meant I could attach my user information to all my events (using a global context), and attach the profile to just those that it was relevant to.

Rudderstack has no support for such an idea, they do have the concept of a user, but I had to manually add all the profile attributes to each event, mixing them in with the event attributes. Unless this is done, I can’t slice and dice my data as effectively, I can’t easily roll out changes and the data is simply more confusing.

With Snowplow, entities not only had the benefit of being easily reusable and consistent across events, but because Snowplow bundles these all into a single nested column, it was easy to understand which attributes were for which context and event. Rudderstack, with its randomly ordered columns, meant identifying what came from an event vs. the context the event took place in just from looking at the data was impossible.

In the context of my personal app, since I was both the tracker and the analyst, these weren’t yet huge issues, but I was already getting frustrated having to duplicate my code and manually adapt my data to be usable, both red flags.

Data richness

In terms of the number of out-of-the-box fields that comes with events, Rudderstack provides about 45 unique fields across their tracks, users, and event table (with many of those fields duplicated across each table). Snowplow provides anywhere from 80-140+ depending on what one-line contexts you choose to enable.

Under the hood of those numbers, Snowplow provides additional information through extras such as the Geo enrichment, IP lookups (optional), and marketing information based on url parsing. Snowplow also has an increased number of flags for what a browser has enabled, and tracks vertical and horizontal scroll depth throughout – this is huge for effective content analytics.

All this out-of-the-box information is super helpful, I don’t have to track these things myself with my events, and the variety of identifiers (event, domain, network session and user) mean I should be able to answer a lot of questions easily.

The bonus here is that Snowplow has page ping (or heartbeat) tracking out of the box. I can accurately track my engaged time on page, not just absolute, which is generally the most predictive metric for things such as conversion or recommendations.

This accuracy of tracking for time on page simply isn’t available with Rudderstack.

No red flags for either product here, as both have documented their standard columns well, but Snowplow giving both wider and deeper data is a green flag.

Rudderstack’s definitions and validation were ineffective

When setting up the tracking, I had to define schemas for my events and entities for Snowplow. This did slow down my initial setup by a few hours, but it gave me time to think about my event structure and define limits on what is allowed (e.g., the types of data).

Defining event structure with Rudderstack is not required, so anyone can effectively create an event with any naming convention and productionize it immediately. New columns are automatically added to the table, new events generate new tables, etc. While this sounds appealing for the sake of speed, the result was actually more work in the end.

Mistakes do happen after all.

It turns out I am not as perfect as I claimed, multiple times I made a typo in the event name or an event attribute. This led to a new table being created that I had to merge into my old one then delete, or a new column that I had to do the same with. With my XS app these were easy to notice and deal with quickly, but that won’t always be the case.

In fact, there is something far worse at the heart of this, data evolution…

I won’t talk here about schema versioning, but with Snowplow if I change a field from an integer to a string, that is a breaking change and is dealt with accordingly. I tried this in Rudderstack, the event loaded but the field was null, and the data for that field was sent to its own table. I could try and create my own alerting system for when this happens, but it won’t be easy, and I would then have to manually alter my tables to accept this new type.

But the worst was still to come…

I now decided to try something more subtle.

I changed an attribute from an integer to a decimal (remember that unlike Snowplow, Rudderstack didn’t ask me to define types in the first place, so it may even be the tracking just sent integers to start with but always was able to send doubles).

Rudderstack happily tracked these values, put them into the table, but silently rounded down the number.

I want to say that again – Rudderstack altered my data with no logging, no alerting, no warning, and no way to recover the original event. If this field is relevant for finance calculations, strategy decisions, or even legal audits this could have far reaching consequences and it might take months to notice this was happening.

This is a red flag so big I ended the date right there.

n.b., Rudderstack do offer their Ruddertype tool, that forces a strongly typed approach, although validation is done at compile time, rather than on each sent event. However, this wasn’t a requirement to get up and running like Snowplow’s schemas, and being limited to their Enterprise plan I was unable to test it.

Extra tooling to test my data

The good part about Rudderstack OS, despite some destination limitations and being limited to 30-minute batch processing, was that I was able to run it entirely locally, whereas with Snowplow there was the need to deploy the pipeline to a cloud environment.

However, with Rudderstack I had no option (other than setting up a whole new destination) to test the data I was sending before it landed (which was where I really felt that 30 minutes); Snowplow, however, has a tool called Snowplow Micro.

Micro enabled me to test sending events entirely locally, with full schema validation, and identify issues BEFORE they hit my warehouse and were stored forever (until I delete them). This meant I could test everything locally and get immediate feedback.

Key takeaways

Nearly all of these red flags, I could convince myself, were manageable at this small scale.
Essentially, I was pretending to be a one-man-band, the only data engineer in my small company with everything housed inside my own head, so it didn’t really matter that discovering the data was hard, or that it was easy to track the same thing in different ways.

Erroneous data? ‘Meh’, I’ll spot that when it happens.

But what if I were no longer alone? How bad would things get? Let’s see how these issues could get far worse as my little startup, Dating XS, turns into an SME called Dating XL…

Rudderstack – mo’ data mo’ problems
(how small issues snowball at scale)

I. Rudderstack creates one table per event

With Rudderstack, as the volume of event types grows, so too does the number of tables (and the length of queries). The warehouse becomes visually bloated and difficult to navigate. This also means that new tables will be harder to discover, and noticing errors in event names is ever more likely.

Imagine a 100 different data apps – or ‘use cases’ – running simultaneously. If each of these apps contains 50 different event types, we now have to manage 5,000 tables in our warehouse.

Now imagine that the data guru who understands everything leaves and someone new has to manage all of these tables and make sure they are understood around the business – realistically this is a glass ceiling for growth and collaboration, and a huge risk.

Snowplow’s single table (combined with schemas) means that the warehouse is not bloated, queries remain short and optimized, and the tracking information is within the schema, not a person’s head.

II. Where are the entities??!

As Dating XS becomes Dating XL, we could very reasonably go from using an entity in a few types of events and on a few pages to hundreds of types of events across hundreds of thousands of pages and screens. We’ve also gone from just me to whole teams for tracking and analyzing.

Not only does this increase the chance of errors, but without entities you have to manually track which entity-type attributes should go with which events, and implement them in every single event that needs them – with no way to identify which events share the same entity.

I’ve lived this, it ends up as a spreadsheet on a Sharepoint – painful.

Using Snowplow’s entities means you can make them once, and use them as many times you like.

At small numbers Rudderstack’s lack of entities was manageable, even if it was just in my head, but at this scale not knowing which event types share entities is going to be a huge time sink. Let’s just hope no one accidentally deletes some rows from that spreadsheet…

III. Data richness limitations

Here is the difference that decreases with a larger business. When you have hundreds of entities and contexts, the out-of-the-box features make up a small set of them. While you’d still have to implement custom-page pings in Rudderstack, this would be a small part of your overall tracking.

The difficulty is in ensuring it’s done the same way across all your apps and sites. If you failed to create this in a performant and uniform way, your most predictive metric would be compromised across the business.

IV. Rudderstack’s definitions and validation were ineffective

With every new event and new attributes, the chance of a breaking change (that wasn’t meant to be a breaking change) slipping through grows. And as we grow, the chance of some major report or business decision being made off this data grows too.

No one wants to have spent millions on a marketing campaign just to find out you based it on poorly validated data…

For well established organizations you might have reviews and approval processes, but these are still manual. This means the team’s effort is going into making sure the tracking was implemented right, rather than spending time designing the best tracking or analyzing that data.

WIth Snowplow, all those checks are automated and versioned, so people can spend their time doing the stuff that only they can.

V. The cost of fixing these mistakes

Again, when it was just me, fixing these mistakes was pretty straightforward – I was in charge of the tracking and of the data – I could easily make changes to the tracking and I had full rights to do whatever I wanted with the data warehouse – I was the only one using it. It would still have taken time, but it wouldn’t have been days.

For a large organizations and enterprises, changes to tracking are going to need to go through some prioritization and approvals process, analysts almost certainly don’t have power to manipulate the source data tables (and if they do, they shouldn’t!)

All this and the above combined means that each error is likely going to take longer to notice, be harder and more impactful to fix, and likely take days if not weeks before it is fully resolved. With Snowplow, the data doesn’t even land in the warehouse if it’s wrong, and validation can be done with Snowplow Micro, and fixed before it even happens.

Final Summary

The problems which I encountered with my small implementation (Dating XS) were a headache, but I was able to hack my way around them with some additional effort. As you can see above though, when an implementation is scaled up, these cracks become canyons and a lot of value is compromised.

As an analytics engineer, trust in the data is my number 1 priority – without this teams are just running in circles. Snowplow’s focus on trustworthy data constantly showed up during my experiment. With Rudderstack, what seemed like attractive time savings early on became huge time sinks due to confusion and errors.

‍

Subscribe to our newsletter

Get the latest to your inbox every month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.