Making it easy to work with event-level mobile data


In the last month we released new versions of the Snowplow iOS and Android SDKs. These new versions deliver a data set that is significantly easier to model and work with: helping companies use mobile data to better inform product development and marketing decisions.
In this post, we’ll explain how the data delivered by these new mobile trackers is both easier to work with and better enables analysts, data scientists, marketers and product managers to do more with that data.
A quick recap: what is data modeling?
Event data pipelines like Snowplow deliver a very granular data set, directly into your data warehouse, that describes exactly what has happened in your business. For companies running mobile apps, for example, each line of data would correspond to an action or event that occurred for one of your users in that mobile app.
We refer to the data set that is delivered “out-of-the-box” as the “atomic” data set. It is an un-opinionated description of everything that happened in your business. Whilst it is a great, comprehensive record, it is not easy for end users, including data analysts, data scientists, marketers or product managers to consume. It is data that has been optimized for auditability and completeness, rather than for specific patterns of querying.
As a result, the majority of Snowplow users run a “data modeling process” to generate a second data set: one that is easy to query. This can be directly ingested in a BI or other analytics tool, from where it can be socialized around the business. The modeled data provides easy-to-consume tables that summarize key units of analysis. You’d expect to find:
- A user table: with a line of data per user. This might include data points that describe when you first saw this user, what initially drove the user to your app, when you last saw this user, how active this user is, how valuable this user is.
- A session table: with a line of data per session. This might contain data points that describe what triggered the session to start (e.g. was it a particular marketing campaign), when it finished, the session length, the value of the session, any session categorization, any A/B testing information and any important goals accomplished e.g. signup.
- A screen views table: with a line of data for each screen viewed by the user. This might include information like: what was the previous screen, what is the next screen, how long did the user spend on the screen, what version of the screen did the user see (if an A/B test was being conducted on one of the screens), what actions the user performed on the screen and more.
These modelled data tables are pivotable: it should be possible to load them into any business intelligence tool and start slicing and dicing different combinations of dimensions (i.e. grouping by different column values) against different metrics (computed by counting rows or aggregating values in particular columns).
For example, if a product manager wants to check that a new feature that is being A/B tested increases propensity to sign up, she could pivot the sessions table to compare sign up rates for sessions in the experiment group with the control group.
Data modeling is essential
As the above section hopefully makes clear, data modeling is an essential step in socializing an event-level data set like that generated by Snowplow around all the different teams in a company.
Data modeling is difficult
Companies typically write data models as SQL jobs that run inside their data warehouse. There are a number of reasons why these jobs are difficult to write and maintain, but one big issue is that aggregating event-level data is difficult. That is because:
- We’re often interested in understanding user journeys that are made of multiple different event types, with different fields collected across those different event types. These cannot be aggregated over in a straightforward way.
- We’re often interested in the sequence of events. Knowing that a user performed A and B and C often is not enough – we care whether she did A then B then C, or A then C then B. SQL functions typically are no good at identifying different sequences, or letting analysts aggregate over different sequences.
Snowplow data from our new mobile trackers is easier to model
The big benefit of our new mobile trackers to Snowplow users is that they deliver data that is significantly easier to model. We’ve worked with many companies across different industries to build data models that incorporate mobile data and learnt first hand how hard this process is. We’ve taken that learning and updated our mobile trackers to make that process easier.
The data delivered by the updated versions of those mobile trackers should be easier to model and therefore it should be easier for companies adopting those new tracker versions to socialize their mobile data to the marketing, product and other teams that want to use it, and empower those teams to perform deep analysis on the data. Let’s look at how that is accomplished in detail, by diving into the new approach taken to screen view tracking in particular.
Introducing the new screen context
Screen views are an essential unit of analysis for many companies working with mobile data. Many mobile data models start by aggregating event-level data to screen-level data (so we can understand the journey as described as the sequence of screens viewed and actions taken on those screens), and then aggregate over that sequence of screens to understand the session as a whole. Making it easy to aggregate mobile data by screen, then, is an essential first step in building a mobile data model.
The new tracker versions can be configured to automatically send a screen context with every event recorded in the mobile tracker. Whilst this contains a number of important fields, two are critical:
- The “name” field identifies the name of the screen. This should be a unique value for every screen in your mobile app, so each can be unambiguously identified. Because this is captured with every single event recorded in the mobile app, it is easy to identify on what screen each event takes place.
- An “id” field that identifies the unique screen view ID generated when this screen was loaded. If a user enters your app on Screen A, navigates to Screen B, then Screen C, then returns to screen A, a new ID will be generated with each screen view, ensuring that it is easy to distinguish events that occur the first time the user was on screen A, with the second time the user was on screen A.
How does the screen context make modeling mobile data easier?
Generally when modeling event data, we recommend starting from the smallest unit of analysis and working up to the largest. So with mobile data, you would typically:
- Start by generating a “screens” table. This would be computed by aggregating all the events that took place on a particular screen
- Use that to generate a “sessions” table, by aggregating over all the different screens that each session was composed of, and summarising each as a line of data.
- Use that to generate a “users” table, by aggregating over all the different sessions table to identify all the sessions for each user and generate user-level summaries
The screen context makes step 1 very easy: trivial in fact! We can simply group all the events by the screen view ID.
Note how hard this simp
le act of grouping all the events that occur on a particular screen view is if we don’t have a screen view ID in a dedicated screen context. In this case we would have to:
- Examine our list of screen view events
- Run a window function to identify the timestamp of the subsequent screen view event. This would enable us to tell how long a user spends on each screen – except for cases when the user backgrounds or exits the app.
- Run a window function to identify the timestamp of any other subsequent events that might signal the screen is no longer in the foreground (e.g. an app_background event)
- Use those timestamps to determine the period that this particular screen was foregrounded on a user’s mobile device
- For each non-screen-view event examine the timestamp and user ID and match it to the screen view in question based on its timestamp falling in the appropriate range for that screen view
Not only is the above SQL complicated to write, manage and maintain: it is also very computationally expensive. This is the type of process that will lock up Redshift for hours on end, or cost BigQuery or Snowflake users significant sums of money to run.
Understanding how users of a mobile app move through the app
Understanding how users move through a mobile app is the foundation for many different types of analysis performed on mobile data. Often this boils down to:
- Which screens did a user visit on his / her journey?
- In what sequence did the user visit those screens?
As we flagged earlier, it is not straightforward to figure out the sequence events occur in using SQL: this typically involves hard-to-write and expensive-to-run window functions. Fortunately, the new mobile trackers ship with an updated screen view event that includes the following data points:
- previousName: the name of the previous screen viewed
- previousId: the ID of the last screen viewed
This makes it easy for the person writing the data model to ensure screens are properly sequenced and that sequence can be summarised e.g. as an array of screens in the sessions table, to make it easy to identify more common journeys through the app, for example.
We are committed to doing the boring work of data preparation, so you can do the hard work of driving value from that data
Mobile event data is extremely valuable, and any company that is serious about mobile needs to collect this data to understand how to improve and position that mobile proposition. Whilst there are many ways to collect that data, many of them result in a data set that is hard to work with to drive insight and action.
At Snowplow, we are completely focused on delivering a high quality, rich data set that sets you free to focus on using the data to drive value: so you can focus on socializing the data, driving insight from the data and driving action from that insight.