From Raw Events to Actionable Insights: Mastering Behavioral Data Products with Snowplow
In 2006, noted data scientist Clive Robert Humby stated, “Data is the new oil”. This quote came on the cusp of the Big Data era, where it was novel to hook together dozens if not hundreds of machines to power massive Hadoop-based analytical computations on terabytes to petabytes of corporate data. Data, like oil, powered this revolution. As with every revolution, at some point, order must be restored for everyone to prosper. In this case, the data fueling Big Data analytics needed to be organized, packaged, reused and perhaps even sold. Enter the phrase “data products”.
What is a Data Product?
At the core, data products are a curated, referenceable data set. This data set has a name, a shape (a schema), an owner, and a set of metadata that provides extra important information about the data. This metadata might include owner, time created, version, lineage/root source(s) of the data, and more. This information is stored in some type of lookup repository, a data set library, for others to discover and use.
Harvard Business Review article also sums up data products quite nicely: :
“A data product delivers a high-quality, ready-to-use set of data that people across an organization can easily access and apply to different business challenges.“
Today, the phrase “data product”, much like the phrase “Big Data” a decade ago, is used by every data-related company in order to latch onto this burgeoning trend of “data is the new oil”. A quick Google search suggests the following as examples of data products:
- A data warehouse.
- A curated list of data.
- A company KPI dashboard.
- Google Maps is a data product.
Not surprisingly, here at Snowplow, we too believe that you should treat data as a product; hence, a core part of our value proposition is delivering “data products” for our customers. Our primary goal at Snowplow is to enable our customers to build, manage and effectively use behavioral data products to power AI, ML and analytical data applications.
Snowplow’s Customer Data Infrastructure (CDI) captures first party customer behavior data from a variety of sources. The data collected are behavioral events – actions and observations captured from a client or server endpoint. For example, Snowplow’s event tracking technology can capture behavioral events from websites, mobile phones, video, internet TVs, airplane seat entertainment systems, factory floor IoT devices, and more. Events from these customer endpoints, often billions a month, are captured in real-time and stored in an atomic events table in a destination data store, such as Snowflake, Databricks or even cloud data lakes like S3.
This atomic events table houses data from one or more data products in a consistent way. This is a powerful concept – this table is a singular view of all the different behaviors of all your customers, all stored in one place. As a result, this events table data provides the foundation for Snowplow’s behavioral data products.
Good decisions need good data
The Snowplow pipeline captures and enriches behavioral events with additional customer-specific processing (perhaps stripping out PII) and data (perhaps customer ID enrichments). This ensures that the data entering the atomic events table is of high quality.
Snowplow also provides tools to create specific datasets for web, media, mobile, and commerce.
What are Behavioral Data Products?
A Snowplow Behavioral Data Product includes:
- Behavioral event tracking: This defines the customer behavior that you wish to capture. These are a set of event specifications, schema and associated semantics, including the rules for when the events are triggered along with the properties contained within the events.
- Data Product metadata:This includes the product’s name, description, owner’s email address, a set of events and event schema, and location of collected behavioral event (modeled) data.
- Data Model: This includes a domain-specific model that aggregates all the “micro behaviors” found in the atomic events table. For example, extracting and aggregating behaviors such as user clicks on web page link, user enters search term, user clicks search button, user scrolls down list of items, user adds item to basket.
The result is an aggregate unit of analysis e.g. “product searches” or “shopping journeys” or “sessions”. These are queryable tables that can be used to answer business questions, such as “how much revenue are we missing out on because we don’t have the right stock for the things our customers are looking for?”. These models can also power AI applications, for example, that personalize the search to maximize the chance that a user finds what she wants, resulting in, driving up conversion rate, basket size and customer lifetime value.
- Behavioral event data: This is the transformed data stored in specific tables in your data lake or warehouse.
The two important aspects of a Snowplow behavioral data product are tracking definitions and the data model.
It is important to define the set of events as well as the shape, the schema, of each event that you want to capture. Snowplow uses the schema of each event type to validate individual incoming events.
The domain-specific models, defined by a set of SQL DDL and queries, are executed to create a collection of derived tables from the atomic events. These models filter, aggregate and process the raw events and create meaningful tables that are domain-specific, capturing behavior for e-commerce, web behavior, marketing attribution, and media/video player, among others.
These domain-specific models are purpose-built to produce behavioral analytics for that particular activity.
Once you have captured events and processed the data with the data models, you now have the core components of a Snowplow Behavioral Data Product:
- You have a well-defined set of events.
- You have deployed your automatically generated event tracker code to your endpoints.
- You have collected, validated and enriched the captured behavioral events.
- You have extracted derived tables,using your data model, from the raw behavioral events.
How do you create Snowplow Behavioral Data Products?
Snowplow Data Product Studio allows anyone in your organization to easily create high quality behavioral data products. These data products are then available to the rest of the business, to drive analytic and data applications.
First, create a set of well-defined event definitions.
You can create a standard product for e-commerce or media with just a few clicks. Note that you can easily augment our standard event packages by adding new events or modifying existing events to suit your needs. These out-of-the-box event definitions make a great starting point for behavioral data capture. However, if you wish to create a custom data product, creating your own set of custom events, that too is easily accomplished in the Data Product Studio.
Once you have defined your events, you have the opportunity to specify metadata for this future behavioral data product.
Metadata describes the data product being built. This information is useful for the people implementing the tracking, but Snowplow also will use that information to:
- Automate as much of the instrumentation as possible with Snowtype (see next section).
- Make it easy for the developer to validate that the tracking has been set up correctly, using Snowtype and Snowplow Micro.
- Help others to discover this data set, via search or data catalog, so that others can understand and make use of this product.
The metadata that you can define includes:
- Name: A meaningful name of the data in this behavioral data product.
- Owner: The person, an email, who is responsible for this data.
- Description: Information about the data contained within this behavioral data product.
- Events: The shape and schema of the data.
- Access instructions: Guidance on how to use the data.
Your second step is to deploy your tracking code.
Defining your events and event schema could be considered the hardest part, if you didn’t start with standard events supplied by Snowplow and had to define many events yourself. However, don’t fret, once you have your events defined, Snowplow’s Snowtype utility can generate the code needed to capture the events in your customer-facing applications.
Snowtype generates instructions, functions and methods that makes it easier for developers to instrument tracking in Typescript, Go, Kotlin and Swift. Once generated, simply add the code to your websites, mobile apps, etc. When you deploy those changes you will start collecting behavioral events.
Your third step is to collect, validate and enrich the captured behavioral events
Prior to defining and deploying your tracking it is assumed that you have deployed a Snowplow pipeline. Once your tracking code is deployed, it will start sending events to the Snowplow collector, the public entrypoint designed to capture these events. When an event enters the Snowplow pipeline, Snowplow will validate that the event data coming in matches the shape and the schema of the event that you defined in the Data Product Studio (in step 1).
The Snowplow pipeline also allows you to configure custom enrichments to be applied to your events. This is an opportunity to process the event in some manner prior to it entering the raw events table in your destination, a data lake or warehouse. Enrichment processing can add data to an event, perhaps a unified customer id, or can strip or mask data from an event, such as PII. Snowplow provides a set of standard enrichments, and also allows for custom enrichments to be applied.
Your fourth and final step is to run your data models to extract derived tables from the raw behavioral events
All the behavioral data that you have collected, validated, and enriched resides in the atomic events table in your target destination. This table has over 130 columns and likely millions of events. You will want to apply and run a data model on this data to extract derived, domain-specific data from this raw data set.
It’s easy to do with Snowplow. Snowplow provides a set of out-of-the box data models for popular use cases such as web and mobile applications, media, e-commerce, and marketing attribution analytics. Further, you can easily set up custom modeling to suit your needs. To apply and configure data models to your events, navigate to the “Data models” view in the Snowplow Data Product Studio. From here you can add out-of-the-box Snowplow-supplied models or one of your own custom data models.
A typical Snowplow pipeline might service multiple behavioral data products. The Snowplow Data Product Studio lists all the data products defined for your organization:
How do you use Snowplow Behavioral Data Products?
Once you have created your behavioral data product, either using an out-of-the-box Snowplow supplied data product definition or created a custom behavioral data product, and deployed your Snowplow-generated tracking code to your endpoints, you will now be able to make use of this behavioral data product.
There are a variety of ways to make use of this behavioral data product. Here’s a brief list, but the possibilities are endless and depend on your business needs and objectives:
- If you are using a Snowplow-defined standard behavioral data product you can use one of the pre-built behavioral data apps created by Snowplow.
- Build your own dashboards and analytics using a standard BI tool like Tableau or Looker.
- Use the behavioral data product to train AI and ML models.
- Use the behavioral data product to create features to store in your ML Feature Store.
- Build your own custom data applications capturing the insights from your customer’s behavior.
Snowplow’s Behavioral Data Products
A behavioral data product is a well documented dataset. It explains how data is collected, its meaning, and how to use it. This transparency encourages better teamwork among departments. Snowplow’s Data products are underpinned by the concept of a data contract. They act as a formal agreement between the producers of data products and the consumers of data products, and support better collaboration around the data being created.
Snowplow’s Data Product Studio rapidly enables data professionals to easily define behavioral data products – from supplying pre-build tracking plans, automatically generating tracking code, to easily defining and configuring the data modeling that processes the data to a consumable AI-ready form.
Snowplow behavioral data products take the concept of data products to the next level, bringing enhanced quality, governance, and discoverability to the data that you create.
Want to Learn More About Snowplow’s Data Product Studio?
Snowplow’s Solutions Engineer Freddie Day will be walking users inside Snowplow’s Data Product Studio on Thursday, August 15th. Register here to explore the comprehensive features of the Data Product Studio, including advanced enrichments (PII, IP anonymization, JS, API, SQL enrichments), data structures tooling, and data modeling management. Attendees will leave the webinar with a concrete understanding of how to leverage these tools to enhance your data strategy and deliver better business outcomes.