How to guides, Releases

Build powerful AI data applications and composable CDPs with Snowplow and Databricks

As the world’s first and leading data lakehouse, Databricks enables over 7,000 organisations to unify their data storage and power AI. Snowplow creates behavioral data, which is the strongest predictor of customer behaviour.

Playing such an integral role within the modern data stack, it was critical for Snowplow to offer our customers and open source community the ability to utilize Databricks as the end-destination for the rich, predictive and AI-ready behavioral data that they create with Snowplow.

From today, Snowplow users can seamlessly load the data they create with Snowplow into Databricks and use the new out-of-the-box web models to enable users to power AI/ML data applicationsadvanced analytics and automate insight as part of a composable CDP, directly from their Databricks lakehouse.

How to get started: 

  • Snowplow customers, please contact your Customer Success rep.

The power and challenge of behavioral data for AI

The aim of AI is often to make accurate predictions about future behavior, whether that be the behavior of machines, people or the natural environment. There is no better predictor of future behavior than past behavior – our actions today are the best indicator of what we will do tomorrow.

However, many companies have traditionally decided against the large-scale use of behavioral data due to the complexity involved. Much of this complexity comes from the extensive data preparation which is necessary before feature engineering and model training can take place. 

“Dirty data is the Achilles heel of artificial intelligence tools. There’s a common adage among machine learning and artificial intelligence experts that a 10% improvement in data is more impactful than a 100% improvement in the effectiveness of algorithms.” (Forbes)

So how can we get our data at the right altitude to train AI models?

Introducing Data Creation for AI with Snowplow

Snowplow’s pioneering approach to Data Creation removes the need for extensive data preparation and instead empowers data teams to manage end-to-end behavioral data creation with the AI-ready data in their data lakehouse (or other storage destination). 

Snowplow modeled data has the following 5 traits, which enable data teams to focus on feature engineering and model training as well as advanced analytics. 

  • Reliable: data must arrive in a timely and reliable manner in order to be useful for data applications. Without knowing how old data is, many applications are impossible to run effectively..
  • Explainable: each metric is tightly defined by JSON schemas and is both humans and machine readable. This enables all stakeholders to deeply understand the meaning of the data.
  • Compliant: the full lineage of the data is observable and auditable, basis for capture can be recorded with each event and PII can be hashed.
  • Accurate: the data faithfully describes the user journey as it actually happened; 400 page views must mean 400 page views.
  • Predictive: unlimited entities and properties customised for a specific data application help ML algorithms find unlikely correlations.

Being AI-ready means data created with Snowplow can be loaded directly to Databricks without preparation, where it can be used to power AI and ML data applications as well as advanced BI.  Snowplow’s model of deployment in our customer’s own cloud environments also allows for data observability, 2-year tracking – despite ITP restrictions – and complete governance.

Learn more about Data Creation with Snowplow

Snowplow and Databricks for AI and Advanced Analytics

The massive value-add with Databricks on top of this Snowplow data is time to value for AI and ML projects, as well as having a single destination for both your data warehouse and lake.

Databricks is a powerful tool, allowing users to standardize the full ML lifecycle from experimentation to production. With Snowplow’s AI-ready data in Databricks, data engineers and scientists can focus on training and productionize models with MLflow and AutoML rather than cleaning and preparing data.

Databricks users can also access Machine Learning environments with one click with groundbreaking frameworks like Tensorflow, Scikit-Learn, and Pytorch. Experiments can also be tracked from a central repository, whilst collaboratively managing and reproducing runs. This ends a chain of governance that starts with Snowplow – when the data is generated, enhanced and modeled – all the way to the creation of data intensive applications in Databricks.

Snowplow and Databricks as part of a Composable CDP

Added to the flexibility offered by Snowplow and Databricks is the concept of a composable CDP. Fundamentally, this allows businesses to create a single view of their customer within their data lake or warehouse. It also means companies can activate data through integrations with other tools, and natively within their own platform. 

Many businesses are turning away from off-the-shelf CDPs due to the fundamental shift in how data is stored – i.e., orbiting around the data lake or data warehouse with a strong focus on customer privacy. Off-the-shelf platforms, like CDPs, create a separate data silo for your business, while the composable CDP route means adopters get best-in-class products at each stage in the pipeline, from data creation to storage, modeling and activation.

Start using Snowplow and Databricks to power custom data applications

New to Snowplow? Book a demo or schedule a chat.

OS customers, learn how to implement Snowplow’s Databricks Loader.

BDP customers, reach out to Customer Success and we’ll get you set up.

Technical Overview

  1. Snowplow DBT Models for Databricks

These dbt models read off the atomic data table created by Snowplow in your storage destination – we call this the canonical event model.

Learn more about Snowplow’s DBT models

  1. Snowplow Databricks Loader

The Databricks Loader is part of Snowplow’s RDB Loader framework, which offers many benefits for Open Source Users. Learn more about the RDB Loader framework.

BDP customers don’t need to worry about setup or configuration, simply contact your Customer Success Rep.

Clouds Accounts SupportedCurrently only available for AWS pipelines. We are working on porting the RDB Loader Framework to GCP later this year.
LatencyLatency SLAs available for BDP customers on Ascent and Summit for guaranteed data freshness
Monitoring & Observability Alarms are raised if batches fail to load or if the warehouse is unhealthy, offering an unpresented overview of the health of your pipeline. 
ReliabilityRetry & Timeout logic which is self-healing after a failed load, providing a failsafe
PartitioningData is partitioned by date (collector timestamp). This can be configured, but we recommend this set up to avoid breaking the dbt models downstream.
Format detailsAll events are loaded into a single atomic events table backed by Databricks’ Delta tables.
We call this a “Wide-row Table” – with one row per event, and one column for each type of entity/property and self-describing event
Explore Snowplow Data.
Versions supportedThe supported versions start from RDB Loader version 4.0.3

More about
the author

Phil Western
Phil Western

Product Marketing Manager at Snowplow

View author

Ready to start creating rich, first-party data?

Image of the Snowplow app UI