Identity Stitching in Snowplow: A Q&A for Data Engineers

Snowplow Team

November 3, 2023

Share this post

Identity stitching is a foundational technique for any organization looking to develop a reliable single customer view from behavioral data. In this blog post, we answer the most common technical questions around identity stitching in Snowplow, drawing on community contributions and Snowplow best practices.

What is identity stitching in Snowplow?

Identity stitching refers to the process of linking individual behavioral events in your Snowplow data to unique users across sessions, devices, and platforms. It involves building and applying a mapping of user identifiers that resolves to a canonical user ID — enabling robust user-level analytics.

Why is identity stitching important?

Without identity stitching, behavioral events are fragmented across cookies, devices, and platforms — making it impossible to:

Track full customer journeys
Measure attribution accurately
Understand conversion paths
Power personalization or LTV modeling

Snowplow's flexibility and transparency make it an ideal tool for implementing a precise identity stitching pipeline.

How do I implement identity stitching with Snowplow data?

Step 1: Track as many identifiers as possible with each event

Snowplow trackers are designed to collect a wide range of identifiers automatically. These include:

Collector-provided fields:

network_userid: Third-party cookie ID
user_ipaddress: User's IP address

All trackers:

user_id: Your own identifier, e.g. from login

JavaScript tracker:

domain_userid: First-party cookie ID
domain_sessionid: Session cookie
user_fingerprint: Browser fingerprint

Mobile trackers:

open_idfa, apple_idfa, apple_idfv, android_idfa
client_session.user_id and client_session.session_id

Custom contexts:

You can define your own schema (e.g., com.mycompany/user_context) to pass additional identifiers such as:

{

"id": "string",

"email": "string",

"twitterHandle": "string",

"facebookId": "string"

}

Step 2: Build a user mapping table

Identify events that include multiple identifiers — especially login events that include both domain_userid and user_id. Use these to construct a user mapping table:

CREATE TABLE derived.user_mapping AS (

SELECT domain_userid, user_id

FROM atomic.events

WHERE domain_userid IS NOT NULL AND user_id IS NOT NULL

GROUP BY 1, 2

);

This table maps anonymous identifiers (cookies) to authenticated user IDs.

Step 3: Apply the mapping to your atomic events

Enrich your events dataset with the canonical user_id using a LEFT JOIN:

SELECT

COALESCE(um.user_id, e.user_id) AS resolved_user_id,

e.*

FROM atomic.events e

LEFT JOIN derived.user_mapping um

ON e.domain_userid = um.domain_userid;

This ensures that even pre-login events (where user_id is null) are assigned to a known user where possible.

Can I stitch across platforms like mobile and web?

Yes. You can expand your mapping logic to include additional identifiers from mobile contexts (e.g., apple_idfv, android_idfa) or cross-device linking via internal identifiers (e.g., user_id, email hashes). Make sure to:

Normalize all identifiers
Consider mapping across combinations of cookies, device IDs, and login IDs

What if users share devices (e.g., tablets, public computers)?

In shared-device scenarios, deterministic stitching may misattribute events. Consider:

Supplementing with probabilistic models (e.g., IP + UA + time proximity)
Excluding shared-device identifiers from mapping logic
Logging uncertainty or confidence scores in your mapping graph

Can I include 3rd-party marketing identifiers like GCLID or CID?

Yes. These can be captured via custom contexts or directly in Snowplow URL parameters. For example, Google Ads gclid can be parsed into a context and joined in your model. Capture them early and propagate across sessions where possible.

Are there resources or tools for advanced identity stitching?

Yes — Snowplow customers and open-source users often leverage tools like:

dbt: for modular SQL data modeling
Kafka / AWS MSK: for real-time streaming and enrichment
Spark / Beam: for large-scale graph-based identity mapping
Snowplow Enrich / Loader: for passing through and shaping identifier payloads

Final thoughts

Identity stitching is not a one-size-fits-all operation. Snowplow provides the raw, granular behavioral data and the flexibility needed to implement stitching tailored to your business logic and tech stack.

For the most reliable results:

Collect identifiers consistently across platforms
Start simple, iterate with complexity as needed
Monitor for data quality and edge cases

To explore more use cases and technical patterns, visit our Knowledge Base or reach out to the Snowplow team.

‍

Subscribe to our newsletter

Get the latest content to your inbox monthly.

Identity Stitching in Snowplow: A Q&A for Data Engineers

Get Started

Get Started

Get Started

Products

Comparisons

Customers

Solutions

Explore

Integrations

Technology

Resources

Follow Us