Identity Stitching in Snowplow: A Q&A for Data Engineers
Identity stitching is a foundational technique for any organization looking to develop a reliable single customer view from behavioral data. In this blog post, we answer the most common technical questions around identity stitching in Snowplow, drawing on community contributions and Snowplow best practices.
What is identity stitching in Snowplow?
Identity stitching refers to the process of linking individual behavioral events in your Snowplow data to unique users across sessions, devices, and platforms. It involves building and applying a mapping of user identifiers that resolves to a canonical user ID — enabling robust user-level analytics.
Why is identity stitching important?
Without identity stitching, behavioral events are fragmented across cookies, devices, and platforms — making it impossible to:
- Track full customer journeys
- Measure attribution accurately
- Understand conversion paths
- Power personalization or LTV modeling
Snowplow's flexibility and transparency make it an ideal tool for implementing a precise identity stitching pipeline.
How do I implement identity stitching with Snowplow data?
Step 1: Track as many identifiers as possible with each event
Snowplow trackers are designed to collect a wide range of identifiers automatically. These include:
Collector-provided fields:
- network_userid: Third-party cookie ID
- user_ipaddress: User's IP address
All trackers:
- user_id: Your own identifier, e.g. from login
JavaScript tracker:
- domain_userid: First-party cookie ID
- domain_sessionid: Session cookie
- user_fingerprint: Browser fingerprint
Mobile trackers:
- open_idfa, apple_idfa, apple_idfv, android_idfa
- client_session.user_id and client_session.session_id
Custom contexts:
You can define your own schema (e.g., com.mycompany/user_context) to pass additional identifiers such as:
{
"id": "string",
"email": "string",
"twitterHandle": "string",
"facebookId": "string"
}
Step 2: Build a user mapping table
Identify events that include multiple identifiers — especially login events that include both domain_userid and user_id. Use these to construct a user mapping table:
CREATE TABLE derived.user_mapping AS (
SELECT domain_userid, user_id
FROM atomic.events
WHERE domain_userid IS NOT NULL AND user_id IS NOT NULL
GROUP BY 1, 2
);
This table maps anonymous identifiers (cookies) to authenticated user IDs.
Step 3: Apply the mapping to your atomic events
Enrich your events dataset with the canonical user_id using a LEFT JOIN:
SELECT
COALESCE(um.user_id, e.user_id) AS resolved_user_id,
e.*
FROM atomic.events e
LEFT JOIN derived.user_mapping um
ON e.domain_userid = um.domain_userid;
This ensures that even pre-login events (where user_id is null) are assigned to a known user where possible.
Can I stitch across platforms like mobile and web?
Yes. You can expand your mapping logic to include additional identifiers from mobile contexts (e.g., apple_idfv, android_idfa) or cross-device linking via internal identifiers (e.g., user_id, email hashes). Make sure to:
- Normalize all identifiers
- Consider mapping across combinations of cookies, device IDs, and login IDs
What if users share devices (e.g., tablets, public computers)?
In shared-device scenarios, deterministic stitching may misattribute events. Consider:
- Supplementing with probabilistic models (e.g., IP + UA + time proximity)
- Excluding shared-device identifiers from mapping logic
- Logging uncertainty or confidence scores in your mapping graph
Can I include 3rd-party marketing identifiers like GCLID or CID?
Yes. These can be captured via custom contexts or directly in Snowplow URL parameters. For example, Google Ads gclid can be parsed into a context and joined in your model. Capture them early and propagate across sessions where possible.
Are there resources or tools for advanced identity stitching?
Yes — Snowplow customers and open-source users often leverage tools like:
- dbt: for modular SQL data modeling
- Kafka / AWS MSK: for real-time streaming and enrichment
- Spark / Beam: for large-scale graph-based identity mapping
- Snowplow Enrich / Loader: for passing through and shaping identifier payloads
See also:
Final thoughts
Identity stitching is not a one-size-fits-all operation. Snowplow provides the raw, granular behavioral data and the flexibility needed to implement stitching tailored to your business logic and tech stack.
For the most reliable results:
- Collect identifiers consistently across platforms
- Start simple, iterate with complexity as needed
- Monitor for data quality and edge cases
To explore more use cases and technical patterns, visit our Knowledge Base or reach out to the Snowplow team.