Blog

Identity Stitching in Snowplow: A Q&A for Data Engineers

By
Snowplow Team
&
November 3, 2023
Share this post

Identity stitching is a foundational technique for any organization looking to develop a reliable single customer view from behavioral data. In this blog post, we answer the most common technical questions around identity stitching in Snowplow, drawing on community contributions and Snowplow best practices.

What is identity stitching in Snowplow?

Identity stitching refers to the process of linking individual behavioral events in your Snowplow data to unique users across sessions, devices, and platforms. It involves building and applying a mapping of user identifiers that resolves to a canonical user ID — enabling robust user-level analytics.

Why is identity stitching important?

Without identity stitching, behavioral events are fragmented across cookies, devices, and platforms — making it impossible to:

  • Track full customer journeys

  • Measure attribution accurately

  • Understand conversion paths

  • Power personalization or LTV modeling

Snowplow's flexibility and transparency make it an ideal tool for implementing a precise identity stitching pipeline.

How do I implement identity stitching with Snowplow data?

Step 1: Track as many identifiers as possible with each event

Snowplow trackers are designed to collect a wide range of identifiers automatically. These include:

Collector-provided fields:

  • network_userid: Third-party cookie ID

  • user_ipaddress: User's IP address

All trackers:

  • user_id: Your own identifier, e.g. from login

JavaScript tracker:

  • domain_userid: First-party cookie ID

  • domain_sessionid: Session cookie

  • user_fingerprint: Browser fingerprint

Mobile trackers:

  • open_idfa, apple_idfa, apple_idfv, android_idfa

  • client_session.user_id and client_session.session_id

Custom contexts:

You can define your own schema (e.g., com.mycompany/user_context) to pass additional identifiers such as:

{

  "id": "string",

  "email": "string",

  "twitterHandle": "string",

  "facebookId": "string"

}

Step 2: Build a user mapping table

Identify events that include multiple identifiers — especially login events that include both domain_userid and user_id. Use these to construct a user mapping table:

CREATE TABLE derived.user_mapping AS (
  SELECT domain_userid, user_id
  FROM atomic.events
  WHERE domain_userid IS NOT NULL AND user_id IS NOT NULL
  GROUP BY 1, 2
);

This table maps anonymous identifiers (cookies) to authenticated user IDs.

Step 3: Apply the mapping to your atomic events

Enrich your events dataset with the canonical user_id using a LEFT JOIN:

SELECT
  COALESCE(um.user_id, e.user_id) AS resolved_user_id,
  e.*
FROM atomic.events e
LEFT JOIN derived.user_mapping um
  ON e.domain_userid = um.domain_userid;

This ensures that even pre-login events (where user_id is null) are assigned to a known user where possible.

Can I stitch across platforms like mobile and web?

Yes. You can expand your mapping logic to include additional identifiers from mobile contexts (e.g., apple_idfv, android_idfa) or cross-device linking via internal identifiers (e.g., user_id, email hashes). Make sure to:

  • Normalize all identifiers

  • Consider mapping across combinations of cookies, device IDs, and login IDs

What if users share devices (e.g., tablets, public computers)?

In shared-device scenarios, deterministic stitching may misattribute events. Consider:

  • Supplementing with probabilistic models (e.g., IP + UA + time proximity)

  • Excluding shared-device identifiers from mapping logic

  • Logging uncertainty or confidence scores in your mapping graph

Can I include 3rd-party marketing identifiers like GCLID or CID?

Yes. These can be captured via custom contexts or directly in Snowplow URL parameters. For example, Google Ads gclid can be parsed into a context and joined in your model. Capture them early and propagate across sessions where possible.

Are there resources or tools for advanced identity stitching?

Yes — Snowplow customers and open-source users often leverage tools like:

  • dbt: for modular SQL data modeling

  • Kafka / AWS MSK: for real-time streaming and enrichment

  • Spark / Beam: for large-scale graph-based identity mapping

  • Snowplow Enrich / Loader: for passing through and shaping identifier payloads

See also:

Final thoughts

Identity stitching is not a one-size-fits-all operation. Snowplow provides the raw, granular behavioral data and the flexibility needed to implement stitching tailored to your business logic and tech stack.

For the most reliable results:

  • Collect identifiers consistently across platforms

  • Start simple, iterate with complexity as needed

  • Monitor for data quality and edge cases

To explore more use cases and technical patterns, visit our Knowledge Base or reach out to the Snowplow team.

Subscribe to our newsletter

Get the latest content to your inbox monthly.

Get Started

Whether you’re modernizing your customer data infrastructure or building AI-powered applications, Snowplow helps eliminate engineering complexity so you can focus on delivering smarter customer experiences.