How to Approach Identity Stitching with Snowplow Data
Identity stitching is the process of tying fragmented user identifiers together so you can recognize the same person across sessions, devices, and domains. It’s how businesses create a unified customer profile from a pile of cookies, device IDs, email addresses, and phone numbers.
It sounds simple. But in practice, it’s one of the hardest problems in the modern data stack to solve.
When we initially wrote this post back in 2020, the conversation was mainly focused on batch stitching in the warehouse. SQL jobs, daily rebuilds, good enough for reporting. But A LOT has changed since then.
Both Safari and Firefox have chewed through client-side cookies. While Chrome’s third-party cookie story is still a moving target.
You have users that bounce between logged-in and anonymous states constantly, and they bring more devices with them each year. In fact, the average US household now has 21 connected devices across 13 device type categories, according to Deloitte.
The use cases for identity stitching have changed too. It’s no longer just about tidy analytics reports tomorrow morning. It’s about personalizing the user experience right now. But that requires feeding a current and unified view of the user into an ML model or AI agent mid-session. Or passing resolved user identities into marketing activation tools like Braze while the user is still on your site or using your app.
Batch-based stitching simply can’t keep up with any of that.
So in this post we'll cover two things: First, we'll walk through the original approach to identity stitching with Snowplow data: the SQL-heavy, warehouse-level logic that still holds up for batch use cases, with a few 2026 updates.
Then we'll look at what's changed with the launch of Snowplow Identities, which moves identity resolution out of the warehouse and into the data pipeline itself, in real time. Let’s dive in.
What identity stitching actually involves
Identity stitching is the process of combining various user identifiers into a single, stable ID so you can track a user accurately across their full customer journey.
Done correctly, it allows businesses to create personalized experiences that don’t crumble the moment a customer switches devices.
Without it, you can’t reliably connect a session on mobile to the same person opening an email on desktop. With it, you can personalize experiences based on what this actual human being has done, not what “Device ID 6954” looked at last Tuesday.
The technical pieces of a stitching system typically include:
- A set of event properties that can identify a user (common identifiers like user_id, domain_userid, network_userid, android_idfa, email addresses, phone numbers)
- A model of how these identifiers relate to each other
- A way to apply that mapping to every event, even when some identifiers are missing
The third piece is where it gets business-specific. How you resolve a user depends on your industry, tracking strategy, device mix, and what "the same user" means to you. A streaming service, a news publisher, a retail bank, and a multi-brand retailer will all draw the line differently.
Batch vs. real-time stitching: picking the right approach
On the surface, your identity stitching implementation looks different depending on whether you're running batches of data in the warehouse or handling events in real time. But most of the difference is logistics, not logic. Tools, cadence, plumbing. The underlying approach is the same either way.
Real-time stitching can serve almost any use case, but it's not always the most cost-effective option. Batch works fine for retrospective analytics and reporting. It falls over the moment you need stitched identity to drive something live such as a personalized recommendation, context window for an AI agent, or a triggered message.
If you already have a real-time pipeline for one time-sensitive use case, it's usually worth scaling it to cover others, even the ones that don't strictly need low latency. That way you avoid maintaining two stitching systems doing the same work at different speeds.
Which Snowplow event properties are user identifiers?
Out of the box, with Snowplow's web and mobile tracking, you'll get the following identifiers on every event collected:
- domain_userid, set via a first-party client-side cookie by the Snowplow JavaScript tracker
- network_userid, set server-side by the Snowplow collector as a first-party or third-party cookie
- android_idfa / apple_idfa / apple_idfv, depending on the mobile device
- user_id, a custom identifier you set when tracking an event (often the authenticated email or a CRM ID)
There are other properties you can use to varying degrees, such as refr_domain_userid, user_fingerprint and user_ipaddress. You'll likely have your own custom identifiers alongside user_id too, such as email_address, client_id, or a hashed loyalty ID.
A 2026 note on client-side cookies. Since this post was first written, Safari has dramatically cut client-side cookie lifetimes, and Firefox has followed. Whereas Chrome has been the outlier. If your identity graph leans heavily on domain_userid, you'll want to pair it with a more durable server-side identifier (like network_userid in a first-party context) and, where possible, an authenticated user_id. Don't build a stitching strategy that silently depends on a cookie that might only live seven days.
How to map your user identifiers together
Let's use the same example as the original post: a company with a website and an Android app, that wants to identify users across platforms, across devices, and whether they're logged in or not.
On web, network_userid set as a first-party cookie is usually the most reliable long-term identifier. On mobile, android_idfa (or the Apple equivalents) plays the same role. We'll assume user_id is tracked on both and set to the email used for login.
These identifiers relate to each other in predictable ways:
- The same user_id can be associated with multiple network_userid values (different computers, different browsers, cookie resets)
- The same user_id can be associated with multiple android_idfa values (different phones or tablets)
- The same network_userid can be associated with multiple user_id values (shared devices, or one person with two accounts)
- network_userid and android_idfa have no direct relationship. They need a user_id or similar to bridge them
All the SQL examples below assume data is loaded into a warehouse and you're doing batch-based stitching. The same logic applies in a streaming setup, but with different plumbing, typically a low-latency mapping store like DynamoDB.
Some values are shown as integers where you'd normally see a UUID, for readability.
Multiple network_userids per user_id
What does the data say?
SELECT
user_id,
COUNT(DISTINCT network_userid) AS network_userid_count
FROM atomic.events
WHERE user_id IS NOT NULL
GROUP BY 1
ORDER BY 2;
This query will often reveal that for many user_id there is more than one network_userid associated with them.
What does it mean?
One of three things:
- The person associated with a specific user_id is visiting the website from different devices (for example a home and work computer).
- The person is visiting the website in different browsers (for example Chrome and Safari), potentially on multiple devices.
- The person is always on the same device and browser, but the cookies have been reset, forcing a new network_userid to be assigned.
How can you use this information?
If your goal is to identify this user across all their devices and browsers, including when they're not logged in, you can create a mapping table where each of the network_userid values is associated with the same user_id. Whenever you see any of these network_userid values, you'll know it's the user with that specific user_id.
SELECT
a.event_id,
b.user_id
FROM events AS a
JOIN users AS b
ON a.network_userid = b.network_userid
You'll often have events where you only have the network_userid for a user but not their user_id, because those events happened before the user logged in. The reverse isn't true. Events generated by a logged-in user should always have both fields populated. For that reason, the reverse lookup (user_id to network_userid) isn't needed for identity stitching.
The other cases, in brief
The same mapping pattern extends in predictable ways. The same network_userid can be associated with multiple user_id values (shared devices, household accounts).
Mobile identifiers like android_idfa play the same role on mobile that network_userid plays on web.
Each case needs its own mapping table, its own SQL, and a judgment call about how to handle ambiguity. Do you ignore shared-device users? Always attribute to the latest seen user_id? Pick the most frequent one?
There are no wrong answers, only tradeoffs. And every tradeoff has a cost: some events get attributed to the wrong user, some users drop out of your analysis, some logic gets harder to maintain as your identifier set grows.
It's also worth noting: in batch, your mapping table and your atomic tables can be rebuilt on different cadences. In a real-time setup, you have to do two things at once for every event. Look up the mapping store to identify the user, and update the mapping store with what you just learned. That's where the warehouse approach starts to creak. Recursive SQL to walk an identity graph gets expensive at scale, lookback windows fight against shrinking cookie lifetimes, and maintenance gets heavier every time your identifier logic changes.
What changes when identity resolution moves into the pipeline
The SQL approaches above work. They've worked for years, and plenty of data teams still use them. But all of them share the same architectural limit: they happen in the warehouse, after the fact. The stitched user ID only exists after the next model run.
That's fine if your use case can wait for tomorrow. It's a problem if it can't.
In April 2026 we launched Snowplow Identities to solve exactly this.
Identities resolves user identifiers deterministically and in real time, inside the Snowplow pipeline, before events land in your warehouse. It uses a graph-based identity model, not probabilistic matching, and outputs a stable snowplow_id on every enriched event.
Here's how it changes the picture:
- Pipeline-level resolution. Your events arrive in the warehouse already stitched. You don't need a nightly job to reconcile IDs. You don't need a reverse ETL tool to push resolved profiles back out to activation platforms.
- Deterministic, not probabilistic. Every merge is triggered by a specific event linking two identifiers. Every merge is auditable.
- Stable Snowplow ID across the full customer journey. The ID follows a user from their very first anonymous interaction. The moment they log in or convert, their anonymous history is merged into the known identity. No orphaned pre-authentication behavior.
- You still own the graph. The identity data sits in your own cloud, in a managed Postgres database deployed alongside your pipeline. It's not in a vendor's black box. It's not on a third-party server.
- Handles the edge cases DIY SQL struggles with. Shared devices, cross-domain tracking, multiple users on one device, unique identifiers to prevent incorrect merges. The graph model handles them natively rather than with fragile SQL heuristics.
And because the unified Snowplow ID is attached to every event in the pipeline, it feeds directly into everything downstream. Your warehouse. Snowplow Profiles. Snowplow Event forwarding to activation tools. Essentially, any AI agent or decisioning system that needs identity-aware context. In other words, identity is resolved once, at the source, rather than rebuilt from scratch by every tool that needs it.
When DIY stitching still makes sense
Of course, not every use case needs real-time. If your use cases are all retrospective, such as monthly attribution reports, quarterly LTV analysis, or batch audience exports, warehouse-level stitching with the SQL patterns above is perfectly reasonable.
You own the logic, it runs on infrastructure you already pay for, and you're not adding another system to your stack.
The honest test is this: does anything in your business need a stitched view of the user inside the same session they're currently in? Think personalization and recommendation engines, AI agents, mid-session activation, fraud scoring. Those are all real-time use cases. If you have any of them, warehouse stitching will always be a step behind. That's the gap Snowplow Identities closes.
Take ownership of your data
It’s important to note that there’s no single right way to do identity stitching. The right approach really does depend on your use case, your latency requirements, and how much of the heavy lifting you want to do yourself.
Since Snowplow gives you full ownership and control of the real-time data stream and the data in the warehouse, you can run whichever stitching strategy fits: batch, real-time, or both at once.
Whichever path you take, identity stitching handles personal data. That means GDPR, CCPA, and the growing patchwork of US state privacy laws all apply.
At Snowplow, our aim is to give you more control over how personal user data is handled, whether that's pseudonymization at collection or deletion support at the identity graph level.
If you'd like to see the pipeline-level approach in action, you can read the Snowplow Identities launch post or explore the product page.