Real-Time vs Batch Identity Resolution for Data Teams
TL;DR
Identity resolution is the process of tying every identifier a user generates — cookies, device IDs, hashed email addresses, phone numbers, login IDs — into one unified customer profile. Batch identity resolution does this on a schedule within your data warehouse, usually overnight. In contrast, real-time identity resolution does it in the data pipeline, in real time, as events arrive. That choice changes when and what your personalization, AI and data agents, and activation systems can do with the result.
Key takeaways
- Batch identity resolution runs on a delay. The graph in your warehouse this morning reflects who your customers were yesterday, not who's on your site right now.
- Real-time identity resolution runs at event time. By the time an event lands in storage, it already carries a resolved user ID and a stitched profile reference.
- The choice between the two defines which downstream use cases are possible. Real-time personalization, agentic AI, and live activation all need a resolved profile in the moment.
- Batch is fine for retrospective reporting, but it falls apart the second you need stitched identity to drive something live. That now includes data agents like Snowflake Intelligence, which read from the same warehouse and inherit the same lag.
- Most teams end up running both. The reason is usually historical, not architectural. Batch came first, and the time-sensitive workload has not yet moved into the pipeline.

Why timing is the real question
Most posts about identity resolution usually stop at “what it is” and “deterministic vs probabilistic.” Of course, these are useful framings. We’ve covered them in our guide to identity graphs and our deeper walkthrough of identity stitching with Snowplow data.
Once you’ve decided you need an identity graph, your next question should be around when you stitch. That single decision, overnight in the warehouse or in real time in the pipeline, decides whether your personalization, your AI and data agents, and your activation tools work in the moment or arrive a day late. It also quietly defines the ceiling on what your data team can deliver to the rest of the business.
This post unpacks the practical differences. What each approach actually does. Where the gap shows up in a concrete signup flow. Which use cases break under batch. And how to tell whether your current setup is holding you back without anyone noticing it’s the identity layer that’s at fault.
Batch and real-time identity resolution can use the same underlying logic. Both stitch user identifiers using verifiable signals like logged-in IDs, hashed emails, and device identifiers. Implementations vary. Some are fully deterministic. Others lean on probabilistic matching to fill gaps. The graph you build looks much the same whether you build it in the warehouse or in the pipeline.
What differs is when the merge happens, and what your downstream systems see in the meantime.
It is this very gap where most identity resolution problems live.
What batch identity resolution actually does
Batch identity resolution runs on a schedule. You’ll typically run it overnight, sometimes hourly, and occasionally every fifteen minutes. The job reads raw events from your storage, applies stitching logic (usually recursive SQL via dbt models), and writes a refreshed identifier mapping table back to your data warehouse.
When the job finishes, your unified profiles reflect everything that happened up to the cutoff. So anything that happened since is still unstitched in raw form, waiting for the next run.
This approach works for a specific kind of question, such as: “How many unique users converted last week, across all devices?” Batch identity resolution usually answers that cleanly. Your warehouse will rebuild the graph, your dashboards will refresh, and reports will go out.
Batch identity resolution also works if your activation cadence is daily. So if your email platform pulls a fresh audience export every morning, batch-stitched segments are good enough. In this case, your user data is twelve hours old, but the email is going out tomorrow anyway.
Where batch falls apart is when you need a resolved profile for something live.
What real-time identity resolution actually does
Real-time identity resolution runs in your data pipeline, typically between the event collector and the warehouse. Every event passes through the resolution layer before it hits long-term storage. So by the time it arrives in your warehouse, it already carries a stable user ID and a stitched profile reference
With this approach, your graph updates incrementally. When a new identifier appears, it links to the right user node. When two existing nodes turn out to belong to the same person, a merge event fires and downstream systems can react.
Now, this might sounds like a small change. But, in practice, it completely reshapes what you can build.
A recommendation engine reading a user profile mid-session sees the same person it would have seen at the end of the day if it depended on batch. A personalization tool that changes a web page based on what a user did in the last fifteen seconds, not what their cookie ID did last week. An AI agent has a 360-degree view of the customer the moment it’s asked to act.
Both approaches have the same data points and identifiers. The difference is the resolution time.
A scenario where the difference shows up
To bring this difference to life, let’s explore a hypothetical example of a meal-kit signup funnel. This is a fairly typical pattern across retail, streaming, and direct-to-consumer brands.
Firstly, a new visitor lands on the meal-kit homepage. They click through a quiz, pick their favorite meal types, set a delivery frequency, and choose a starting plan. So far, none of this involves a login. The whole journey is anonymous, tied to a single domain_userid cookie.
At the bottom of the funnel, they eventually create an account. They do so by submitting their email address, phone number, and shipping details. That’s the first time the funnel sees a verifiable identifier.
If you have a batch identity resolution stack, here’s what happens.
The anonymous quiz events sit in the raw event table with a domain_userid and no user_id. The account-creation event arrives later in the same session with both a domain_userid and an email. Both are written to storage. The stitching job runs overnight, links the cookie to the new user, and tomorrow's warehouse query shows one unified customer profile.
The problem with this though, is your personalization engine on the welcome page has nothing to work with in the moment,. The agent that’s supposed to send a tailored onboarding message doesn’t know what meals the customer picked. The activation tool pushing a "welcome" audience to Braze doesn't yet recognize this person as the same one who completed the quiz.
If you have a real-time identity resolution stack, here's what happens.
The merge fires the moment the account is created. The event carrying the email and the domain_userid triggers the graph to link the two. The next event from the same session — the welcome page view — carries the resolved user ID, and every downstream system sees a complete picture.
The data sources and identifiers are identical. The resolution time differs. The outcome differs with it.

Where the difference matters most
Three use cases sit at the front of every Snowplow customer conversation about identity resolution. Each one stops working without real time.
Real-time personalization
A recommendation engine fed by your data warehouse only knows what a user did before last night’s batch run. Anything they’ve done today is invisible to it until tomorrow.
With pipeline-level resolution, the recommendation engine works from the full picture, including the meals the user just picked, the page they’re on now, and the device they switched to mid-session. Cross-device personalization stops being a roadmap item and starts showing up in the live experience.
Agentic AI
Customer-facing agents, internal copilots, orchestration agents, and data agents like Snowflake Intelligence all act on customer context. If that context is yesterday's, the agent will confidently use stale information. McKinsey estimates AI agents will create $2.6 to $4.4 trillion in annual value, but only when they have reliable context to act on. Real-time identity resolution is the layer that gives agents a stable view of who they're talking to. Solutions like Snowplow Signals sits on top of this to feed agents directly.
Audience activation
Activation tools work from audiences that are only as fresh as the upstream identity mapping. A "high-intent abandoners" segment built from yesterday's stitched data will miss everyone who became high-intent in the last twelve hours. How identity resolution work feeds activation pipelines and event forwarding flows determines whether customer engagement is responding to what someone did this morning or what someone did last week.
For these three use cases, batch is always behind. The personalization decision has already been made, the agent has already answered, the audience has already shipped.
When batch identity resolution is the right answer
So when should you use batch identity resolution? Retrospective attribution. Cohort analysis. Lifetime value modeling. Anything that asks a question about who customers were over a period that does not include the current moment, rather than who they are right now.
Batch identity resolution is also cheaper to run for use cases that don’t need the freshness. Spinning up real-time infrastructure for a dashboard that refreshes once a day is overkill.
The trap is using batch for use cases that need real time. For instance, a "personalized" homepage that loads its context from a daily-refreshed table is technically personalized. It's also a worse experience than one that responds to what someone did in the last session. Customers can tell the difference even if they can't name it.
If your team is running batch identity resolution today and you’re being asked to power personalization, agents, or activation, the gap shows up in specific ways. A welcome page that recommends meals the user just rejected in the funnel. A data agent that answers a question with yesterday’s customer count. A campaign that ships the same person into the same audience twice because the merge hasn’t happened yet.

How Snowplow Identities does real-time identity resolution
Snowplow Identities resolves identity in the data pipeline, before events reach the warehouse. It runs as a managed service. Private Managed Cloud customers can deploy it inside their own VPC, so first-party data never leaves their infrastructure. Either way, the resolved output lands in your warehouse. It builds the identity graph using deterministic matching against verifiable identifiers and supports custom identifier priorities so you can model how your business defines "the same user."
When a merge happens, Identities emits a merge event you can subscribe to. That event carries the IDs involved and the triggering event, so the merge is auditable. Data governance and data engineering teams can explain every link in the graph back to the event that caused it.
The resolved output lands in your atomic events table, where it can be modeled using a set of dbt models (snowplow_id_mapping, snowplow_id_mapping_scd, identifier_mapping, and related tables). Those models plug into the warehouse-native queries your team already runs, so identity resolution powers your existing warehouse use cases and lets you build a true customer 360 without the headache of complex stitching SQL.
If you want to go further
The Snowplow Identities documentation covers the architecture, the dbt models, and where Identities sits in the pipeline. If you'd rather talk it through with someone on the team, please contact us here and we’ll be in touch.
Frequently asked questions
Does real-time identity resolution replace batch identity resolution?
Not necessarily. Most teams run real-time for live use cases and keep batch processes for retrospective reporting. The two can share the same underlying identity graphs and the same warehouse output.
How do customer data platforms handle identity resolution?
Most customer data platforms run identity resolution inside their own platform and serve unified profiles back through their APIs. That works when marketing owns the workflow. For companies that want a unified foundation and view of the customer, with the graph in their own warehouse and full control over the merge logic, pipeline-level resolution is usually a better fit.
What's the difference between deterministic matching and probabilistic matching here?
Deterministic matching uses identifiers you can verify, such as a logged-in user ID or a hashed email address. Probabilistic matching infers a link from signals like IP address, device ID, and fingerprinting data. Real-time and batch can both use either approach. Snowplow Identities only uses deterministic matching.
How fresh is "real time" in practice?
For Snowplow Identities, the merge typically completes inside the same event-processing window. End-to-end latency from event arrival to resolved output usually sits in low single-digit seconds. The exact figure depends on your pipeline configuration and downstream destinations.
Can I move from batch to real time without rebuilding everything?
Yes, in most cases. Because the warehouse output schema stays familiar, the dbt models and analytics queries you already run against batch-stitched data work against real-time output too. Most of the change is upstream of the warehouse, not downstream.