What is an Identity Graph?

Adam Roche

May 11, 2026

Share this post

TL;DR

An identity graph is a data structure that connects every identifier a single person generates (browser cookies, hashed emails, device IDs, login IDs, IP addresses) into one unified customer profile. Identifiers are stored as nodes, the relationships between them as edges. The graph updates as new events arrive, so the same person stays recognizable across sessions, devices, and channels.

Key takeaways

An identity graph is a graph-style data model, not a single row of identifiers per customer. The relationships between identifiers are what make it useful.
Two matching approaches exist: deterministic (verified identifiers like a logged-in user_id) and probabilistic (statistical guesses based on signals like IP and device fingerprint).
The hard part is handling shared devices, cross-domain navigation, and merge conflicts without silently corrupting your data.
You have three options: build the graph yourself in SQL, buy a CDP black box, or resolve identity in the data pipeline before events hit your warehouse.
Real-time pipeline-level resolution is the only approach that supports use cases like real-time personalization and agentic AI.

What is an identity graph?

You have a prospect. They click an ad on their phone. They then research your product on a laptop the next day, and convert on a tablet a week later. Three devices. One customer journey. Three different rows in your warehouse.

This is an extremely common scenario that companies have to contend with. According to recent cross-device analytics research, around 67% of consumers move sequentially between devices when shopping online. Multi-device journeys now account for as much as 65% of all online purchases. This multi-device behavior is very much the norm, and any system that treats each device as a separate visitor is likely to be producing fiction.

It is this very problem that an identity graph is designed to fix. An identity graph is a specific data structure that connects every identifier a person generates into a unified profile you can trust,

In this post, we’ll explore what an identity graph is, how they work, where deterministic and probabilistic matching fit, the edge cases that break naive implementations, and what to look at when you’re building an identity graph or buying one.

What is an identity graph, exactly?

An identity graph is a data structure that stores customer identifiers as nodes and the relationships between them as edges. Each user is the central node a cluster of identifiers connects to: a domain_userid from a browser cookie, a hashed email address from a form fill, a user_id from your auth system, a device ID from the mobile app, an IP address from a server-side event.

The structure can sit on top of different storage technologies. Some implementations use a dedicated graph database. Others, including Snowplow Identities, use a managed Postgres database with graph-style data modeling on top. Either way, what matters is the model, not the underlying storage.

What separates this from a simpler approach is the relational layer. A single-row-per-customer table flattens the picture and struggles with many-to-many relationships, merges over time, and conflict resolution when shared devices and household accounts blur identity. A graph model captures all of that.

Most articles you read on this topic tend to describe an identity graph as a “database of identifiers.” This isn’t wrong, but it misses what makes the structure useful. The point of a graph is the edges. They tell you which identifiers belong together, when the connection was made, and whether it’s safe to merge two profiles into one. Without that relational layer, you’ve just got a list.

‍

‍

How does an identity graph work?

The mechanics of an identity graph are pretty easy to describe, but hard to get right at high event volumes.

To explain, let’s assume you have an event that arrives carrying one or more identifiers. An identity graph will check whether any of those identifiers are already attached to a user node. If they are, the new identifier links to that node. If they aren’t, a new user node is created and identifiers attach to it.

Where it gets interesting is when an event carries identifiers that belong to two different existing user nodes. That’s your signal that what your system thought were two separate customers are actually one. As a result, the graph records a merge.

Here's a more concrete example.

A user browses your site anonymously. Their domain_userid cookie attaches to a fresh user node, call it sp_001.

They install your mobile app and use it logged out.

The apple_idfv from their phone attaches to a separate user node, sp_002. Then they log into the app.

The next event carries both the apple_idfv and a user_id that matches the email they entered on your website last week.

The graph sees that user_id already lives on sp_001. It merges sp_002 into sp_001, and from that point on every event from either device resolves to the same user.

This technical walkthrough lives in the Snowplow Identities documentation if you want the full sequence.

That merge is the moment a fragmented customer journey becomes one. It's also the moment a poorly designed identity graph silently corrupts your data.

Deterministic vs probabilistic matching: what's the difference?

There are two ways to decide whether two identifiers belong to the same person.

Deterministic matching uses identifiers you can verify with certainty, such as a logged-in user_id, a hashed email address, a loyalty number. If two records carry the same deterministic identifier, the same human is behind them. Confidence is effectively 100%.

Probabilistic matching uses signals that suggest a connection without proving it. Same IP address, same device fingerprint, similar browsing times, overlapping content patterns. A model assigns a confidence score and links the records if the score is high enough.

Probabilistic matching has its place. It can fill gaps when deterministic identifiers are missing. But it has costs, and it should be treated as a complement to deterministic identity resolution rather than a substitute.

It’s less precise, requires ongoing validation, and produces merges that are hard to explain after the fact. When an attribution model or an AI agent asks why two records ended up linked to the same user, "the model thought they were the same with 87% probability" is not an answer it can act on.

As third-party cookies become less reliable and first-party data remains a priority for B2B marketers, deterministic identifiers that customers actually provide (logins, hashed emails, app IDs) become the spine of any graph worth using.

A note on the cookie timeline. Google reversed its planned third-party-cookie deprecation in 2024 and later confirmed it would not force a full phase-out in Chrome. That doesn’t make probabilistic matching more reliable. Cookies were always a poor proxy for users, and Safari and Firefox have long applied stronger third-party-cookie restrictions than Chrome.

What happens when devices are shared?

Most identity graphs handle the easy cases. But they tend to struggle with the edge cases where customers interact with your product in ways your data model didn’t predict.

The obvious one is shared devices. Alice logs into your mobile app on her phone. Her apple_idfv and user_id link to the same user node. She hands the phone to her partner, Bob, who logs in with his own credentials. Now you have a single apple_idfv linked to two different user_id values.

As a result, a naive graph either merges Alice and Bob into one profile (disaster!) or keeps creating new nodes in every session (analytics chaos). In contrast, a well-designed graph treats user_id as a unique identifier that can never trigger a merge with conflicting values. So when Bob logs in, the graph creates a new node for him and shares the device identifier between the two profiles.

When the device sees an anonymous session later, the graph can't attribute it to either user with confidence. So it creates an anonymous node and waits for a login to disambiguate.

This is what separates identity graphs that hold up under real customer data from ones that look great in a vendor demo and break the first time a household shares a tablet.
‍

‍

What are identity graphs used for?

Once events resolve to the right user, every downstream system gets better:

Attribution and analytics. A customer journey that touches paid, organic, email, and direct gets tracked as one path instead of four disconnected ones. Conversion paths actually reconstruct.
Personalization and product recommendations. Recommendation engines work from full behavior history, not the slice of it captured in the current session. Real-time product personalization needs a unified profile or it's just guessing.
Audience targeting. Deduplicated profiles mean you stop spending ad budget retargeting customers who already converted on a different device.
AI agents and real-time decisioning. Context-aware AI agents need to act on a unified profile. McKinsey estimates AI agents will create $2.6–4.4 trillion in annual value across business use cases, but only when they have reliable customer context to act on. Tools like Snowplow Signals and Event Forwarding get significantly more useful when the events flowing through them carry resolved identities.
Customer support. Support agents see the full history across web, app, and email instead of a fragment from whichever channel the customer arrived through.

The identity resolution market itself reflects this. Analysts at Cognitive Market Research value identity resolution software at around $1.99 billion in 2025, growing to over $5 billion by 2034. Most of that growth is being driven by downstream demand from AI, analytics, and personalization use cases that fall over without a working graph underneath.

How do you build an identity graph?

There are roughly three options when you set out to build an identity graph for your business.

1. Roll your own in SQL

Identifier mapping tables in dbt, batch jobs that recompute graph nightly, custom logic for edge cases. This approach works and you own every piece of it. But it also gets expensive to maintain as identifier types multiply, and it’s batch by definition, which rules out real-time use cases. Our latest guide to identity stitching with Snowplow data walks through the SQL patterns if you want to see what that looks like in practice.

2. Buy a CDP

Segment, Amperity, and similar tools resolve identity inside their own platform. The graph is hidden behind their UI. You don't see the edges, can't audit the merges, and can't extend the matching logic for your own customer data quirks. Because of this, teams hit limitations and can't stress-test the logic. Business teams end up acting on incomplete data, and data engineers have their hands tied.

3. Resolve identity in the pipeline

Tools like Snowplow Identities build the graph on the event stream before data lands in the warehouse. Resolution is real-time and deterministic. The graph runs in your own cloud account, the merge logic is configurable, and the resolved Snowplow ID arrives on every event already attached. That's the architecture we've been building toward, and it's the one we recommend if your use cases include real-time personalization, agentic AI, or anything else where millisecond user behavior context is required.

[image suggestion: 3-column comparison diagram. SQL/DIY shows graph in warehouse with batch arrow. CDP shows graph in vendor cloud with arrow back to warehouse. Pipeline shows graph in customer cloud, in-line with the event stream. Highlights where the graph lives and when resolution happens]

Where this leaves you

Most data teams running batch identity resolution today aren't doing it wrong. They're answering the questions their warehouse was built to answer, on the schedule it was built to run on, which is still useful work.

The shift is in what gets asked of the same team. Personalization that responds in the session. Agents that need current customer context to reason from. Activation tools waiting on profile updates that arrive in seconds, not hours. None of these are well served by a graph that rebuilds overnight.

It's worth asking yourself: which of those is your team being asked to power, and how soon. That answer decides whether your current setup holds up.

Where to go from here

Real-time identity resolution is worth it for some use cases and overkill for others. If you want a second pair of eyes on which is which in your stack, drop us a line. No demo unless you want one.

If you'd rather start with the technical detail, the Snowplow Identities documentation covers the dbt models, the merge events, and where Identities sits in the pipeline.

‍
Frequently asked questions about identity graphs

What's the difference between an identity graph and identity resolution?

An identity graph is the data structure. Identity resolution is the process. The graph stores identifiers as nodes and their relationships as edges. Resolution is the work of evaluating each new event against the graph, deciding whether it belongs to an existing user, and merging profiles where needed. You can't have one without the other, but they're not the same thing.

What is an identity graph database?

An identity graph database is a database optimized for storing and querying the relationships between customer identifiers, not just the identifiers themselves. Most modern implementations use either a dedicated graph database or a managed Postgres (such as Aurora on AWS or AlloyDB on GCP) with graph-style data modeling on top. The choice usually comes down to query latency requirements and how much identifier volume you're handling.

How does an identity graph handle anonymous-to-known users?

When a user is anonymous, only cookie or device-level identifiers exist on their events. Those identifiers attach to an anonymous user node. When that user later logs in, the event carries both the anonymous identifier and the now-known user_id. The graph sees that the anonymous identifier is already attached to a node, links the new user_id to it, and from that point on every event (including the prior anonymous session history) resolves to the known user. With a real-time graph, this happens at the moment of authentication, not in tomorrow's batch job.

Do I still need an identity graph if I have a CDP?

Most CDPs include their own identity graph, but you don't own it and you can't see how it works. The matching logic lives inside the vendor's platform, the graph itself is stored on their infrastructure, and exporting profiles back to your warehouse usually loses the relational structure. If your team needs auditability, custom merge rules, or the graph in your own cloud account, the CDP graph is not enough.

Can I build an identity graph in SQL?

Yes. Identifier mapping tables in dbt, with logic for joining domain_userid to user_id and propagating known IDs back across anonymous sessions, will produce a working graph. The downsides are that it's batch by default, gets expensive as identifier types multiply, and edge cases like shared devices need bespoke logic. SQL graphs work for retrospective analytics. They don't work for real-time personalization or in-session agent decisions.

How do identity graphs handle GDPR and privacy compliance?

A well-designed identity graph supports per-identifier consent tracking and propagates deletion requests across all linked nodes. When a customer requests deletion under GDPR or CCPA, the graph removes every identifier connected to that user node, not just the one specified in the request. Graphs that run in your own cloud (rather than a vendor's) make this straightforward because you control the storage layer. Graphs hidden inside CDPs require trusting that the vendor's deletion logic actually fires.

Subscribe to our newsletter

Get the latest content to your inbox monthly.

Data Foundation

Modeling & Analytics

ML & Agentic AI

What is an Identity Graph?

TL;DR

Key takeaways

What is an identity graph?

What is an identity graph, exactly?

How does an identity graph work?

Deterministic vs probabilistic matching: what's the difference?

What happens when devices are shared?

What are identity graphs used for?

How do you build an identity graph?

1. Roll your own in SQL

2. Buy a CDP

3. Resolve identity in the pipeline

Where this leaves you

Where to go from here

‍
Frequently asked questions about identity graphs

What's the difference between an identity graph and identity resolution?

What is an identity graph database?

How does an identity graph handle anonymous-to-known users?

Do I still need an identity graph if I have a CDP?

Can I build an identity graph in SQL?

How do identity graphs handle GDPR and privacy compliance?

Get Started

Products

Comparisons

Customers

Solutions

Explore

Integrations

Technology

Company

Resources

Get the latest Snowplow news and updates

Follow Us