Blog

How to approach identity stitching with Snowplow data

By
Dilyan Damyanov
November 24, 2020
Share this post

Identity stitching, also known as user stitching, is the process of tying together user identifiers to build a more complete view of user behavior. Companies might have different reasons to stitch user identifiers together, and some of the most popular ones include:

  • You want to identify users of your website or application across multiple devices, e.g. multiple computers, mobile devices, smart TVs, etc;
  • You want to identify users across multiple platforms, e.g. your website and mobile app;
  • You want to identify users before and after they log in to your website or application.

This complete view, known also as a 360-degree customer view or a single customer view, can then allow you to personalize the experience users have with your brand or product.

Typically, user stitching involves three key ingredients:

  • A set of event properties that can identify a user;
  • An understanding of how these user identifiers map onto each other;
  • A way to use the mapping and identify all events for a user, even when not all of these properties are available.

The final part can be highly specific to a business, depending on its industry, use cases, tracking strategy and platforms among many other factors. 

In this post, we'll take a look at the first two ingredients, and how you would approach these two steps using Snowplow events. We’ll pay special attention to the most common questions that come up in the process, and a few potential answers.

The difference between batch and real-time user stitching, and how to select the right approach

On the surface, your identity stitching implementation may appear different depending on whether you have a "batch" workflow where you process big batches of data at a time, or "in-stream", which is more suitable for real-time applications. However the difference is mostly in the logistics: what tools and technologies you use, how you connect them to each other and how often your process runs. The underlying logic will largely be the same, built on top of the ingredients mentioned above.

Usually, a real-time approach would suit any use case, though it might not be the most cost-effective option in all situations. On the other hand, a batch approach cannot be applied in use cases where timing is essential, such as real-time recommendation engines. If you are already building an in-stream solution for a time-sensitive use case, it might make sense to scale it so it covers other use cases as well, even if you do not need low latency and provided that it is economical.

User stitching with Snowplow: which properties of a Snowplow event are user identifiers and what are the main differences between them?

Let's take a look at the user identifiers first.

Out of the box, with Snowplow's standard web and mobile tracking, you'll get the following properties:

  • domain_userid, which is set via a first-party client-side cookie by the Snowplow JavaScript tracker;
  • network_userid, which is set server-side by the Snowplow collector as either a first-party or third-party cookie;
  • android_idfa / apple_idfa / apple_idfv, depending on the mobile device;
  • user_id, which is a custom identifier that you set when tracking an event with a Snowplow tracker.

There are other properties in a Snowplow event that can be used to various degrees to identify users, such as refr_domain_userid, user_fingerprint and user_ipaddress. In addition, you might also have other custom identifiers alongside user_id, such as email_address, client_id, etc. The above are just the most popular ones that you can expect to see (under different names) in many tracking implementations, regardless of whether they are based on Snowplow or not.

How to map your user identifiers together

Let's take the example of a company that has a website and an Android mobile app, and wants to identify users across both of these platforms, and across different devices, both when they are logged in and not. In this section we'll take a look at how that company might approach the problem. What questions does it need to answer? What decisions does it have to make?

We'll limit the example to the identifiers mentioned above. On web, network_userid is set as a first-party cookie, making it a reliable long-term identifier. (If you are using network_userid in a third-party context, you can swap in domain_userid but keep in mind that client-side cookies' lifetime has recently been dramatically cut, most notably in Safari.) 

The android_idfa uniquely identifies an user's Android mobile device; for iOS the Apple identifiers play the same role. Let's also assume that the user_id property is being tracked on both web and mobile, and its value is the email address used for logging in.

How do these properties map onto each other? It helps to consider what types of relationships (one-to-one or one-to-many) can exist between the different properties:

  • the same user_id can be associated with multiple network_userid values if the user is on different computers or in different browsers;
  • the same user_id can be associated with multiple android_idfa values if the user is on different mobile devices;
  • the same network_userid can be associated with multiple user_id values, if an user has two accounts or two people use the same computer;
  • the network_userid doesn't have a direct relationship to the android_idfa;
  • the same android_idfa can be associated with multiple user_id values, if an user has two accounts or two people use the same device.

All queries below assume that data is loaded in Redshift. This implies a batch workflow. However, most of the approaches can also be implemented in a real-time setup. Sticking to AWS as the cloud provider, the mapping tables described below can be DynamoDB tables, which will allow very low latency in obtaining the correct mapping.

For simplicity, some values are represented as integers where you would normally find a UUID in real-world Snowplow data.

Multiple network_userids per user_id

What does the data say?

SELECT
user_id,
COUNT(DISTINCT network_userid) AS network_userid_count
FROM atomic.events
WHERE user_id IS NOT NULL
GROUP BY 1
ORDER BY 2;

This query will often reveal that for many user_id there is more than one network_userid associated with them.

What does it mean?

This suggests one of three things:

  1. The person associated with a specific user_id is visiting the website from different devices (for example a home and work computer).
  2. The person associated with a specific user_id is visiting the website in different browsers (for example Chrome and Safari), potentially on multiple devices.
  3. The person associated with a specific user_id is always visiting the website from the same device and browser, but the cookies in that browser have been reset, forcing a new network_userid to be assigned.

How can you use this information? 

If your goal is to identify this user across all their devices / browsers, when they are not logged in, then you can create a mapping table, in which each of the network_userid values is associated with the same user_id. Then, whenever you see any one of these network_userids, you'd know it's the user with that specific user_id.

network_useriduser_idCA9C57DE-EEC1-4A47-9B75-3074C8E778B8user@email.com97D44551-58AF-4B8A-8D2D-CA46F728AD41user@email.com25CA22BC-B3D8-4736-833B-809F3CDEC688user@email.comUsers table

event_idnetwork_useriduser_id1CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null297D44551-58AF-4B8A-8D2D-CA46F728AD41null325CA22BC-B3D8-4736-833B-809F3CDEC688nullEvents table

SELECT
a.event_id,
b.user_id
FROM events AS a
JOIN users AS b
ON a.network_userid = b.network_userid

event_iduser_id1user@email.com2user@email.com3user@email.com

You may have many events where you only have the network_userid for a user but not their user_id, because those events happened before the user was logged in. But the opposite should not be true. Events generated by a logged-in user should always have both fields populated. For this reason, the reverse lookup (user_id to network_userid) is unnecessary for identity stitching.

Multiple user_ids per network_userid

What does the data say?

SELECT
network_userid,
COUNT(DISTINCT user_id) AS user_id_count
FROM atomic.events
GROUP BY 1
ORDER BY 2

This query might reveal cases where there are more than one user_id associated with a specific network_userid.

What does it mean?

The possible reasons include:

  1. A user can have two accounts, under different email addresses.
  2. More than one user is visiting the website from the same device and browser.

How can you use this information?

To establish if these are different users or the same user with multiple accounts, you might need to join the event data to other information that you have about your registered users, such as full name or postal address.

user_idfirst_namelast_nameaddresssteve@email.comSteveHarris22, Acacia Avenues.harris@email.comSteveHarris22, Acacia AvenueUser_registry table

network_useriduser_idCA9C57DE-EEC1-4A47-9B75-3074C8E778B8steve@email.comCA9C57DE-EEC1-4A47-9B75-3074C8E778B8s.harris@email.comUsers table

event_idnetwork_useriduser_id1CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null2CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null3CA9C57DE-EEC1-4A47-9B75-3074C8E778B8nullEvents table

SELECT
a.event_id,
c.first_name || ' ' || c.last_name || ' of ' || c.address AS user
FROM events AS a
JOIN users AS b
ON a.network_userid = b.network_userid
JOIN user_registry AS c
ON b.user_id = c.user_id
GROUP BY 1, 2

event_iduser1Steve Harris of 22, Acacia Avenue2Steve Harris of 22, Acacia Avenue3Steve Harris of 22, Acacia Avenue

But what if they are different users, who maybe live in the same household and use the same device?

user_idfirst_namelast_nameaddresssteve@email.comSteveHarris22, Acacia Avenuenicko@gmail.comNickoMcBrain22, Acacia AvenueUser_registry table

network_useriduser_idCA9C57DE-EEC1-4A47-9B75-3074C8E778B8steve@email.comCA9C57DE-EEC1-4A47-9B75-3074C8E778B8nicko@email.comUsers table

You probably won't be able to determine with certainty which one is visiting the website at any particular moment. You can deal with this in several different ways:

1. Ignore such users.

If they are a small proportion of your overall user base, this might be the simplest and most efficient solution. It's also the only 100% accurate one, since every user who is not excluded can have their identity stitched without ambiguity. The downside is that you will be leaving out some portion of your user base.

2. Associate the network_userid with the latest seen user_id

network_useriduser_idtstampCA9C57DE-EEC1-4A47-9B75-3074C8E778B8nicko@email.com2020-10-22 14:27:00CA9C57DE-EEC1-4A47-9B75-3074C8E778B8steve@email.com2020-10-21 11:45:33Users table

event_idnetwork_useriduser_id1CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null2CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null3CA9C57DE-EEC1-4A47-9B75-3074C8E778B8nullEvents table

WITH users AS (
SELECT
network_userid,
LAST_VALUE(user_id IGNORE NULLS) OVER
(
PARTITION BY network_userid ORDER BY tstamp ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) AS user_id
FROM users
GROUP BY 1, 2
)

SELECT
a.event_id,
b.user_id
FROM events AS a
JOIN users AS b
ON a.network_userid = b.network_userid

event_iduser_id1nicko@email.com2nicko@email.com3nicko@email.com

Of course, in this case you are making an assumption about the real identity of the user, so your data will not be 100% accurate.

3. Associate the network_userid with the most frequently appearing user_id.

Consider whether you want to use the user_id with most sessions or the one with most events. The example below uses sessions.

network_useriduser_idsession_idCA9C57DE-EEC1-4A47-9B75-3074C8E778B8steve@email.com1CA9C57DE-EEC1-4A47-9B75-3074C8E778B8nicko@email.com2CA9C57DE-EEC1-4A47-9B75-3074C8E778B8steve@email.com3Users table

event_idnetwork_useriduser_id1CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null2CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null3CA9C57DE-EEC1-4A47-9B75-3074C8E778B8nullEvents table

WITH prep_users AS (
SELECT
network_userid,
user_id,
COUNT(DISTINCT session_id) AS sessions
FROM users
GROUP BY 1, 2
),

users AS (
SELECT
network_userid,
FIRST_VALUE(user_id) OVER
(
PARTITION BY network_userid ORDER BY sessions DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as user_id
FROM prep_users
GROUP BY 1, 2
)

SELECT
a.event_id,
b.user_id
FROM events AS a
JOIN users AS b
ON a.network_userid = b.network_userid

event_iduser_id1steve@email.com2steve@email.com3steve@email.com

Similar to the second option above, the results in the final derived table will not be 100% accurate, so some events might be attributed to the wrong user_id.

Multiple android_idfa values per user_id and multiple user_id values per android_idfa

Sticking to our example above, on mobile devices the android_idfa property will play the same role as network_userid plays on web. It identifies a device. The same considerations as above are applicable in these scenarios as well; and the query examples can be easily adapted.

As a final note, you may want to add a lookback window to the WHERE clause of some of the queries above, to ensure you're only using the latest data and not your entire historical records dating back months or even years.

Identity stitching as a real-time workflow

All the examples so far have been about batch processing of tables in a data warehouse. With streaming data, you would normally deal with each event as it comes in. The same principles as above will apply, but the "logistics" will be a little bit different. 

The most notable difference will be that in the warehouse creating a mapping table and then using it in conjunction with atomic tables and tables elsewhere in the warehouse can be separate steps. They can even happen at different frequencies, for example if you update the mapping tables daily via a scheduled process but run queries that use them throughout the day. In-stream, as you handle each event, you will need to accomplish two things at once: lookup your DynamoDB mapping table to identify the user in the current event *and* update the DynamoDB mapping table with information from the current event.

Take ownership of your data

There is no right or wrong approach to user identity stitching. The key consideration when deciding between a batch or real-time approach depends on your use case, what you want to achieve through user identity stitching and how it will benefit your product, user experience or business. Since Snowplow gives you full ownership and control of the real-time data stream and the data in the warehouse, you can use Snowplow for both batch, and real-time workflows. 

However, regardless of which approach you take, identity stitching involves using personal data, which means businesses should always make sure to comply with data privacy regulations. At Snowplow, our aim is to give our users more control over how personal user data is handled so they can better manage how personal data is used at every stage. 

To find out more about how to gain a complete customer view with Snowplow, you can get in touch with us here.

Subscribe to our newsletter

Get the latest blog posts to your inbox every week.

Get Started

Unlock the value of your behavioral data with customer data infrastructure for AI, advanced analytics, and personalized experiences