De-duplicating Events in Redshift and Hadoop: A Technical Tutorial for Snowplow Users

Snowplow Team

October 4, 2024

Share this post

Snowplow's pipeline is designed to maximize data integrity, but as many experienced users know, duplicate events can still occur under certain conditions. While rare, duplicate event_ids can lead to problematic joins—especially when working in Redshift—resulting in cartesian products and skewed analyses.

In this tutorial, we share best practices for identifying, handling, and removing duplicates using the Snowplow batch pipeline, Redshift SQL, and Hadoop de-duplication jobs. The techniques covered here reflect the collective knowledge of the Snowplow team and its community of users, and are tailored for data engineers, modelers, and platform maintainers running Snowplow in production.

1. What Causes Duplicates in Snowplow?

There are two primary causes of duplicated events:

1.1 Tracker Retry Logic

Snowplow’s trackers are built to prioritize data collection integrity. If a tracker doesn’t receive confirmation from the collector, it may retry sending the event. This can lead to true duplicates:

Identical payloads are sent multiple times.
The etl_tstamp may differ, but the payload and event_id are the same.

1.2 JavaScript Tracker Edge Cases

Bots, spiders, and some content blockers can cause the JavaScript tracker to emit events with non-unique (or constant) event_ids:

Some bots do not return a proper UUID.
All events from a given session may have the same event_id, creating thousands of pseudo-duplicates.

2. De-duplication in Hadoop

Starting with Snowplow R76, the batch pipeline introduced an upstream de-duplication job that runs on Hadoop. This process:

Partitions events on event_id and event_fingerprint.
Removes all but the first occurrence within a batch.
Prevents 95–99% of duplicates from reaching Redshift.

Limitations:

Does not deduplicate across batches.
Does not handle synthetic duplicates (e.g., multiple events with same ID but different payloads).

Follow-up enhancements to address these limitations were released in later versions of the batch pipeline.

3. Handling Existing Duplicates in Redshift

For data already in Redshift, deduplication can be handled in multiple ways depending on your architecture and performance requirements.

3.1 Join on Both event_id and collector_tstamp

Many duplicates share the same event_id but differ in collector_tstamp. When joining tables, ensure you join on both:

ON a.event_id = b.root_id AND a.collector_tstamp = b.root_tstamp

Benefits:

Simple and performant.
Often sufficient to prevent cartesian joins.

To identify affected tables:

SELECT count, COUNT(*)

FROM (

SELECT event_id, COUNT(*)

FROM atomic.events

GROUP BY 1

) AS counts

GROUP BY 1

ORDER BY 1;

3.2 De-duplicate in the Modeling Layer

Most Snowplow users transform atomic.events into modeled tables. Add a deduplication step using ROW_NUMBER() in your dbt models:

SELECT * FROM (

SELECT *, ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY collector_tstamp) AS row_num

FROM atomic.events

) WHERE row_num = 1;

Benefits:

Keeps raw data untouched.
Easier to maintain via dbt, especially in modular pipelines.

3.3 Remove Duplicates Using SQL

To permanently move duplicates to a separate schema, use Snowplow-provided SQL scripts from snowplow/sql-runner:

Setup

CREATE SCHEMA duplicates;

CREATE TABLE duplicates.events (LIKE atomic.events);

Run either of:

01-events.sql (for events with a fingerprint)
02-events-without-fingerprint.sql (for events without one)

Optionally, run step (d) to remove events with differing fingerprints (use with caution).

Unstructured Events and Context Tables

For derived tables (e.g., com_acme_unstructured_event), follow this pattern:

Create a duplicates version of the table.
Use a window function to rank rows by collector_tstamp within each event_id.
Keep event_number = 1 in source, and move >1 to duplicates.

4. Best Practices and Next Steps

Minimize Real User Loss: Start by updating joins to use both event_id and collector_tstamp instead of full deletion.
Stage Deduplication: If still necessary, run deduplication scripts in two phases—first without step (d), then with it.
Use event_fingerprint: Customize fingerprint logic if certain fields (e.g., dvce_sent_tstamp, user_ipaddress) are inconsistent but non-critical.
Bots & Anomalies: Investigate anomalies where duplicate event_ids vary in user agent or IP. These could be bots or misconfigured trackers.

Additional Resources

‍

Subscribe to our newsletter

Get the latest content to your inbox monthly.

De-duplicating Events in Redshift and Hadoop: A Technical Tutorial for Snowplow Users

Get Started

Get Started

Get Started

Products

Comparisons

Customers

Solutions

Explore

Integrations

Technology

Resources

Follow Us