Blog

GDPR Tutorial: Deleting Customer Data from Amazon Redshift

By
Snowplow Team
&
September 10, 2024
Share this post

The introduction of the General Data Protection Regulation (GDPR) transformed how organizations manage personal data. Among the most impactful changes is the data subject’s “right to be forgotten,” requiring companies to delete personal data upon request — across all data stores.

For companies running Snowplow, the good news is that our highly structured event data model makes this process significantly more manageable.

In this tutorial, we’ll walk through a reliable and repeatable process to delete all data associated with a specific user from Amazon Redshift, leveraging Snowplow’s atomic data model.

Assumptions

This guide assumes the following:

  • A data subject has requested deletion of all data associated with them.

  • You are using user_id as the identifier (this approach also applies to other identifying fields such as domain_userid, network_userid, or user_ipaddress).

  • In-batch deduplication is enabled (i.e., you're running Snowplow R76+).

  • Your data models are rebuilt daily from atomic data — not incrementally maintained. If using incremental models, additional steps are required (outside the scope of this tutorial).

Step 1: Identify Events to Delete

Create a scratch table with all event_id and collector_tstamp combinations tied to the data subject:

CREATE SCHEMA IF NOT EXISTS scratch;
DROP TABLE IF EXISTS scratch.deletion_fact_table;
CREATE TABLE scratch.deletion_fact_table
  DISTKEY(event_id)
  SORTKEY(collector_tstamp) AS (
  SELECT
    event_id,
    collector_tstamp
  FROM atomic.events
  WHERE user_id = 'DATA_SUBJECT_ID'  -- replace with actual identifier
);

Sanity check:

SELECT COUNT(*) FROM scratch.deletion_fact_table;

Confirm that the row count aligns with expectations before proceeding.

Step 2: Identify Affected Tables

Redshift stores shredded JSON data and other enriched data in multiple tables. To find which contain references to the user’s events, run:

SELECT DISTINCT table_name 
FROM information_schema.columns 
WHERE table_schema = 'atomic' 
AND column_name = 'root_id';

This query returns all tables in the atomic schema that include root_id, which links shredded data back to the original event.

Step 3: Delete from Derived Tables

Using the deletion_fact_table, delete the user’s data from each of the tables identified in Step 2 (except atomic.events):

DELETE FROM atomic.com_yourcompany_eventname_1 AS m
USING scratch.deletion_fact_table AS d
WHERE m.root_id = d.event_id 
AND m.root_tstamp = d.collector_tstamp;

💡 Tip: Script this for all relevant tables to ensure consistency and avoid omissions.

Step 4: Delete from atomic.events

Once all dependent tables are cleared, remove the events from the main events table:

DELETE FROM atomic.events AS a
USING scratch.deletion_fact_table AS d
WHERE a.event_id = d.event_id 
AND a.collector_tstamp = d.collector_tstamp;

Step 5: Post-Deletion Cleanup

Optional: Log Deleted Time Range

Before dropping your scratch table, it can be useful to log the time range of deleted events:

SELECT
  MIN(collector_tstamp) AS min_tstamp,
  MAX(collector_tstamp) AS max_tstamp
FROM scratch.deletion_fact_table;

Vacuum Tables

To reclaim storage and maintain performance, run:

VACUUM;

This is especially important if you don’t already run routine vacuum operations.

Important Considerations: Snapshots and Backups

Amazon Redshift Snapshots

Redshift’s automated and manual snapshots can retain deleted data beyond this process.

  • Automated snapshots are created regularly and retained based on your cluster’s settings (default: 1 day).

  • Manual snapshots are not deleted automatically and persist even if the cluster is removed.

To stay compliant:

  • Review your snapshot retention policy.

  • Delete any manual snapshots containing data that must be forgotten.

  • Check for cross-region backups or shared snapshots stored in S3.

Best Practices for GDPR Compliance with Snowplow

  • Use consistent identifiers: Ensure a single source of truth (e.g., user_id, network_userid) to make deletions accurate and complete.

  • Automate your deletion pipeline: Build a repeatable process using scheduled queries or orchestration tools like Airflow or dbt.

  • Audit and log deletions: Retain non-identifying logs of what was deleted (e.g., timestamps) for compliance audits.

  • Review backups regularly: Ensure forgotten data isn’t lurking in unmanaged snapshots.

Conclusion

While GDPR compliance presents operational challenges, Snowplow’s structured data and event model simplify user data deletions in Redshift. By isolating event metadata and separating enriched data into derived tables, it’s possible to identify and delete user data systematically and confidently.

For teams building real-time behavioral pipelines, GDPR doesn’t have to be a blocker — especially when your stack is built with privacy-by-design in mind.

Subscribe to our newsletter

Get the latest content to your inbox monthly.

Get Started

Whether you’re modernizing your customer data infrastructure or building AI-powered applications, Snowplow helps eliminate engineering complexity so you can focus on delivering smarter customer experiences.