Blog

Make Big Data Small Again with Redshift ZSTD Compression

By
Snowplow Team
&
November 13, 2024
Share this post

How Can Redshift ZSTD Compression Help Snowplow Users?

Amazon Redshift introduced support for Zstandard (ZSTD) compression in early 2017, unlocking substantial storage and performance efficiencies for Snowplow users working with large datasets. This guide explains how to implement ZSTD compression with Snowplow's atomic.events and other derived tables, what benefits you can expect, and common pitfalls to avoid.

What Is ZSTD and Why Does It Matter for Snowplow?

Zstandard, or ZSTD, is a modern compression algorithm developed by Facebook that combines Lempel-Ziv and tANS encoding. Compared to older algorithms like LZO or RUNLENGTH, ZSTD typically offers:

  • Higher compression ratios (typically 2–3x over Snowplow’s standard setup)

  • Faster decompression speeds

  • Support across all Redshift-compatible data types

These characteristics make ZSTD ideal for compressing large VARCHAR columns—especially useful for verbose fields like unstruct_event, contexts, or any schema with large JSON strings.

How Do I Apply ZSTD Compression in Redshift?

There are two main approaches:

1. Analyze Existing Tables

Use Redshift's ANALYZE COMPRESSION to profile a table and get encoding recommendations.

ANALYZE COMPRESSION atomic.events;

Pros:

  • Automated suggestions

  • Fast for small/medium tables

Cons:

  • Sampling limited to 1 billion rows

  • Recommends compression for all columns (including SORTKEYs, which is not ideal)

2. Manual Deep Copy with Explicit Encodings

Create a new table manually and explicitly define encodings based on your workload and query patterns. Then, deep copy the data:

CREATE TABLE atomic.events_new (...);  -- with ZSTD encodings
INSERT INTO atomic.events_new SELECT * FROM atomic.events;

Pros:

  • Full control over compression types

  • Opportunity to clean, sort, and reoptimize data layout

Cons:

  • Requires disk headroom (50%+ recommended)

  • Need to recreate views, permissions, and dependencies

Should I Compress Sort Keys?

Short answer: No.

Compressing SORTKEY columns can negatively affect query performance. AWS and the Snowplow team recommend using ENCODE RAW for these columns (e.g., collector_tstamp, event_id).

What Compression Ratios Can I Expect?

Snowplow users have reported the following when switching from default compression to ZSTD:

  • ~3x reduction in disk space across datasets of 10 million to 5 billion rows

  • Negligible query performance impact, though this depends on workload and access patterns

What Are the Steps to Migrate Safely?

  1. Run ANALYZE COMPRESSION or analyze workloads manually

  2. Create new table with ZSTD-compressed columns (except SORTKEYs)

  3. Deep copy data into the new table

  4. Validate row counts and data consistency

  5. Recreate views, grants, foreign keys, and constraints

  6. Swap tables (rename atomic.events_new to atomic.events)

Be sure to:

  • Pause pipeline if needed

  • Check disk space before copying

  • Reapply table ownership and ACLs

Use the SQL query below to extract user grants before migration:

-- Permission recovery query omitted for brevity (see source)

What’s the Status of Tooling Support in Snowplow?

  • igluctl: Now supports ZSTD as a compression option for DDL generation (since 2018)

  • Schema Guru: Doesn’t auto-generate ZSTD yet, but can be integrated with igluctl workflows

For managed deployments, Snowplow BDP users can get support from the team for planning and executing ZSTD migrations.

Final Recommendations

  • Use ZSTD for large VARCHAR and JSON-heavy fields

  • Avoid compressing SORTKEYs

  • Benchmark both compression ratio and query performance before deploying to production

  • Always validate and backup before dropping or renaming key tables

Subscribe to our newsletter

Get the latest content to your inbox monthly.

Get Started

Whether you’re modernizing your customer data infrastructure or building AI-powered applications, Snowplow helps eliminate engineering complexity so you can focus on delivering smarter customer experiences.