Make Big Data Small Again with Redshift ZSTD Compression
How Can Redshift ZSTD Compression Help Snowplow Users?
Amazon Redshift introduced support for Zstandard (ZSTD) compression in early 2017, unlocking substantial storage and performance efficiencies for Snowplow users working with large datasets. This guide explains how to implement ZSTD compression with Snowplow's atomic.events and other derived tables, what benefits you can expect, and common pitfalls to avoid.
What Is ZSTD and Why Does It Matter for Snowplow?
Zstandard, or ZSTD, is a modern compression algorithm developed by Facebook that combines Lempel-Ziv and tANS encoding. Compared to older algorithms like LZO or RUNLENGTH, ZSTD typically offers:
- Higher compression ratios (typically 2–3x over Snowplow’s standard setup)
- Faster decompression speeds
- Support across all Redshift-compatible data types
These characteristics make ZSTD ideal for compressing large VARCHAR columns—especially useful for verbose fields like unstruct_event, contexts, or any schema with large JSON strings.
How Do I Apply ZSTD Compression in Redshift?
There are two main approaches:
1. Analyze Existing Tables
Use Redshift's ANALYZE COMPRESSION to profile a table and get encoding recommendations.
ANALYZE COMPRESSION atomic.events;
Pros:
- Automated suggestions
- Fast for small/medium tables
Cons:
- Sampling limited to 1 billion rows
- Recommends compression for all columns (including SORTKEYs, which is not ideal)
2. Manual Deep Copy with Explicit Encodings
Create a new table manually and explicitly define encodings based on your workload and query patterns. Then, deep copy the data:
CREATE TABLE atomic.events_new (...); -- with ZSTD encodings
INSERT INTO atomic.events_new SELECT * FROM atomic.events;
Pros:
- Full control over compression types
- Opportunity to clean, sort, and reoptimize data layout
Cons:
- Requires disk headroom (50%+ recommended)
- Need to recreate views, permissions, and dependencies
Should I Compress Sort Keys?
Short answer: No.
Compressing SORTKEY columns can negatively affect query performance. AWS and the Snowplow team recommend using ENCODE RAW for these columns (e.g., collector_tstamp, event_id).
What Compression Ratios Can I Expect?
Snowplow users have reported the following when switching from default compression to ZSTD:
- ~3x reduction in disk space across datasets of 10 million to 5 billion rows
- Negligible query performance impact, though this depends on workload and access patterns
What Are the Steps to Migrate Safely?
- Run ANALYZE COMPRESSION or analyze workloads manually
- Create new table with ZSTD-compressed columns (except SORTKEYs)
- Deep copy data into the new table
- Validate row counts and data consistency
- Recreate views, grants, foreign keys, and constraints
- Swap tables (rename atomic.events_new to atomic.events)
Be sure to:
- Pause pipeline if needed
- Check disk space before copying
- Reapply table ownership and ACLs
Use the SQL query below to extract user grants before migration:
-- Permission recovery query omitted for brevity (see source)
What’s the Status of Tooling Support in Snowplow?
- igluctl: Now supports ZSTD as a compression option for DDL generation (since 2018)
- Schema Guru: Doesn’t auto-generate ZSTD yet, but can be integrated with igluctl workflows
For managed deployments, Snowplow BDP users can get support from the team for planning and executing ZSTD migrations.
Final Recommendations
- Use ZSTD for large VARCHAR and JSON-heavy fields
- Avoid compressing SORTKEYs
- Benchmark both compression ratio and query performance before deploying to production
- Always validate and backup before dropping or renaming key tables