Replacing Amazon Redshift with Apache Spark for Event Data Modeling
In the early days of Snowplow, we used Amazon Redshift as the default platform for transforming raw event data into structured tables for analysis. This worked well at first—but as our customers’ data volumes grew and the need for real-time or near-real-time processing increased, the limitations of Redshift's batch-oriented SQL processing became more apparent.
In this post, we’ll walk through how to migrate from Redshift-based modeling to Apache Spark. We’ll cover:
- Why Redshift may no longer meet your modeling needs
- The benefits of switching to Spark
- How to rewrite your modeling pipeline in Spark
- Tips for running your Spark jobs on EMR
Let’s dive in.
Why Move Away from Redshift?
Amazon Redshift is a great data warehouse, but it's optimized for OLAP-style queries, not for large-scale data transformation pipelines. Over time, we saw several challenges:
- Performance bottlenecks: Modeling large Snowplow datasets in Redshift led to long runtimes and occasional timeouts.
- Data mutability limitations: Redshift is optimized for immutable data. Handling late-arriving events or reprocessing data becomes difficult.
- Lack of flexibility: Complex transformations, especially those involving custom logic or joins across large datasets, are difficult to optimize in SQL alone.
For customers running large-scale Snowplow pipelines (tens or hundreds of millions of events per day), we needed something more scalable and flexible.
Why Spark?
Apache Spark is a distributed computing engine designed for high-throughput, fault-tolerant data processing. It excels at:
- Handling large-scale transformations with ease
- Allowing rich, expressive logic via Python, Scala, or SQL
- Integrating seamlessly with cloud storage like S3
- Supporting both batch and streaming modes
Crucially, Spark fits well into Snowplow's modular architecture: events are loaded from S3 (via Snowplow's batch pipeline), transformed in Spark, and then written to downstream destinations like Snowflake or BigQuery.
Overview of the Spark-Based Modeling Approach
Let’s walk through the core components of a Spark-based modeling pipeline using Scala.
Step 1: Define the Data Model
Start by outlining your data model. For example, you might want to extract key dimensions and facts from raw Snowplow events, such as:
- Session-level aggregates
- User acquisition source
- Product views and purchases
Previously, this logic might have been implemented in SQL views in Redshift. Now, we’ll define them as transformations in Spark.
Step 2: Load the Raw Snowplow Data
Snowplow stores enriched events in JSON format in S3. To load them in Spark:
val enriched = spark.read
.json("s3://your-bucket/enriched/run=2025-05-01/")
.cache()
You can optionally infer the schema, but for production use cases, we recommend defining the schema explicitly for stability and performance.
Step 3: Transform the Data
Let’s say you want to extract page views per user:
val pageViews = enriched
.filter($"event" === "page_view")
.groupBy($"user_id")
.agg(count("*").as("page_view_count"))
Or maybe you want to build a session table:
val sessions = enriched
.withColumn("session_start", window($"dvce_created_tstamp", "30 minutes"))
.groupBy("domain_userid", "session_start")
.agg(
min("dvce_created_tstamp").as("session_start_time"),
max("dvce_created_tstamp").as("session_end_time"),
count("*").as("event_count")
)
These are the kinds of queries that are difficult to optimize in Redshift but run efficiently in Spark, especially when partitioned and cached properly.
Step 4: Write the Results
You can write the transformed tables back to S3, a data warehouse, or a database:
pageViews.write
.mode("overwrite")
.parquet("s3://your-bucket/models/page_views/")
sessions.write
.mode("overwrite")
.parquet("s3://your-bucket/models/sessions/")
Or use Spark connectors to write directly to Snowflake or BigQuery.
Running Spark Jobs on EMR
If you're using AWS, Amazon EMR is a great way to run Spark jobs at scale. Here's how we typically set it up:
- Define jobs as JARs (for Scala/Java) or Python scripts.
- Launch EMR clusters on-demand via the AWS CLI or Step Functions.
- Use Spot instances to save costs.
- Use S3 as the source and sink for all data.
- Log output to CloudWatch or S3 for observability.
You can automate this pipeline via Airflow, Step Functions, or even shell scripts, depending on your stack.
Tips for a Smooth Migration
- Start small: Begin by rewriting a single modeling step in Spark and validating the output.
- Use version control: Treat your data modeling logic like application code.
- Leverage schema inference cautiously: For stability, define schemas explicitly in production.
- Monitor and optimize: Use the Spark UI to find slow stages and optimize joins and shuffles.
Conclusion
Apache Spark offers a powerful, flexible, and scalable alternative to Redshift for event data modeling—especially for teams processing large Snowplow datasets. It enables more expressive logic, better performance, and lower operational headaches once set up.
By adopting Spark, you’ll unlock faster iterations and deeper insights from your behavioral data.