Replacing Amazon Redshift with Apache Spark for Event Data Modeling

Snowplow Team

September 6, 2024

Share this post

In the early days of Snowplow, we used Amazon Redshift as the default platform for transforming raw event data into structured tables for analysis. This worked well at first—but as our customers’ data volumes grew and the need for real-time or near-real-time processing increased, the limitations of Redshift's batch-oriented SQL processing became more apparent.

In this post, we’ll walk through how to migrate from Redshift-based modeling to Apache Spark. We’ll cover:

Why Redshift may no longer meet your modeling needs
The benefits of switching to Spark
How to rewrite your modeling pipeline in Spark
Tips for running your Spark jobs on EMR

Let’s dive in.

Why Move Away from Redshift?

Amazon Redshift is a great data warehouse, but it's optimized for OLAP-style queries, not for large-scale data transformation pipelines. Over time, we saw several challenges:

Performance bottlenecks: Modeling large Snowplow datasets in Redshift led to long runtimes and occasional timeouts.
Data mutability limitations: Redshift is optimized for immutable data. Handling late-arriving events or reprocessing data becomes difficult.
Lack of flexibility: Complex transformations, especially those involving custom logic or joins across large datasets, are difficult to optimize in SQL alone.

For customers running large-scale Snowplow pipelines (tens or hundreds of millions of events per day), we needed something more scalable and flexible.

Why Spark?

Apache Spark is a distributed computing engine designed for high-throughput, fault-tolerant data processing. It excels at:

Handling large-scale transformations with ease
Allowing rich, expressive logic via Python, Scala, or SQL
Integrating seamlessly with cloud storage like S3
Supporting both batch and streaming modes

Crucially, Spark fits well into Snowplow's modular architecture: events are loaded from S3 (via Snowplow's batch pipeline), transformed in Spark, and then written to downstream destinations like Snowflake or BigQuery.

Overview of the Spark-Based Modeling Approach

Let’s walk through the core components of a Spark-based modeling pipeline using Scala.

Step 1: Define the Data Model

Start by outlining your data model. For example, you might want to extract key dimensions and facts from raw Snowplow events, such as:

Session-level aggregates
User acquisition source
Product views and purchases

Previously, this logic might have been implemented in SQL views in Redshift. Now, we’ll define them as transformations in Spark.

Step 2: Load the Raw Snowplow Data

Snowplow stores enriched events in JSON format in S3. To load them in Spark:

val enriched = spark.read

.json("s3://your-bucket/enriched/run=2025-05-01/")

.cache()

You can optionally infer the schema, but for production use cases, we recommend defining the schema explicitly for stability and performance.

Step 3: Transform the Data

Let’s say you want to extract page views per user:

val pageViews = enriched

.filter($"event" === "page_view")

.groupBy($"user_id")

.agg(count("*").as("page_view_count"))

Or maybe you want to build a session table:

val sessions = enriched

.withColumn("session_start", window($"dvce_created_tstamp", "30 minutes"))

.groupBy("domain_userid", "session_start")

.agg(

min("dvce_created_tstamp").as("session_start_time"),

max("dvce_created_tstamp").as("session_end_time"),

count("*").as("event_count")

)

These are the kinds of queries that are difficult to optimize in Redshift but run efficiently in Spark, especially when partitioned and cached properly.

Step 4: Write the Results

You can write the transformed tables back to S3, a data warehouse, or a database:

pageViews.write

.mode("overwrite")

.parquet("s3://your-bucket/models/page_views/")

‍

sessions.write

.mode("overwrite")

.parquet("s3://your-bucket/models/sessions/")

Or use Spark connectors to write directly to Snowflake or BigQuery.

Running Spark Jobs on EMR

If you're using AWS, Amazon EMR is a great way to run Spark jobs at scale. Here's how we typically set it up:

Define jobs as JARs (for Scala/Java) or Python scripts.
Launch EMR clusters on-demand via the AWS CLI or Step Functions.
Use Spot instances to save costs.
Use S3 as the source and sink for all data.
Log output to CloudWatch or S3 for observability.

You can automate this pipeline via Airflow, Step Functions, or even shell scripts, depending on your stack.

Tips for a Smooth Migration

Start small: Begin by rewriting a single modeling step in Spark and validating the output.
Use version control: Treat your data modeling logic like application code.
Leverage schema inference cautiously: For stability, define schemas explicitly in production.
Monitor and optimize: Use the Spark UI to find slow stages and optimize joins and shuffles.

Conclusion

Apache Spark offers a powerful, flexible, and scalable alternative to Redshift for event data modeling—especially for teams processing large Snowplow datasets. It enables more expressive logic, better performance, and lower operational headaches once set up.

By adopting Spark, you’ll unlock faster iterations and deeper insights from your behavioral data.

Subscribe to our newsletter

Get the latest content to your inbox monthly.

Replacing Amazon Redshift with Apache Spark for Event Data Modeling

Get Started

Get Started

Get Started

Products

Comparisons

Customers

Solutions

Explore

Integrations

Technology

Resources

Follow Us