Blog

Deploying Snowplow on Kubernetes: Technical Q&A for Data Engineers

By
Snowplow Team
&
September 10, 2024
Share this post

As data pipelines continue to move toward containerized, cloud-native infrastructure, many Snowplow users are looking to run their full pipeline on Kubernetes. From AWS EKS to GKE to on-prem clusters, this guide distills real-world insights and frequently asked questions from the Snowplow community to help you succeed with Kubernetes deployments.

Is it possible to run the entire Snowplow pipeline on Kubernetes?

Yes, Snowplow components work well in Kubernetes. Several community members and Snowplow engineers have successfully deployed the full pipeline—including collectors, enrichers, loaders, and real-time processing infrastructure—on Kubernetes clusters.

Popular components for containerized pipelines include:

  • stream-collector (Scala)

  • stream-enrich

  • Kafka-based message brokers (e.g., Apache Kafka, AWS MSK, Strimzi)

  • Snowplow loaders (e.g., BigQuery, Postgres, Redshift)

  • Spark-based batch jobs (optional but supported)

One Snowplow user confirmed in 2022 that they are running the entire stack in Kubernetes, although they noted the complexity and custom engineering required to get everything working properly.

Are there any existing Kubernetes YAML files or Helm charts available?

There are a few community and experimental resources available:

  • Helm Chart by @lukaspastva – This community-developed chart provides a starting point, though it currently lacks support for Kafka input in the Postgres loader.

  • Generic Helm charts internally used by Snowplow – These charts aren't yet published in a unified form, but Snowplow has confirmed that all components should now work on EKS.

  • Example YAMLs and Helm discussions in GitHub issues and Discourse posts.

For production-grade deployments, it’s common to customize charts to handle secrets, logging, monitoring (e.g., Prometheus, Grafana), and platform-specific settings.

What are some common challenges with deploying Snowplow on Kubernetes?

1. IAM Role Binding Issues (EKS):
The snowplow-collector-scala did not initially support IAM Role for Service Accounts (IRSA), but this can be fixed by explicitly setting the default service account or enabling OIDC IAM roles.

2. Lack of Kafka Support in Some Loaders:
As noted by Lukas, the snowplow-postgres-loader currently does not support consuming from Kafka directly, which complicates pure Kubernetes + Kafka deployments.

3. Custom Logging & Metrics:
One user running Snowplow in their own data center noted having to extend Docker images to get correct log formatting and still faced challenges collecting metrics from all components.

4. Lack of Unified Helm Charts:
Snowplow does not yet maintain a comprehensive Helm chart for the entire pipeline. Most users resort to managing individual components manually or writing custom charts.

What’s the best way to get started with Snowplow on Kubernetes?

1. Define your target stack:
Decide if you’re using Kafka (or Kinesis/PubSub), which loader (BigQuery, Redshift, Postgres), and where Spark fits in (if at all). This will help you decide which components to deploy.

2. Explore and adapt community charts:
Start with lukaspastva’s Helm chart or your own Helm templates. Even if incomplete, these provide a baseline for your setup.

3. Use IRSA and OIDC on EKS:
If you're deploying on AWS, follow best practices for IAM role bindings via OIDC to grant Kubernetes workloads access to S3, Kinesis, etc.

4. Monitor Discourse and GitHub:
Snowplow community discussions and GitHub issues often surface solutions to Kubernetes-specific edge cases (e.g., role bindings, container configuration, health checks).

Final Thoughts

Running Snowplow on Kubernetes is possible, increasingly common, and aligns well with cloud-native infrastructure best practices. However, success requires:

  • Careful orchestration of distributed components

  • Managing secrets and IAM correctly

  • Possibly building or extending Helm charts

  • Monitoring performance and metrics across the stack

While Snowplow does not yet offer an official Helm chart for the full pipeline, internal progress and community contributions continue to improve the deployment experience on Kubernetes.

Subscribe to our newsletter

Get the latest content to your inbox monthly.

Get Started

Whether you’re modernizing your customer data infrastructure or building AI-powered applications, Snowplow helps eliminate engineering complexity so you can focus on delivering smarter customer experiences.