Blog

Kinesis Streams vs. S3 Buckets: What’s the Best Choice for Your Snowplow Pipeline?

By
&
June 27, 2024
Share this post

Snowplow pipelines built on AWS often leverage both Kinesis Streams and S3 Buckets for data processing and storage. But when should you use each, and how do they work together in a typical Snowplow setup?

In this guide, we break down the use cases, best practices, and technical considerations for integrating Kinesis Streams and S3 Buckets in your Snowplow pipeline.

Q: What are the key differences between Kinesis Streams and S3 Buckets?

Kinesis Streams and S3 Buckets serve different purposes in the data pipeline:

Feature - Kinesis Streams -- S3 Buckets

Purpose - Real-time data streaming -Persistent data storage

Data Retention - 24 hours to 7 days -Indefinite

Data Format - Serialized JSON, Protobuf, etc. - Raw files (JSON, Parquet, Avro, etc.)

Access Method - Real-time consumers (Lambda, Firehose) - Batch processing or data lake queries

Cost Structure - Pay per shard/hour + data transfer - Pay per storage volume + data requests

Latency - Milliseconds to seconds - Minutes to hours (depending on processing)

Q: How does Snowplow use Kinesis Streams in its real-time pipeline?

In a standard Snowplow real-time pipeline, Kinesis Streams are used to:

  1. Collect Raw Events

    • The Scala Stream Collector sends incoming events to a raw event stream (good stream).

  2. Handle Oversized or Invalid Events

    • Events that are too large or malformed are routed to a separate bad stream for monitoring and debugging.

  3. Stream Enrichment

    • The Stream Enrich application reads events from the raw event stream, processes them, and outputs enriched data to another Kinesis stream.

Why use Kinesis here?

  • Low latency, real-time processing
  • Fault-tolerant architecture
  • Supports multiple consumers (e.g., Stream Enrich, S3 Loader, Firehose)

Q: Why use S3 in a Snowplow pipeline?

While Kinesis is optimized for streaming data, S3 is essential for persistent storage and batch processing.

Common S3 Use Cases in Snowplow Pipelines:

  1. Raw Event Backup:

    • Before processing, raw events can be persisted to S3 as a fallback to prevent data loss in case of stream processing issues.

  2. Enriched Data Storage:

    • After enrichment, events are written to S3 to serve as input for downstream processing (e.g., data modeling, Snowflake loader).

  3. Data Lake Integration:

    • S3 can act as a central data lake, making it easy to access both raw and enriched data for data warehousing or machine learning.

Q: How do I handle oversized events in a Kinesis Stream?

Snowplow handles oversized events by routing them to the bad stream. However, if the event is too large to fit into a Kinesis record, consider these options:

  • S3 Loader: Configure the S3 Loader to capture these large events and store them in S3, where they can be reviewed and processed later.

  • Kinesis Firehose: If using Firehose, configure it to automatically write oversized events to S3 with minimal latency.

Example S3 Loader Configuration:

sink {
  enabled = s3
  region = "us-east-1"
  bucket = "snowplow-raw-events"
  directory = "oversized-events/"
  format = "json"
}

Q: Can I use Kinesis Firehose to write directly to S3?

Yes, you can use Kinesis Firehose to write directly to S3, but there are some caveats:

  • Firehose expects specific formats. It can handle JSON, CSV, or Parquet, but you must define the format explicitly.

  • Batch processing: Firehose buffers records before sending them to S3, adding potential latency.

  • Transformation: Firehose can optionally use Lambda functions to process or reformat records before writing them to S3.

Recommended Use Case: Use Firehose for simple data storage needs. For complex transformations or enrichment, stick to the Snowplow S3 Loader.

Q: Where should I run the S3 Loader in a Snowplow pipeline?

The S3 Loader can be deployed in various configurations:

  • Same instance as the collector: If the volume of data is low and latency is not a concern, the S3 Loader can run on the same EC2 instance as the collector.

  • Dedicated EC2 instance: For higher throughput, use a separate instance or auto-scaling group to handle large data volumes.

  • Containerized deployments: Consider running the S3 Loader as a Docker container or within AWS Fargate, especially if your architecture is containerized.

Q: Do I need separate instances for each S3 Loader?

Not necessarily. You can run multiple S3 Loaders on the same instance as long as you configure each to target different Kinesis streams or S3 directories. However, for high throughput, consider isolating loaders by event type (e.g., good, bad, enriched).

Example S3 Loader Configuration for Enriched Data:

sink {
  enabled = s3
  region = "us-east-1"
  bucket = "snowplow-enriched-data"
  directory = "enriched/"
  format = "json"
}

Final Thoughts

Kinesis Streams and S3 Buckets are complementary components in a Snowplow pipeline:

  • Kinesis Streams handle real-time data processing with low latency, ensuring fast ingestion and enrichment.

  • S3 Buckets provide reliable, long-term storage for both raw and enriched data, enabling downstream processing and data lake integration.

By understanding the strengths and best practices for each service, you can design a robust, scalable, and fault-tolerant Snowplow pipeline that meets your data needs.

Subscribe to our newsletter

Get the latest content to your inbox monthly.

Get Started

Whether you’re modernizing your customer data infrastructure or building AI-powered applications, Snowplow helps eliminate engineering complexity so you can focus on delivering smarter customer experiences.