Blog

Snowplow Docker images released

By
Snowplow Team
&
October 13, 2017
Share this post

We are thrilled to announce the first batch of official Docker images for Snowplow. This first release focuses on laying the foundations for running a Snowplow real-time pipeline in a Docker-containerized environment. As a result, this release includes images for:

Bringing Docker support to Snowplow has been a real community effort - huge thanks to Joshua Cox, Tamas Szuromi and last but not least Daniel Zohar for their heroic contributions here.

In this post, we will cover:

  1. Why provide Docker images?
  2. A foundation common to all images
  3. Real-time pipeline images
  4. Docker Compose example
  5. Future work
  6. Contributing

1. Why provide Docker images?

Snowplow community members have been experimenting with building their own Docker images for Snowplow for some time. Our decision to bring this “in house” and start publishing and maintaining our own official images is based on a few factors.

An important reason is around the ease of distribution and scheduling Snowplow, through container orchestrators such as Kubernetes, Nomad, OpenShift or Docker Swarm. Providing officially supported images should help to reduce the friction in adopting these platforms for Snowplow real-time pipeline users.

Another argument can be made for resource efficiency. For example, running two instances of Stream Enrich will require two different boxes costing us the OS overheads. Moving to containers should allow you to run those two instances on the same box, giving us higher resource utilization.

But most fundamentally, providing Docker images for the Snowplow realtime pipeline is part of a broader move on our side to formalize the Snowplow real-time pipeline as an asynchronous micro-services-based architecture.

Micro-services architectures are growing in popularity, and the Snowplow real-time pipeline is an example of a platform built out of a set of asynchronously connected micro-services. Asynchronous means that none of our apps have any direct coupling with each other - instead they all rely on an overarching streaming abstraction such as Kinesis or Apache Kafka to communicate. These kinds of architectures are very often containerized using Docker to ease deployment and scheduling.

Official Docker images have been a long-requested feature - we’re excited to finally be providing these to the community!

2. A foundation common to all Snowplow images

In this section, we’ll detail a few technical aspects we’ve taken care of to ensure reliable and performant images.

Every Snowplow Docker image is based on our own base image which leverages the Java 8 Alpine image.

Thanks to this base image, every component runs under dumb-init which handles reaping zombie processes and forwarding signals to all processes running in the container. They also uses su-exec as a sudo replacement, to run any component as the non-root snowplow user.

Each container exposes the /snowplow/config volume to store the component’s configuration. If this folder is bind-mounted then ownership will be changed to the snowplow user.

The -XX:+UnlockExperimentalVMOptions and -XX:+UseCGroupMemoryLimitForHeap JVM options are automatically provided when launching any component in order to make the JVM adhere to the memory limits imposed by Docker; for more information, see this article.

Finally, if you want to manually tune certains aspect of the JVM, additional options can be set through the SP_JAVA_OPTS environment variable when launching a container.

3. Real-time pipeline images

As mentioned above, this release includes images for the Snowplow real-time pipeline. In this section, we’ll cover each of these in turn.

Note that all of these images are hosted in our snowplow-docker-registry.bintray.io Docker registry.

3.1 Scala Stream Collector

You can pull and run the image with:

docker pull snowplow-docker-registry.bintray.io/snowplow/scala-stream-collector:0.10.0 docker run -d -v ${PWD}/config:/snowplow/config -p 80:80 -e 'SP_JAVA_OPTS=-Xms512m -Xmx512m' snowplow-docker-registry.bintray.io/snowplow/scala-stream-collector:0.10.0 --config /snowplow/config/config.hocon

In the above, we’re assuming that there is a valid Scala Stream Collector configuration located in the config folder in the current directory.

Alternatively, you can build the image yourself:

docker pull snowplow-docker-registry.bintray.io/snowplow/base:0.1.0 docker build -t snowplow/scala-stream-collector:0.10.0 scala-stream-collector/0.10.0

The above assumes that you’ve cloned the repository.

This image was contributed by Joshua Cox, huge thanks Josh!

3.2 Stream Enrich

We can pull the image and launch a container with:

docker pull snowplow-docker-registry.bintray.io/snowplow/stream-enrich:0.11.1 docker run -d -v ${PWD}/config:/snowplow/config -e 'SP_JAVA_OPTS=-Xms512m -Xmx512m' snowplow-docker-registry.bintray.io/snowplow/stream-enrich:0.11.1 --config /snowplow/config/config.hocon --resolver file:/snowplow/config/resolver.json --enrichments file:/snowplow/config/enrichments/ --force-ip-lookups-download

Here we’re assuming a valid Stream Enrich configuration as well as a resolver and enrichments in the config directory.

The Stream Enrich image was written by Daniel Zohar. Big thanks to Daniel for this image and all the advice that he’s given us on our Docker journey!

3.3 Snowplow Elasticsearch Loader

Same as before we can pull and run with the following:

docker pull snowplow-docker-registry.bintray.io/snowplow/elasticsearch-loader:0.10.1 docker run -d -v ${PWD}/config:/snowplow/config -e 'SP_JAVA_OPTS=-Xms512m -Xmx512m' snowplow-docker-registry.bintray.io/snowplow/elasticsearch-loader:0.10.1 --config /snowplow/config/config.hocon

Refer to the Elasticsearch Loader configuration example as required.

3.4 Snowplow S3 Loader

As before:

docker pull snowplow-docker-registry.bintray.io/snowplow/s3-loader:0.6.0 docker run -d -v ${PWD}/config:/snowplow/config -e 'SP_JAVA_OPTS=-Xms512m -Xmx512m' snowplow-docker-registry.bintray.io/snowplow/s3-loader:0.6.0 --config /snowplow/config/config.hocon

Check out the S3 Loader config example to remind yourself of the format.

4. Docker Compose example

To help you get started there is also a Docker Compose example which incorporates one container for the Scala Stream Collector and another one for Stream Enrich.

As is, the provided configurations make the following assumptions:

  • The snowplow-raw Kinesis stream exists and is used to store the collected events
  • The snowplow-enriched Kinesis stream exists and is used to store the enriched events
  • The snowplow-bad Kinesis stream exists and is used to store the events which failed validation
  • All those streams are located in the us-east-1 region

Feel free to modify the given configuration files to suit your needs. This Docker Compose example is provided to illustrate how you can start to compose our Snowplow containers together; it is not intended to be a reference or production-ready deployment.

The containers can be launched with:

docker swarm init docker stack deploy -c docker-compose.yml snowplow-realtime

A Scala Stream Collector and a Stream Enrich container are now running!

If you want to stop them:

docker stack rm snowplow-realtime

The Docker Compose example was contributed by Tamas Szuromi, thanks Tamas!

5. Future work

This release is just the beginning of a huge amount of experimentation around Docker, containerized environments and container scheduling that we are embarking on here at Snowplow.

Within the snowplow-docker project, we are planning on publishing our images directly to Docker Hub, as well as adding other images related to Iglu and RDB Loader.

Within Snowplow Mini, we have firm plans to swap out our current architecture for a Docker Compose-based composition of the various services that make up Snowplow Mini. See the Docker milestone for more details.

And of course if there are other aspects of containerization that you would like us to explore, please let us know!

6. Contributing

Please check out the repository and the open issues if you’d like to get involved!

If you have any questions or run into any problems, please visit our Discourse forum.

Subscribe to our newsletter

Get the latest blog posts to your inbox every week.

Get Started

Unlock the value of your behavioral data with customer data infrastructure for AI, advanced analytics, and personalized experiences