Snowplow Google Cloud Storage Loader 0.1.0 released


We are pleased to release the first version of the Snowplow Google Cloud Storage Loader. This application reads data from a Google Pub/Sub topic and writes it to a Google Cloud Storage bucket. This is an essential component in the Snowplow for GCP stack we are launching: this application enables users to sink any bad data from Pub/Sub to Cloud Storage, from where it can be reprocessed, and subsequently sink either the raw or enriched data to Cloud Storage as a permanent backup.
Please read on after the fold for:
- An overview of the Snowplow Google Cloud Storage Loader
- Running the Snowplow Google Cloud Storage Loader
- The GCP roadmap
- Getting help
1. An overview of the Snowplow Google Cloud Storage Loader
The Snowplow Google Cloud Storage Loader is a Cloud Dataflow job which:
- Consumes the contents of a Pub/Sub topic through an input subscription
- Groups the records by a configurable time window
- Writes the records into a Cloud Storage bucket
As part of a Snowplow installation on GCP, this loader is particularly useful to archive data from the raw, enriched, or bad streams to long-term storage.
It can additionally partition the output bucket by date (up to the hour), making it faster and less expensive to query the data over particular time periods. The following is an example layout of the output bucket:
- gs://output-bucket/
- 2018/
- 11/
- 01/
- output-2018-11-01T15:25:00.000Z-2018-11-01T15:30:00.000Z-pane-0-last-00000-of-00001.txt
- 01/
- 11/
- 2018/
Note that every part of the filename is configurable:
output
corresponds to the filename prefix2018-10-25T15:25:00.000Z-2018-10-25T15:30:00.000Z-pane-0-last-00000-of-00001
is the shard template and can be further broken down as:2018-11-01T15:25:00.000Z-2018-11-01T15:30:00.000Z
, the time windowpane-0-last
, the pane label, where panes refer to the data actually emitted after aggregation in a time window00000-of-00001
, the shard index and total number of shards respectively, where shards refer to the number of files produced per window which is also configurable
.txt
is the filename suffix
If the notions of windows or panes are still relatively new to you, we recommend reading the following article series by Tyler Akidau which detail the different Cloud Dataflow capabilities with regards to streaming and windowing:
Finally, the loader can optionally compress data in gzip or bz2. Note that bz2-compressed data can’t be loaded directly into BigQuery.
2. Running the Snowplow Google Cloud Storage Loader
The Google Cloud Storage Loader comes as a ZIP archive, a Docker image, or a Cloud Dataflow template, feel free to choose the one which fits your use case the most.
2.1 Running through the template
You can run Dataflow templates using a variety of means:
- Using the GCP console
- Using
gcloud
at the command line - Using the REST API
Refer to the documentation on executing templates to learn more.
Here, we provide an example using gcloud
:
gcloud dataflow jobs run ${JOB_NAME} --gcs-location gs://snowplow-hosted-assets/4-storage/snowplow-google-cloud-storage-loader/0.1.0/SnowplowGoogleCloudStorageLoaderTemplate-0.1.0 --parameters inputSubscription=projects/${PROJECT}/subscriptions/${SUBSCRIPTION}, outputDirectory=gs://${BUCKET}/YYYY/MM/dd/HH/, # partitions by date outputFilenamePrefix=output, # optional shardTemplate=-W-P-SSSSS-of-NNNNN, # optional outputFilenameSuffix=.txt, # optional windowDuration=5, # optional, in minutes compression=none, # optional, gzip, bz2 or none numShards=1 # optional
Make sure to set all the ${}
environment variables included above.
2.2 Running through the zip archive
You can find the archive hosted on our Bintray.
Once unzipped the artifact can be run as follows:
./bin/snowplow-google-cloud-storage-loader --runner=DataFlowRunner --project=${PROJECT} --streaming=true --zone=europe-west2-a --inputSubscription=projects/${PROJECT}/subscriptions/${SUBSCRIPTION} --outputDirectory=gs://${BUCKET}/YYYY/MM/dd/HH/ # partitions by date --outputFilenamePrefix=output # optional --shardTemplate=-W-P-SSSSS-of-NNNNN # optional --outputFilenameSuffix=.txt # optional --windowDuration=5 # optional, in minutes --compression=none # optional, gzip, bz2 or none --numShards=1 # optional
Make sure to set all the ${}
environment variables included above.
To display the help message:
./bin/snowplow-google-cloud-storage-loader --help
To display documentation about Cloud Storage Loader-specific options:
./bin/snowplow-google-cloud-storage-loader --help=com.snowplowanalytics.storage.googlecloudstorage.loader.Options
2.3 Running through the Docker image
You can also find the Docker image on our Bintray.
A container can be run as follows:
docker run -e GOOGLE_APPLICATION_CREDENTIALS=/snowplow/config/credentials.json # if running outside GCP snowplow-docker-registry.bintray.io/snowplow/snowplow-google-cloud-storage-loader:0.1.0 --runner=DataFlowRunner --job-name=${JOB_NAME} --project=${PROJECT} --streaming=true --zone=${ZONE} --inputSubscription=projects/${PROJECT}/subscriptions/${SUBSCRIPTION} --outputDirectory=gs://${BUCKET}/YYYY/MM/dd/HH/ # partitions by date --outputFilenamePrefix=output # optional --shardTemplate=-W-P-SSSSS-of-NNNNN # optional --outputFilenameSuffix=.txt # optional --windowDuration=5 # optional, in minutes --compression=none # optional, gzip, bz2 or none --numShards=1 # optional
Make sure to set all the ${}
environment variables included above.
To display the help message:
docker run snowplow-docker-registry.bintray.io/snowplow/snowplow-google-cloud-storage-loader:0.1.0 --help
To display documentation about Cloud Storage Loader-specific options:
docker run snowplow-docker-registry.bintray.io/snowplow/snowplow-google-cloud-storage-loader:0.1.0 --help=com.snowplowanalytics.storage.googlecloudstorage.loader.Options
2.4 Additional options
A full list of all the Beam CLI options can be found in the documentation for Google Cloud Dataflow.
3. GCP roadmap
For those of you following our Google Cloud Platform progress, make sure you read our official GCP launch announcement and our BigQuery Loader release post.
We plan to shortly release:
- A new event recovery workflow
- A new release of Beam Enrich, which integrates with Cloud Dataflow templates mentioned above; you can check out the original blogpost for Snowplow R110 Valle dei Templi to find out more about Beam Enrich
4. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problem, please visit our Discourse forum.