Snowplow R101 Neapolis released with initial GCP support


We are tremendously excited to announce the release of Snowplow R101 Neapolis. This realtime release marks the first step in our journey towards making Snowplow Google Cloud Platform-native, evolving it into a truly multi-cloud platform.
Read on for more information on R101 Neapolis, named after the archeological site in Sicily, Italy where the Greek theatre of Syracuse is located:
- Why bring Google Cloud Platform support to Snowplow?
- Adding GCP support to Stream Collector and Stream Enrich
- Dedicated artifacts for each platform
- Miscellaneous changes
- Upgrading
- Roadmap
- Help
1. Why bring Google Cloud Platform support to Snowplow?
Historically, the Snowplow platform has been closely tied to Amazon Web Services and to a lesser extent on-premise (largely through Apache Kafka support). In order to make the platform accessible to all, it is important to make it as cloud-agnostic as possible.
This process begins with porting the realtime pipeline to Google Cloud Platform, the hugely popular public cloud offering – and this release is the first step on this journey.
The next releases on our journey to GCP will focus on porting the streaming enrichment process to Google Cloud Dataflow and making loading Snowplow events into BigQuery a reality.
For more information regarding our overall plans for Google Cloud Platform, please check out our RFC on the subject.
2. Adding GCP support to Stream Collector and Stream Enrich
To those familiar with the current Snowplow streaming pipeline achitecture, this release will look straightforward: we simply take our existing components and add support for publishing and subscribing to Google Cloud PubSub topics.
Specifically, we took our Scala Stream Collector and added support for publishing raw Snowplow events to a Google Cloud PubSub topic. Similarly, we updated our Stream Enrich component to read those raw events off Google Cloud PubSub, enrich them and publish them back to another PubSub topic.
We have written dedicated guides to setting up those micro-services in the following wiki articles:
Huge thanks to our former intern, Guilherme Pires, for laying the foundations for this release.
3. Dedicated artifacts for each platform
As we go multi-cloud and support a growing number of different platforms, it is becoming increasingly important to split the different artifacts according to their targeted streaming technology, in order to:
- Keep the JAR sizes from getting out of hand
- Prevent a combinatorial explosion of different source and sink technologies requiring testing (e.g. Amazon Kinesis source to Google Cloud PubSub sink)
Therefore, from this release onwards, there will be five different artifacts for the Scala Stream Collector, and five for Stream Enrich.
For the Scala Stream Collector:
JAR | Targeted platform |
---|---|
snowplow-stream-collector-google-pubsub-version.jar | Google Cloud PubSub |
snowplow-stream-collector-kinesis-version.jar | Amazon Kinesis |
snowplow-stream-collector-kafka-version.jar | Apache Kafka |
snowplow-stream-collector-nsq-version.jar | NSQ |
snowplow-stream-collector-stdout-version.jar | stdout |
For Stream Enrich:
JAR | Targeted platform |
---|---|
snowplow-stream-enrich-google-pubsub-version.jar | Google Cloud PubSub |
snowplow-stream-enrich-kinesis-version.jar | Amazon Kinesis |
snowplow-stream-enrich-kafka-version.jar | Apache Kafka |
snowplow-stream-enrich-nsq-version.jar | NSQ |
snowplow-stream-enrich-stdin-version.jar | stdin/stdout |
This approach reduces artifact size and simplifies testing, at the cost of some flexibility for Stream Enrich. If you were previously running a “hybrid-cloud” Stream Enrich (reading and writing to different streaming technologies), then we suggest setting up a dedicated app downstream of Stream Enrich to bridge the enriched events to the other stream system.
4. Miscellaneous changes
4.1 Exposing the number of requests made to the collector through JMX
Thanks to GitHub user jspc, the Scala Stream Collector now exposes some valuable metrics through JMX via the new MBean com.snowplowanalytics.snowplow:type=StreamCollector
, containing the following attributes:
Requests
: total number of requestsSuccessfulRequests
: total number of successful requestsFailedRequests
: total number of failed requests
You can turn on JMX by launching the collector in the following manner:
java -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9010 -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -jar snowplow-stream-collector-google-pubsub-0.13.0.jar --config config.hocon
For more informat
ion on setting JMX up, refer to this guide.
4.2 Upgrading to Kafka 1.0.1
We’ve taken advantage of this release to upgrade the Kafka artifacts to Kafka 1.0.1.
5. Upgrading
5.1 Scala Stream Collector
The latest version of the Scala Stream Collector is available from our Bintray here.
A complete setup guide for running the Scala Stream Collector on GCP can be found in the following guides:
5.1.1 Updating the configuration
For non-Google Cloud PubSub users, the only minor change was made to the collector.crossDomain
section: it’s now non-optional but has an enabled
flag:
crossDomain { enabled = false domain = "acme.com" secure = true }
However, if you want to leverage Google Cloud PubSub, you’ll need to change the collector.streams.sink
section to something akin to the following:
sink { enabled = googlepubsub googleProjectId = ID # values are in milliseconds backoffPolicy { minBackoff = 50 maxBackoff = 1000 totalBackoff = 10000 # must be >= 10000 multiplier = 2 } }
For a complete example, see our sample config.hocon template.
If you’re running the collector from a GCP instance in the same project, authentication will be transparently taken care of for you.
If not, you’ll need to run the following to authenticate using GCP’s CLI gcloud:
gcloud auth login gcloud auth application-default login
Regarding backoffPolicy
, if sinking a raw event to PubSub fails, the first retry will happen after minBackoff
milliseconds. For the following failures, this backoff will be multiplied by multiplier
each time until it reaches maxBackoff
milliseconds, its cap. If the sum of the time spent backing off exceeds totalBackoff
milliseconds, the application will shut down.
5.1.2 Launching
As explained in section 3, there is now one JAR per platform, as such you’ll need to use one of the following commands to launch the collector:
java -jar snowplow-stream-collector-google-pubsub-0.13.0.jar --config config.hocon java -jar snowplow-stream-collector-kinesis-0.13.0.jar --config config.hocon java -jar snowplow-stream-collector-kafka-0.13.0.jar --config config.hocon java -jar snowplow-stream-collector-nsq-0.13.0.jar --config config.hocon java -jar snowplow-stream-collector-stdout-0.13.0.jar --config config.hocon
5.2 Stream Enrich
The latest version of Stream Enrich is available from our Bintray here.
A complete setup guide for running Stream Enrich on GCP can be found in the following guides:
5.2.1 Updating the configuration
Configuration for Stream Enrich has been remodeled in order to only allow a source and a sink on the same platform and, for example, disallow reading events from a Kafka topic and writing out enriched events to a Kinesis stream.
As such, the configuration, if you’re using Kinesis, now looks like:
enrich { streams { in { ... } # UNCHANGED out { ... } # UNCHANGED sourceSink { # NEW SECTION enabled = kinesis region = eu-west-1 aws { accessKey = iam secretKey = iam } maxRecords = 10000 initialPosition = TRIM_HORIZON backoffPolicy { minBackoff = 50 maxBackoff = 1000 } } buffer { ... } # UNCHANGED appName = "" # UNCHANGED } monitoring { ... } # UNCHANGED }
If you want to leverage Google Cloud PubSub, it should look like the following:
enrich { streams { in { ... } # UNCHANGED out { ... } # UNCHANGED sourceSink { # NEW SECTION enabled = googlepubsub googleProjectId = id threadPoolSize = 4 backoffPolicy { minBackoff = 50 maxBackoff = 1000 totalBackoff = 10000 # must be >= 10000 multiplier = 2 } } buffer { ... } # UNCHANGED appName = "" # UNCHANGED } monitoring { ... } # UNCHANGED }
For a complet
e example, see our sample config.hocon template.
If you’re running the collector from a GCP instance in the same project, authentication will be transparently taken care of for you.
If not, you’ll need to run the following to authenticate using GCP’s CLI gcloud:
gcloud auth login gcloud auth application-default login
Regarding backoffPolicy
, if sinking an enriched event to PubSub fails, the first retry will happen after minBackoff
milliseconds. For the following failures, this backoff will be multiplied by multiplier
each time until it reaches maxBackoff
milliseconds, its cap. If the sum of the time spent backing off exceeds totalBackoff
milliseconds, the application will shut down.
threadPoolSize
refers to the number of threads available to the PubSub Subscriber.
5.2.2 Launching
Same as for the collector, there is now one JAR per targeted platform:
java -jar snowplow-stream-enrich-google-pubsub-0.15.0.jar --config config.hocon --resolver file:iglu.json java -jar snowplow-stream-enrich-kinesis-0.15.0.jar --config config.hocon --resolver file:iglu.json java -jar snowplow-stream-enrich-kafka-0.15.0.jar --config config.hocon --resolver file:iglu.json java -jar snowplow-stream-enrich-nsq-0.15.0.jar --config config.hocon --resolver file:iglu.json java -jar snowplow-stream-enrich-stdin-0.15.0.jar --config config.hocon --resolver file:iglu.json
6. Roadmap
Upcoming Snowplow releases will include:
- R102 Afontova Gora, various stability, security and data quality improvements for the batch pipeline
- R103 [STR] PII Enrichment phase 2, enhancing our recently-released GDPR-focused PII Enrichment for the realtime pipeline
- R10x [STR & BAT] IP Lookups Enrichment upgrade, moving us away from using the legacy MaxMind database format, which is being sunsetted on 2nd April 2018
Furthermore, this release is only the beginning for Google Cloud Platform support in Snowplow!
As discussed in our RFC, we plan on porting our streaming enrichment process to Google Cloud Dataflow, leveraging the Apache Beam APIs (see this milestone for details). In parallel, we are also busy designing our new Snowplow event loader for BigQuery.
We look forward to your feedback as we continue to roll out and extend our GCP capabilities!
7. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please visit our Discourse forum.