Snowplow R113 Filitosa real-time pipeline improvements


Snowplow 113 Filitosa, named after the megalithic site in Southern Corsica, is a release focusing on improvements to the Scala Stream Collector as well as new features for Scala Common Enrich, the library powering all the different enrichment platforms.
This release is almost entirely made of community contributions, shoutout to all the contributors:
- LiveIntent for:
- adding Prometheus support to the Scala Stream Collector
- making it possible to use POST requests in the API request enrichment
- and other improvements to the Scala Stream Collector
- Peter Zhu and Mike from Poplin Data for:
- the HubSpot webhook integration
- and other improvements to the existing webhook integrations and Scala Stream Collector
- Sven Pfenning and Mirko Prescha for the improvements made to our Kafka platform
- Arun Manivannan, Saeed Zareian from the Globe and Mail and Toni Cebrián for the build improvements
Thanks a lot to everyone involved!
Please read on after the fold for:
Jean-Pol Grandmont CC BY-SA 3.0
1. Scala Stream Collector improvements
1.1 Prometheus metrics support
Thanks to LiveIntent, the Scala Stream Collector now publishes Prometheus metrics to the /metrics
endpoint. You’ll find the following metrics published at this endpoint:
http_requests_total
: the total count of requestshttp_request_duration_seconds
: the time spent handling requests
You will be able to slice and dice the metrics by endpoint, method and/or response code.
Additional information will also be available, such as the Java and Scala versions as well as the version of the Scala Stream Collector artifact.
1.2 Improved Kafka support
It is now possible to specify arbitrary Kafka producer configurations for the collector through the collector.streams.sink.producerConf
configuration setting. Additionally, the Kafka library has been upgraded to the latest version to leverage the latest features.
Note, that those changes are also true for Stream Enrich for Kafka through the enrich.streams.sourceSink.{producerConf, consumerConf}
configurations.
Thanks a lot to Sven Pfenning and Mirko Prescha for those two awesome features!
1.3 Other improvements
For people using the do not track cookie feature of the Scala Stream Collector, LiveIntent has improved the feature by letting you specify a regex for the cookie value.
Mike from Poplin Data has introduced a configurable Access-Control-Max-Age
header which lets clients cache the results of OPTIONS
request, resulting in fewer requests and faster POST
requests: no need to make a preflight request if the result is already cached.
2. Scala Common Enrich improvements
2.1 HubSpot webhook integration
Peter Zhu from Poplin Data built the HubSpot webhook integration from scratch for this release. Huge props to Peter!
You’ll now be able to track the following HubSpot events in your Snowplow pipeline:
- Deal creation
- Deal change
- Deal deletion
- Contact creation
- Contact change
- Contact deletion
- Company creation
- Company change
- Company deletion
Peter has also made small improvements to the Marketo and CallRail integrations.
2.2 POST support in the API request enrichment
It is now possible to use POST
requests to interact with the API leveraged in the API request enrichment. Thanks to LiveIntent for this feature.
This is useful if you have to leverage an API which isn’t necessarily RESTful.
3. Upgrading
3.1 Upgrading the Scala Stream Collector
A new version of the Scala Stream Collector incorporating the changes discussed above can be found on our Bintray.
To make use of this new version, you’ll need to amend your configuration in the following ways:
- Add a
collector.cors
section to specify theAccess-Control-Max-Age
duration:
cors { accessControlMaxAge = 5 seconds # -1 seconds disables the cache }
- Add a
collector.prometheusMetrics
section:
prometheusMetrics { enabled = false durationBucketsInSeconds = [0.1, 3, 10] # optional buckets by which to group by the `http_request_duration_seconds` metric }
- Modify the
collector.doNotTrackCookie
section if you want to make use of a regex:
doNotTrackCookie { enabled = true name = cookie-name value = ".+cookie-value.+" }
- Add the optional
collector.streams.sink.producerConf
if you want to specify additional Kafka producer configuration:
producerConf { acks = all }
This also holds true for Stream Enrich enrich.streams.sourceSink.{producerConf, consumerConf}
.
A full example configuration can be found in the repository.
3.2 Upgrading your enrichment platform
If you are a GCP pipeline user, a new Beam Enrich can be found on Bintray:
- as a ZIP archive
- as a Docker image
If you are a Kinesis or Kafka pipeline user, a new Stream Enrich can be found on Bintray.
Finally, if you are a batch pipeline user, a new Spark Enrich can be used by setting the new version in your EmrEtlRunner configuration:
enrich: version: spark_enrich: 1.17.0 # WAS 1.16.0
or directly make use of the new Spark Enrich available at:
s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.17.0.jar
For the batch pipeline, we’ve also extended the timeout recovery introduced in R112. A new version of EmrEtlRunner incorporating those improvements is available from our Bintray here.
4. Roadmap
Upcoming Snowplow releases include:
- R114 New bad row format, a release which will incorporate the new bad row format discussed in the dedicated RFC.
Stay tuned for announcements of more upcoming Snowplow releases soon!
5. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problem, please visit our Discourse forum.