Snowplow 93 Virunum released


We are tremendously excited to announce the release of Snowplow 93 Virunum.
This release focuses on a much needed refresh of the real-time pipeline components: the Scala Stream Collector as well as Stream Enrich. It also fixes some long-standing annoyances regarding the Scala Stream Collector.
If you’d like to know more about R93 Virunum, named after [the ancient Roman city in Austria][virunum], please read on after the fold:
- Scala Stream Collector: detecting blocked third-party cookies through cookie bounce
- Scala Stream Collector: relaxing the parsing of HTTP request components
- Stream Enrich: enriching events from a certain point in time with
AT_TIMESTAMP
- Stream Enrich: forcing the download of the ip lookup database
- Stream Enrich: partitionining the output stream according to event properties
- Kafka improvements
- Under the hood
- Upgrading
- Roadmap
- Help
1. Scala Stream Collector: detecting blocked third-party cookies through cookie bounce
Following on from Christoph Buente’s RFC, the Scala Stream Collector now provides a mechanism to test if third-party cookies are blocked and reacts appropriately. Huge thanks to Christoph and the team at LiveIntent for contributing this sophisticated new feature.
Simply put, the new “cookie bounce” mechanism:
- Checks if the cookie named according to the value of the
cookie.name
configuration is present- if it is present, uses it and processes the request
- if not, issues a redirect to itself with a
Set-Cookie
header- if the cookie is still missing, then we infer that third-party cookies are not allowed, and the Scala Stream Collector processes the request with the placeholder network user id defined in
cookieBounce.fallbackNetworkUserId
- if the cookie is now present, third-party cookies are allowed, and the Scala Stream Collector processes the request with the cookie value
- if the cookie is still missing, then we infer that third-party cookies are not allowed, and the Scala Stream Collector processes the request with the placeholder network user id defined in
To enable this feature, you can change the cookieBounce.enabled
configuration to true
.
Be careful though: the redirects mentioned above can significantly increase the number of requests that your collectors have to handle.
2. Scala Stream Collector: relaxing the parsing of HTTP request components
The Scala Stream Collector was previously too restrictive when it came to parsing elements of an HTTP request, and would reject certain events despite their intrinsic correctness, most notably due to:
- Query string parameters with non-url-encoded reserved characters in their values (#3272)
- Useragents not conforming to RFC 2616 (#2970)
Those shortcomings have been fixed in the new version of the collector as part of our ongoing focus on removing any possible data loss scenarios across the pipeline.
Additionally, the enrich stream processing application won’t reject events for which page_url
contains more than one # characters (#2893).
3. Stream Enrich: enriching events from a certain point in time with `AT_TIMESTAMP`
If you are using Kinesis with Stream Enrich, you previously had two choices when it came to enriching your raw event stream:
- Starting from the beginning through
TRIM_HORIZON
- Starting from the latest message with
LATEST
With R93, you are now able to consume your raw event stream from an arbitrary point in time by specifying AT_TIMESTAMP
as the streams.kinesis.initialPosition
configuration setting. Additionally, you’ll need to specify an actual timestamp in streams.kinesis.initialTimestamp
.
4. Stream Enrich: forcing the download of the MaxMind IP lookups database
Before R93, when you launched Stream Enrich with the IP lookups enrichment, the MaxMind IP lookups database was downloaded locally and, if you were to launch it later it would reuse this local cache of the database.
R93 introduces a command line argument --force-ip-lookups-download
to download a new version of the ip lookup database every time that Stream Enrich is launched.
There are plans to introduce a time-to-live for this database and re-download it while Stream Enrich is running in issue #3407.
5. Stream Enrich: partitioning the output stream according to event properties
Before R93, it was only possible to use user_ipaddress
as a partition key for the enriched event stream emitted by Stream Enrich. This release extends the realm of possibilities by introducing thestreams.out.partitionKey
configuration setting, which lets you specify which event property to use to partition the output stream of Stream Enrich.
The available properties have been selected based on their fitness as a partition key (i.e. good distribution and usefulness):
domain_userid
network_userid
domain_sessionid
user_ipaddress
event_id
event_fingerprint
user_fingerprint
If none of these are used, a random UUID will be generated for each event as partition key.
As a reminder, in Kinesis and Kafka, two events having the same partition key are guaranteed to end up in the same shard or partition respectively.
6. Kafka improvements
Improvements have also been made regarding how both the Scala Stream Collector and Stream Enrich interact with Kafka. In particular:
- This release exposes the
streams.kafka.retries
configuration for both the Scala Stream Collector and Stream Enrich, allowing the Kafka producer to resend any record which failed being sent the specified number of times - Prior to this release, the
streams.buffer.byteLimit
setting was used as the size of the batch being sent to Kafka, which didn’t
make a lot of sense. It now corresponds to the quantity of memory the Kafka producer can use to buffer records before sending them - Finally, this release makes use of the callback-based API to notify of errors when producing messages to a Kafka topic
7. Under the hood
This release also includes a big set of other updates which are part of the modernization effort around the realtime pipeline, most notably:
- The Scala Stream Collector and Stream Enrich use Java 8 (#3328 and #3392)
- They run on Scala 2.11 (#3311 and #3388)
- Akka HTTP has replaced Spray for the Scala Stream Collector (#3299)
- A flurry of other library updates
8. Upgrading
8.1 Scala Stream Collector
The latest version of the Scala Stream Collector is available from our Bintray here.
8.1.1 Updating the configuration
collector { cookieBounce { # NEW enabled = false name = "n3pc" fallbackNetworkUserId = "00000000-0000-4000-A000-000000000000" } sink = kinesis # WAS sink.enabled streams { # REORGANIZED good = good-stream bad = bad-stream kinesis { // ... } kafka { // ... retries = 0 # NEW } } } akka { http.server { # WAS spray.can.server // ... } }
For a complete example, see our sample config.hocon
template.
8.1.2 Launching
The Scala Stream Collector is no longer an executable JAR file. As a result, it has to be launched as:
java -jar snowplow-stream-collector-0.10.0.jar --config config.hocon
8.2 Stream Enrich
The latest version of Stream Enrich is available from our Bintray here.
8.2.1 Updating the configuration
enrich { // ... streams { // ... out { // ... partitionKey = user_ipaddress # NEW } kinesis { # REORGANIZED // ... initialTimestamp = "2017-05-17T10:00:00Z" # NEW but optional backoffPolicy { # MOVED // ... } } kafka { // ... retries = 0 # NEW } } }
For a complete example, see our sample config.hocon
template.
8.2.2 Launching
Stream Enrich is no longer an executable JAR file. As a result, it will have to be launched as:
java -jar snowplow-stream-enrich-0.11.0.jar --config config.hocon --resolver file:resolver.json
Additionally, a new --force-ip-lookups-download
flag has been introduced as mentioned above.
9. Roadmap
Upcoming Snowplow releases will include:
- R94 [BAT] Ellora, enhancing our Redshift event storage with ZSTD encoding, plus various bug fixes for the batch pipeline
- R95 [STR] Zeugma, which will add support for NSQ to the stream processing pipeline, ready for adoption in Snowplow Mini
- R9x [STR] Priority fixes, removing the potential for data loss in the stream processing pipeline
- R9x [BAT] 4 webhooks, which will add support for 4 new webhooks (Mailgun, Olark, Unbounce, StatusGator)
10. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problem, please visit our Discourse forum.