Snowplow 65 Scarlet Rosefinch released
We are pleased to announce the release of Snowplow 65, Scarlet Rosefinch. This release greatly improves the speed, efficiency, and reliability of Snowplow’s real-time Kinesis pipeline.
Table of contents:
- Enhanced performance
- CORS support
- Increased reliability
- Loading configuration from DynamoDB
- Randomized partition keys for bad streams
- Removal of automatic stream creation
- Improved Elasticsearch index initialization
- Other changes
- Upgrading
- Getting help
1. Enhanced performance
Kinesis’ new PutRecords API enabled the biggest performance improvement: rather than sending events to Kinesis one at a time, we can now send batches of up to 500. This greatly increases the event volumes which the Scala Stream Collector and Scala Kinesis Enrich can handle.
You might not want to always wait for a full 500 events before sending the stored records to Kinesis, so the configuration for both applications now has a buffer
section which provides greater control over when the stored records get sent.
It has three fields:
byte-limit
: if the stored records total at least this many bytes, flush the bufferrecord-limit
: if at least this many records are in the buffer, flush the buffertime-limit
: if at least this many milliseconds have passed since the buffer was last flushed, flush the buffer
An example with sensible defaults:
Additionally, the Scala Stream Collector has a ShutdownHook
which sends all stored records. This prevents stored events from being lost when the collector is shut down cleanly.
2. CORS support
The Scala Stream Collector now supports CORS requests, so you can send events to it from Snowplow’s client-side JavaScript Tracker using POST rather than GET. This means that your requests are no longer subject to Internet Explorer’s querystring size limit.
The Scala Stream Collector also now supports cross-origin requests from the Snowplow ActionScript 3.0 Tracker.
3. Increased reliability
If an attempt to write records to Kinesis failed, previous versions of the Kinesis apps would just log an error message. The new release prevents events from being lost in the event that Kinesis is temporarily unreachable by implementing an exponential backoff strategy with jitter when PutRecords requests fail.
The minimum and maximum backoffs to use are configurable in the backoffPolicy
for the Scala Stream Collector and Scala Kinesis Enrich:
4. Loading configuration from DynamoDB
The command-line arguments to Scala Kinesis Enrich have also changed. It used to be the case that you provided a --config
argument, pointing to a HOCON file with configuration for the app, together with an optional --enrichments
argument, pointing to a directory containing the JSON configurations for the enrichments you wanted to make use of:
Scarlet Rosefinch makes three changes:
- The “resolver” section of the configuration HOCON has been split into a separate JSON file which should be specified using the command line argument
--resolver
. - If you want to get the resolver JSON and the enrichment JSONs from the local filesystem, you need to preface their filepaths with “file:”.
- You can now also load the resolver JSON and enrichment JSONs from DynamoDB
To recreate the pre-r65 behavior, convert the resolver
section of your configuration HOCON to JSON and put it in its own file, “iglu_resolver.json”. Then start the enricher like this:
To get the resolver from DynamoDB, create a table named “snowplow_config” with hashkey “id” and add an item to the table of the following form:
Then provide the resolver
argument as follows:
To get the enrichments from DynamoDB, the enrichment JSONs must all be stored in a table – you can reuse the “snowplow_config” table. The enrichments’ hash keys should have a common prefix, for example “enrich_”:
Then provide the resolver
argument as follows:
If you are using a different AWS region, replace “us-east-1” accordingly.
The full command:
5. Randomized partition keys for bad streams
When sending a record to Kinesis, you provide a partition key. The hash of this key determines which shard will process the record. The good events generated by Scala Kinesis Enrich use the user_ipaddress
field of the event as the key, so that events from a single user will all be in the right order in the same shard.
Previously, the bad events generated by Scala Kinesis Enrich and the Kinesis Elasticsearch Sink all had the same partition key. This meant that no matter how many shards the stream for bad records contained, all bad records would be processed by the same shard. Scarlet Rosefinch fixes this by using random partition keys for bad events.
6. Removal of automatic stream creation
The Kinesis apps will no longer automatically create a stream if they detect that the configured stream does not exist.
This means that if you make a typo when configuring the stream name, the mistake will immediately cause an error rather than creating an incorrectly named new stream which no other app knows about.
7. Improved Elasticsearch index initialization
We now recommend that when setting up your Elasticsearch index, you turn off tokenization of string fields. You can do this by choosing “keyword” as the default analyzer:
This has two positive effects:
- URLs are no longer incorrectly tokenized
- Not having to tokenize every string field improves the performance of the Elasticsearch cluster
8. Other changes
We have also:
- Started logging the names of the streams to which the Scala Stream Collector and Scala Kinesis Enrich write events (#1503, #1493)
- Added macros to the “config.hocon.sample” sample configuration files (#1471, #1472, #1513, #1515)
- Fixed a bug which caused the Kinesis Elasticsearch Sink to silently drop inputs containing fewer than 24 tab-separated fields (#1584)
- Fixed a bug which prevented the applications from using a DynamoDB table in the configured region (#1576, #1582, #1583)
- Added the ability to prevent the Scala Stream Collector from setting 3rd-party cookies by setting the cookie expiration field to 0 (#1363)
- Bumped the version of Scala Common Enrich used by Scala Kinesis Enrich to 0.13.1 (#1618)
- Bumped the version of Scalazon we use to 0.11 to access PutRecords (#1492, #1504)
- Stopped Scala Kinesis Enrich outputting records of over 50kB because they are exceed Kinesis’ size limit (#1649)
9. Upgrading
The Kinesis apps for r65 Scarlet Rosefinch are now all available in a single zip file here:
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r65_scarlet_rosefinch.zip
Upgrading will require various configuration changes to each of the four applications:
Scala Stream Collector
- Add
backoffPolicy
andbuffer
fields to the configuration HOCON.
Scala Kinesis Enrich
- Add
backoffPolicy
andbuffer
fields to the configuration HOCON - Extract the resolver from the configuration HOCON into its own JSON file, which can be stored locally or in DynamoDB
- Update the command line arguments as detailed above
Kinesis LZO S3 Sink
- Rename the outermost key in the configuration HOCON from “connector” to “sink”
- Replace the “s3/endpoint” field with an “s3/region” field (such as “us-east-1”)
Kinesis Elasticsearch Sink
- Rename the outermost key in the configuration HOCON from “connector” to “sink”
And that’s it – you should now be fully upgraded!
10. Getting help
For more details on this release, please check out the r65 Scarlet Rosefinch on GitHub.
Documentation for all the Kinesis apps is available on the wiki.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.