Kinesis S3 0.4.0 released with gzip support


We are pleased to announce the release of Kinesis S3 version 0.4.0. Many thanks to Kacper Bielecki from Avari for his contribution to this release!
Table of contents:
1. gzip support
Kinesis S3 now supports gzip as a second storage/compression option for the files it writes out to S3. Using this format, each record is treated as a byte array containing a UTF-8 encoded string (whether CSV, JSON or TSV). The records are then written to files as strings, one record per line and gzipped.
Big thanks go to Kacper Bielecki for contributing this storage option! For more information please see Kacper’s pull request.
Snowplow users please note: you must continue to use the LZO format for storing raw Snowplow events.
2. Infinite loops
With the recent Amazon S3 outage in us-east-1, an issue was discovered where Kinesis S3 was unable to recover the connection to S3 even after the service was restored. This resulted in an infinite loop of failures to PUT
any records into S3. To fix this, we had to manually restart all Kinesis S3 instances.
To prevent this recurring, Kinesis S3 now supports a failure timeout: if failures extend beyond this timeout, then Kinesis S3 will self-terminate. You can specify this timeout in the configuration file:
// Failure allowed for one minute sink.s3.max-timeout: 60000
This feature can be neatly coupled with an automated restart wrapper to ensure that the application will recover without human intervention.
3. Safer record batching
In the previous release post we discussed potential out-of-memory problems for this application. To improve things further we have implemented a new configuration option: max-records
to specify how many records the application is allowed to read per GetRecords
call. This helps prevent the application from suddenly exceeding the Heap with sudden traffic spikes.
// Amount of records per GetRecords call sink.kinesis.in.max-records: 10000
Unless you are experiencing out-of-memory issues, we recommend using the default of 10000
. Please note that 10000
, for the moment, is also the maximum setting. If set any higher an InvalidArgumentException
will be thrown.
4. Bug fixes
We have also:
- Fixed a bug where the Snowplow Tracker was using the wrong event type for
write_failures
(#45) - Added logging for
OutOfMemoryErrors
so it is easier to debug in the future (#29)
5. Upgrading
The Kinesis S3 application is available in a single zip file here:
http://bintray.com/artifact/download/snowplow/snowplow-generic/kinesis_s3_0.4.0.zip
Upgrading will require various configuration changes to the application’s HOCON configuration file:
- Add
max-records
to thesink.kinesis.in
section and configure how many records you want the application to get at any one time - Add
format
to thesink.s3
section and select eitherlzo
orgzip
to control what format files are written in - Add
max-timeout
to thesink.s3
section and enter the maximum timeout in ms for the application
And that’s it – you should now be fully upgraded!
6. Getting help
For more details on this release, please check out the Kinesis S3 0.4.0 release on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.