Snowplow 104 Stoplesteinan released with important EmrEtlRunner bugfixes
We are pleased to announce the release of Snowplow R104 Stoplesteinan. This release brings a few critical stability-related bug fixes to the new Stream Enrich mode introduced in EmrEtlRunner in R102 Afontova Gora.
Read on for more information on R104 Stoplesteinan, named after the ancient stone circle located in southwestern Norway:
1. Bugs in R102 Stream Enrich mode
In R102 Afontova Gora we presented a new Stream Enrich mode for EmrEtlRunner, evolving the Snowplow Lambda architecture towards something more performant and cost-effective.
Unfortunately, several critical bugs were introduced in the recovery process of pipelines with Stream Enrich mode enabled; these issues combined can lead to folders becoming “stalled” in
enriched.good or archived without proper shredding and loading (though no data should be lost).
In Stream Enrich mode, EmrEtlRunner has a new skippable step,
staging_stream_enrich, which replaces both
enrich steps from the classic Batch Enrich mode.
The problem is that EmrEtlRunner R012 running in Stream Enrich mode still accepted the inappropriate
enrich steps as valid skip values; recovery scripts which were not updated to skip
staging_stream_enrich instead of
- Silently swallow the
- Run an unwanted staging_stream_enrich step, which would incorrectly stage new enriched data into an enriched.good folder
- This new folder of enriched data would be “stalled” in
enriched.goodand never processed
Another related bug was EmrEtlRunner returning a false negative for the “ongoing run” check when enriched event folders had stalled in
These issues have been addressed in R104 Stoplesteinan.
2. Who is affected
The bugs described above impact only Stream Enrich mode and do not cause issues in classic Batch Enrich mode. A corresponding Snowplow pipeline likely was affected by these bugs if a recovery attempt was made with R102:
- If a recovery process was launched which incorrectly skipped
enrich– you should check
enriched.goodfor leftover folders
- If a recovery process was launched which also skipped
rdb_load– you should check if Redshift is missing any data from folders present in
3. How to recover
If you find that you are missing data in Redshift and in
shredded.archive, then first upgrade to R104.
To recover the data, you can simply restage data from the run folders to the
enriched.stream folder, to be staged and processed during your next launch.
The latest version of EmrEtlRunner is available from our Bintray here.
There are no configuration-level changes in this release.
When you upgrade, make sure to update any recovery scripts you have which previously featured
--skip staging,enrich and change them to either
--resume-from shred or
Upcoming Snowplow releases will include:
- R105 Pompeii, fixing an urgent duplication issue which was introduced in R101 Neapolis (when we introduced the initial GCP support)
- R106 Acropolis, enhancing our recently-released GDPR-focused PII Enrichment for the realtime pipeline
- R10x [STR] New webhooks and enrichment, featuring Marketo and Vero webhook adapters from our partners at Snowflake Analytics
- R10x Vallei dei Templi, porting our streaming enrichment process to Google Cloud Dataflow, leveraging the Apache Beam APIs
6. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please visit our Discourse forum.