Snowplow RDB Loader R30 released


We are pleased to announce the release of Snowplow RDB Loader R30 which adds multiple usability and stability improvements.
Please read on after the fold for:
- Load manifest improvements
- Configuration refactoring
- Logging improvements
- Snowplow Processing Manifest
- Other improvements
- Upgrading
- Getting help
1. Load manifest improvements
In R87 Chichen Itza we introduced the Redshift load manifest table, widely known as atomic.manifest
. Each row in this table represents metadata about a loading “job” performed by RDB Loader.
Until now, neither RDB Loader nor any other Snowplow software has used this table for any purpose – it was purely informational. In this release, we have changed that and implemented several defensive checks that work on top of the manifest
table and add additional safety to the loading process.
Protection from double-loading
The first issue we have addressed using load manifest is accidental double-loading, which can happen if the Snowplow operator isn’t very experienced with the recovery process. R30 implements the following load algorithm:
- Start a DB transaction
- Load data into
atomic.events
table - Check latest
etl_tstamp
inatomic.events
- If latest record in
atomic.manifest
has sameetl_tstamp
, abort the transaction - Otherwise, proceed to shredded data, write new record to the manifest, and commit the transaction
Before R30 we didn’t have the third and fourth steps so nothing in RDB Loader prevented loading the same data multiple times.
Historical loading
RDB Loader has always assumed that the most recent etl_tstamp
in atomic.events
is the timestamp of the current “run” folder.
This is normally the case – but sometimes we need to load archived data using the --folder
option introduced in RDB Loader 0.13.0</a>); this could happen when the operator discovers that a given folder was not loaded (in turn happening perhaps due to mistakenly using --resume-from
in EmrEtlRunner).
R30 introduces a “transitional load” mode, activating only when the --folder
option is passed. In this mode, RDB Loader loads atomic data into a temporary created table, with a schema identical to atomic.events
, in order to get precise information about the dataset that it is about to load.
With transitional load, RDB Loader can inspect the temporary table and get the correct etl_tstamp
without being confused by existing data. Only if etl_tstamp
does not exist in the load manifest will the data be moved over to the atomic.events
table and the temporary table be dropped; otherwise, the transaction will be aborted.
Notice that in order to use this feature, your database user should have permissions to create and drop tables.
New skippable steps
R30 also introduces three new steps that the user can skip during and RDB Loader job:
load_manifest_check
– skipping the check described above, meaning that any data, even duplicated data, will be loadable into Redshiftload_manifest
– skipping all the load manifest interactions, i.e. checking and writing to the manifesttransitional_load
– skipping the “transitional load” mode introduced above, even when--folder
is specified
Note that as of R102 Afontova Gora, EmrEtlRunner will skip load_manifest
for any pipeline with “stream enrich” mode running, as the etl_tstamp
does not bear any useful information within real-time pipeline.
Skip these steps at your own risk and when you fully understand the consequences, as a failure during load_manifest_check
almost always will mean double-loading.
2. Configuration refactoring
Historically, RDB Loader used PostgreSQL-compatible options to configure an SSL connection, which could have one of following available settings: DISABLE
, REQUIRE
, VERIFY_CA
or VERIFY_FULL
.
However, this method does not match the Redshift configuration, where sslMode
could be only verify-ca
or verify-full
.
In this version we have re-worked the configuration file to instead use direct JDBC configuration options in a new jdbc
object, which includes most of the available Redshift JDBC options. You can see full list of available settings in configuring JSON Schema.
Note that the root-level sslMode
is not a valid setting anymore.
We also took this opportunity to make some forward-looking changes and clean-ups:
- All optional root-level settings cannot now be omitted, and must be set to
null
id
setting is not an optional anymore, and must be set to a (random) UUID- A new optional
processingManifest
option has been added (more on this later)
All mandatory changes are set out in the upgrading section below.
3. Logging improvements
Historically, RDB Loader pr
ovided very concise log output, showing only what steps were successfully executed. Using this output for debugging purposes was challenging, and operators had to rely on third-party sources.
With R30, RDB Loader provides very detailed information about the load process, in particular adding:
- List of discovered folders (usually just one) with a full list of shredded types found there
- Eventual consistency check delays
- Improved failure messages, showing possible resolution steps
- Timestamps for all messages, helping to highlight which steps take the most time
- List of timestamped and truncated
COPY INTO
statements instdout
– this won’t be printed by EmrEtlRunner, but will be present in EMR logs
4. Snowplow Processing Manifest
Another major new feature of this release isthe introduction of Snowplow Processing Manifest – not to be confused with the Redshift atomic.manifest
table discussed above.
Snowplow Processing Manifest is a library and journal allowing jobs to keep a record of all significant processing steps in a pipeline. It has been designed as a very generic mechanism, independent from specific processing engines and target databases.
Currently only AWS DynamoDB is available as a backend for the manifest, and only Redshift (through RDB Shredder and Loader) is supported as target database, however a similar approach is used in the Snowplow Snowflake Loader, and we have plans to migrate Snowflake Loader to use the Snowplow Processing Manifest, which will ultimately evolve into a “universal glue” between pipeline components.
The most important features of Snowplow Processing Manifest include:
- Tracking of all shredded types in a folder, in order to skip S3 consistency check delays
- Tracking of all RDB Shredder and Loader runs with their respective timestamps and exit statuses
- Locking mechanism, preventing double-loading and race conditions
Right now, the Snowplow Processing Manifest as embedded in RDB Loader is considered beta, and it does not need to be enabled in order to use RDB Loader, nor make use of any of the new functionality outlined above. The manifest is disabled by default.
Stay tuned for more information on Snowplow Processing Manifest, coming in an official announcement soon.
5. Other improvements
- RDB Shredder with enabled cross-batch deduplication does not automatically create DynamoDB event manifest anymore – you will need to create this table manually #62
- We fixed a bug where the Loader would fail if the JDBC password could be interpreted as an invalid regular expression #87
6. Upgrading
To make use of the new version, you will need to update your EmrEtlRunner configuration, and also the storage target configuration for either Redshift or Postgres.
EmrEtlRunner
If you are using EmrEtlRunner, you’ll need to update your config.yml
file:
storage: versions: rdb_shredder: 0.14.0 # WAS 0.13.1 rdb_loader: 0.15.0 # WAS 0.14.0
Redshift
In the storage target configuration for Redshift, you’ll need to do following changes:
- Switch SchemaVer to
3-0-0
- Remove
sslMode
and addjdbc
JSON object instead - In case you had
"sslMode": "DISABLE"
– add"ssl": false
tojdbc</code>; if you had
"sslMode": "REQUIRE"
- add"ssl": true
tojdbc
- Assign random UUID to
id
property (add it if it didn’t exist) - Add
"sshTunnel": null
unless you already have configured SSL tunnel, introduced in R28 - Add
"processingManifest": null
unless you’re going to use the processing manifest (to be covered in a future release announcement)
PostgreSQL
If you’re loading data to PostgreSQL, you’ll need to make following changes in respective config:
- Switch SchemaVer to
2-0-0
- Assign random UUID to
id
property (add it if it didn’t exist) - Add
"sslTunnel": null
unless you already have configured SSL tunnel, introduced in R28 - Add
"processingManifest": null
unless you’re going to use the processing manifest (to be covered in a future release announcement)
7. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problem, please visit our Discourse forum.