Iglu Schema Registry 5 Scinde Dawk released


We are pleased to announce the fifth release of the Iglu Schema Registry System, with an initial release of igluctl
– an Iglu command-line tool and Schema DDL as part of Iglu project.
Read on for more information on Release 5 Scinde Dawk, named after the first postage stamp in Asia:
1. igluctl
The main feature of this release is our new igluctl
command-line application, which collects various separate Iglu-related tools into a single easy-to-use CLI app.
At launch, igluctl
includes three commands:
static generate
– which started life as Schema Guru’sschema-guru ddl
commandstatic push
– which originated as Iglu’s Registry Syncer bash scriptlint
– a brand new command, which performs syntax and consistency checking for JSON Schemas
1.1 static generate
From its 0.3.0 release, Schema Guru has included a ddl
subcommand, which reads JSON Schemas and creates corresponding Redshift table definitions plus JSON Paths files to load these tables.
To centralize all tools related to Iglu in one place, we decided to factor out this functionality from Schema Guru and embed it into igluctl
. In this release there are no new features and the command’s interface remains the same as it was for Schema Guru, except schema-guru ddl
has been replaced with igluctl static generate
.
1.2 static push
Another ported command is static push
, which was previously a dedicated bash script inside Iglu project on GitHub. It allows you to upload set of JSON Schemas from a local static registry to Iglu’s Scala schema registry.
static push
accepts three required positional arguments: input
, host
and apikey
:
input
is just a directory containing JSON Schemashost
is domain name or IP address of your Scala schema registryapikey
is the master API key, which you must create manually, and willl be used to create temporary read and write keys (they will be automatically deleted after command completed)
You can find out more Iglu’s Scala schema registry and how to set it up on its dedicated wiki page.
1.3 lint
The third and most exciting subcommand of igluctl
is lint
, which allows you to perform syntax and consistency checking of your JSON Schemas. lint
is an all-new command, and we’re not aware of anything similar outside of the Snowplow ecosystem.
igluctl lint
accepts one required argument, input
, which is a path to a local static registry or a single JSON Schema, and one optional argument, --skip-warnings
which forces igluctl
to omit warnings about unknown properties if required.
A typical use of the lint
command would look like following:
$ igluctl lint /path/to/static/registry
We strongly advise you to use igluctl lint
to increase quality of your JSON Schemas! This command can surface difficult to detect mistakes in the schema’ing process across the following categories:
Syntax errors. These are the most obvious errors. Each JSON Schema must conform to the JSON Meta Schema, which states for example that the value of property maximum
must be a positive integer, the value of required
must be a non-empty array and so on.
Consistency check. Unfortunately, the current specification of JSON Schema does not include some checks. For example, if you have the key foo
inside required
property, some predefined set of keys in properties
without foo
and at the same time additionalProperties
is false
, it’s still a valid JSON Schema, but it cannot validate any possible JSON instance. igluctl
can identify these and other inconsistencies.
Iglu-specific errors. These are errors that don’t make a JSON Schema invalid or unusable itself, but they do make their behavior unpredictable inside Iglu-aware applications. The most notable example is when the schema’s path within the Iglu static registry conflicts with the schema’s self-describing metadata.
Minor errors. These are errors that other validation tools mark as warnings (and can be omitted by lint
using --skip-warnings
), but in fact can prove critical. The most common case is an unknown property: JSON Schema tolerates this and states that it is up to the validating application to implement some property, but this latitude can leads to confusions and further mistakes. For example, we’ve seen people confuse the maximum
property with maxValue
from JSON Schema pre-DraftV4, which means that the JSON Schema will validate incorrect instances, and tools like static generate
will then produce incorrect DDL.
Performing these kinds of check will help you to maintain quality of your JSON Schemas at a high level, which can reduce data loss and increase the stability of your data pipeline. In future versions we’re planning to introduce severity levels to handle even more subtle things that can possibly lead to undesired behavior.
2. Schema DDL
The standalone Schema DDL library has been in use inside Snowplow for about year, providing a partial abstract syntax tree for Redshift tables. As part of the restructuring of Iglu, we are moving Schema DDL into the main Iglu project.
As part of this move, the main package is com.snowplowanalytics.iglu.schemaddl
, instead of the previous com.snowplowanalytics.schemaddl
. This breaking change allowed us to reorganize the existing package stucture, making the Redshift DDL AST available as com.snowplowanalytics.iglu.schemaddl.redshift
, the first of many. This highlights the purpose of the Schema DDL project: to contain abstract syntax trees (ASTs) and related functions for various data definition languages and schema formats.
And to double down on this, in this release we also introduce a new AST for JSON Schema, available at com.snowplowanalytics.iglu.schemaddl.jsonschema
. JSON Schema’s AST can be used to parse arbitray JSON into typesafe AST and drives schema linting and DDL derivation for JSON Schema; in future it will be used more widely for various Iglu-related tasks.
The Schema DDL artifact now is also available on JCenter and Maven Central, and can be included into SBT project as follows:
"com.snowplowanalytics" %% "schema-ddl" % "0.4.0"
3. Migration guide
Given that Redshift table and JSON Paths file generation is now available as part of igluctl
, we will be deprecating schema-guru ddl
command – of course, everything related to JSON Schema derivation remains. It means we strongly encourage you to switch to igluctl
as soon as possible for DDL generation.
You can download igluctl
from Bintray using the following link:
$ wget http://dl.bintray.com/snowplow/snowplow-generic/igluctl_0.1.0.zip $ unzip -j igluctl_0.1.0.zip
Migration should be fairly easy: you just need to replace ./schema-guru ddl
with ./igluctl static generate
. All options remain the same with only two minor behavioral changes:
- The command now exits with status 1 if any error has been encountered in any JSON Schema
- The default Redshift encoding for BOOLEAN column is now
RUNLENGTH
instead ofRAW
4. Iglu roadmap
At this moment, we have two major independent goals for Iglu:
- First-class support for database table definitions and mappings between these definitions and corresponding schemas. This should allow users of the Snowplow platform to concentrate on schema definitions and forget about tedious table deployments and manual data migrations
- Schema inference. This is one more step towards making Iglu “just work”, without users having to do exhaustive upfront schema definition
Stay tuned!
5. Getting help
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.