We are pleased to announce the release of Schema Guru 0.6.0, with long-awaited initial support for database migrations in SQL. This release is an important step in allowing Iglu users to easily and safely upgrade Redshift table definitions as they evolve their underlying JSON Schemas.
This release post will cover the following topics:
- Introducing migrations
- Redshift migrations in Schema Guru
- New –force flag
- Minor CLI changes
- Getting help
- Plans for future releases
1. Introducing migrations
In data-sophisticated companies, data models or schemas can evolve rapidly. To help to support that rapid evolution, at Snowplow we introduced the SchemaVer versioning system to allow schemas to evolve in a safe way.
The most common form of schema evolution is an
ADDITION, where we for example bump the SchemaVer from 1-0-0 to 1-0-1. An
ADDITION is where we can guarantee that:
- All existing data is still compatible with the updated schema
- All existing data consumers are still compatible with the new data
ADDITIONs consist simply of the addition of one or more new optional properties to the schema. This is easy to handle in a columnar database like Amazon Redshift: we can simply apply
ADD COLUMN statements to add the new properties to the end of the existing table.
Previously, Iglu users had to handle schema updates manually, by comparing the schemas, writing SQL migrations by hand and keeping an eye on the all-important column order. Schema Guru was little help here, because running the
schema-guru ddl command on a schema version 1-0-4 (say) would generate a “clean slate” Redshift table which ignored the previous column orders.
We have seen first-hand that this was a very error-prone process. Taking these capabilities of JSON Schema and Redshift together, this release can now generate SQL migration scripts for existing Redshift tables, along with full SQL table definitions.
2. Redshift migrations in Schema Guru
With this release, Schema Guru’s
ddl command will generate a Redshift SQL migration file between all
ADDITION versions, as well as a SQL file to create the highest
ADDITION version from scratch.
For example, running
schema-guru ddl on the following JSON Schemas:
Will result in the following output:
From this we can see that Schema Guru generated a list of migration scripts across all
Here is an example migration script, which updates
com.amazon.aws.cloudfront/wd_access_log/jsonschema from version 1-0-0 directly to version 1-0-2:
Warning: this new migration capability is experimental: please exercise caution with this feature and always visually inspect any migration script before applying it to a Redshift database.
This migration capability is also incomplete: to date it only supports the addition of new optional columns. We have an open ticket, #140, to track other possible migration scenarios – please add your suggestions/priorities to that ticket.
3. New –force flag
Schema Guru is steadily getting smarter at generat
ing table definitions – however, users will still encounter rare cases where it is just not possible to generate the correct DDL automatically. In these cases users tend to edit their DDL files manually.
Once a user has manually edited schemas, he or she is at risk of accidentally overwriting those schemas by re-running Schema Guru. As of this release, Schema Guru will not silently overwrite DDL files if a given file is already present on disk and holds different contents to Schema Guru’s new output; indeed Schema Guru only checks the actual SQL code – not comments or formatting.
Instead of silently overwriting the file, Schema Guru will report a warning for that file. To override this behavior a user may use the new
--force flag to update all files regardless of manual changes.
4. Minor CLI changes
In this release we also introduced two minor CLI changes:
- In the Spark Job,
--enum-setsis no longer an option, but instead a flag which can be used to tell Schema Guru to check all known predefined enum sets. It is equivalent to
schema-guru schema --enum-sets allin the CLI. This doesn’t affect a CLI app
- Schema Guru no longer expects the
inputargument to be in the last position for the
schemacommands. The following command is now valid:
schema-guru ddl --output /event-dictionary /json-schemas --raw-mode, whereas in previous versions users have to move
/json-schemasto the end of line
Schema Guru CLI
Simply download the latest Schema Guru from Bintray:
Assuming you have a recent JVM installed, running should be as simple as:
Schema Guru’s Web UI and Spark Job
Schema Guru’s Web UI and Spark Job have no new features in this release, but as stated in CLI update section, if you were deriving enum sets with the Spark Job, you now must specify
--enum-sets without parameters. Also, you can safely use 0.4.0 versions of both Spark Job and Web UI without a fear to miss new features or bugfixes.
6. Getting help</a>
For more details on this release, please check out the Schema Guru 0.6.0 on GitHub.
More details on the technical architecture of Schema Guru can be found on the For Developers page of the Schema Guru wiki.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.
7. Plans for future releases</a>
With the new features introduced in this release for the
ddl command, this side of Schema Guru has evolved into a de facto Iglu static schema registry generator (a little like Jekyll but for Iglu). This use case is very separate from Schema Guru’s original purpose, the
schema command, which aims to derive JSON Schemas from a collection of instances.
The current design also bundles many different dependencies and features into Schema Guru, making it harder to follow “Do One Thing and Do It Well” philosophy.
Given the above, we are now planning to move all features related to the
ddl command into a separate project inside iglu repository. Schema Guru will revert to its initial purpose – we have no plans to change the
schema command capabilities.