Schema Guru 0.3.0 released for generating Redshift tables from JSON Schemas
We are pleased to announce the release of Schema Guru 0.3.0 and Schema DDL 0.1.0, our tools to work with JSON Schemas.
This release post will cover the following new topics:
- Meet the Schema DDL library
- Commands and CLI changes
- Overview of the ddl command
- ddl command for Snowplow users
- Advanced options for ddl command
- Upgrading
- Getting help
- Plans for next release
1. Meet the Schema DDL library
Schema DDL is a new Scala library which lets you generate Redshift table definitions and JSON Paths mapping files from JSON Schemas. It supports both “vanilla” JSON Schema and our own Self-describing Schema variant.
Currently there’s only one DDL generator, schemaddl.generators.redshift
, and one utility generator, schemaddl.generators.SchemaFlattener
for representing JSON Schemas as a flat data structure. In future releases we plan on adding generators for other database targets, including PostgreSQL, Vertica, Azure SQL Data Warehouse and Google BigQuery.
If you are interested in the code, check out Ddl.scala, the AST-like set of case classes that let us model Redshift’s CREATE TABLE
DDL.
As a first step, Schema DDL 0.1.0 has been added to the new release of Schema Guru to power the new ddl
subcommand there. In the future, we plan on embedding Schema DDL into our Iglu repository server, so that database table definitions can be auto-generated from JSON Schemas.
2. Commands and CLI changes
With the addition of DDL generation functionality to Schema Guru we moved everything related to schema derivation into the schema
subcommand; all new functionality related to DDL generation is part of the new ddl
subcommand.
To make both of these commands consistent we also made few changes in schema
options. For example now you don’t need to specify whether your input is a file or directory full of schemas – just use the positional parameter with input and Schema Guru will decide how to process it. Both examples are valid:
Note that the positional parameter must be placed after all options.
In previous releases you had to specify different output options for a single schema output (--output
) versus multiple schemas (--output-dir
). Now it’s the single --output
option, which is required when you process multiple JSON Schemas but can be omitted when you are generating a single schema (which will be printed to stdout).
The ddl
command works in a similar way: input is a required positional parameter, and you also may specify path to output dir; if you didn’t it will create a sql
directory (and potentially a jsonpaths
directory) in your current directory. Other ddl
options will be covered in following sections.
Note that if you specify a directory as input it will be processed recursively for both commands, so ensure that this directory contains only files you want to process, otherwise your JSON Schema may be polluted (schema
command) or you will see error messages (ddl
command).
3. Overview of the ddl command
Let’s imagine that we have a set of JSONs that we want to load into Redshift, and these JSONs are modeled by a set of JSON Schemas.
We can use the new ddl
subcommand to generate Redshift CREATE TABLE
statements for each of these JSON Schemas, plus accompanying JSON Paths files to support Redshift’s COPY from JSON command.
Let’s take some example JSON Schemas from Alex’s Unified Log Processing book:
What just happened? First we fed five JSON Schemas into Schema Guru’s ddl
command. Here’s the truck_arrives.json
JSON Schema:
Then Schema Guru generated five matching Redshift table definitions. Here is truck_arrives.sql
:
Note that all nested pro
perties are flattened into dot-separated column name, all camelCased keys are transformed to snake_cased, and columns are ordered by nullness and then alphabetically.
And because we ran with --with-json-paths
, Schema Guru also generated five JSON Paths files to map the JSON instances to the new Redshift table definition. Here is truck_arrives.json
:
Phew! That completes our example.
4. ddl command for Snowplow users
As existing Snowplow users have probably guessed, Schema Guru’s new ddl
command is purpose-built for the task of creating Snowplow event dictionaries!
Simply remove the --raw
option and the ddl
command will work great with a Snowplow-compatible collection of self-describing JSON Schemas; no longer will you have to write Redshift table definitions and JSON Paths mappings by hand. Note that we recommend bumping the default VARCHAR
size to 4096 to prevent issues with column truncation in Redshift.
Let’s work through an example, using the Snowplow website’s event dictionary. First let’s clone the event dictionary and move the existing sql
and jsonpaths
folders out of harm’s way:
Now let’s run Schema Guru against our event dictionary:
Great – that’s now generated all of our table definitions, ready for deploying into Redshift, and our JSON Paths mappings, ready for uploading to Iglu or Amazon S3.
We can take a sneak peak at one of our tables, com_snowplowanalytics_snowplow_website_trial_form_submitted_1.sql
:
Note how the table contains all of the “boilerplate” columns required by the Snowplow shredding process.
And here is the corresponding JSON Paths file:
Again, this file is now ready for use in the Snowplow shredding process. If you are interested, you can compare the machine-generated files above to the original human-generated files now in the -old
folders.
5. Advanced options for ddl command
–size for default VARCHAR size
In the absence of any other clues about size (e.g maxLength
or enum
), VARCHAR
s in generated Redshift tables default to a size of 255. You can override this with the --size
option:
–schema for specifying your tables’ schema
By default, tables are generated in the atomic
schema, while the --raw
option generates tables without a named schema.
If you want to specify your own schema for the CREATE TABLE
DDL, use the --schema
option:
–split-product-types for specifying your tables’ schema
A JSON Schema product type consists of two or more different types, like this: ["string", "integer"]
. If we encounter a product type when generating Redshift table definitions, currently we use a VARCHAR(4096)
. However, this loses some type safety.
An alternative is to split a product type into multiple columns using --split-product-types
. For example, this JSON property:
Will be represented as two different columns:
Notes:
- Constraints in the JSON Schema will be preserved on the element of the split type to which they correspond
- Avoid using this option with
--with-json-paths
– a JSON Paths file for split product types wi
ll likely break a RedshiftCOPY from JSON
load
–db for specifying database type
Currently only Redshift is supported by the ddl
command, but going forwards we plan on supporting other databases. You will be able to specify the database target with the --db
option:
There is no need to use this option yet, given that only Redshift is currently supported.
6. Upgrading
Simply download the latest Schema Guru from Bintray:
Assuming you have a recent JVM installed, running should be as simple as:
Note that the web UI is unchanged in this release.
7. Getting help
For more details on this release, please check out the Schema Guru 0.3.0 on GitHub.
In the meantime, if you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.
8. Plans for next release
We have plenty planned for the Schema Guru, Schema DDL and Iglu roadmaps! A few things that we are working on:
- Implementing Apache Spark support for Schema Guru so that we can derive JSON Schemas from much larger volumes of JSON instances
- Detecting known enum sets when generating Schemas (any suggestions highly appreciated!)
- Adding the ability to generate
CREATE TABLE
DDL for other databases