Schema Guru 0.4.0 with Apache Spark support released


We are pleased to announce the release of Schema Guru version 0.4.0 with Apache Spark support, new features in both schema and ddl subcommands, bug fixes and other enhancements.
In support of this, we have also released version 0.2.0 of the schema-ddl library, with Scala 2.11 support, Amazon Redshift COMMENT ON
and a more precise schema-to-DDL transformation algorithm.
This release post will cover the following topics:
- Apache Spark support
- Predefined enumerations
- Comments on Redshift table
- Support for minLength and maxLength properties
- Edge cases in DDL generation
- Minor changes
- Bug fixes
- Upgrading
- Getting help
- Plans for the next release
1. Apache Spark support
This release lets you run Schema Guru’s JSON Schema derivation process as an Apache Spark job – letting you derive your schemas from much larger collections of JSON instances.
For users of Amazon Web Services we provide a tasks.py file to quickly deploy an EMR cluster and run your Schema Guru job against your JSON instances stored in Amazon S3.
To use this you will need to have boto, pyinvoke and awscli packages installed. If you prefer not to install these manually, they are automatically installed via our Vagrant provisioning:
host> git clone https://github.com/snowplow/schema-guru host> cd schema-guru host> vagrant up && vagrant ssh guest> cd /vagrant
Either way, you will also need to have:
- An AWS CLI profile, e.g. my-profile
- A EC2 keypair, e.g. my-ec2-keypair
- At least one Amazon S3 bucket, e.g. my-bucket
With all the prerequisites in place you can now run the job:
guest> cd sparkjob guest> inv run_emr my-profile my-bucket/input/ my-bucket/output/ my-bucket/errors/ my-bucket/logs my-ec2-keypair
You can easely modify tasks.py
to suit your own needs:
- To pass non-default options to the job, e.g. enum cardinality, just modify
args
list inrun_emr
task. All options passed after path to jar file will be accepted as regular Schema Guru options - You can adapt this script to run Schema Guru against your own non-AWS Spark cluster
2. Predefined enumerations
While deriving schemas, we often encounter some repeating enumerations like ISO country codes, browser user agents or similar.
In the 0.2.0 release, we implemented an enum derivation allowing us automatically recognize set of values whithin some cardinality limit.
However, if during derivation we only see, say, 100 of 165 possible currency codes, it’s very unlikely that we don’t need other 65. Even if we did encounter all 165 currency codes, if our enum detector’s cardinality limit is 100 then the enum set will be rejected.
To get around this, you can now specify specific known enumerations with --enum-sets
option. Built-in sets include iso_4217, iso_3166-1_aplha-2 and iso_3166-1_aplha-3 (written as they should appear in CLI).
If you need two or more, pass them as multiple options:
$ ./schema-guru-0.4.0 schema --enum-sets iso_4217 --enum-sets iso_3166-1_aplha-3 /path/to/instances
Or even better, you can pass special value all
to include all built-in enum sets.
$ ./schema-guru-0.4.0 schema --enum-sets all /path/to/instances
Going further, and taking into account that users with domain-specific enums, you can now also pass in your own predefined enum sets. Just pass in the path to a JSON file containing an array with values, and if the encountered values intersetcs then the enumeration will be enforced in the schema:
$ ./schema-guru-0.4.0 schema --enum-sets ../favourite_colors.json --enum-sets all /path/to/instances
Where favourite_colors.json
might look like this:
["blue", "indigo", "purple", "violet", "white", "black"]
3. Redshift object comments
Amazon Redshift is based on PostgreSQL 8.0.2 and thus they have many similarities and shared features. One powerful feature of PostgreSQL is the ability to COMMENT ON
on all sort of internal objects, such as tables, data bases, views etc.
Redshift also has the COMMENT ON syntax, although the documentation states that we cannot retrieve these comments with a SQL query. After some research we discovered that in fact table comments can be retrieved like so:
SELECT description FROM pg_description WHERE objoid = 'schema.table'::regclass
The ddl
command of Schema Guru now generates a COMMENT ON
statement for each Redshift table containing the full Iglu URI used to generate this table. In the future we will use this metadata to drive automated table migrations.
4. Support for minLength and maxLength properties
From the beginning the ddl
subcommand has used minLength
and maxLength
properties of string schemas to determine whether column has type CHAR
(fixed-length) or which VARCHAR
size it has otherwise.
Given this, it was an omission that the schema
subcommand does not generate minLength
and maxLength
properties. In this version we have fixed this, and all strings in JSON Schemas now have these properties.
Be aware that this can produce excessively strict JSON Schemas if you process very small set of instances. For this case we provide --no-length
option:
$ ./schema-guru-0.4.0 schema --no-length /path/to/few-instances
With this setting, no minLength
nor maxLength
will appear in the resulting JSON Schema.
5. Edge cases in DDL generation
It can be challenging to precisely map the very powerful and dynamic set of JSON Schema rules to static database table DDL. With each release we aim to track down and solve the edge cases we have found.
With this release, Schema Guru can now process sub-objects in JSON Schema which lack the properties
property:
- If the sub-object lacks
properties
but containspatternProperties
it will be resulted inVARCHAR(4096)
- If the sub-object lacks
properties
butadditionalProperties
is set tofalse
the object will be silently ignored
Schema Guru is also now aware of nullable parent objects: if a child key is listed in the required
property, but the containing object is not required, then these keys will not have a NOT NULL
constraint in their DDL.
6. Minor changes
There are some minor changes introduced in this release:
- Schema Guru now throws an exception if you try to use
--with-json-paths
and--split-product-types
together, because there is no support for split product types in our JSON Path generation code yet - The
--size
option for theddl
subcommand, used to declare defaultVARCHAR
size, has been renamed to--varchar-size
(issue #98)
7. Bug fixes
Since implementing Base64 detection, we sometimes saw false positives where this formatting rule was unfairly applied to short human-readable strings (that happened to also be valid Base64), as per issue #76. Now, application of this pattern depends on the total quantity of JSON instances being processed, and the length of the string, so the chance of false detection has been reduced almost to zero.
While generating DDL, Schema Guru now [correctly handles] maxLength
for complex types like ["object", "string"]
(issue #35).
Also, a regression around schemas for array structures, introduced in the 0.2.0 release, has been fixed (issue #81).
8. Upgrading
Schema Guru CLI
Simply download the latest Schema Guru from Bintray:
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.4.0.zip $ unzip schema_guru_0.4.0.zip
Assuming you have a recent JVM installed, running should be as simple as:
$ ./schema-guru-0.4.0 {schema|ddl} {input} {options}
Schema Guru web UI
The Web UI can be also downloaded from Bintray:
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_webui_0.4.0.zip $ unzip schema_guru_webui_0.4.0.zip
Note that the Web UI has been updated only to reflect the codebase refactoring; no new features have been added.
Schema Guru Spark job
For running Schema Guru on Spark, please see the relevant section above. For AWS Elastic MapReduce users, we host the Spark job on S3 as:
s3://snowplow-hosted-assets/schema-guru/spark/schema-guru-sparkjob-0.4.0
You can download this if you want to run this job on Spark elsewhere:
$ wget https://snowplow-hosted-assets.s3.amazonaws.com/schema-guru/spark/schema-guru-sparkjob-0.4.0
9. Getting help
For more details on this release, please check out the Schema Guru 0.4.0 on GitHub.
More details about how core of Schema Guru works can be found on the For Developers page of the Schema Guru wiki.
In the meantime, if you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.
10. Plans for next release
We have plenty of features planned for Schema Guru! The roadmap includes: