Schema Guru 0.4.0 with Apache Spark support released
We are pleased to announce the release of Schema Guru version 0.4.0 with Apache Spark support, new features in both schema and ddl subcommands, bug fixes and other enhancements.
In support of this, we have also released version 0.2.0 of the schema-ddl library, with Scala 2.11 support, Amazon Redshift
COMMENT ON and a more precise schema-to-DDL transformation algorithm.
This release post will cover the following topics:
- Apache Spark support
- Predefined enumerations
- Comments on Redshift table
- Support for minLength and maxLength properties
- Edge cases in DDL generation
- Minor changes
- Bug fixes
- Getting help
- Plans for the next release
1. Apache Spark support
This release lets you run Schema Guru’s JSON Schema derivation process as an Apache Spark job – letting you derive your schemas from much larger collections of JSON instances.
For users of Amazon Web Services we provide a tasks.py file to quickly deploy an EMR cluster and run your Schema Guru job against your JSON instances stored in Amazon S3.
Either way, you will also need to have:
- An AWS CLI profile, e.g. my-profile
- A EC2 keypair, e.g. my-ec2-keypair
- At least one Amazon S3 bucket, e.g. my-bucket
With all the prerequisites in place you can now run the job:
You can easely modify
tasks.py to suit your own needs:
- To pass non-default options to the job, e.g. enum cardinality, just modify
run_emrtask. All options passed after path to jar file will be accepted as regular Schema Guru options
- You can adapt this script to run Schema Guru against your own non-AWS Spark cluster
2. Predefined enumerations
While deriving schemas, we often encounter some repeating enumerations like ISO country codes, browser user agents or similar.
In the 0.2.0 release, we implemented an enum derivation allowing us automatically recognize set of values whithin some cardinality limit.
However, if during derivation we only see, say, 100 of 165 possible currency codes, it’s very unlikely that we don’t need other 65. Even if we did encounter all 165 currency codes, if our enum detector’s cardinality limit is 100 then the enum set will be rejected.
To get around this, you can now specify specific known enumerations with
--enum-sets option. Built-in sets include iso_4217, iso_3166-1_aplha-2 and iso_3166-1_aplha-3 (written as they should appear in CLI).
If you need two or more, pass them as multiple options:
Or even better, you can pass special value
all to include all built-in enum sets.
Going further, and taking into account that users with domain-specific enums, you can now also pass in your own predefined enum sets. Just pass in the path to a JSON file containing an array with values, and if the encountered values intersetcs then the enumeration will be enforced in the schema:
favourite_colors.json might look like this:
3. Redshift object comments
Amazon Redshift is based on PostgreSQL 8.0.2 and thus they have many similarities and shared features. One powerful feature of PostgreSQL is the ability to
COMMENT ON on all sort of internal objects, such as tables, data bases, views etc.
Redshift also has the COMMENT ON syntax, although the documentation states that we cannot retrieve these comments with a SQL query. After some research we discovered that in fact table comments can be retrieved like so:
ddl command of Schema Guru now generates a
COMMENT ON statement for each Redshift table containing the full Iglu URI used to generate this table. In the future we will use this metadata to drive automated table migrations.
4. Support for minLength and maxLength properties
From the beginning the
ddl subcommand has used
maxLength properties of string schemas to determine whether column has type
CHAR (fixed-length) or which
VARCHAR size it has otherwise.
Given this, it was an omission that the
schema subcommand does not generate
maxLength properties. In this version we have fixed this, and all strings in JSON Schemas now have these properties.
Be aware that this can produce excessively strict JSON Schemas if you process very small set of instances. For this case we provide
With this setting, no
maxLength will appear in the resulting JSON Schema.
5. Edge cases in DDL generation
It can be challenging to precisely map the very powerful and dynamic set of JSON Schema rules to static database table DDL. With each release we aim to track down and solve the edge cases we have found.
With this release, Schema Guru can now process sub-objects in JSON Schema which lack the
- If the sub-object lacks
patternPropertiesit will be resulted in
- If the sub-object lacks
additionalPropertiesis set to
falsethe object will be silently ignored
Schema Guru is also now aware of nullable parent objects: if a child key is listed in the
required property, but the containing object is not required, then these keys will not have a
NOT NULL constraint in their DDL.
6. Minor changes
There are some minor changes introduced in this release:
- Schema Guru now throws an exception if you try to use
--split-product-typestogether, because there is no support for split product types in our JSON Path generation code yet
--sizeoption for the
ddlsubcommand, used to declare default
VARCHARsize, has been renamed to
7. Bug fixes
Since implementing Base64 detection, we sometimes saw false positives where this formatting rule was unfairly applied to short human-readable strings (that happened to also be valid Base64), as per issue #76. Now, application of this pattern depends on the total quantity of JSON instances being processed, and the length of the string, so the chance of false detection has been reduced almost to zero.
While generating DDL, Schema Guru now [correctly handles]
maxLength for complex types like
["object", "string"] (issue #35).
Schema Guru CLI
Simply download the latest Schema Guru from Bintray:
Assuming you have a recent JVM installed, running should be as simple as:
Schema Guru web UI
The Web UI can be also downloaded from Bintray:
Note that the Web UI has been updated only to reflect the codebase refactoring; no new features have been added.
Schema Guru Spark job
For running Schema Guru on Spark, please see the relevant section above. For AWS Elastic MapReduce users, we host the Spark job on S3 as:
You can download this if you want to run this job on Spark elsewhere:
For more details on this release, please check out the Schema Guru 0.4.0 on GitHub.
More details about how core of Schema Guru works can be found on the For Developers page of the Schema Guru wiki.
We have plenty of features planned for Schema Guru! The roadmap includes: