We are pleased to announce version 0.3.0 of Dataflow Runner, our cloud-agnostic tool to create clusters and run jobflows. This release is centered around new features and usability improvements.
In this post, we will cover:
- Preventing overlapping job runs through locks
- Tagging playbooks
- New template functions
- Other updates
1. Preventing overlapping job runs through locks
This release introduces a mechanism to prevent two jobs from running at the same time. This is great if you have for example an ETL process that needs to run as a singleton, or you have multiple jobs that each need exclusive access to the same database.
With this feature, Dataflow Runner will acquire a lock before starting the job. Its release will happen when:
- the job has terminated (whether successfully or with failure) with the
- the job has succeeded with the
--lockflag (“hard lock”)
As the above implies, if a job were to fail and the
--lock flag was used, manual cleaning of the lock will be required.
Two strategies for storing the lock have been made available: local and distributed.
1.1 Local lock
You can leverage a local lock when launching your playbook with
./dataflow-runner run using:
This prevents anyone on this machine from running another playbook using
path/to/lock as lock.
For example, launching the following while the steps above are running:
You can set the lock name as appropriate to setup locks across different playbooks, job names and/or cluster IDs.
In a local context, the lock will be materialized by a file at the specified path which can be relative or absolute. In case of a relative path, it will be relative to your current working directory.
1.2 Distributed lock
Anoter strategy is to leverage Consul to enforce a distributed lock:
That way, anyone using
path/to/lock as lock and this Consul server will have to respect the lock.
In a distributed context, the lock will be materialized by a key-value pair in Consul, the key being at the specified path.
2. Tagging playbooks
Much like cluster configurations which can be tagged, versions 0.3.0 introduces the ability to tag playbooks.
As an example, we could have the following
However, unlike the cluster configuration tags which actually tag the EMR cluster, playbook tags don’t have any effect in EMR.
Note that, compared with version 0.2.0 of Dataflow Runner, the playbook schema version has changed to 1-0-1. 1-0-1 is fully backward compatible, so if you do not wish to use the tags introduced in this release you do not have to change anything.
The up-to-date playbook schema can be
found on GitHub.
3. New template functions
In addition to the already existing
systemEnv, the 0.3.0 release brings three new template functions:
timeWithFormat [time] [format] will format the specified unix time thanks to the format argument.
As an example, if we have the following in our
it results in:
As its name implies, the
base64 template function will encode the argument using base 64 encoding.
base64File will encode the contents of the file passed as argument.
Let’s say we have a
playbook-name.txt file containing:
4. Other updates
Some changes have been made to improve usability regarding missing template variables:
4.1 Short-circuit execution on unset template variable
Prior to 0.3.0, if you forgot to specify a template variable, then the string
<no value> would be filled into the template.
For example, launching an EMR cluster with the following
would have resulted in an EMR cluster named:
snowplow - <no value>.
With 0.3.0, forgetting a template variable is not allowed, and the following error will be generated:
4.2 Short-circuit execution on unset environment variable
In the same vein, referring to an unset environment variable in a template through
systemEnv will result in an error instead of an empty string.
Let’s say that we have the following
Before 0.3.0, launching the cluster with the
CLUSTER_NAME enviroment variable unset would have resulted in an EMR cluster with no name. Now, it will result in the following error:
As we stated in the blog post for the previous release, we are committed to supporting other clouds such as Azure HDInsight (see issue #22) and Google Cloud Dataproc.
If you have other features in mind, feel free to log an issue in the GitHub repository.
You can check out the repository if you’d like to get involved! In particular, any preparatory work getting other cloud providers integrated would be much appreciated.