Dataflow Runner 0.3.0 released


We are pleased to announce version 0.3.0 of Dataflow Runner, our cloud-agnostic tool to create clusters and run jobflows. This release is centered around new features and usability improvements.
In this post, we will cover:
- Preventing overlapping job runs through locks
- Tagging playbooks
- New template functions
- Other updates
- Roadmap
- Contributing
1. Preventing overlapping job runs through locks
This release introduces a mechanism to prevent two jobs from running at the same time. This is great if you have for example an ETL process that needs to run as a singleton, or you have multiple jobs that each need exclusive access to the same database.
With this feature, Dataflow Runner will acquire a lock before starting the job. Its release will happen when:
- the job has terminated (whether successfully or with failure) with the
--softLock
flag - the job has succeeded with the
--lock
flag (“hard lock”)
As the above implies, if a job were to fail and the --lock
flag was used, manual cleaning of the lock will be required.
Two strategies for storing the lock have been made available: local and distributed.
1.1 Local lock
You can leverage a local lock when launching your playbook with ./dataflow-runner run
using:
./dataflow-runner run --emr-playbook playbook.json --emr-cluster j-21V4W2CSLYUCU --lock path/to/lock
This prevents anyone on this machine from running another playbook using path/to/lock
as lock.
For example, launching the following while the steps above are running:
./dataflow-runner run --emr-playbook playbook.json --emr-cluster j-KJC0LSX73BSF --lock path/to/lock
fails with:
WARN[0000] lock already held at path/to/lock
You can set the lock name as appropriate to setup locks across different playbooks, job names and/or cluster IDs.
In a local context, the lock will be materialized by a file at the specified path which can be relative or absolute. In case of a relative path, it will be relative to your current working directory.
1.2 Distributed lock
Anoter strategy is to leverage Consul to enforce a distributed lock:
./dataflow-runner run --emr-playbook playbook.json --emr-cluster j-21V4W2CSLYUCU --lock path/to/lock --consul 127.0.0.1:8500
That way, anyone using path/to/lock
as lock and this Consul server will have to respect the lock.
In a distributed context, the lock will be materialized by a key-value pair in Consul, the key being at the specified path.
2. Tagging playbooks
Much like cluster configurations which can be tagged, versions 0.3.0 introduces the ability to tag playbooks.
As an example, we could have the following playbook.json
file:
{ "schema": "iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1", "data": { "region": "us-east-1", "credentials": { "accessKeyId": "env", "secretAccessKey": "env" }, "steps": [ { "type": "CUSTOM_JAR", "name": "Combine Months", "actionOnFailure": "CANCEL_AND_WAIT", "jar": "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar", "arguments": [ "--src", "s3n://my-output-bucket/enriched/bad/", "--dest", "hdfs:///local/monthly/" ] } ], "tags": [ { "key": "environment", "value": "production" } ] } }
However, unlike the cluster configuration tags which actually tag the EMR cluster, playbook tags don’t have any effect in EMR.
Note that, compared with version 0.2.0 of Dataflow Runner, the playbook schema version has changed to 1-0-1. 1-0-1 is fully backward compatible, so if you do not wish to use the tags introduced in this release you do not have to change anything.
The up-to-date playbook schema can be
found on GitHub.
3. New template functions
In addition to the already existing nowWithFormat
and systemEnv
, the 0.3.0 release brings three new template functions: timeWithFormat
, base64
, and base64File
.
3.1 timeWithFormat
Similarly to nowWithFormat
, timeWithFormat [time] [format]
will format the specified unix time thanks to the format argument.
As an example, if we have the following in our cluster.json
:
{ "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1", "data": { "name": "dataflow-runner {{timeWithFormat "1495622024" "Mon Jan _2 15:04:05 2006"}}", // omitted for brevity } }
it results in:
{ "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1", "data": { "name": "dataflow-runner Wed May 24 10:33:44 2017", // omitted for brevity } }
3.2 base64
As its name implies, the base64
template function will encode the argument using base 64 encoding.
For example:
{ "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1", "data": { "name": "{{base64 "dataflow-runner"}}", // omitted for brevity } }
results in:
{ "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1", "data": { "name": "ZGF0YWZsb3ctcnVubmVy", // omitted for brevity } }
3.3 base64File
Buidling on base64
, base64File
will encode the contents of the file passed as argument.
Let’s say we have a playbook-name.txt
file containing:
dataflow-runner
The following cluster.json
:
{ "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1", "data": { "name": "{{base64File "playbook-name.txt"}}", // omitted for brevity } }
results in:
{ "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1", "data": { "name": "ZGF0YWZsb3ctcnVubmVyCg==", // omitted for brevity }
}
4. Other updates
Some changes have been made to improve usability regarding missing template variables:
4.1 Short-circuit execution on unset template variable
Prior to 0.3.0, if you forgot to specify a template variable, then the string <no value>
would be filled into the template.
For example, launching an EMR cluster with the following cluster.json
configuration:
{ "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1", "data": { "name": "{{.name}} - {{.owner}}", // omitted for brevity } }
with:
./dataflow-runner up --emr-config config.json --vars name,snowplow
would have resulted in an EMR cluster named: snowplow - <no value>
.
With 0.3.0, forgetting a template variable is not allowed, and the following error will be generated:
FATA[0000] template: cluster.json: executing "cluster.json" at <.owner>: map has no entry for key "owner"
4.2 Short-circuit execution on unset environment variable
In the same vein, referring to an unset environment variable in a template through systemEnv
will result in an error instead of an empty string.
Let’s say that we have the following cluster.json
:
{ "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1", "data": { "name": "{{systemEnv "CLUSTER_NAME"}}", // omitted for brevity } }
Before 0.3.0, launching the cluster with the CLUSTER_NAME
enviroment variable unset would have resulted in an EMR cluster with no name. Now, it will result in the following error:
FATA[0000] template: cluster.json: executing "cluster.json" at <systemEnv "CLUSTER_NAME">: error calling systemEnv: environment variable CLUSTER_NAME not set
5. Roadmap
As we stated in the blog post for the previous release, we are committed to supporting other clouds such as Azure HDInsight (see issue #22) and Google Cloud Dataproc.
If you have other features in mind, feel free to log an issue in the GitHub repository.
6. Contributing
You can check out the repository if you’d like to get involved! In particular, any preparatory work getting other cloud providers integrated would be much appreciated.