Spark Example Project released for running Spark jobs on EMR
On Saturday I attended Hack the Tower, the monthly collaborative hackday for the London Java and Scala user groups hosted at the Salesforce offices in Liverpool Street.
It’s an opportunity to catch up with others in the Scala community, and to work collaboratively on non-core projects which may have longer-term value for us here at Snowplow. It also means I can code against the backdrop of some of the best views in London (see below)! Many thanks as always to John Stevenson of Salesforce for hosting us.
Over the last few months I have been teaming up with other Scala devs at Hack the Tower to try out Apache Spark, a cluster computing framework and potential challenger to Hadoop.
The particular challenge I set myself this month was to complete our Spark Example Project, which is a clone of our popular Scalding Example Project. Most howtos for data processing frameworks like Scalding or Spark assume that you are working with a local cluster in an interactive (e.g. REPL-based) fashion. At Snowplow, we are much more interested in creating self-contained jobs which can be run on Amazon’s Elastic MapReduce with a minimum of supervision, and this is what I wanted to template in the Spark Example Project.
In the rest of this blog post I’ll talk about:
- Challenges of running Spark on EMR
- How to use Spark Example Project
- Getting help
- Thoughts on Spark
- Spark and Snowplow
I actually started work on Spark Example Project last year. I spent some time trying to get the project working on Elastic MapReduce: we wanted to be able to assemble a “fat jar” which we could deploy to S3 and then run on Elastic MapReduce via the API in a non-interactive way. This wasn’t possible at the time, despite the valiant efforts of Ian O’Connell (SparkEMRBootstrap) and Daithi O Crualaoich (spark-emr), and our own questioning. And so I paused on the project, to revisit when EMR’s support for Spark improved.
Happily on Saturday I noticed that Amazon’s tutorial, Run Spark and Shark on Amazon Elastic MapReduce, had been updated in early March, including with bootstrap scripts to deploy Spark 0.8.1 to an EMR cluster. Logging on to the master node, I found a script called ~/spark/run-example
, designed to run any of Amazon’s example Spark jobs, each pre-assembled into a fat jar on the cluster.
It wasn’t a lot of work to adapt the ~/spark/run-example
script so that it could be used to run any pre-assembled Spark fat jar available on S3 (or HDFS): that script is now available for anyone to invoke on Elastic MapReduce here:
s3://snowplow-hosted-assets/common/spark/run-spark-job-0.1.0.sh
Once this was working, it was just a matter of reverting our Spark Example Project to Spark 0.8.1, testing it thoroughly and updating the documentation!
Getting up-and-running with the Spark Example Project should be relatively straightforward:
2.1 Building
Assuming you already have SBT installed:
The ‘fat jar’ is now available as:
2.2 Deploying
Now upload the jar to an Amazon S3 bucket and make the file publically accessible.
Next, upload the data file data/hello.txt
to S3.
2.3 Running
Finally, you are ready to run this job using the Amazon Ruby EMR client:
Replace {JAR_BUCKET}
, {IN_BUCKET}
and {OUT_BUCKET}
with the appropriate paths.
2.4 Verifying
Once the output has completed, you should see a folder structure like this in your output bucket:
Download the files and check that part-00000
contains:
while part-00001
contains:
And that’s it!
We hope you find Spark Example Project useful. As always with releases from the Snowplow team, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.
Although it is still early days, I was impressed with Spark, and very pleased to get it running on Elastic MapReduce in the same fashion as our existing Scalding jobs.
In particular, I like Spark’s pragmatic use of in-memory processing, where in contrast Hadoop jobs can be very disk-intensive. I also like Spark’s narrow focus: where Hadoop is an entire data ecosystem (file system, cluster management, job scheduling etc), Spark is much more manageable in scope, being designed to work well with other great technology such as Apache Mesos, Typesafe Akka, HDFS et al.
Separately, at Snowplow we are also closely following the Spark Streaming project, and paying particular attention to Amazon’s work adding Kinesis support there.
We are excited at Snowplow about the long-term potential for Apache Spark as a data processing framework for us to use alongside or potentially in places instead of Hadoop.
As a first step, we plan to pilot writing bespoke data processing jobs for Professional Services clients in Spark, where previously we would have used Scalding. If this goes well, we may experiment with running the Snowplow Enrichment process (scala-common-enrich) from inside Spark.
Separately, we will look into integrating Spark Streaming into our Snowplow Kinesis flow</a>; this could be a great way of implementing real-time decisioning flows and feedback loops for our users.
Stay tuned for more from Snowplow about Spark and Spark Streaming in the future!