Building robust data pipelines in Scala – Session at Scala eXchange, December 2014


It was great to have the opportunity to speak at Scala eXchange last week in London on the topic of “Building robust data pipelines in Scala: the Snowplow experience”.
It was my first time speaking at a conference dedicated to Scala – and it was fantastic to see such widespread adoption of Scala in the UK and Europe. It was also great meeting up with Snowplow users and contributors face-to-face for the first time!
Many thanks to the team at Skills Matter for organizing such a great conference.
Below the fold I will briefly cover:
Building robust data pipelines in Scala
This session was an opportunity for me to “step back” a little and think about how and why we use Scala to enforce robust event processing at Snowplow. We have always been strong proponents of what we have called “high-fidelity analytics” – in this talk I explored how we use Scalding, the Scalaz toolkit and some simple design patterns to deliver this robustness:
It was a very experienced and technical audience, who asked some great questions. The pattern I presented which seemed to resonate most was “railway-oriented programming”, a term coined in the Railway oriented programming blog post by functional programmer Scott Wlaschin.
At Snowplow we came to Scott’s “railway-oriented” approach independently via Scalaz’s Validation type, which today underpins all of our event validation and processing. Scala and big data guru Dean Wampler was in the audience and summed up the railway approach in a single tweet:
"Railway-oriented programming": Data stream happy and a failure paths. Track from happy to failure, but not the reverse @alexcrdean #ScalaX
— Dean Wampler (@deanwampler) December 8, 2014
I really enjoyed giving the talk – it was a great opportunity to shine a techincal light on the foundational work we do at Snowplow on event quality and pipeline robustness. You can see a video version of the session online on the Skills Matter website. Expect a chapter on “railyway-oriented programming” in Unified Log Processing in due course!
My highlights from Scala eXchange
The Skills Matter team succeeded in packing a huge number of great sessions into Scala eXchange’s two days. Here were some of my highlights:
- Martin Odersky, the creator of Scala, providing a much-needed introduction to the binary compatibility challenges faced by Scala, how these are typically handled (or avoided) in other proglangs, and a suggested solution for Scala, called typed trees. A great talk on an important and poorly-understood topic
- Noel Markham, a Scala developer at ITV, gave a great introductory talk to Scalaz, the functional programming toolkit for Scala which we use heavily at Snowplow
- Brendan McAdams, at Netflix, gave a valuable talk on using Scala at scale at Netflix, including some great insights on their AMI-based packaging and deployment strategies
- Andreas Gies, atooni on GitHub, gave a deep technical talk on building a multi-container integration test suite using Akka, Docker and ScalaTest. Very clever stuff and available on GitHub as part of the blended project
- Pere Villega, a Scala developer at Gumtree, shared his experiences building a micro-services architecture around Apache Kafka. Our work at Snowplow with Kafka centers around its role as a unified log – so it was interesting to get the micro-services perspective on this same technology
Of course these were just my highlights – the two days were packed with great content and interactions across the four tracks. In particular, I was sorry to miss Dean Wampler’s second day keynote on why Scala is dominating the big data landscape – something we definitely concur with at Snowplow.
Many thanks to Skills Matter and all the organizers for an excellent conference!