As we have got to know the Snowplow community better, it has become clear that many members have very specific event processing requirements including:
- Custom trackers and collector logging formats
- Custom event models
- Custom business logic that impacts on the way their event data is processed
To date, we have relied on three main techniques to help Snowplow users meet these requirements:
- Adding additional configuration options into the core Enrichment process (e.g. IP address anonymization, coming in 0.8.11)
- Working with users on bespoke re-writes of the Snowplow Enrichment process (mostly forks of the Scalding ETL job)
- Helping users to implement additional processing steps downstream of the current Enrichment/Storage processes (e.g. building reporting cubes in Hive or Redshift)
Each of these approaches has its strengths and weaknesses, and we will certainly continue to develop and improve all three. But we also want to explore if there is a “middle ground” between configuration options and fully bespoke code: can we somehow make the Snowplow Enrichment process user-scriptable?
If possible, the following approach would make an attractive middle ground:
- Pass one or more user-authored scripts into our Scalding ETL at runtime
- The user-authored script(s) are executed against each row of event data
- These scripts can be written in a popular and easy-to-learn scripting language
The first step to test if this approach is viable, was to test out Rhino and Scala’s inter-operation to see what was possible. In the rest of this blog post, we will reproduce that investigation as an interactive REPL (read-eval-print loop) session. To follow along, you will need to have SBT and Scala installed…
First we clone our Scalding Example Project, available on GitHub. This gives us a Scala environment which we know successfully can run Scalding on Hadoop (including Elastic MapReduce), giving us some confidence that whatever scripting works in this environment will ultimately work fine on EMR too.
So let’s get started:
$ git clone [email protected]:snowplow/scalding-example-project.git $ cd scalding-example-project $ sbt scalding-example-project > console scala>
Great, now we’re in the Scala console within SBT, and we have access to all of the libraries loaded as part of the scalding-example-project should we need them.
The return type definitely looks problematic, although the problem won’t manifest itself until we try to cast it into a Boolean. So let’s put together an example with some type safety:
condOpt “magic”, check out this Stack Overflow answer to “How to cast java.lang.Object to a specific type in Scala?”.
Great! We can see inside Scala case classes without any particular fuss.
Before we go, let’s try to generalize our
evalAsBoolean() method above into something a little bit more reusable. How about a method with a signature like this:
Hopefully the function arguments and return value are fairly clear, so let’s proceed to the whole function definition:
Paste that into the Scala console in SBT and you should see:
Now let’s try this out – first with a script which should evaluate to true:
Now a false value, involving checking a property inside of a case class:
voked at runtime from a Scala environment.
If you’re interested in adapting Snowplow’s technology to meet your custom event processing needs, and would like to discuss your requirements with the Snowplow team, then get in touch.