As part of my winternship here at Snowplow Analytics in London, I’ve been experimenting with using Scala to upload Snowplow’s enriched events to Google’s BigQuery database. The ultimate goal is to add BigQuery support to both Snowplow pipelines, including being able to stream data in near-realtime from an Amazon Kinesis stream to BigQuery. This blog post will cover:
- Getting started with BigQuery
- Downloading some enriched events
- Installing BigQuery Loader CLI
- Analyzing the event stream in BigQuery
- Loading enriched events into BigQuery
- Next steps
To follow along with this tutorial, you will need:
- Some Snowplow enriched events as typically archived in Amazon S3
- Java 7+ installed
- A Google BigQuery account
If you don’t already have a Google BigQuery account, please sign up to BigQuery, and enable billing. Don’t worry, this tutorial shouldn’t cost you anything – Google have reasonably generous free quotas for both uploading and querying.
Next, create a project, and make a note of the Project Number by clicking on the name of the project on the Google Developers Console.
We now need a local folder of Snowplow enriched events – these should be in your Archive Bucket in S3. If you use a GUI S3 client like Bucket Explorer or Cyberduck, use that now to download some enriched events from your archive. You want to end up with a single folder containing enriched event files.
If you use the AWS CLI tools, then the following shell commands should retrieve all of your enriched events for January (update the bucket path and profile accordingly):
For the purposes of this tutorial I have written a simple command-line application in Scala, called BigQuery Loader CLI, to handle the loading of Snowplow enriched events into BigQuery.
The jarfile is hosted, compressed, in Bintray. You can download it by running the following shell commands:
We now need some Google credentials to access the BigQuery project. Head back to the Google developers console and:
- Click on the Consent screen* link in the **APIs and auth section of the Developer Console, add an Email address and hit Save
- Click on the Credentials link in the APIs and auth section
- Click on the create new Client ID button, selecting Installed application as the application type and other as the installed application type
- Click CreateClient Id and then Download JSON to save the file
- Save the
client_secretsfile to the same directory that you unzipped the command-line app
- Rename the
<projectId>is the Project Number obtained earlier
Done? Now we are ready to run the application.
To upload your data you simply type the command:
<projectId>is the Project Number obtained from the Google development console
<datasetId>is the name of the dataset, which will be created if it doesn’t already exist
<tableId>is the name of the table, which will be created if it doesn’t already exist
<dataLocation>is the location of either a single file of Snowplow enriched events, or an un-nested folder of Snowplow enriched events
On your first use of this command you will be prompted to go through Google’s browser-based authentication process. This may take a little while – it will load each file found in the directory separately.
To append further data to the table simply run the command again, omitting the
--create-table flag and changing
<dataLocation> as appropriate.
You can now view your loaded events in the Developers Console – navigate to the query UI by clicking on the BigQuery button under Big Data bottom-left.
Let’s take a simple query from Snowplow’s Analyst’s Cookbook: Number of unique visitors. Adapted to BigQuery’s slightly idiosyncratic SQL syntax, it looks like this:
If we run it against our January data in BigQuery, we will see something like this:
If you want to try your hand at adapting other Snowplow recipes to BigQuery, make sure to check out Google’s Query Reference documentation for BigQuery.
The next step in terms of my R&D with Google BigQuery is to write a Kinesis app that reads Snowplow enriched events from a Kinesis stream and writes them to BigQuery in near-realtime. After this, we will port this functionality over into Snowplow’s Hadoop-based batch flow. We also need to determine how best to support unstructured event and custom context JSONs in BigQuery.
Meanwhile, on the analytics side, others at Snowplow are looking at how they might best utilize the unique features of BigQuery to analyze a Snowplow event stream.
If you have run into any problems with this tutorial, or have any suggestions for our BigQuery roadmap, please do raise an issue or get in touch with us through the usual channels.