Snowplow 0.8.8 released with Postgres and Hive support
We are pleased to announce the immediate release of Snowplow 0.8.8. This is a big release for us: it adds the ability to store your Snowplow events in the popular PostgreSQL open-source database. This has been the most requested Snowplow feature all summer, so we are delighted to finally release it.
And if you are already happily using Snowplow with Redshift, there are two other new features to check out:
- We have added support for multiple storage targets to Snowplow’s StorageLoader. This means that you can configure StorageLoader to load into three different Redshift databases, one PostgreSQL database and one Redshift – whatever
- We have brought back the ability to query your Snowplow events using HiveQL. Regardless of which storage target(s) you are using, you can now also run HiveQL queries against your Snowplow events stored in Amazon S3
As well as these new features, we have made a large number of improvements across Snowplow:
- We have made some improvements to the Hadoop-based Enrichment process (bumped to version 0.3.3)
- We have simplified EmrEtlRunner and its configuration file format
- We have improved the performance of the Redshift loading code
- We have added a configuration option for setting
MAXERROR
when loading into Redshift (see the RedshiftCOPY
documentation for details) - We have moved the Snowplow JavaScript Tracker into its own repository
- We have removed the deprecated Hive ETL and Infobright folders from the repository
After the fold, we will cover the options for upgrading and using the new functionality:
1. Upgrading
There are three components to upgrade in this release:
- The Hadoop ETL, to version 0.3.3
- EmrEtlRunner, to version 0.4.0, and its configuration file
- StorageLoader, to version 0.1.0, and its configuration file
Let’s take these in turn:
Hadoop ETL
If you are using EmrEtlRunner, you need to update your configuration file, config.yml
, to use the latest version of the Hadoop ETL:
Read on for the rest of the changes you will need to make to config.yml
.
EmrEtlRunner
You need to upgrade your EmrEtlRunner installation to the latest code (0.8.8 release) on GitHub:
$ git clone git://github.com/snowplow/snowplow.git $ git checkout 0.8.8 $ cd snowplow/3-enrich/emr-etl-runner $ bundle install --deployment
Next, you need to update the format of your config.yml
– the format has been simplified significantly. The new format (as found on GitHub) looks like this:
Note that the :snowplow:
section has been mostly removed, with :hadoop_etl_version:
moving into the :etl:
section.
StorageLoader
You need to upgrade your StorageLoader installation to the latest code (0.8.8 release) on GitHub:
$ git clone git://github.com/snowplow/snowplow.git $ git checkout 0.8.8 $ cd snowplow/4-storage/storage-loader $ bundle install --deployment
Next, you need to update the format of your config.yml
– the format has been updated to support multiple storage targets. The new format (as found on GitHub) looks like this:
Note the new :maxerror:
setting – see the Redshift COPY
documentation for more on this.
To add another Redshift storage target, just add another configuration block under :targets:
, starting with - :name:
.
To add a new Postgres storage target, read on…
2. Loading events into Postgres
Loading events into Postgres is quite straightforward:
- Upgrade to the latest version of Snowplow as described above
- If you don’t have one already, setup a Postgres database server
- Create a
snowplow
database within Postgres - Deploy the Snowplow schema and table into
snowplow
- Configure StorageLoader to load into Postgres
For help on steps 2-4, please see our new guide, Setting up PostgreSQL. You can find the new PostgreSQL script on GitHub, here.
For step 5, you should create a StorageLoader configuration file which looks like this (as found on GitHub):
Make sure to set :folder:
to a local directory where you can download the Snowplow event files to, ready for loading into your local Postgres database server.
3. Querying events with HiveQL
The new release makes it possible again to query your Snowplow events directly on Amazon S3 using HiveQL.
Steps are as follows:
- Upgrade to the latest version of Snowplow as described above
- Wait for Snowplow to run at least once following the upgrade
- Follow the instructions in our updated guide, Running Hive using the command line tools
If you want to run HiveQL queries across your historical event data (i.e. from before the upgrade), this is possible too. You will need to rename the timestamped folders in your event archive from the old format to the new format, by prepending run=
. So for example, change:
s3://my-snowplow-archive/events/2013-07-01-04-00-03
To:
s3://my-snowplow-archive/events/run=2013-07-01-04-00-03
One this is done for all folders, all of your historic event files should be correctly partitioned ready for Hive to query.
4. Getting help
As always, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.
For more details on this release, please check out the 0.8.8 Release Notes on GitHub.