Announcing Snowplow OS Distribution 21.08 North Cascades
In this post we are pleased to be announcing the latest features of the Snowplow Open Source Projects as part of the Snowplow OS Distribution 21.08 North Cascades. This announcement is the second distribution in our new format, building on the updates we discussed in the 21.04 Pennine Alps annoucement. As before, we clarify our recommended component versions and discuss the latest Snowplow features.
As a brief reminder, in April 2020 we made our last umbrella release, R119 Tycho Magnetic Anomaly Two. Following this release, we moved the Snowplow components into separate repositories and we announced why we’re changing the way we’re releasing in a blog post. We then started making periodic OS Distributions, starting with 21.04 Pennine Alps.
Snowplow 21.08 North Cascades
So what’s new in 21.08 North Cascades? Over the past few months, we have focused on:
- Improved mobile analytics options: Continuing to improve our mobile offering through a number of mobile tracker updates (inc. the ability to remotely update configurations), as well as the release of the mobile data models.
- Open Source quick start for AWS: Making it easier than ever to get started with Snowplow open source on AWS via a suite of terraform modules (GCP coming soon).
- Optimizing the core pipeline applications: Optimising the cost, performance & observability of the Snowplow pipeline through a re-write of some of the core pipeline applications.
- Support for dbt: The first release of the official Snowplow dbt package for our web data model.
- Snowplow Micro experience improvements: Load your schemas directly into Snowplow Micro and various other developer experience improvements.
Improved mobile analytics options
There have been three significant releases to the Snowplow mobile app trackers and a brand new mobile data model for sql runner.
Native Mobile Trackers
First up, we have the two native mobile trackers for iOS and Android. With version 2 of these trackers, both now have feature parity and a consistent API across the two platforms. We’ve also made considerable improvements to the API to make the trackers easier to configure and understand. On the topic of easier to configure, you can also now remotely configure the trackers, as the trackers can read a remote configuration file and send events using the configuration options specified there.
React Native Trackers
There is also a significantly updated version of the React Native Tracker. Version 1.0.0 of the React Native tracker is built on the foundations of the native mobile trackers v2 so includes the same features as well as various developer experience improvements such as type definitions for those using the tracker with TypeScript.
Mobile Data Model
Following the release of our new and updated Web Data Model, we have built a new Mobile Data Model specifically for the types of events which are emitted in typical use cases from our Mobile trackers.
This new, incremental, model is designed to work with the Snowplow mobile trackers with screen view events and the mobile session context enabled. It runs on Redshift, BigQuery and Snowflake using sql runner.
Support for dbt
To ensure that we continue offering the best options when it comes to modelling your Snowplow data, we have now released the first official Snowplow dbt package. This initial release is for the web data model and currently supports Redshift and BigQuery. You can expect Snowflake support shortly.
packages: - package: snowplow/snowplow_web version: 0.2.0
Optimizing the core pipeline applications
Re-architected RDB Loader
We have significantly improved the Snowplow RDB loader to make it observable, more performant and cheaper to run. It has been re-written as a light-weight standalone application, and re-architected to remove EmrEtlRunner and reduce the number of steps required to load the enriched data to Redshift. This forms part of our overarching strategy to move the Snowplow pipeline applications away from big data frameworks.
The latency of the data landing in Redshift and a count of good events are now emitted by the loader. In addition, a new manifest table has been made available within Redshift itself, which protects against double loading, but also gives you granular and exhaustive information about each batch that has been processed.
New Enrich PubSub
The new enrich-pubsub asset has also been built as a standalone application, replacing Apache Beam and no longer running as a Dataflow job, with the benefit of delivering cost reductions. Our benchmarking results against Beam Enrich have shown up to a 50% cost saving for a similar throughput (note that this will vary depending on your event volumes).
In addition, we have added observability to this version of Enrich, and as such you can now monitor the latency of your data from the point of hitting the collector to the point that it gets written to the enriched stream on Pubsub.
The new metrics use StatsD and as such require a StatsD installation. You can find the Installation and Configuration guide in the StatsD documentation.
We have also added the fields event_name and app_id as attributes to the messages that are written to Pubsub, making it easier for downstream subscribers of the enriched stream to filter or selectively subscribe to specific events.
Open Source quick start for AWS
We’re very excited to release a new way of getting started with Snowplow with a new quick start guide for AWS, coupled with a collection of Terraform modules that can be used to set up and configure your Snowplow pipeline. GCP support will also be rolling out soon (Sneak peak: some of the Terraform modules are already available!).
With the release of 21.08 North Cascades, the Terraform Modules have also been updated to use the recommended component versions, which we detail at the end of this post.
This new guide leverages Terraform to automate the deployment of your first Snowplow pipeline, giving you a load balancer, Snowplow collector, event validation & enrichment, Kinesis streams, S3 Loading and Postgres Loading. Even if this isn’t your first time setting up Snowplow, using the new Terraform modules is a great way to manage your Snowplow infrastructure.
You can find more information in the Introducing Snowplow’s Quick Start edition for Open Source blog post and the quick start documentation.
Surge Protection for AWS
We have continued to improve surge protection feature on AWS which we discussed in the 21.04 release notes. The sqs2kinesis application which is responsible for reading the messages from SQS and writing to Kinesis is now at v1.0.0. The sqs2kinesis application is now easier to configure, more reliable and battle tested so it’s ready for you to use in your production environments.
Snowplow Micro experience improvements
Snowplow Micro is a great way to write integration tests for your Snowplow tracking. A Docker container which contains the fundamental parts of a Snowplow pipeline and RESTful endpoints which let you check all your events have been validated (or not) as expected.
We’ve made some developer experience improvements that are now available as part of Version 1.2 and are described in detail in the Snowplow Micro v1.2 release blog post.
You can now embed your Iglu Schemas folder directly into Snowplow Micro, without the need for a separate Iglu server. This makes it easier than ever to test that your application updates work with your new and existing schemas as part of your CI or CD processes. Not only this, you can now query Snowplow Micro’s new
/iglu endpoint to retrieve your schemas directly.
On top of this, all the Snowplow Micro endpoints now support CORS headers, meaning you can send requests to Micro from a website and Snowplow Micro now supports ARM64 Docker images, so you can run it using Docker on new Apple Silicon Macbooks or even a Raspberry Pi.
Recommended Component Versions
This section links to the recommended components in 21.08 North Cascades. We’ve listed the major features above but many components have also seen smaller but significant updates. With 21.08 North Cascades, we’ve also introduced recommended Iglu component versions. Running the recommended 21.08 North Cascades components ensures you will be able to use all the features listed above and have the confidence they are battle tested and ready for production.
Recommended Component Versions are detailed on the 21.08 North Cascades Version Compatibility Matrix. Components which have been updated since the last release are highlighted.
Latest releases and the public roadmap
If you’re eager to play with the very latest Snowplow technology, you can find individual component releases listed in the snowplow/snowplow repository, on Discourse in the New Releases section or you can check the releases and product features sections of the Snowplow Analytics blog.
You can also head over to the Snowplow Public Roadmap which highlights the latest updates we’ve released and what will be coming soon. We’d also love to hear from you, so please add an emoji or a comment to the features you’re excited about or that you’d like to know more about. If you’d like to know more, you can read all about it in our Public Roadmap blog post.
For Snowplow Insights customers reading this, the majority of pipelines are already running 21.08 North Cascades components so you should be good to go ahead and explore the features above. If you’d like to find out exactly which versions you are running currently, please contact Snowplow Support.