Snowplow Postgres Loader 0.1.0 released
For several years we were getting requests from the OSS community to add PostgreSQL support to Snowplow. Today we’re thrilled to announce the initial release of Snowplow Postgres Loader.
In this post:
1. Why PostgreSQL
PostgreSQL is one of top 5 database engines as of mid 2020. It is almost the standard de-facto among OLTP relational databases with an enormous amount of extensions and tools that can turn it into anything from time-series database to a monstrous multi-node cluster with analytical workloads. Vanilla PostgreSQL though was never meant to be used as OLAP and it still can be hard to tune the performance of analytical queries on multi-GB scans. However, not everyone is having datasets with hundreds of gigabytes, nor everyone is interested in analytical queries.
In fact, most of the requests we’ve seen were coming from users who would like to try out Snowplow and don’t want to pay for Redshift or Snowflake or set them up. Snowplow is easy to run on-premise, it is supported by almost any known cloud provider and and has an incredible ecosystem and community.
These are good reasons to extend the pool of supported databases with such a great candidate!
2. Snowplow Postgres Loader
RDB Loader (and its predecessor, StorageLoader) had PostgreSQL support since inception, but this support was extremely limited. Several major examples of this limited functionality:
- It supported only
atomic.eventstable, without all shredded types (not to mention recently added support of auto migrations)
- It used batch loading, mimicking Redshift and other OLAPs, which is suboptimal for PostgreSQL and for end user meant very infrequent and expensive loading
- It was tightly coupled with AWS
- It required EmrEtlRunner
We decided that PostgreSQL has inherently different requirements over the RDB Loader, which is currently (watch this space) oriented towards batch loading, and therefore deserves a dedicated Snowplow Postgres Loader repo.
We’re planning to remove PostgreSQL support from RDB Loader in one of the next releases, while Postgres Loader will be leveraging all the benefits of its only storage target.
With this initial release, Postgres Loader is already capable of:
- Working with “streaming inserts”, i.e. reading events directly from a subscription, which then turned into a transaction with a bunch of
INSERTstatements for every entity in the event. As a result, Postgres Loader is a single application, responsible for transformation and loading
- Supporting table creation and migrations out of the box so no need to generate and apply DDLs
- Supports both GCP PubSub and Amazon Kinesis as sources, other implementations are very easy to add – and PRs are welcome!
- Loading both enriched data and plain self-describing JSONs, which can be used as a replacement for ElasticSearch for debugging Snowplow bad rows
- Using rich set of unique PostgreSQL types, such as
UUID(for strings with
JSONB(for arrays and union types)
- Being used as a library (published on Maven Central as a
snowplow-postgres) for building complex data pipelines involving Postgres and Snowplow
3. Next steps
Postgres Loader was born as a hackathon project, as a response to a very frequent OSS community request. Despite a very solid ground such as FS2 and KCL it is not meant to be used in pipelines with scalability requirements and we never tested it in real-world scenarios. However, we do seek your feedback! We believe Postgres Loader has a big potential in many areas: demos, QA, low-volume pipelines and eventually mid-high volume pipelines.
Currently, Postgres Loader is missing following features and we would like to implement them in upcoming releases:
- Postgres Loader doesn’t try to perform any optimisations on its tables: no indexes, foreign keys, constraints etc. Given Postgres’ rich functionality for altering tables – it is better to let user perform this alterations manually
- It is implemented on top of plain JDBC and SQL statements, whereas Postgres Wire protocol could make it significantly more performant
- Currently all bad rows produced by the Loader (e.g. if type cannot be converted to SQL or DB connection lost) just printed to stdout, whereas it would be better to sink them to filesystem or send back to topic
- PubSub and Kinesis (KCL) sources have very different semantics
If you spot other opportunities, please do share your feedback on Discourse and GitHub! It will help us to prioritize the work.
Setup is described on our docs website.