Introducing Git-Backed Management of Snowplow Data Structures

Costas Kotsokalis

Daniela Howard

October 31, 2024

Share this post

Today Snowplow announces snowplow-cli, a command-line (CLI) tool that makes data structures management local-first and version-control-friendly. This new tool integrates with Snowplow's data structure APIs, enabling teams to implement documented approval workflows in platforms such as GitHub via pull requests while controlling the full lifecycle of their data structures via automations such as GitHub Actions.

Introduction

Data structures are at the heart of Snowplow’s Customer Data Infrastructure (CDI). These JSON schema specifications define the expected structure and format of your event and entity data, acting as a contract between your tracking implementation and data consumers. Each data structure contains:

The properties that make up an event (e.g., event_name, user_id, timestamp)
The type of each property (string, number, boolean, etc.)
Required vs. optional fields
Constraints on values (e.g., string length, number ranges)
Descriptions and metadata to document the purpose of each field

This strict schema validation ensures that only properly formatted data lands in your downstream destinations. Data structures allow you to define and capture data in exactly the way your business needs it, ensuring your tracking implementation aligns with your unique business requirements and use cases.

The Snowplow Console UI offers all the necessary facilities for customers to manage the full lifecycle of a data structure, including its creation (via a graphical builder or plain JSON editor), its deployment to development and production environments, and its updates.

This UI works well for most teams, however, some teams prefer to work within their existing workflows.

The Problem

Teams working with tracking design face several challenges in their workflows:

1. Collaboration Limitations: Internal teams often struggle to collaborate during the tracking design phase, with limited ability to review and discuss changes as a group.

2. Schema Version Management Complexity: Managing schema versions as separate files creates unnecessary overhead and makes it difficult to track and review changes effectively.

3. Risk of Breaking Changes: Without proper controls, accidental schema changes can break downstream applications and affect data quality.

4. Manual Overhead: Teams spend significant time on repetitive tasks that could be automated, from publishing schemas to maintaining documentation.

The Solution

Snowplow-cli offers functionality to:

Synchronize existing data structures between a local filesystem and the Snowplow Console
Generate new data structures from templates
Validate data structures
Deploy to development or production environments

These commands can be used within automations such as GitHub Actions in a gitops-like manner, allowing the data engineers and developer teams to leverage their current workflows while benefiting from a clear way to request reviews and discuss changes. Most platforms for software development offer solid configuration options and rules to define who can make changes, to which resource, under what conditions. Furthermore, in using a git platform, teams benefit from the broad ecosystem of integrations with other critical tools such as project management (e.g. Jira), messaging (e.g. Slack), and more.

While we have focused on git and GitHub, Snowplow customers can use snowplow-cli together with their preferred version control system (e.g. Mercurial) or platform (e.g. GitLab, BitBucket).

Benefits of combining 'snowplow-cli' with version control platforms

Our new command-line tool combined with a version control platform transforms how teams collaborate on tracking design and implementation:

1. Accelerated Development Cycles: Product and engineering teams can move quickly with tracking changes while maintaining data quality through automated validation and approvals.

2. Reduced Data Team Overhead: Automated validation and git-based workflows streamline the process of schema updates and deployments, freeing them to focus on strategic work.

3. Error Prevention: Built-in validation and approval workflows catch issues before they reach production, reducing the need for retroactive fixes.

4. Cross-Team Visibility: All stakeholders can see proposed changes, comment on implementations, and participate in the review process through familiar workflows.

5. Flexibility and Control: Teams maintain the agility they need while ensuring all changes meet data quality standards through automated checks and balances.

Example Usage

* The following is a condensed example. For the complete documentation with additional information and useful tips, please see the recipe here: Managing Data Structures in Git.

It all starts with a data structure in a local file. We are using YAML as a more human-friendly format:

‍

apiVersion: v1
resourceType: data-structure
meta:
  hidden: false
  schemaType: event
  customData: {}
data:
  $schema: http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#
  self:
    vendor: com.example
    name: login
    format: jsonschema
    version: 1-0-0
  type: object
  properties:
    result:
      enum: [success, failure]
  additionalProperties: false

‍

We can invoke snowplow-cli on that file to validate its content, like we can do in Console UI:

$ snowplow-cli ds validate data-structures/com.example/login.yaml

This invocation yields some warnings, but no errors:

3:00PM INFO validating from paths=[data-structures/com.example/login.yaml]
3:00PM INFO will create file=data-structures/com.example/login.yaml vendor=com.example name=login version=1-0-0
3:00PM WARN validation file=data-structures/com.example/login.yaml
  messages=
  │ The schema is missing the "description" property (/properties/result)
  │ The schema is missing the "description" property (/)

We can then publish to the dev environment, and if we so wish later on, to the prod one.

$ snowplow-cli ds publish dev
3:00PM INFO publishing to dev from paths=[data-structures]
3:00PM INFO will create file=data-structures/com.example/login.yaml vendor=com.example name=login version=1-0-0
3:00PM WARN validation file=data-structures/com.example/login.yaml
  messages=
  │ The schema is missing the "description" property (/properties/result)
  │ The schema is missing the "description" property (/)
3:00PM INFO all done!

‍

At this point, the data structure becomes locked in Console to avoid conflicts if someone tries to edit it in the UI. It is possible, however, to unlock it (only users with permissions to deploy to production) and resume edits there:

‍

‍

The existence of a development (dev) and a production (prod) environment fits well with a gitops-like workflow, where Github actions can ensure that merges in a develop branch end up deploying to dev, and merges to the main branch deploy to prod. Let’s see what a GH Action to deploy to dev when a merge to develop occurs would look like:

on:
  push:
    branches: [develop]

jobs:
  publish:
    runs-on: ubuntu-latest
    env:
      SNOWPLOW_CONSOLE_ORG_ID: ${{ secrets.SNOWPLOW_CONSOLE_ORG_ID }}
      SNOWPLOW_CONSOLE_API_KEY_ID: ${{ secrets.SNOWPLOW_CONSOLE_API_KEY_ID }}
      SNOWPLOW_CONSOLE_API_KEY: ${{ secrets.SNOWPLOW_CONSOLE_API_KEY }}

    steps:
      - uses: actions/checkout@v4

      - uses: snowplow-product/setup-snowplow-cli@v1

      - run: snowplow-cli ds publish dev --managed-from $GITHUB_REPOSITORY

‍

This action sets up environment variables for secrets, checks out the repository, imports the snowplow-cli, and uses it to publish to dev. It is straightforward to create a similar action for publishing to prod upon merges to main, or another to validate data structures when a pull-request to develop is started. Here’s another example of running an action to surface validation issues for the PR reviewer to detect easily:

Getting Started

Snowplow’s latest release enables better collaboration, stronger controls, and seamless integration with modern development workflows. Data teams can now manage data structures as code, in a version-controlled repository that is replicated to Snowplow's console via API integrations. Book a demo of Snowplow to learn more about this new functionality within our Data Product Studio.

‍

Subscribe to our newsletter

Get the latest content to your inbox monthly.

Introducing Git-Backed Management of Snowplow Data Structures

Introduction

The Problem

The Solution

Benefits of combining 'snowplow-cli' with version control platforms

Example Usage

Getting Started

Get Started

Product

Comparisons

Solutions

Explore

Resources

Support

Company

Get the latest Snowplow news and updates

Follow Us