Introducing Git-Backed Management of Snowplow Data Structures
Today Snowplow announces snowplow-cli
, a command-line (CLI) tool that makes data structures management local-first and version-control-friendly. This new tool integrates with Snowplow's data structure APIs, enabling teams to implement documented approval workflows in platforms such as GitHub via pull requests while controlling the full lifecycle of their data structures via automations such as GitHub Actions.
Introduction
Data structures are at the heart of Snowplow’s Customer Data Infrastructure (CDI). These JSON schema specifications define the expected structure and format of your event and entity data, acting as a contract between your tracking implementation and data consumers. Each data structure contains:
- The properties that make up an event (e.g., event_name, user_id, timestamp)
- The type of each property (string, number, boolean, etc.)
- Required vs. optional fields
- Constraints on values (e.g., string length, number ranges)
- Descriptions and metadata to document the purpose of each field
This strict schema validation ensures that only properly formatted data lands in your downstream destinations. Data structures allow you to define and capture data in exactly the way your business needs it, ensuring your tracking implementation aligns with your unique business requirements and use cases.
The Snowplow Console UI offers all the necessary facilities for customers to manage the full lifecycle of a data structure, including its creation (via a graphical builder or plain JSON editor), its deployment to development and production environments, and its updates.
This UI works well for most teams, however, some teams prefer to work within their existing workflows.
The Problem
Teams working with tracking design face several challenges in their workflows:
1. Collaboration Limitations: Internal teams often struggle to collaborate during the tracking design phase, with limited ability to review and discuss changes as a group.
2. Schema Version Management Complexity: Managing schema versions as separate files creates unnecessary overhead and makes it difficult to track and review changes effectively.
3. Risk of Breaking Changes: Without proper controls, accidental schema changes can break downstream applications and affect data quality.
4. Manual Overhead: Teams spend significant time on repetitive tasks that could be automated, from publishing schemas to maintaining documentation.
The Solution
Snowplow-cli
offers functionality to:
- Synchronize existing data structures between a local filesystem and the Snowplow Console
- Generate new data structures from templates
- Validate data structures
- Deploy to development or production environments
These commands can be used within automations such as GitHub Actions in a gitops-like manner, allowing the data engineers and developer teams to leverage their current workflows while benefiting from a clear way to request reviews and discuss changes. Most platforms for software development offer solid configuration options and rules to define who can make changes, to which resource, under what conditions. Furthermore, in using a git platform, teams benefit from the broad ecosystem of integrations with other critical tools such as project management (e.g. Jira), messaging (e.g. Slack), and more.
While we have focused on git and GitHub, Snowplow customers can use snowplow-cli
together with their preferred version control system (e.g. Mercurial) or platform (e.g. GitLab, BitBucket).
Benefits of combining 'snowplow-cli' with version control platforms
Our new command-line tool combined with a version control platform transforms how teams collaborate on tracking design and implementation:
1. Accelerated Development Cycles: Product and engineering teams can move quickly with tracking changes while maintaining data quality through automated validation and approvals.
2. Reduced Data Team Overhead: Automated validation and git-based workflows streamline the process of schema updates and deployments, freeing them to focus on strategic work.
3. Error Prevention: Built-in validation and approval workflows catch issues before they reach production, reducing the need for retroactive fixes.
4. Cross-Team Visibility: All stakeholders can see proposed changes, comment on implementations, and participate in the review process through familiar workflows.
5. Flexibility and Control: Teams maintain the agility they need while ensuring all changes meet data quality standards through automated checks and balances.
Example Usage
* The following is a condensed example. For the complete documentation with additional information and useful tips, please see the recipe here: Managing Data Structures in Git.
It all starts with a data structure in a local file. We are using YAML as a more human-friendly format:
apiVersion: v1
resourceType: data-structure
meta:
hidden: false
schemaType: event
customData: {}
data:
$schema: http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#
self:
vendor: com.example
name: login
format: jsonschema
version: 1-0-0
type: object
properties:
result:
enum: [success, failure]
additionalProperties: false
We can invoke snowplow-cli
on that file to validate its content, like we can do in Console UI:
$ snowplow-cli ds validate data-structures/com.example/login.yaml
This invocation yields some warnings, but no errors:
3:00PM INFO validating from paths=[data-structures/com.example/login.yaml]
3:00PM INFO will create file=data-structures/com.example/login.yaml vendor=com.example name=login version=1-0-0
3:00PM WARN validation file=data-structures/com.example/login.yaml
messages=
│ The schema is missing the "description" property (/properties/result)
│ The schema is missing the "description" property (/)
We can then publish to the dev
environment, and if we so wish later on, to the prod
one.
$ snowplow-cli ds publish dev
3:00PM INFO publishing to dev from paths=[data-structures]
3:00PM INFO will create file=data-structures/com.example/login.yaml vendor=com.example name=login version=1-0-0
3:00PM WARN validation file=data-structures/com.example/login.yaml
messages=
│ The schema is missing the "description" property (/properties/result)
│ The schema is missing the "description" property (/)
3:00PM INFO all done!
At this point, the data structure becomes locked in Console to avoid conflicts if someone tries to edit it in the UI. It is possible, however, to unlock it (only users with permissions to deploy to production) and resume edits there:
The existence of a development (dev
) and a production (prod
) environment fits well with a gitops-like workflow, where Github actions can ensure that merges in a develop
branch end up deploying to dev
, and merges to the main
branch deploy to prod
. Let’s see what a GH Action to deploy to dev
when a merge to develop
occurs would look like:
on:
push:
branches: [develop]
jobs:
publish:
runs-on: ubuntu-latest
env:
SNOWPLOW_CONSOLE_ORG_ID: ${{ secrets.SNOWPLOW_CONSOLE_ORG_ID }}
SNOWPLOW_CONSOLE_API_KEY_ID: ${{ secrets.SNOWPLOW_CONSOLE_API_KEY_ID }}
SNOWPLOW_CONSOLE_API_KEY: ${{ secrets.SNOWPLOW_CONSOLE_API_KEY }}
steps:
- uses: actions/checkout@v4
- uses: snowplow-product/setup-snowplow-cli@v1
- run: snowplow-cli ds publish dev --managed-from $GITHUB_REPOSITORY
This action sets up environment variables for secrets, checks out the repository, imports the snowplow-cli
, and uses it to publish to dev
. It is straightforward to create a similar action for publishing to prod
upon merges to main
, or another to validate data structures when a pull-request to develop
is started. Here’s another example of running an action to surface validation issues for the PR reviewer to detect easily:
Getting Started
Snowplow’s latest release enables better collaboration, stronger controls, and seamless integration with modern development workflows. Data teams can now manage data structures as code, in a version-controlled repository that is replicated to Snowplow's console via API integrations. Book a demo of Snowplow to learn more about this new functionality within our Data Product Studio.