Blog

What is a Data Contract: 6 Key Components You Need to Understand

By
Yali Sassoon
&
November 27, 2023
Share this post


A data contract serves as an agreement between data producers and consumers, ensuring functionality, manageability, and reliability akin to business contracts in product supply chains.

It’s been just over a year since data contracts were the center of a lot of controversy, and it’s great to see that debate is over: data contracts have been broadly accepted and are increasingly adopted by organizations looking to democratize data and improve data quality.

During this time Andrew Jones has published his excellent book Driving Data Quality with data contracts, Chad Sanderson has launched Gable.ai and the concept of data contracts has been incorporated into a number of products including Starburst and dbt. (Stay tuned for upcoming developments at Snowplow here.)

But as data contracts become more widespread we’re seeing the concept being watered down as more and more people jump on the bandwagon. (This might have reached its nadir with Hightouch’s launch of “data contracts”.) So now feels like a good moment to review what are the critical components of a data contract and why. 

data contracts image 1

What are the critical components of data contracts?

Ownership

In order to use data effectively in an organization, there must be clear ownership for each stage of the data value chain. This is  difficult to achieve in practice, as many organizations overload their central data team with responsibility for the entire data value chain. Data contracts are designed to solve this bottleneck in data teams , by providing a clean way of assigning ownership for the different parts of the value chain to data producers and consumers.

Ownership is therefore one of the fundamental pieces of metadata that has to be captured in any data contract. A data contract should not just describe what standards the data will meet at different points in the data value chain, but who is responsible for ensuring that the data meets that standard.

It is not uncommon to not only identify the owner of the data set described by the data contract, but also the different ways to contact the owner and get support in the data contract itself.

Schema

 To consume data on an ongoing basis, you need to know how that data is structured. (If you’re performing a one-off analysis, you can spend the time figuring out the structure. But if you want to keep a data application consuming a data feed on an ongoing basis, you need to know that the data that will arrive will conform to the same structure as the data you’ve consumed to date.) Schema is therefore a key element to any data contract. 

This is the element of data contracts that everyone remembers, however lax their definition of a data contract is. At Snowplow, we’ve been supportng this element since 2014 when we first introduced self-describing JSONs and then Iglu, our schema registry. However, as we’ve seen above with Ownership and we’ll see below, a data contract needs to contain a lot more than just a schema. 

Semantics

To gain insights and draw conclusions from data, it is not sufficient to know the structure of the data (i.e. the schema), you also need to know what the data means (i.e. the semantics). In many organizations, there are typically a handful of people who are known as “experts” on particular data sets because they have lots of experience working with those data sets, they understand exactly how the data was created, and as a result are able to use those data sets to develop answers, insights and applications much faster than their colleagues. The difference between them and the rest of the company comes down to a deep understanding of the semantics of the data. Let’s bring this to life with a few examples:

  • If you have a data point that describes how long someone ran for, you not only need to know that it is a numeric field type, but also that the units are in km rather than miles, for example. If the data changes, so that at a certain point it starts reporting in miles rather than km, or some of the data is in km and some is in miles, that is a problem that needs to be addressed as part of the data processing.
  • If you have Salesforce opportunity data, it’s not enough to know that a particular field represents “stage 4 in the opportunity funnel” - you need to know the definition that sales people use to categorize that opportunity in stage 4, and how that definition might vary between different members of the team and the team over time. 
  • If you have a data point that describes the number of sessions that a user has engaged with your applications, you need to know how those sessions are defined, and any variation in that definition across different applications and platforms and over time. 

A data contract can’t just be a commitment that the structure of the data is not going to change within certain boundaries - a similar set of guarantees is required for the semantics of the data. This is a lot harder to monitor and enforce than a schema - I’m planning to dive into this in a dedicated post shortly, as it is an important topic that deserves its own treatment. 

Delivery and access SLAs

If you are going to build a data application that consumes some data, you need to know where and how you’re going to access the data, and how readily the data will be made available. The data contract should include information like:

  • Where the data can be accessed (i.e. physical location) e.g. a table in a warehouse or data lake, or a Kafka topic, for example.
  • How the data can be accessed e.g. via an API.
  • How quickly the data is made available (latency). This is going to be essential for powering real-time use cases (e.g. fraud detection, or dynamic pricing).

Data quality SLAs

If you are going to build an application on top of a data set, you need to have a certain level of assurance around the quality of the data.

In an ideal world, a data quality SLA would provide assurance around:

  • The accuracy of the data i.e. to what extent when the data represents that something has happened, that it has really happened, within a certain degree of accuracy. For example, if the data says some individual completed a run, then you have a certain degree of confidence that they really completed a run, and that the length of the run, the time taken and the time completed are accurate to a certain, specified level.
  • The completeness of the data i.e. if the data says that an individual did not complete a run (because there is no indication in the data that she did), then how confident can we be that she actually did not?  

In practice these are very hard if not impossible things to specify categorically: we generally don’t have a way to check our data against the real world it is supposed to describe - the data is our best representation of that world. So most data contract implementations instead specify tests that can be conducted on the data - either on individual lines of data, or on the aggregate data set, to build some lower level of confidence in its accuracy. For example:

  • A test could be carried out on data describing a run completed by an individual to test that the speed is within certain bounds, and that there’s no data describing the individual performing a different activity (e.g. swimming or sleeping) at the same time
  • Tests could be conducted to ensure that the distribution of the data is as expected

Whilst there are significant technical challenges to specifying better data quality SLAs, as with semantics I am optimistic that there is significant scope for technology innovation here. I look forward to sharing more in a future post. 

Policy

It is very common in an organization to have multiple different data sets with different rules related to what can be done with the data, based on a combination of legislation around:

  • The use of personal data and the basis on which this has been collected
  • The use of AI to make decisions, especially as this relates to the provision of services e.g. credit to individuals

It is therefore essential that data contracts include metadata that

  • Describes exactly what the associated data can and cannot be used for.
  • Identifies any fields that are personally identifiable.
  • Identifies any fields that describe decisions made by an AI that are subject to regulation, and if so what standards those decisions need to meet. 
  • Data retention policies (how long this data can be stored for)
  • Data anonymization policies. (Do any fields need to be anonymized or pseudonomized?)

Enforcing schema is a necessary first step, but by itself not sufficient, to enable data contracts

To really drive data quality and improved democratization of data in an organization, all of the above elements need to be incorporated in data contracts.

Towards an open standard for data contracts

There are a number of very promising initiatives to develop open standards for data contracts - the ones we are tracking are:

As an open standard for data contracts emerges, we are looking forward to supporting it at Snowplow.

Read more articles like this

This article is adapted from Yali Sassoon's Substack Data Creation: A blog about data products, data applications and open source tech.

Subscribe to our newsletter

Get the latest blog posts to your inbox every week.

Get Started

Unlock the value of your behavioral data with customer data infrastructure for AI, advanced analytics, and personalized experiences