What is Data Creation?

Data Creation is the process of deliberately creating data to power AI and advanced data applications. It differs from data exhaust, which is the by-product data emitted from existing systems.

The limitations of data exhaust

Most data teams are dealing with data exhaust—the pre-existing byproducts of disparate systems such as analytics platforms or CRMs. These data sets have different field types, granularity, quality, and completeness, and need significant wrangling before they can be used. While this data is plentiful, it’s limited in that each source arrives in a structure that requires transformation and opinion to become potentially useful.

Not only is this technically challenging, it introduces a high degree of variability to the data over time. For data teams—who are effectively trying to build a house with bricks of different sizes and densities—it’s time consuming, unproductive, and difficult.

Figure 1. The impact of data exhaust from existing analytics and CDP solutions

Data exhaust isn’t designed for AI or intensive applications.

  • Difficult to understand – To build effective data apps, you need to know how data was created. Yet this often cannot be inferred from collections of black-box SaaS applications.

  • Isolated tables that require complex integration – Data arrives in separate table structures and grammar, requiring complex joins and transforms to bring them into a workable framework.
  • Irregular structure – Data scientists and engineers have to aggregate data exhaust to the appropriate altitude to train AI models, and then set up processes to transform the data to productionize features. For example, one tool may report summarized transaction performance, while another may report individual events.
  • Not conducive for building features – Often, data exhaust doesn’t include strong signals that benefit the AI model. These signals could be highly predictive or explanatory, so just ‘making do’ compromises the effectiveness of the model. The chances of exhaust data being ‘perfect’ are very low.
  • Requires significant validation when aggregated – Whether a BI or ML use case, complex data exhaust pipelines require regular validation—knowing exactly how, where, and when can significantly change the meaning of metrics or features as a result.
  • Subject to sovereignty and digital compliance laws – Without knowing the full lineage of data, it is impossible to be fully compliant and audit-ready. You cannot record the basis for capturing an event or choose your own storage location, which is important for GDPR compliance.

All these factors lead to menial and painful data preparation work. Projects have a long time to value, and data applications are not as effective as they could be.

Figure 2. Data exhaust creates unusable behavioral events tables

Lack of accuracy

Two values with the same meaning need to be reconciled manually before they can be used in a model.

Not explainable

How is a session defined and how it is calculated? Assumptions cannot be implicit when building data models as they can lead to bias.

Not reliable

Freshness of data not predictable; Latency can vary from minutes to hours to days; Unclear if some events will ever arrive.

Compliance

PII exists in the data set but consent status is unclear. The result is often to discard the data.

Not predictive

Where did the event take place? What OS was used? What channel did they come from? There just aren’t enough entities.

Data Creation customizes data to suit specific needs

By investing time in the planning stages and adopting the right tooling, data teams can avoid data exhaust. Then, by customizing data to suit each data application, teams can develop a structured, rich, and high quality data asset that can easily be evolved over time.

With a Data Creation strategy in place, data engineers and analysts can shift-left the complexity of ETL and guarantee that all data—from creation, enrichment, modeling, and within the data warehouse—conforms and evolves with the business

Data Creation offers technical and practical advantages over traditional data collection techniques.

 

  • Creates behavioral data applications uniquely descriptive of your business – Data is created in an immediately intelligible format, with standardized naming conventions tailored to the experience you are describing. For a digital wearables company, this might be ‘miles run per day’ or another niche metric.
  • Evolves schemas over time – our schema delivers predictable data so data apps can be built and run in production. It is also inherently flexible and evolves with your business. This means you can iterate existing apps and launch new ones without being held back by rigid assumptions about the data already created.
  • Defines security, removes bias, and ensures data sovereignty and first-party data compliance end-to-end – Full ownership is an integral part of Data Creation, along with a full understanding of the lineage of your data for transparency.
  • Enriches in real time, retroactively, or in batch using Data Creation pipelines – To truly create the data you need, it can’t just be delivered in batch every few hours. Many apps have real-time requirements, so created data should match the needs of the data app in question.
  • Requires significant validation when aggregated – Whether a BI or ML use case, complex data exhaust pipelines require regular validation—knowing exactly how, where, and when can significantly change the meaning of metrics or features as a result.
  • Maintains organizational trust in data – To strengthen your organization’s data culture, you need accurate and auditable data. When each department can openly scrutinize data apps from across the business, you will build a culture of transparency and trust.

Snowplow’s approach to Data Creation

Figure 3: Snowplow’s Data Creation pipeline.

Generate

Data is created at source from your digital estate (e.g. website, app, IoT).
  • Define your data parameters – Create your own data vocabulary and schema, and dynamically adjust it as your needs change.
  • Collect data from any source – Generate data from web, mobile, server-side applications, AI, and third-party systems using our array of trackers and webhooks.
  • Assure quality at every stage – Define events up-front, QA using CI/CD and sandbox environments, and monitor quality in real time. Data quality is designed into the platform at every stage.

Enhance

Data is monitored, assessed, and enhanced with other data sets before entering the warehouse.
  • Unify – Bring together data generated across all platforms and channels.
  • Validate and enrich data – Know that data is validated against agreed definitions and enriched against additional first- and third-party data sets.
  • Control privacy – Data is generated with rich metadata. This describes which uses are acceptable and informs downstream data processing. You can also pseudnomize PII in real-time, ahead of delivery of data into downstream destinations

Model

Modeling the data allows you to look at it from different perspectives, including user level, session level, and page-view level, as well as creating custom tables.
  • Deliver data where it is needed – Deliver your data at low latency to real-time streams, to your data warehouse, and to your data lake.
  • Model behavioral data – Use out-of-the-box, extendable web and mobile models to performantly aggregate massive volumes of event-level data to the page, screen, session, and user levels. This shortens time to value and delivers AI and BI-ready data.
  • Monitor models in real time – Monitor your pipeline infrastructure, data quality, and data modeling. Set up alerts about potential issues.

Consume

Get consumption-ready data quickly and use it for advanced analytics, AI, and R-ETL.
  • Drive value from your data warehouse or lake – Measure valuable metrics with BI, train algorithms with AI, and automate your processes with R-ETL by sending data to SaaS products (known as a composable CDP).
  • Support more data applications from real-time stream – Create a data-mesh style culture, in which different departments self-serve from CRMs, ad platforms, and customer engagement tools. You can also trigger in-session actions in web and mobile apps, as well as wearables.

Govern

Maintain complete oversight of your data to ensure privacy and compliance.
  • Enforce a high standard – Deliver your data at low latency to real-time streams, to your data warehouse, and to your data lake.
  • Maintain data sovereignty – Choose the storage location which most benefits the needs of a given data application.
  • Ensure private SaaS – Keep full ownership of your data, with Snowplow completely deployed on your own cloud, ensuring full auditability.
  • Evolve your data – Systematically evolve your data subsets and their governance, as you develop and productionize new data apps and transform existing ones.

Data Creation produces richly contextual, connected data

With Data Creation, you can develop a data asset that surpasses the limitations of transactional and demographic data, and instead provides richly contextual and connected behavioral data optimized for AI and intensive data applications.

The data you create is:

 

  • Reliable and consistent – Business decisions require data consistency—whether consistently low latency for projects like real-time recommendation engines, a consistent level of granularity, or even a consistent field type. Without this, you don’t have a solid foundation upon which to build data applications.
  • Accurate and auditable – Unit tests assure that tracking is set up correctly, and quality checks are carried out at each stage in data processing, so you can trust the final data delivered. Auditability adds to this assurance—you can do retrospective due diligence at each stage in data creation and processing.
  • Explainable – It’s easy to draw meaning from the data. Because the process for creating the data is well understood, you can easily infer why the data is the way it is, and what has happened in the world the data describes.
  • Contextual and predictive – By capturing as much contextual information as possible, you can better identify—and in turn, predict—what drives behavior and decision-making.
  • Compliant – Finally, data must record the basis for capture, be obfuscated where necessary, and respect local data laws (especially regarding storage location). Building complex ML applications without compliant data can waste time and, more significantly, lead to enormous fines.

Behavioral data is a powerful data asset

There is no greater predictor of future behavior than past behavior, and yet many companies haven’t managed to harness the predictive power of behavioral event data.

The main reason for this is that behavioral data collection generates vast quantities of information and this is hard to organize and make use of. Consequently, data teams can either get it wrong and end up in the quagmire of high-volume, poor quality data, or they decide it’s all just too difficult and go back to less predictive data types.

Yet much of this complexity is the result of data exhaust, or trying to stitch together many disparate data sets. With Data Creation, you can have the predictive benefits of behavioral data, without the nightmare of cleaning and wrangling it.

Here is a snapshot of the data ecosystem, with behavioral data being the most complex, but simultaneously the best type of data for anticipating user behavior.

Figure 4. An overview of different data types

  • Behavioral Data – Describes in granular detail the actions performed by users of a digital product as ‘events’.
  • Demographic Data – Describes certain background characteristics of an audience, such as age, gender, interests, income, education, and employment.
  • Transactional Data – Describes the time, place, prices, payment methods, discount values, and quantities related to a particular transaction.
  • Systems Meta/Data – Metadata produced by organizational systems, machines, sensors, applications, and more.

Behavioral data enables organizations to power some of today’s most valuable use cases:

 

icon of machine language symbols

Attribution

Assign credit to each marketing touchpoint that influences high value user behavior, bespoke to your product and user journeys.

icon of machine language symbols under a magnifying glass

Churn reduction

Identify trends in user interaction to isolate behaviors predictive of retention and churn for better forecasting and interventions.

icon of a cog wheel

Personalization

Understand what drives user engagement, and personalize the experience in real time to drive acquisition and retention.

Data products

Put great behavioral data at the heart of your products to deliver compelling and unique value propositions to your customers.

icon of 3 control panel vertical sliders

Product Analytics

Develop a strong understanding of user behavior to inform product strategy and optimize the product experience.

 

The most successful companies in the world use behavioral data to drive competitive advantage.

The role of Universal Data Language in Data Creation

To create behavioral data for AI and advanced applications, each part of an event needs a consistent set of definitions, and the relationships between events and entities must be clear. We call this a Universal Data Language.

The resulting data event structure looks a lot like a sentence:

Figure 5: How a Universal Data Language reflects a spoken language

The vocabulary of this language consists of the definitions of each event and entity. Snowplow helps you achieve this by creating and cataloging JSON schemas which are human readable and machine enforceable for up-front validation.

The grammar is the relationship between events and entities. This is defined when generating the data and is enforceable through automated testing, such as using a Docker image of your data creation pipeline in your test suite.

Tightly governing definitions is fundamental to creating data, and adopting a Universal Data Language means that data will have all the AI-ready attributes listed above: reliability, accuracy, explainability, predictiveness, and compliance.

Snowplow was built upon the principles of Data Creation

Over 10 years ago, our founders Yali and Alex became frustrated with the limitations of traditional analytics tools. Their solution was to build pioneering software that empowers users to create their own rich data.

It seems obvious to us that this is the right way to do data—building exactly what you need, and not having to ‘make do’ with what you find lying around.

To experience the richness and potential of created data without having to create it yourself, try Snowplow today. Or explore this sample Snowplow data table to see how rich our data can be.

Start creating data with
Snowplow today

Snowplow Open Source

The world’s leading open source project for data creation.

Learn More
Snowplow Behavioral 
Data Platform

The private SaaS deployment, with data management, governance and enterprise support.

Learn More