The modern data stack: a guide
The last few years have seen an explosion in the number of data tools an organization can use to drive better decision making largely based on data stored and queried in cloud data warehouses. The cloud data warehouse has given organizations the ability to store and query vast datasets quickly and cost-effectively. For the first time organizations of all sizes can build a single, high-value data asset and use it to drive value across their business.
A scalable framework
Driven by demand from organizations trying to get as much value out of the data in their warehouse, numerous data tools have emerged. Each tool has become highly specialized in its portion of the data lifecycle and most have many options with which they can be interchanged. For example, there are dozens of BI tools to visualize the data in the warehouse, each excellent at democratizing data within the organization.
These highly specialized tools come together to form the modern data stack, a scalable, low barrier to entry group of technologies that startups and enterprises alike can adopt to drive immense value from their data. The stack is made up of a few key categories:
- Behavioral data Ingestion | Streaming behavioral event data from web, mobile and other connected devices (wearables, SmartTVs etc.) to describe the full customer journey. This can be used as a basis for deeply understanding users;
- Transactional data Ingestion | Managed data pipelines to import transactional data, in batch or streaming, from SaaS tools and internal databases for Ops reporting and to be joined with behavioral data for richer insights and Customer360;
- Storage | Cloud data warehouse and cloud data lake as low cost scalable persistent storage solutions and data streams to allow high throughput messaging for low latency access to data by more sophisticated teams for use cases such as real time recommendations;
- Processing | Batch and streaming data transformation platforms that enable data teams to aggregate, filter and apply business logic to raw datasets in each of the storage media to perform analysis and power decision making;
- Operations | Reverse ETL makes it easy to act on the output of the computation performed in the data warehouse, such as segmenting by propensity to convert and expected LTV and publishing these segments to marketing platforms to enable better targeting. More broadly, Reverse ETL allows for rich user data from the warehouse to be actioned upon in many SaaS solutions such as CRMs and Product Analytics platforms; Smart Hubs allow for a greater degree of marketing team self serve and help build the user segments as well as activate them in the same destinations as Reverse ETL, as well as being able to ingest from data lakes;
- Analysis | A long established category of BI tools and Product Analytics tools provide the basis for a self serve culture amongst consumer teams such as Marketing and Product with visualisation and exploration functionality;
- Intelligence | AI & ML tooling for Data Science teams to build, test and deploy models to identify trends in historic data and predict future customer behavior;
- Management | Orchestration tooling and frameworks allow engineering teams to manage their data pipelines while data quality monitoring tooling allows for high observability; Data governance tools solve for organizational problems making it easier to scale the number of data producers and consumers.
Given the number of categories and tools in this ecosystem, the data landscape has become extremely exciting but also increasingly complicated and confusing to map, making it hard for organizations to build and, more importantly, evolve their data platforms. To help organizations with this, we have put together our version of the modern data stack which you can find below.
How did we get here: The rise of the cloud data warehouse
In September 2020, Snowflake had the biggest software IPO of all time but at the start of the millennium, the idea that every company could have a single source of truth with every customer interaction and company record accessible by the entire business would have seemed far-fetched. There were three technologies that paved the way for the cloud data warehouse to become the defacto source of truth for organisations:
- Decades ago, only large companies could analyse large datasets given that vertical scaling of compute resources was needed which required a lot of up front expense. Hadoop kick started the big data revolution in 2006 as it made it easy for organisations to remove hard processing limits and scale compute horizontally rather than vertically.
- Most organizations were still limited by the need to invest upfront in compute resources. AWS ushered in the public cloud era removing the need for companies to build and maintain capital intensive server centres. AWS, GCP and Azure made it possible for organisations of all sizes to pay for as much storage and compute resources as they needed on a metered basis.
- The modern cloud data warehouse revolution began with the launch and widespread adoption of Redshift in 2012.
With Redshift, it suddenly became possible to cost effectively store huge relational datasets and run parallelised queries in SQL, all without owning any of the computers needed to do this. Data teams could write SQL models and analysts could plug in their favourite BI tools like Tableau to build their dashboards faster and with a richer dataset. Far gone were the days of hard processing limits, simple queries that took half a day and large capex to ask questions of your own rich datasets.
This meant that, for the first time, data collection and visualisation could be decoupled, since storage was so cost effective and scalable so organisations could store the data first without worrying too much about the exact structure it was in, then transform it for use by the business. This meant they could create a source of truth upstream of all the business systems that need data, like Tableau.
Today, most organisations build their data platforms following this ELT approach where they create a centralised data asset in their cloud data warehouse or lake, combining all their source data, that acts as the source of truth for all their business systems and appreciates in value over time. Using the specialised tooling in the Modern Data Stack is the key to building, managing and evolving these data platforms.