A guide to the modern data warehouse
2020 was a big year for the adoption of cloud data warehouses. The three most popular cloud-based data warehouses saw a significant increase in companies adopting their systems. As of March 2021, in the past 12 months, Snowflake saw a 66.25% increase in companies using their warehouse, while Google BigQuery saw a 20.14% increase, and Amazon Redshift saw a 1.34% increase.
These numbers show that more and more companies are shifting away from a traditional on-prem data warehouse to a cloud-based data warehouse as they begin to rely heavily on data to inform key decision-making and power core use cases. In fact, there is a recent trend in the market, according to our CEO and co-founder Alexander Dean, that the data warehouse can be a company’s “brain.”
This guide will provide you with the information you need when choosing the right modern data warehouse for your business.
What is a data warehouse?
Organizations use data warehouses to gather insights from their data — acting as a single source of truth for reporting and analytics purposes.
While there are plenty of definitions out there for a data warehouse, we like this one from Gartner best:
“A data warehouse is a storage architecture designed to hold data extracted from transaction systems, operational data stores, and external sources. The warehouse then combines that data in an aggregated, summary form suitable for enterprise-wide data analysis and reporting for predefined business needs.”
While some companies still use traditional on-prem warehouses, a majority of businesses have shifted their warehouses into the cloud. This shift is often referred to as the modernization of data warehouses. This allows companies to rapidly scale their data warehouses to fit their specific needs (we’ll talk more about this later).
A traditional on-prem data warehouse is typically designed and maintained by your team and stored at a physical location. This approach often requires installing physical hardware and figuring out the nuts and bolts of setting up your hardware. Your team will also need to configure storage space, install the required software to run updates on your database, and then figure out what data sources will feed into your warehouse and how often you will update your data. While it might sound appealing to have your warehouse onsite, it often creates problems that wouldn’t exist if your warehouse was in the cloud.
What are the benefits of modernizing your data warehouse?
We mentioned earlier the rise of the “Big Three” of cloud data warehouses ( BigQuery, Redshift, and Snowflake, which Snowplow integrates with by allowing you to send data to your version of their warehouse). According to HG Insights, as of March 2021, over 30,000 companies now use these three cloud data warehouses — and the rise of big data has played a growing part in the rapid evolution of the data warehouse.
Note: While Microsoft Azure is a popular cloud-based data warehouse, Snowplow doesn’t support it yet.
It wasn’t that long ago that companies stored data in databases and application systems, not thinking much of the information they had on file. While they had a great deal of information, they weren’t quite sure of what to do with their data. Over time, companies began to analyze that data and learn more about their customers, business, etc.
When companies realized the actual value in data, they became data-informed companies as they used data to drive decisions. The data function has evolved from IT people sharing reports to sort of a new business revolution. The top-performing companies have found a way to integrate data within or on top of their products.
Data now serves a bigger purpose for these companies and has evolved from basic reporting to game-changing product features and use cases, like, personalized content, real-time recommendations, and machine learning. With this came an increase in the volume of data sources flowing into databases, which could not handle their growing needs. Companies now had to rethink how they collect, handle, and store data at scale. Data was pouring in from multiple sources in various formats, and companies were unable to store and process the data to utilize it across their organization.
1. Scalability
One of the best features of cloud-based data warehouses is their scalability — as they grow alongside your business. Gone are the days when your servers maxed out and you would have to purchase new hardware and, in some cases, have to find places to house the server and people to install it. Run out of storage? This isn’t a problem with cloud data warehouses — simply buy more storage when you need to, and pay for what you use. However, data warehouse users should still play it smart when anticipating how their data volumes will scale to optimize their costs. If unchecked, it can be very costly to their organization.
2. Integrations
Since your data warehouse already exists in the cloud, connecting with other cloud-based services (like Snowplow) is simpler than when warehouses used to live on premises. Data integration tools help your organization make data useful; that way, you can draw insights relatable to your business. A high-quality third-party tool will support all three warehouses. The three big cloud data warehouses seamlessly integrate with other data tools, like dbt — which allows you to transform data in your warehouse more effectively. Organizations also face more flexibility when it comes to building a modular data stack.
If you use:
- Snowflake: Snowplow enables data to be loaded in Snowflake with Snowplow Snowflake Loader.
- BigQuery: Snowplow supports streaming data into BigQuery with BigQuery Loader.
- RedShift: Snowplow allows you to load into Redshift with RDB Shredder and RDB Loader apps.
Google BigQuery vs. Amazon Redshift vs. Snowflake
Our partners over at Poplin data broke down all three warehouses in length in this post. Since all three data warehouses are architectured in different ways and can be used by different businesses for various reasons, here is a quick summary of each warehouse.
- Google BigQuery: The enterprise data warehouse enables SQL queries to perform faster, thanks to the processing power of Google infrastructure.
- Amazon Redshift: The cloud data warehouse allows you to work with data in various formats and easily integrates with and connects to the AWS ecosystem, such as the S3 data lake.
- Snowflake: The cloud data warehouse allows you to share data securely between organizations - through views and materialized views that allow for hashing and redaction.
Questions to ask when choosing a modern data warehouse for your organization
Before choosing the right cloud-based data warehouse for your organization, there are some questions you should consider when looking to implement a warehouse for your business. While we won’t break down the differences between all three warehouses in full detail, like our friends at Poplin Data did, all three warehouses have unique features that set them apart from each other based on your needs.
1. What data types does your business store?
When deciding on a data warehouse, it is crucial to know the type of data that the warehouse will store — either structured or unstructured. If your data is highly structured, a relational data warehouse would work nicely in storing data for your business.
If your data is unstructured, you might want to consider switching over to a data lake, as they are aimed toward dealing with that type of data.
Here are some of the most popular data lakes:
Note: Some data warehouses can now make use of semi-structured and, even in some cases, unstructured data. Examples of semi-structured data are XML files (both BigQuery and Snowflake support this type).
Further Reading: The team at Fivetran put together this benchmark report comparing the three big cloud data warehouses.
2. What factors should be considered when evaluating pricing?
One of the driving factors to moving toward cloud-based warehouses is price. Data warehouses should be cost-effective and scale computes capacity to meet demand. Each warehouse has different costs associated (depending on which option or plan you choose). Each warehouse also has unique capabilities that are specific to certain business needs, which should be considered when choosing the right platform.
Further Reading: The 2021 database showdown: BigQuery vs Redshift vs Snowflake
3. Will the data warehouse scale with your business?
We mentioned earlier the importance of cloud-based data warehouses scaling with your business as you grow, but it shouldn’t break the bank for your organization. Make sure you choose a data warehouse that can handle a significant increase in data volume without compromising speed, cost, and performance.
The future of the data warehouse
Recently, there have been talks of merging data lakes into the data warehouse to form the “lakehouse.” The whole idea behind this comes from the historical divide of data warehouses, which are meant for structured data sets, and data lakes meant for unstructured and semi-structured data. The phenomenon behind it is quite appealing because you get the best of both worlds and only have to worry about one storage layer (your lakehouse). One significant advantage of utilizing a data lakehouse is leveraging the power of data warehouse capabilities, schemas, and metadata within data lakes, meaning you don’t have to rely on one compared to the other.
Over time, it will be interesting to see if both the data warehouse and the data lake converge into a single category. George Fraser of Fivetran and Jamin Ball of Clouded Judgement wrote great articles on this topic if you’re interested in learning more.
TLDR; Databricks combines the best aspects of data warehouses and data lakes together forming a “lakehouse.”
Our CEO, Alex Dean, talked more about this during our What’s in store for Snowplow in 2021 talk — leaving attendees with this question to answer for themselves.
“Frame your current use of data lake and/or data warehouse against your business objectives. Do you need both now, or can you wait for further convergence?”
Snowplow is here to help deliver data to your warehouse
No matter which cloud-based data warehouse you choose, Snowplow is here to help you deliver and manage your behavioral data. With Snowplow, you can track, contextualize, and model users’ behavior across your digital estate. Data is available in real time and is delivered to your data warehouse of choice, where it is later transformed and modeled in order to drive data productivity.
One of our users, Holistics, was able to capitalize off Snowplow’s well-structured data to improve their functionalities across their organization.
“One of the advantages of Snowplow is that it gives us a lot of raw data, but organizing and structuring that data in a way that makes it easy for the rest of the company to consume is where the synergies between the two products come into their own.” – Evan Tan, Chief of Staff, Holistics
If you are interested in learning more about how Snowplow can help with delivering data to your warehouse, please get in touch.