How to build an open source data stack
There’s a problem with your analytics. They aren’t your analytics.
When you choose a proprietary packaged option for your analytics stack, without realizing, you might be choosing convenience over clarity. Perhaps, when you start out, convenience is what’s needed. You don’t have the resources to manage anything but a plug-n-play solution. You aren’t thinking about the optimal data stack; you just need a stack.
But as you scale, using a black-box solution to power your analytics means you’ll lose more and more insight into your customers and product, as well as losing more and more control over your core business asset—your data. If you don’t run the infrastructure, you don’t control the data. If you want a precise understanding of how your users are using your product—not a sample of usage or a summary of data defined by a vendor’s rules—you have to own the infrastructure.
An open source analytics stack allows you to do just that. It can give you more flexibility and more control. You have to put in more work initially to set it up, but the payoff is worth it in the long term as you learn more and iterate faster.
What are open source analytics tools?
Imagine if you could take your favorite analytics tool and look at all the code underneath to understand exactly how it works. Not just that, but then change any code you wanted to work better for your use cases and then host it yourself without having to pay—that is what open source analytics allow you to do.
Open source analytics tools have underlying code that’s open and available to everyone to not just view but also copy, modify, redistribute, and use. There is no proprietary code and no “trade secret” way that things work. All functionality is out in the open. Though you can pay for hosted solutions, you can take the code and run open source analytics tools on your own infrastructure. They are free to explore and free to use.
Here’s an example. Redash is an open source data-visualization and query-editor tool. It is an open source version of Looker. You can sign up for a hosted version of Redash, or you can follow this guide to set it up on your own servers.
Because Redash is open source, unlike Looker, you can get the entire codebase for Redash on GitHub. You can examine the code and see exactly how Redash works. For instance, you can immediately see that Redash is a python back end with a front-end JavaScript client. If you want to dive deeper, you can check out how they make their sankey diagrams. Turns out they use D3.
It’s fun to go through and read code to see how a tool works. But the strength of open source lies in when a tool doesn’t work. Open source leads to robustness and flexibility in a codebase. If your favorite SaaS analytics tool shows an error, all you can do is open a support ticket. If it doesn’t have the integration you need, all you can do is nudge the company on their forum.
With open source, you can go in and fix the issue and build the integration yourself, or talk with other like minded people to find a solution, as open source tech is often host to a community of users in the same boat as you.
Snowplow is another example of this. There have been over 6,000 commits to Snowplow repos to date. Most of these are from the core team, but a significant number are from Snowplow users scratching their own itch. Users will add additional functionality to the codebase to satisfy a need they have that a lot of other users have as well. For example, a data engineer at the Toronto Globe and Mail “added the optional endpoint to Dynamodb to make it work with Localstack.
This is a small change (31 lines added/6 lines deleted), but one that aids the entire community.
Community is an important watchword in open source. The strength of an open source tool lies in the people who use it. When assessing open source analytics tools, you need to look at not just whether the tool services your needs but also whether there is a thriving community that will help keep the tool working and growing.
If there are no open issues, no PRs, or even just commits from the core team, it might not have the high levels of community and engagement for long-term viability.
What you want from an open source analytics stack
Choosing open source analytics over a proprietary stack comes with more work up front. With a SaaS package, often all you need is the right JavaScript snippet in the right place and you can immediately ingest data. It’s also easier to get buy-in for a tool that is well known, requires minimal setup, and works immediately.
But the initial heavy lift of open source analytics is outweighed by the long-term benefits of having total control and an understanding of how the stack (and, by proxy, your raw data) works.
Here are the main benefits of open source analytics:
More control over your data and data infrastructure
This is the core reason to choose these tools. With open source analytics, you own your data and, just as importantly, you own how it is processed.
- You control the exact events and information collected from each user
- You control the rules governing its processing
- and you control where and how it is stored (and deleted if need be)
- This means you can access 100% of your data 100% of the time and
- Know that your data is complete and clean and ready for analysis when needed by your teams
No vendor lock-in
Your data is an asset for you, it shouldn’t be an asset for your analytics company. Yet, that is exactly what vendor lock-in entails—you can’t leave because they effectively own your data.
With open source analytics, you own the infrastructure and the data.
- This means you can do with it what you want and take your data where you want
- It also means you aren’t locked into someone else’s rules
- Additionally, your teams aren’t locked into different silos, where one team needs data in one place and format, and another team needs the data somewhere else
- With open source analytics, once you have the data in place, it can be used more easily across the business.
- You are also not limited by the business itself. If your Packaged SaaS goes out of business, or gets acquired, a la Wagon, you can lose access to the tool
Flexibility for specific use cases
Your tech should reflect your use cases. You shouldn’t have to crowbar your specific needs into a generic system.
When using proprietary stacks, you’re limited to the use cases they are built for. You are also limited to the integrations they have built.
Open source tools allow you to build around the specific use cases you have.
- You can assemble the tech stack that makes the most sense for you
- Take advantage of the flexibility to leverage modular parts of open source tools
- Or, if all else fails, build out the exact integrations required for your data
More cost-effective
You can feasibly build an entire analytics stack with open source products, for close to free. Check out the following examples:
- Open source product analytics tools provide pretty much all you need to assess the performance of the user experience across your product estate.
- Open Source a/b testing tools to assist your organization in testing changes in your product on conversion or engagement uplift, so you can iteratively implement the optimal solution.
- Open source CDP’s (customer data platforms) to look after prospect audiences and customer segments to personalize and optimise your customer experience.
- Open source data validation tools let you ensure that your data is complete, accurate and clean.
- Open source analytics engineering tools to transform, test, deploy and/or document data.
- Open source anomaly detection tools to identify outlier events, items or observations in your data.
- Open source databases allow you to store all your data on your infrastructure.
- Open source data visualization tools to make it easier for us humans to detect patterns, trends and outliers in complex datasets.
- Open source data management platforms to act as a centralized platform to acquire, validate, store, protect and process all of your data.
Though the software itself is freely available, you will still have to find a data warehouse, a hosting solution, and engineering resources. Each of these will add to the cost of running an open source analytics stack.
However, at low-volume levels the costs for these are likely to still be much less than, for example, a GA360 solution (where you’ll still have BigQuery and engineering team costs).
Greater control over security and privacy
When you have control over the information you capture about your users, it makes it easier to understand your data responsibilities.
- You can be sure that your data complies with the GDPR and the CCPA.
- You can also be confident about industry-specific standards for your data.
For example, when you control your data, you have more insight into how your data systems are designed for SOC 2 Type I and how they are operating for SOC 2 Type II, and how your system deals with the SOC 2 five trust principles of security, availability, processing integrity, confidentiality, and privacy.
How to open source your data stack
Open sourcing your stack can start with you finding a cool new tool on GitHub and forking it. But the better way to start is from first principles—your use case. Then, move on to the cool tools!
First, decide your use cases. Your use case is the foundation of your stack. It decides what data sources you need, what schema works best, how you’ll enrich your data, and what analysis/visualization/storage fits best.
Building a content recommendation system is different from building marketing attribution. You need different data for each, and you are going to perform different analysis on that data. For a content recommendation system you’ll need granular data on how a user interacts with pages, such as time on page, whereas just logging the page visit might suffice for marketing attribution.
Conversely, you’ll need Redash or another data visualization tool and their sankey diagrams for marketing attribution so you can visualize the journey that users are taking through your content. You might not need any end visualization tool for a content recommender, instead feeding the data into an algorithm and using the output directly on your site.
Once you have the use case, you can think about the right tools for your pipeline. A modern data stack looks like this:
- Collect: You can use Snowplow Trackers to collect events from the web, mobile, desktop, server, and IoT.
- Load: Warehouses aren’t open source, but all open source analytics tools should be able to load and extract data from the major warehouses, such as Redshift, Snowflake, and BigQuery.
- Transform: This is where you apply your business logic to your data and transform it into something that can be easily analyzed. This is critical for generating insights from your data. dbt is an open source analytics engineering tool that allows you to write modular SQL to transform raw data as well as push engineering principles, such as version control, automated testing, and easy collaboration, into the data analysis space .
- Analyze. Analysis can be as simple as command-line SQL or as sophisticated as full-on BI tools. Prominent open source tools include Metabase, which lets you easily perform SQL queries and set up dashboards for any internal team that needs the data, Redash mentioned above, and Apache Superset, “a modern data exploration and visualization platform” that lets you explore, view, and investigate your data.
These are just some of the tools available. Open source analytics has been growing for the last decade, and the market is exploding with open source technologies. As data engineers and data engineering become more sophisticated and nuanced, more and more tools are being built to service the exact needs of data teams.
Opening your stack
Open sourcing your analytics isn’t without headaches. The code is free, but the infrastructure and the resources required to maintain and manage it are not. But even if you choose a hosted solution to take some of those headaches away, the important part of open source analytics remains: the control.
When you own and run the infrastructure you use to capture all your events, and when the data is collected, processed, stored, and analyzed according to your rules and business needs only, then you end up with an incredibly valuable asset: your data.
Of course, with Snowplow BDP, you have the option to host your infrastructure and manage it within your cloud account. You can think of it as having the best of the open source flexibility, without the hard work of setting it up. To discover more, check out Snowplow BDP for yourself.