It’s fairly easy to find the top SaaS analytics tools. Just search Google for “analytics tools,” and you’ll be presented with some well-placed, high-cost ads pointing you in the right direction.
Finding open source analytics tools isn’t quite as easy. You can go to GitHub and look for repos tagged as “analytics,” but this can be a flawed strategy because:
- Though the source code is available, that doesn’t necessarily mean it is completely free to use; e.g., PostHog (below) has some code that has a commercial license within their repository.
- You’ll have to search through all 3,299 repos (at the time of writing) to find the specific one that fits your needs.
- You will miss out on all the open source analytics tools that haven’t tagged themselves as “analytics” (like, er, Snowplow).
To help, we’ve rounded up the top open source analytics tools. We will break down what they do well and also highlight any weak spots they might have.
The rankings here are partly due to overall popularity—the most starred and forked open source analytics tools on GitHub. But we’ve also included some that are of real interest but aren’t yet in widespread use. These are the tools we think will become ever more important in the open source analytics ecosystem in 2021.
What’s not covered here is Google Analytics-specific competitors: Matomo, Plausible, Koko, and Offen. They are great for their specific use cases, but if you are just starting to use open source analytics tools, it makes more sense to do so completely, breaking free of all your packaged tools.
We’ve split the software into nine separate categories of open source analytics tools, each serving a different need:
- Open source product analytics tools provide pretty much all you need to assess the performance of the user experience across your product estate.
- Open Source a/b testing tools to assist your organization in testing changes in your product on conversion or engagement uplift, so you can iteratively implement the optimal solution.
- Open source CDP’s / Reverse ETL tools to look after prospect audiences and customer segments to personalize and optimise your customer experience.
- Open source data validation tools let you ensure that your data is complete, accurate and clean.
- Open source analytics engineering tools to transform, test, deploy and/or document data.
- Open source anomaly detection tools to identify outlier events, items or observations in your data.
- Open source databases allow you to store all your data on your infrastructure.
- Open source data visualization tools to make it easier for us humans to detect patterns, trends and outliers in complex datasets.
- Open source behavioral data platforms to act as a centralized platform to acquire, validate, store, protect and process all of your data.
Let’s look at a few instances of each that will help you open source your analytics.
Open source product analytics tools
These are entire platforms that can supersede your packaged SaaS tools and give you end-to-end control and insight into your product data. The overall pluses of these types of tools are control and customization. You have complete access to your data and can decide exactly how the data is analyzed. The downside is that they can be resource-intensive to set up and run.
1. Countly for easy mobile analytics
Countly / Countly GitHub / AGPL v3 license / 4.6k stars
The strength of Countly is easy access to your data with read and write API access and analytics for mobile, web, and desktop. It features a number of open source plugins to help you collect and understand your data better.
The downside of the open source version is that it doesn’t include all the features of the Enterprise paid version. With open source, you miss out on real-time data, user profiles, and the ability to design funnels. The open source version also “stores data (only) in an aggregated format,” so you can’t export the data and perform more granular analysis elsewhere (though this does make reporting faster).
2. PostHog for quick setup of self-hosted analytics
PostHog / PostHog GitHub / MIT license / 3.4k stars
PostHog is a self-hosted, open source analytics platform that allows for extremely easy deployment. You can deploy the tool directly to Heroku in one click. This sets it apart from a lot of other open source analytics tools that have a more involved setup process and require more knowledge to get up and running. PostHog works well for teams new to the open source world.
A weakness of PostHog is that you might be limited if you are building out marketing attribution with open source analytics. PostHog doesn’t currently have email link tracking or ad campaign tracking, so you will be missing a subset of your data when trying to understand your marketing campaigns better.
A note on ‘enterprise scale’ web analytics tools:
If you are looking for open source analytics at enterprise scale then you might actually want to consider a mesh of tools which deliver analytics data into a data warehouse. This is exactly what Snowplow was made for, and is number 14 on this list.
This would then enable you to build competitive advantage based on how you amass high quality data at scale, and activate it within tools built especially for real-time marketing automation, customer engagement and business intelligence.
Open source a/b testing tools
3. Wasabi – A real time enterprise grade a/b testing platform
Wasabi/Wasabi GitHub/ Apache-2.0 license / 973 stars
Wasabi is a real-time, open-source, 100% API-driven, A/B testing platform by Intuit. The open-source testing software allows users to own their data and experiment across the web, mobile, and desktop. Users utilize Wasabi because it’s fast, scalable, and easy to use for organizations of all sizes.
Developers lean toward Wasabi for A/B testing because it is 100% API-driven and can be developed in any programming language and environment. The software has been tested for years with products like TurboTax and QuickBooks.
While Wasabi is a proven open-source platform that can run on your servers or in the cloud, it is no longer under active development or supported by Intuit, as of August 28, 2019.
Open source CDPs / Reverse ETL tools
4. Grouparoo for integrating customer data with cloud-based tools
Grouparoo/Grouparoo Github/ Mozilla Public License 2.0/ 428 stars
Grouparoo is an open-source Reverse ETL solution that makes it easy to send data from your data warehouse to cloud-based marketing, sales and customer platforms like Mailchimp, Salesforce and Zendesk. Grouparoo integrates with any tech stack; you can configure your setup locally, commit changes, and deploy with git – just like how you’d deploy DBT projects. There’s also a web-based user interface to support complex configurations.
Grouparoo is a very new solution and therefore doesn’t feature as many integrations as its non open-source counterparts in the reverse ETL category. That being said, it’s a hugely promising platform with advantages in its privacy and the fact you can fit it into your existing engineering workflow. Grouparoo also has great segmentation capabilities, including a group building tool that can be used by engineers as well as less technical teams like marketers. This can be used to determine which profiles get synced to certain tools and will also create tags or lists in the destination systems.
5. Pimcore for managing digital data
Pimcore/Pimcore Github / GPLv3 license /2K stars
Pimcore was introduced to the open-source world in 2010. The open-source platform assists organizations in managing digital data and customer experience. Pimcore is 100% API-driven, allowing integration into any tech stack. Eighty-two thousand customers across 56 countries utilize Pimcore to manage their data, including, SONY and Pepsi.
Pimcore stores data independently and can provide the managed data to any channel, such as B2B websites, ecommerce systems, and mobile applications.
It is important to know that Pimcore is not an “out of the box” software product and, therefore, is meant for people with software development experience.
Open source data validation tools
These tools have a specific use within your data pipeline. You can add them in as a step within an open source data platform to perform a single function. The plus of these tools is that they perform important operations that you are unlikely to get in packaged SaaS tools. The downside is that they are built specifically for certain purposes—you need multiple tools like these to answer every use case you have.
6. Great Expectations for data validation
Great Expectations / Great Expectations GitHub / Apache-2.0 license / 3.2k stars
The strength of Great Expectations (apart from its amazing name!) is that it allows you to set and assert specific validation rules for your data and be alerted when your data is straying from those rules. You can also automatically create documentation directly from these assertions:
A caveat is that Great Expectations is very new. It has a lot of promise, but key features, such as autogenerated documentation from tests and data profiling, are still experimental.
Open source analytics engineering tools
7. dbt for improved analytics workflow
dbt / dbt GitHub / Apache-2.0 license / 2.2k stars
dbt’s strength is that it allows you to bring general engineering principles, such as version control, testing, and sandboxing, into your data pipeline. You can perform data transformation and business logic without impacting users in separate, collaborative environments.
The limitation of dbt is that it is purely a transformation tool. It expects that extraction and loading will be done by another tool. This is fine, as there are plenty of other tools that can do these jobs in the pipeline, but it’s important to realize this is just one step in a larger process.
Open source anomaly detection tools
8. Hastic for data anomaly detection
Hastic / Hastic GitHub / Apache-2.0 license / 269 stars
The strength of Hastic is its ability to find anomalies in your data and alert you immediately. You set up predefined parameters for possible anomalies in your data, and Hastic will find them if they reoccur:
The limitation here is that Hasitc only works with open source analytics monitoring platform Grafana, so you can’t see these plots in Superset or Metabase. Hastic is also currently lightly documented, so setup and maintainability might be a challenge.
Open source databases
Open source databases allow you to store your data outside of the larger proprietary warehouses. A lot of databases, such as MySQL, PostgreSQL, CockroachDB, MongoDB, and SQLite, are open source, but the two highlighted here are different in that they are engineered to deal with specific types of data and analysis.
9. Apache Druid for real-time DB querying
Druid / Druid GitHub / Apache-2.0 license / 10.3k stars
The strength of Druid is in real-time analytics, where a user is performing multiple queries in rapid succession and needs sub-second answers. If you are working on a product that requires you to analyze data on the fly, then Druid is the right database to choose.
Druid’s lack of fault tolerance has been cited as a weakness, specifically if you are susceptible to network failures.
10. Timescale for time-series querying
Timescale / Timescale GitHub / Apache-2.0 license / 9.8k stars
Timescale’s strength is that it is optimized for time-series data. If you are working with time-series data, such as ongoing product usage, Timescale allows you to perform complex queries on the data.
A weakness of Timescale is that, though the relational database model is versatile, it can be more difficult to get started with. There is a steep learning curve for the tool.
Open source data visualization tools
For any data analysis, you want the ability to query and visualize the data. Proprietary dashboards and business intelligence tools such as Looker, Tableau, or Chartio are extremely popular, but so are some of the open source visualization tools available. These are some of the most starred and forked open source analytics tools out there.
11. Superset for visualizing data in any DB
Superset / Superset GitHub / Apache-2.0 license / 31.4k stars
The main strength of Superset is that it integrates with dozens of modern databases, so wherever your data currently lives, Superset can interface, allowing you to visualize your data. You can also visualize and analyze data from different sources simultaneously.
Superset is not necessarily an “enterprise-ready” tool. There is a challenging setup process, and some cite potential security risks of giving a Docker image access to your data. But it is an extremely powerful tool if you take the time to learn all that Superset has to offer.
12. Metabase for quick visualization
Metabase / Metabase GitHub / AGPL license / 22.9k stars
The strength of Metabase is its simplicity, both in setup (boasting a five-minute setup process) and in the analysis, where anyone on your team can use Metabase to query your data and get answers.
Its strength is also its weakness, in that the simplicity can mean complex querying of your data is more difficult. There is an SQL mode, but this isn’t the main feature of the tool as in other business intelligence tools.
13. Redash for different dashboards for different teams
Redash / Redash GitHub / BSD-2-Clause license / 17.7k stars
Like Metabase, the strength of Redash is in its ease of use. Though you do need some SQL experience to get the most out of the tool, you can easily create visualizations based on your data, and you can create different dashboards for different teams.
Probably, the downside of Redash is that the visualizations and dashboards of Redash aren’t quite as pretty and sophisticated as you can produce with Metabase, and it doesn’t have quite the power of Superset. It also has recently been acquired by Databricks, meaning its future is unknown.
Open Source behavioral data platforms
Snowplow / Snowplow GitHub / Apache-2.0 license / 5.8k stars
The strength of Snowplow is complete ownership of your data and data infrastructure. You have direct access to your granular data and can collect, process, analyze, and store it exactly as you need. Snowplow has trackers and webhooks to pull in multiple data sources and integrates with the main data warehouses.
Snowplow’s extensive toolset means it can be daunting for engineers to set up and run. It takes time to build a tracking strategy and implement Snowplow in an effective way for your team. Thankfully, it’s possible to gain support via Snowplow BDP, a managed, private SaaS version of the product.
With over 600,000 mobile apps and websites using Snowplow, there is a vibrant community of users on hand to help you answer questions while setting up Snowplow for your organization.
There is a wealth of opportunities for data teams looking to leverage open source analytics tools.
Take control of your data with open source tools
With some, you can get an entire pipeline, from collection to transformation and visualization, up and running in an hour. Others will take your entire data team weeks to configure.
Whatever your use case, it makes sense to explore the flexibility of open source tools. In particular, it’s worth taking advantage of thriving open source analytics communities, discourse forums, Slack environments, and Twitter chats to find the best tools for your chosen use case.
Snowplow users frequently integrate many of the above tools in order to open source a data stack. Snowplow’s modular technology can slot into your existing processes, giving the flexibility to leverage Snowplow for multiple use cases.
If you’d like to learn more about how Snowplow’s open core infrastructure can empower you on your data journey, why not try Snowplow yourself?