An advanced guide to A/B testing
Along with funnel-based analytics, A/B testing is used by the vast majority of companies to develop their products, there are some more advanced concepts and practices that can make these tests more effective and accurate.
We’ll run through the technical aspects of A/B testing, covering topics such as hypothesis testing, statistical significance, power analysis, and sample size calculations.
We'll also explore some of the common mistakes to avoid when designing and conducting A/B tests. If you have been struggling to get good results from your A/B tests, this practical advice can help take your optimization efforts to the next level.
Different approaches to running A/B tests
Broadly speaking, there are two approaches to tracking and measuring A/B tests:
Metric-first: this involves defining the metrics in advance that will be compared between the different test groups, and instrumenting dedicated tracking specifically for those metrics based on what segment a user belongs in (test, control, or neither).
Event-analytics approach: in this method, it is only necessary to decide that a user belongs in a particular group - at least once. The relevant metric is then computed post-collection when the data is in the data warehouse.
Hypothesis testing
Having a clear and well-documented hypothesis is crucial for A/B testing as it helps prevent biases from creeping in during testing.
Once you have your hypothesis, there are two main potential outcomes:
- Null hypothesis proven (no significant difference between two or more versions of a web page or application)
- Significant difference, i.e., null hypothesis disproven (a difference over a pre-determined level of significance).
Not documenting your hypothesis, or not articulating it clearly, are common errors that can have negative consequences - especially with advanced A/B testing. Our advice is to find a template and follow a pre-made set of documenting procedures. It's also advisable to get this signed off by the team before launching an experiment.
Statistical significance
Statistical significance is the measure of the extent to which your hypothesis can be relied upon.
Rather than going with gut feel, as we've previously said, it's worth having a number in mind before conducting the experiment.
It's pretty common to view a statistical significance level of 5% (or p-value of 0.05) as a strong case against the null hypothesis – i.e., there's a 5% chance that the differences you observed actually occurred by chance.
Power analysis and sample-size calculations
Sample size is often overlooked in AB testing. In order to calculate the right size for an experiment to be meaningful, power analysis is needed.
With this analysis, you can discover a statistical power, which is your overall probability of correctly rejecting the null hypothesis. It's strongly influenced by a variety of factors, including: sample size, effect size, and the significance level.
Power analysis can help ensure that an A/B test is designed well enough to detect meaningful differences between the test groups. It's worth remembering that companies like Google have literally colossal test groups and do not struggle to power tests, whereas niche B2B webpages may not yield meaningful statistical data. On the other hand, when the sample size goes beyond a certain point, you may not actually get any additional benefits as you get diminishing returns. As a result, it’s necessary to strike a balance between statistical power and practical considerations such as time and cost when making your experiment plan.
The difference between events, segments and metrics for A/B testing
To understand the difference between the two approaches to measurement with A/B tests, we first need to distinguish between events, segments, and metrics.
Events are observations of things that happen, such as when a user loads a website, clicks a button, or navigates to a different page. You can think of each event as a “fact” that is recorded or an observation that is made.
Segments are groups of users that have something in common like age group or geolocation. In an A/B test, both the “control” group and “test” groups are segments. Metrics are things computed on user data like the number of people who visited a page or the number of pages a user has visited.
Metrics are measurements that are computed on groups of events. For example, we might compute the “number of page views per session,” “number of sessions per user,” the “time taken” for a particular workflow to complete, or the “percent dropout” for a particular workflow.
All of these are calculations that can be run on the underlying event-level data. When we’re running an A/B test, we’re comparing one or more metrics between the test and control group.
The structure of A/B tests and why it matters
When you’re running an A/B test, you’re fundamentally comparing two segments, the control group and the test group. You’re comparing a metric (or set of metrics) between the test group and control group to see if a new feature or other change has the desired effect.
For high-functioning product teams, this becomes challenging to manage due to the scale of testing. Across the team, members are coming up with great features and contributing to the overall evolution of the product. Each feature needs to be tested, to see if it has the desired effect, before being rolled out to the rest of the user base. The chances are high that each test will be run in a fundamentally different way than the others with a different control group and a different set of variables to measure. The end result is that a product team, might be running dozens or even hundreds of experiments simultaneously.
Because of this, it’s not uncommon for some companies to run multiple experiments every week. Meta, for example, is known for its rapid and rigorous product development cycles and runs hundreds of concurrent tests every day on its billions of users. This is no small task: not only do those tests needs to be designed and deployed, they need to be tracked and monitored and carefully executed so as not to interfere with one another.
Some ways to avoiding confusion when A/B testing at scale are:
- Clear and well-socialized documentation
- A data product owner or data product manager to oversee experiments
- Repeating tests to verify results
- Phased implementation of changes
Event-based A/B testing in action
When you can configure your metrics after the fact - using your warehouse or lake data - you can run more interesting experiments.
Imagine a product team wants to experiment with real-time production-recommendations on product pages. They would measure:
- “click-through rate” on those recommended products
- differences in basket and transaction value for users in the test group
- conversion rates on product pages with other recommended products shown alongside them to measure whether the recommended product makes it less likely that a user will buy the original product they’re looking at
This product team now wants to experiment with a new approach to internal search. They want to know if:
- more people are buying because fewer drop out of the process due to not finding the item they want
- if improved search capacity would shorten the buying journey because the user finds the item faster
To answer these questions, the product team can compare what percentage of users who search go on to buy a product between the control and test groups and the time taken to buy after performing a search.
In both of these cases, these metrics might not have been reported prior to the experiment being run. But, with event-level data, all of the relevant information is already present, in the data warehouse or lake, just waiting to be modeled in new and exciting ways. You should be careful in this situation to avoid HARKing - or Hypothesizing After the Results are Known, but looking at the way different metrics changed can be tested for repeatability.
Making advanced A/B testing easier
Because they make recording A/B tests so easy, event analytics systems like Snowplow are very often used alongside a dedicated A/B testing system, which would be used to assign users to test groups and deliver the test. This was super interesting to us - our users are measuring their A/B tests in Snowplow, even while running tools like Optimizely or home-built tools.
The combination of being able to easily track ongoing experiments and flexibly compute metrics on the fly means that, as your experiments become more sophisticated along with your product, your analytics system can evolve and scale just as rapidly.