- Published on
- 20/12/2021
Designing, Running and Analyzing A/B Testing Experiments
Experimentation is a broad topic that could easily fill a 500+ page textbook. I have had multiple opportunities to design, run, and monitor both online and offline experiments for new product features as well as scaling interventions. The goal of this article is not to cover every necessary detail required to design and run a good experiment but to share some of the knowledge from mistakes I made and also offer theoretical notes on some experimentation topics.
First, we have to define the specific goal and the anticipated value of the experiment to the business in our initial experiment design phase of hypothesis design. This usually begins with an internal history audit, documentation review and looking at observational studies to ensure that we are duplicating past efforts or testing questions that existing data already answers. Clearly articulating the "why and how it benefits the business" provides the necessary grounding for all the next design, execution, and interpretation phases. Avoid the common temptation of running tests simply to "see what happens" with a feature/intervention without a clear end goal and business value. Therefore, the work done before the first users are assigned determines whether your results will be actionable or noise. The common way to structure the experimentation process for actionable insights involves 5 critical steps:
Define the Goal and the Hypothesis
Define the Metric
User Eligibility, Randomization, and Logging
Duration, Sample Size, and Trade-offs
Analysis and the Ship Decision
The above steps require critical thinking of the users or entities for which we are running the experiment, in that, we must understand the product flow deeply enough to anticipate how a change will affect user behavior and most importantly having the judgment to make sound decisions. Good experimenters don't just analyze but act as product partners that connect statistical results to business outcomes, and they also know when it is enough to ship and when it isn't. That's the difference between reporting what happened and actually driving what happens next. Let's now look at each of these steps in more details:
Define the Goal and the Hypothesis
In this phase, our goal is to answer what behavior or outcome we trying to change. The goal needs to be specific and tied to a specific change, it should be measurable or trackable with a metric tied to a user action, and should predicts whether the outcome will increase or decrease.
As an example, let's imagine you built a feature to help lab technicians verify COVID results. You might hypothesize that including a confidence score alongside each result helps technicians make faster and more accurate decisions, because it reduces the cognitive load of interpreting the PCR machine readings without additional context. That is a strong hypothesis because it is grounded in a real user pain point of large sequence machine readings, it predicts a behavioral shift, and it tells you exactly which direction the metric should move. To build one, start by identifying the user pain, think about what change would shift the experience, and ask yourself why this change would improve the experience and business value.
Alongside defining the goal and the hypothesis, we also need to specify what our treatment and control groups are in this phase. The treatment is the changed experience we are testing, and the control is the baseline we are comparing it against. Once both are defined, we move to the next phase of metrics.
Define the Metrics
The purpose of metrics is to help you understand how the experiment is performing and has performed. The primary metric, sometimes called the north star metric, is the success metric that reflects the main goal you are trying to influence. It is always good to identify 2 to 3 candidate metrics first, then narrow down to one based on the goal. The primary metric must be available for both the control and treatment groups. In our COVID example, the primary metric might be test verification rate, since the goal is to speed up results verification. Both the control and treatment subjects should be able for example to verify the test for the two groups to be comparable.
Guardrail metrics are what we aim to protect at all cost. We are not optimizing for them, but most importantly do not want them to get worse. Examples include application crash rate, page load times, etc. It is a good practice to have 2 to 5 of these metrics to monitor throughout the experiment. They act as a safety net for not shipping a winning experiment that causes chaos and degrades user experience.
Alongside guardrail metrics, we also need to define sanity checks and tracking metrics. These help in the verification that the change is working as expected, for example, that users can actually see or interact with the new feature you are testing, and that the backend is correctly logging movement in the metrics. If your sanity checks are failing, your experiment results cannot be trusted regardless of what the numbers say.
Finally, while defining your metrics, avoid vanity metrics that are not tied to the hypothesis being tested, and avoid choosing a primary metric that is too hard to shift within your experiment window and most importantly, do not rely on a single metric to evaluate the whole experiment.
Randomization and User Selection
Once the hypothesis and metrics are defined, the next step is making sure the right users are included through the right unit of randomization, the variant exposure is fair, we know when each unit will enter the experiment from the user funnel and we are avoiding network effects or any form of interference. There are three key pieces to get right: eligibility, randomization, and instrumentation.
Eligibility is about defining who qualifies for inclusion. In our COVID lab feature example, you would want to exclude users who would never encounter the treatment or reach that stage, lab technicians who have not yet been onboarded to the platform, or users accessing the system from unsupported devices. These users have zero chance of completing a test verification since they are unauthorized, including them would dilute the experiment. It might make sense to include users who only reach the final stage of the funnel being tested, so that we have the minimum dilution possibilities.
Randomization ensures unbiased assignment between groups. Usually the gold standard is user-level randomization, meaning for example a given lab technician will always sees the same experience throughout the experiment. This is what sticky assignment achieves once a user is placed in control or treatment, they stay there for the duration. Sometimes this rule breaks if we are running a multi-armed bandit experiment. Session-level randomization is sometimes used, but only in situations where the nature of the task does not carry persistent effects across sessions, which mostly in a lab diagnostic workflow like ours is rarely the case. At this stage we also need proper instrumentation and logging design to capture all the events for further analyzing our outcomes.
Duration and Sample Size Calculation
Since the intial design choices are in place, how long should we run the experiment, and how many users or observations do we need to detect an effect? Before we can answer these questions, let's first ground ourselves with some foundations (there will be some maths but you can jump the equations), that is, statistical tests, p-values, confidence intervals, effect size, statistical power, and powwer analysus then we shall be ready to answer those questions.
1. Hypothesis Testing and Errors
Here, it is important to pause and distinguish between stating a hypothesis and testing it. While in Phase I we were stating an hypothesis, hypothesis testing is the process of evaluating that claim against observed evidence. We are now asking how much evidence does our data provide to reject it. Simply, hypothesis testing is a way two opposing claims are evaluates.
The null hypothesis () represents the status quo or the "no effect" default. We assume this statement is true unless the data proves otherwise. In contrast, the alternative hypothesis () is the actual claim we are trying to detect.
Our claim can be one-sided (predicting a specific direction e.g Feature A increases conversion rate i.e , ) or two-sided (predicting any difference e.g Feature A changes conversion rate that could increase or decrease i.e , ). Notably, a hypothesis test is not designed to prove that is true. Instead, it measures whether the observed data is inconsistent enough with the null hypothesis to justify rejecting it in favor of .
Each hypothesis concludes with a decision of either rejecting or failing to reject it. Because we rely on sample data rather than population we won't be able to observe, this decision can result in four outcomes which two are correct and two erroneous as shown below:
Our main goal is to control the False Positives. The Significance Level () which then is our primary goal is to control the probability of a Type I error. This significance level () represents our tolerance for false positives.
We define formally as:
In most cases, the default is , meaning we accept a 5% risk of being wrong when claiming a result is significant. However, for higher-stakes environments like clinical trials, this is often .
With the significance level, the hypothesis testing process is summarized as follows:
Stating the Hypotheses: Clearly state what the null () and alternative () hypotheses using a specific population parameter. The null must be testable statement (e.g., ), and the alternative shows the direction of the change you aim to detect.
Calculate the test statistic: The way to do this depends on your data's distribution and the parameter being tested (In the next subsections we shall cover t-statistics, z-statistics, and quasi-experiments methods).
Evaluate the evidence via the p-value: We Compare test statistic against the null distribution(the results we would expect if were actually true)
Make a Decision: We compare the p-value to our pre-set significance level (). If , we reject . If , you fail to reject .
It is important to remember that failing to reject is not the same as proving it is true. Not having the evidence against the null does not confirm the status quo; it may simply indicate that test had low statistical power or that the data was insufficient to detect a real effect. Hypothesis testing alone with p-values is not enough.
2. Power
The power of a test is the probability of correctly rejecting the null hypothesis () when the alternative hypothesis () is actually true. Unlike the significance level (), the power relates to the true parameter value (). Formally, power quantifies the likelihood of your test detecting an effect that truly exists as:
The represents the probability of a Type II error (failing to reject a false null). When the probability of missing a real effect decreases, the power of your test increases. In A/B testing, it is common to use a power of 0.8 i.e 80%.
3. Statistical tests
The test statistic is calculated value from the sample data that helps us make the decision on whether or not to reject the null hypothesis. It condenses observations into a format that can be compared against a reference distribution (or null distribution). This distribution represents the range of values the test statistic would typically take if the null hypothesis were actually true.
3.1 z-statistic and t-statistic
Given a population mean , we use a test statistic that measures how far the observed sample mean deviates from the null hypothesis value . When the population variance is known, we standardize this difference by dividing it by the standard error, resulting in Z-statistic:
Under the null hypothesis , this statistic follows a standard normal distribution , which acts as our reference. Meaning, the z-test determine whether two population means are different when the variance is known. In practice, suppose we measure an outcome (e.g., conversion rate or revenue per user). For two variants A and B, our population means would be
These represent the true expected outcomes for each variant. We would then estimate the metric by computing the sample means and the the difference in means as:
But under the null hypothesis we assume no treatment effect and . We would now calculate our test statistic with the population variances with the Z-statistic as:
The most important thing is that, for our Z-statistic, we comparing against normal distribution because that is the distribution we expect it to follow when the null hypothesis is true.
However, when we no longer know the population variance , the t-statistic come in play. We use the sample variance which is itself almost random and sometimes fall below the true resulting into additional variability into the test statistic. Our null distribution must therefore be more accommodating of extreme values than the standard normal, which the t-distribution provides.
The t-distribution, denoted closely resembles the standard normal in shape but its control by the degrees of freedom parameter, that results in heavier tails and a sharper peak. So, substituting for in the z-statistic results in the t-statistic:
Before running an experiment, we have to select the appropriate statistical test for analysing the results. In most A/B testing scenarios, a t-test is enough. However, there are settings where the experimental design is more complex and assignment to treatment cannot be fully randomised, or basic assumptions of constant variance or independence are violated. In these cases, quasi-experimention methods such as regression discontinuity design, difference-in-differences, and instrumental variables become essential.
Furthermore, even within the t-test process, we need to know how to estimate the variability of the test statistic. The standard approach above relies on the standard error of the difference in means, defined as:
This assumes that the variance of the outcome is stable across observations. When that assumption holds, the standard error is enough. When it does not, using this test statistics will either be too liberal or too conservative. Some of the alternatives to the standard error include:
- Heteroskedasticity-Robust Standard Errors
The standard errors assume homoskedasticity meaning that the variance of the outcome is constant across all groups. In practice, a new feature might cause some users in the treatment group to act with much higher variability than the control group. A potential solution is the heteroskedasticity-robust standard errors defined as:
This is useful when we suspect "outliers" or specific segments of users driving more variance than others. So, instead of assuming a shared average variance, we estimate the spread using each observation's own squared residual (). In some industries, it is a standard practice to use robust standard errors by default.
- Clustered Standard Errors
When observations are not independent aka clustered, for example you randomize at the level of a city, device type, or time period, rather than the individual user. All users in a "cluster" likely behave similarly for reasons unrelated to your treatment which might artificially inflates your samples. We use clustered standard errors whenever your unit of randomization (e.g., a school) is at a higher level than your unit of measurement (e.g., a student):
Other options that are less used include Bootstrap Standard Errors, Newey-West Standard Errors, among others.
3.2 Quasi-experiments
Difference in Difference(DiD)
When randomization is impossible, with a treatment and control group across two timelines pre-intervention and post-intervention, instead of simply comparing the groups after the treatment, DiD isolates the treatment effect by subtracting the baseline difference that already existed between them. That is, we measure the change in the treatment group and subtracts the change in the control group as:
However, real-world problems in most cases are rarely linear. Metrics such as lab test orders (during covid), revenue, or conversion rates are subject to seasonality and other macroeconomic fluctuations that a simple two-period model cannot account for. Commonly a two-way fixed effects extension is used, where unit fixed effects and time fixed effects replace the group and period indicators as:
is a vector of additional covariates that control for variables and external factors like trend and user characteristics. Also, takes into consideration the time-invariant heterogeneity across units, while the time fixed effects accounts for shocks to all units in a given period.
The validity of DiD is due to the parallel trends assumption that states, the treatment and control groups would have followed the same trajectory over time as:
Since this assumption is untestable post-treatment, it is usually assessed empirically with a test of equivalence before finalising treatment and control assignments which is essentially an A/A test.
Other common quasi-experimental methods for A/B testing include Regression Discontinuity Design (RDD) and Instrumental variable (IV).
4. P-values
A p-value is the probability under the null hypothesis of observing a test statistic at least as extreme as the one observed from the data. Formally, for a two-sided test:
Its worth pausing since p-values are frequently misunderstood. The p-value is not the probability that the null hypothesis is true. These two statements are not the same thing, and mixing them leads will eventually led to overconfident misleading conclusions. For example, let's assume we are testing at significance level and obtain a p-value of . The correct interpretation is, If the true mean really is 30, there is a 5% chance we would see a result this far from the it just by luck. It is not a statement that there is a 5% chance the mean is truly 30 and a p-value says nothing about that probability directly. Summarize it as:
5. Confidence intervals (CI)
CI provide a range of plausible values for an unknown population parameter based on observed sample data. Instead of reporting only a single point estimate (such as a sample mean), a confidence interval quantifies the uncertainty around that estimate. For example, a 95% confidence interval represents a procedure that produces intervals containing the true parameter in 95% repeated samples.
Given a population variance, we use the Z statistic with a rejection region for a two-sided hypothesis test as:
Since , the rejection condition could be rearranged to find all values of that would not be rejected as:
Solving for would finally result in (1 − ) confidence interval:
However, if the population variance is unknown (which is the common case), we replace with a sample standard deviation and use a t-distribution degrees of freedom. Our CI result in:
Confidence intervals are also widely misunderstood. The most common misinterpretation is that “There is a 95% probability that the true mean lies inside CI values”. This is a gross misinterpretation because, in the frequentist thinking, the population parameter is fixed and does not vary randomly.
The correct interpretation is If we repeated the same sampling process many times and constructed a confidence interval each time, 95% of those intervals would contain the true population mean. So, the focus is on the performance of interval generation procedure, not the probability that the parameter lies inside a particular interval.
6. Effect size
As we have seen with the p-values and CIs, statistical significance alone does not show whether an observed difference is meaningful. With large sample sizes, we can generally produce lower p-value, but to better understand the magnitude of an effect, we use the effect size. Effect size quantifies the difference between two groups relative to the variability in the data. Cohen's d is commonly used to standardized effect size of two means as:
represents the treatment mean, the control mean and is the standard deviation across groups. The Common Effect Size classification is as shown below:
In addition to standardized measures, we can report the raw differences between group averages which usually provide the actual magnitude of the difference in the metric of values between the two groups as:
Both standardized effect sizes and raw differences are typically reported together in modern experimentation analyses.
Power Analysis
We now have some good theoritical context, let's revisit our two questions; how many observations are required, and how long should the experiment run?. This is what power analysis will help us answer with either sequential design or fixed-sample design.
- Fixed-Sample Design
In a classical A/B test, the sample size is calculated before the experiment begins and we runs the test until that sample size is reached. We start by setting the significance level , the target power , the variance , and the Minimum Detectable Effect (MDE) which is the smallest change we want our experiment to detect.
For a two-sample comparison of means with equal group sizes and known variance, the required sample size per group is computed as:
For example using the default values and would result in the sample size of:
Given the required sample size, , the required duration in days would be:
For example, with users per day, a 50/50 split, and a required sample size of we would require to run this experiment for at least:
- Sequential Experiment Design
The fixed sample method is great and easy, however, in practice we need to monitor an experiment continuously and stop early if we have a strong evidence either in favor of the alternative or against it. Naively peeking at the experiment and stopping when might inflate the Type I error rate well above , because each look gives an additional opportunity to incorrectly reject . Sequential designs solve this by defining decision boundaries that account for continuous monitoring.
At time , after observing samples per group, the test statistic will be:
With Wald's SPRT, we monitors the likelihood ratio of the alternative to the null at each step:
We reject if , accept if , and continue sampling otherwise using the two decision thresholds defined in terms of the error rates:
Therefore, the required sample size under is:
This approach is highly efficient for large treatment effects, and often saves time compared to fixed-sample tests.
Once we have a clear power analysis method to determine the required sample size and experiment duration, one of the final considerations in the experiment design phase is whether we are testing for multiple metrics simultaneously. This matters because every hypothesis test carries an probability of a false positive. When we test hypotheses independently at significance level , the probability of making at least one Type I error which is the familywise error rate (FWER) will inflates drastically as:
Some of the potential solutions are:
- Bonferroni Correction
This is the most conservative but straightforward, tt controls the FWER by dividing the significance threshold equally across all tests:
A test is declared significant only if its p-value satisfies . The FWER follows the union bound:
- False Discovery Rate
Instead of controlling the probability of any false positive, this controls the expected proportion of rejected hypotheses that are false positives. That is, if is the number of false rejections and is the total number of rejections, it results in:
The Benjamini-Hochberg procedure is a common method to control FDR. It sorts the p-values in ascending order and finds the largest such that:
All hypotheses corresponding to are rejected. This method is less conservative than Bonferroni and is more powerful when many tests are being conducted, but it allows for a controlled proportion of false positives.
Finally, once the experiment concludes running and the data is analyzed, our final decision depends on both statistical evidence and practical considerations from product partners(sometimes its not worth making a change even with detected effect). The treatment effect, confidence interval, and metric trade-offs are reviewed to determine whether the feature should be launched, modified, or discarded. It is a good practice for the experiment process and results to be documented and becomes part of an organizational knowledge base that informs future product development and improvements.
For comments, please send me an email.