← Back to Blog
Statistics

A/B Testing Is Broken at Most Companies. Here Is How to Fix It.

April 2025·13 min read·MainState Labs

Most product teams run A/B tests incorrectly. They peek at results early, stop tests when they see significance, and ignore multiple comparisons. The result is a false positive rate far higher than they think. Here is the statistics behind doing it right.

The Peeking Problem

The most common A/B testing mistake is checking results before the planned sample size is reached and stopping early when p < 0.05. This inflates the false positive rate dramatically. If you check results 5 times during a test and stop at the first significant result, your actual false positive rate is not 5% — it is closer to 20%.

This is called the "optional stopping problem" and it is well-documented in the statistics literature. The p-value is only valid at the pre-specified sample size. Every additional peek is an additional test, and multiple testing inflates error rates.

The fix

Use sequential testing methods (Sequential Probability Ratio Test, or SPRT) if you need to monitor results continuously. These methods are designed for continuous monitoring and maintain the correct error rate regardless of when you stop. The MainState Labs statistics API includes SPRT alongside classical hypothesis tests.

Choosing the Right Test

The choice of statistical test depends on the data type, the number of groups, and the distributional assumptions. Using the wrong test does not always give wrong answers, but it often gives less powerful answers — meaning you need more data to detect the same effect.

ScenarioCorrect TestCommon Mistake
Conversion rate (binary)Z-test for proportionst-test on 0/1 values
Revenue per user (skewed)Mann-Whitney U or log-transform + t-testt-test on raw values
3+ variantsANOVA + post-hoc correctionMultiple pairwise t-tests
Continuous monitoringSPRT or mSPRTStandard t-test with peeking
Small samples (<30)Welch's t-test or permutation testAssuming normality

Sample Size Calculation: Do It Before You Start

Sample size calculation is the most skipped step in A/B testing. Most teams run tests until they "feel" like they have enough data, or until the deadline forces a decision. This is backwards.

A proper power analysis requires four inputs: the baseline conversion rate, the minimum detectable effect (MDE) you care about, the desired statistical power (typically 80%), and the significance level (typically 5%). From these, you can calculate exactly how many users you need per variant before starting the test.

The most common error in MDE specification: teams set the MDE too small. If your baseline conversion is 3% and you want to detect a 0.1% absolute improvement, you need roughly 200,000 users per variant. Most products do not have that traffic. Setting a realistic MDE — say, 0.5% absolute — reduces the required sample to around 8,000 per variant.

The sample size calculator in the MainState Labs Statistics API handles all standard test types and returns the required sample size, expected test duration based on your daily traffic, and the minimum detectable effect for a given sample size.

Bayesian A/B Testing: A Better Mental Model

Frequentist hypothesis testing answers the question: "If the null hypothesis were true, how likely is this data?" This is not the question product managers actually want answered. They want: "Given this data, what is the probability that variant B is better than variant A?"

Bayesian A/B testing answers that question directly. Instead of a p-value, you get a probability: "There is a 94% chance that variant B has a higher conversion rate than variant A." This is more intuitive, handles early stopping naturally, and allows you to incorporate prior knowledge about your conversion rates.

The Bayesian A/B endpoint in the statistics suite uses a Beta-Binomial model for conversion rates and returns the posterior probability of each variant winning, the expected lift, and the credible interval for the true conversion rate of each variant.

Run statistically rigorous A/B tests via API.

Try the Statistics API →