Most product teams run A/B tests incorrectly. They peek at results early, stop tests when they see significance, and ignore multiple comparisons. The result is a false positive rate far higher than they think. Here is the statistics behind doing it right.
The most common A/B testing mistake is checking results before the planned sample size is reached and stopping early when p < 0.05. This inflates the false positive rate dramatically. If you check results 5 times during a test and stop at the first significant result, your actual false positive rate is not 5% — it is closer to 20%.
This is called the "optional stopping problem" and it is well-documented in the statistics literature. The p-value is only valid at the pre-specified sample size. Every additional peek is an additional test, and multiple testing inflates error rates.
The fix
Use sequential testing methods (Sequential Probability Ratio Test, or SPRT) if you need to monitor results continuously. These methods are designed for continuous monitoring and maintain the correct error rate regardless of when you stop. The MainState Labs statistics API includes SPRT alongside classical hypothesis tests.
The choice of statistical test depends on the data type, the number of groups, and the distributional assumptions. Using the wrong test does not always give wrong answers, but it often gives less powerful answers — meaning you need more data to detect the same effect.
| Scenario | Correct Test | Common Mistake |
|---|---|---|
| Conversion rate (binary) | Z-test for proportions | t-test on 0/1 values |
| Revenue per user (skewed) | Mann-Whitney U or log-transform + t-test | t-test on raw values |
| 3+ variants | ANOVA + post-hoc correction | Multiple pairwise t-tests |
| Continuous monitoring | SPRT or mSPRT | Standard t-test with peeking |
| Small samples (<30) | Welch's t-test or permutation test | Assuming normality |
Sample size calculation is the most skipped step in A/B testing. Most teams run tests until they "feel" like they have enough data, or until the deadline forces a decision. This is backwards.
A proper power analysis requires four inputs: the baseline conversion rate, the minimum detectable effect (MDE) you care about, the desired statistical power (typically 80%), and the significance level (typically 5%). From these, you can calculate exactly how many users you need per variant before starting the test.
The most common error in MDE specification: teams set the MDE too small. If your baseline conversion is 3% and you want to detect a 0.1% absolute improvement, you need roughly 200,000 users per variant. Most products do not have that traffic. Setting a realistic MDE — say, 0.5% absolute — reduces the required sample to around 8,000 per variant.
The sample size calculator in the MainState Labs Statistics API handles all standard test types and returns the required sample size, expected test duration based on your daily traffic, and the minimum detectable effect for a given sample size.
Frequentist hypothesis testing answers the question: "If the null hypothesis were true, how likely is this data?" This is not the question product managers actually want answered. They want: "Given this data, what is the probability that variant B is better than variant A?"
Bayesian A/B testing answers that question directly. Instead of a p-value, you get a probability: "There is a 94% chance that variant B has a higher conversion rate than variant A." This is more intuitive, handles early stopping naturally, and allows you to incorporate prior knowledge about your conversion rates.
The Bayesian A/B endpoint in the statistics suite uses a Beta-Binomial model for conversion rates and returns the posterior probability of each variant winning, the expected lift, and the credible interval for the true conversion rate of each variant.
Run statistically rigorous A/B tests via API.
Try the Statistics API →