A p-value in A/B testing is the probability of observing a result at least as extreme as the one measured, assuming there is no real difference between the control and variant (the null hypothesis). A p-value below 0.05 means there is less than a 5% chance the observed difference occurred by random chance alone, which is the conventional threshold for calling a result statistically significant.
A p-value in A/B testing is the probability of observing a result at least as extreme as the one measured, assuming there is no real difference between the control and variant (the null hypothesis). A p-value below 0.05 means there is less than a 5% chance the observed difference occurred by random chance alone, which is the conventional threshold for calling a result statistically significant.
In an A/B testing workflow, p-value is part of the statistical layer that determines whether an observed conversion rate difference is a real signal or noise. It is calculated from the effect size, sample size, and variance of the metric being measured. A lower p-value means the data is less consistent with the null hypothesis — the assumption that control and variant perform identically. P-value is most reliable when the primary metric and significance threshold are locked in before the test starts, so the decision rule is not distorted by peeking at early results.
P-value matters because it gives teams a principled way to separate a real improvement from random fluctuation. Without it, a team might see a variant converting at 4.8% against a control at 4.2% and conclude the variant wins — even if that gap could easily arise by chance with a small sample. It should be interpreted alongside effect size, confidence interval, sample size, and test duration to get the full picture of whether the result is trustworthy.
For example, a team runs an A/B test on a pricing-page headline for two weeks. The control converts at 3.2% and the variant at 3.8% — a 0.6 percentage point lift. The calculated p-value is 0.03. Because 0.03 falls below the 0.05 threshold, the team rejects the null hypothesis and concludes the lift is statistically significant, meaning there is only a 3% probability this difference occurred by chance alone.
Set your significance threshold — typically p < 0.05 — before the test starts, not after seeing the data. Calculate the required sample size upfront using a sample size calculator, then wait for the full planned duration before evaluating results. When reading the final p-value, pair it with effect size and practical business impact to decide whether the lift is worth shipping, not just statistically notable.
The two most common mistakes are peeking and conflating statistical with practical significance. Peeking — stopping a test the moment p-value drops below 0.05 — inflates the false positive rate and can lead to shipping a change that has no real effect. The second mistake is treating a very low p-value as a reason to ship regardless of effect size: a p-value of 0.001 on a 0.1% lift may be statistically significant but not meaningful for the business. Always evaluate both.
A p-value in A/B testing is the probability of observing a result at least as extreme as the one measured, assuming there is no real difference between the control and variant (the null hypothesis). A p-value below 0.05 means there is less than a 5% chance the observed difference occurred by random chance alone, which is the conventional threshold for calling a result statistically significant.
P-value matters because it gives teams a principled way to separate a real improvement from random fluctuation. Without it, a team might see a variant converting at 4.8% against a control at 4.2% and conclude the variant wins — even if that gap could easily arise by chance with a small sample. It should be interpreted alongside effect size, confidence interval, sample size, and test duration to get the full picture of whether the result is trustworthy.
Set your significance threshold — typically p < 0.05 — before the test starts, not after seeing the data. Calculate the required sample size upfront using a sample size calculator, then wait for the full planned duration before evaluating results. When reading the final p-value, pair it with effect size and practical business impact to decide whether the lift is worth shipping, not just statistically notable.
Visual editor, AI-generated variants with MidaGX, redirect testing, and GA4 integration — free forever in the Mida Sandbox. No credit card required.