Statistical Methods¶
This page documents the statistical methods used by ASV and asv-spyglass to determine benchmark significance.
Mann-Whitney U Test¶
ASV uses the Mann-Whitney U test (also called the Wilcoxon rank-sum test) to determine if two sets of benchmark samples come from different distributions.
Properties:
Non-parametric: works for any distribution shape (not just normal)
Compares medians rather than means
Robust to outliers
Requires
--record-samplesto collect multiple measurements
The null hypothesis is that both sample sets are drawn from the same distribution. A low p-value rejects this hypothesis, indicating a statistically significant difference.
P-Value Threshold¶
ASV uses a p-value threshold of 0.002 (0.2%). This is much stricter than the common 0.05 (5%) threshold, reducing false positives at the cost of requiring larger or more consistent differences to register as significant.
With a threshold of 0.002 and multiple benchmarks, the chance of at least one false positive is kept low even without formal multiple-comparison correction.
99% Confidence Intervals¶
ASV computes 99% confidence intervals for the ratio between two measurements using two methods:
Binomial quantile method: when enough samples exist, uses the binomial distribution to find the sample quantiles that bound the 99% interval
Laplace posterior fallback: when sample counts are low, falls back to a Bayesian approach using a Laplace prior
The confidence interval is reported as the +/- value after measurements
(e.g., 167+/-3ns means the 99% CI spans 3ns around the median).
The Factor Parameter¶
The factor parameter (default: 1.1, meaning 10%) sets the minimum ratio
change that asv-spyglass considers noteworthy. A benchmark with ratio 1.05
and factor 1.1 is not flagged, even if statistically significant.
The interaction between factor and statistical significance:
Ratio vs Factor |
Significant? |
Mark |
Meaning |
|---|---|---|---|
Above factor |
Yes |
|
Real, meaningful change |
Above factor |
No |
|
Uncertain – might be noise |
Below factor |
Yes |
(space) |
Statistically real but too small to matter |
Below factor |
No |
(space) |
No meaningful change |
Why ~ Prefix Matters¶
A ratio like ~1.15 means the change exceeds the factor threshold (1.1) but
failed the Mann-Whitney U test. This is an important signal: the benchmark
might be regressing, but the evidence is not strong enough to be confident.
This is different from a simple ratio > threshold approach (which is what
inline bash regex patterns typically do). The statistical test accounts for
measurement noise and sample size.
Why This Beats Naive Ratio Comparison¶
Many CI setups compare ratios with a simple threshold:
if (( $(echo "$RATIO > 10.0" | bc -l) )); then
REGRESSION="true"
fi
This has several problems:
Ignores measurement variance (a 2x ratio from noisy data may be meaningless)
Ignores sample size (single-sample ratios are unreliable)
Cannot distinguish real regressions from noise
No confidence intervals
ASV + asv-spyglass addresses all of these by requiring --record-samples,
applying Mann-Whitney U, and computing confidence intervals. The action
preserves this rigor in its PR comments.
Practical Recommendations¶
Always use
--record-samplesin ASV. Without it, no statistical testing is possible and asv-spyglass falls back to simple ratios.Use at least 3-5 samples per benchmark (ASV’s default repetition count). More samples improve statistical power.
Set
--factorbased on your project’s noise floor. If benchmarks typically vary by +/-5%, a factor of 1.1 (10%) is appropriate. For lower-noise setups, 1.05 works.The
~prefix is not a false alarm – it is a signal to investigate further, not to ignore.