==================== Statistical Methods ==================== This page documents the statistical methods used by ASV and asv-spyglass to determine benchmark significance. Mann-Whitney U Test ------------------- ASV uses the Mann-Whitney U test (also called the Wilcoxon rank-sum test) to determine if two sets of benchmark samples come from different distributions. Properties: - Non-parametric: works for any distribution shape (not just normal) - Compares medians rather than means - Robust to outliers - Requires ``--record-samples`` to collect multiple measurements The null hypothesis is that both sample sets are drawn from the same distribution. A low p-value rejects this hypothesis, indicating a statistically significant difference. P-Value Threshold ----------------- ASV uses a p-value threshold of 0.002 (0.2%). This is much stricter than the common 0.05 (5%) threshold, reducing false positives at the cost of requiring larger or more consistent differences to register as significant. With a threshold of 0.002 and multiple benchmarks, the chance of at least one false positive is kept low even without formal multiple-comparison correction. 99% Confidence Intervals ------------------------ ASV computes 99% confidence intervals for the ratio between two measurements using two methods: - **Binomial quantile method:** when enough samples exist, uses the binomial distribution to find the sample quantiles that bound the 99% interval - **Laplace posterior fallback:** when sample counts are low, falls back to a Bayesian approach using a Laplace prior The confidence interval is reported as the ``+/-`` value after measurements (e.g., ``167+/-3ns`` means the 99% CI spans 3ns around the median). The Factor Parameter -------------------- The ``factor`` parameter (default: 1.1, meaning 10%) sets the minimum ratio change that asv-spyglass considers noteworthy. A benchmark with ratio 1.05 and factor 1.1 is not flagged, even if statistically significant. The interaction between factor and statistical significance: .. table:: +-----------------+--------------+----------------+--------------------------------------------+ | Ratio vs Factor | Significant? | Mark | Meaning | +=================+==============+================+============================================+ | Above factor | Yes | ``+`` or ``-`` | Real, meaningful change | +-----------------+--------------+----------------+--------------------------------------------+ | Above factor | No | ``~`` prefix | Uncertain -- might be noise | +-----------------+--------------+----------------+--------------------------------------------+ | Below factor | Yes | (space) | Statistically real but too small to matter | +-----------------+--------------+----------------+--------------------------------------------+ | Below factor | No | (space) | No meaningful change | +-----------------+--------------+----------------+--------------------------------------------+ Why ``~`` Prefix Matters ------------------------ A ratio like ``~1.15`` means the change exceeds the factor threshold (1.1) but failed the Mann-Whitney U test. This is an important signal: the benchmark **might** be regressing, but the evidence is not strong enough to be confident. This is different from a simple ``ratio > threshold`` approach (which is what inline bash regex patterns typically do). The statistical test accounts for measurement noise and sample size. Why This Beats Naive Ratio Comparison ------------------------------------- Many CI setups compare ratios with a simple threshold: .. code:: bash if (( $(echo "$RATIO > 10.0" | bc -l) )); then REGRESSION="true" fi This has several problems: - Ignores measurement variance (a 2x ratio from noisy data may be meaningless) - Ignores sample size (single-sample ratios are unreliable) - Cannot distinguish real regressions from noise - No confidence intervals ASV + asv-spyglass addresses all of these by requiring ``--record-samples``, applying Mann-Whitney U, and computing confidence intervals. The action preserves this rigor in its PR comments. Practical Recommendations ------------------------- - Always use ``--record-samples`` in ASV. Without it, no statistical testing is possible and asv-spyglass falls back to simple ratios. - Use at least 3-5 samples per benchmark (ASV's default repetition count). More samples improve statistical power. - Set ``--factor`` based on your project's noise floor. If benchmarks typically vary by +/-5%, a factor of 1.1 (10%) is appropriate. For lower-noise setups, 1.05 works. - The ``~`` prefix is not a false alarm -- it is a signal to investigate further, not to ignore.