====================
Statistical Methods
====================

This page documents the statistical methods used by ASV and asv-spyglass to
determine benchmark significance.

Mann-Whitney U Test
-------------------

ASV uses the Mann-Whitney U test (also called the Wilcoxon rank-sum test) to
determine if two sets of benchmark samples come from different distributions.

Properties:

- Non-parametric: works for any distribution shape (not just normal)

- Compares medians rather than means

- Robust to outliers

- Requires ``--record-samples`` to collect multiple measurements

The null hypothesis is that both sample sets are drawn from the same
distribution. A low p-value rejects this hypothesis, indicating a statistically
significant difference.

P-Value Threshold
-----------------

ASV uses a p-value threshold of 0.002 (0.2%). This is much stricter than the
common 0.05 (5%) threshold, reducing false positives at the cost of requiring
larger or more consistent differences to register as significant.

With a threshold of 0.002 and multiple benchmarks, the chance of at least one
false positive is kept low even without formal multiple-comparison correction.

99% Confidence Intervals
------------------------

ASV computes 99% confidence intervals for the ratio between two measurements
using two methods:

- **Binomial quantile method:** when enough samples exist, uses the binomial
  distribution to find the sample quantiles that bound the 99% interval

- **Laplace posterior fallback:** when sample counts are low, falls back to a
  Bayesian approach using a Laplace prior

The confidence interval is reported as the ``+/-`` value after measurements
(e.g., ``167+/-3ns`` means the 99% CI spans 3ns around the median).

The Factor Parameter
--------------------

The ``factor`` parameter (default: 1.1, meaning 10%) sets the minimum ratio
change that asv-spyglass considers noteworthy. A benchmark with ratio 1.05
and factor 1.1 is not flagged, even if statistically significant.

The interaction between factor and statistical significance:

.. table::

    +-----------------+--------------+----------------+--------------------------------------------+
    | Ratio vs Factor | Significant? | Mark           | Meaning                                    |
    +=================+==============+================+============================================+
    | Above factor    | Yes          | ``+`` or ``-`` | Real, meaningful change                    |
    +-----------------+--------------+----------------+--------------------------------------------+
    | Above factor    | No           | ``~`` prefix   | Uncertain -- might be noise                |
    +-----------------+--------------+----------------+--------------------------------------------+
    | Below factor    | Yes          | (space)        | Statistically real but too small to matter |
    +-----------------+--------------+----------------+--------------------------------------------+
    | Below factor    | No           | (space)        | No meaningful change                       |
    +-----------------+--------------+----------------+--------------------------------------------+

Why ``~`` Prefix Matters
------------------------

A ratio like ``~1.15`` means the change exceeds the factor threshold (1.1) but
failed the Mann-Whitney U test. This is an important signal: the benchmark
**might** be regressing, but the evidence is not strong enough to be confident.

This is different from a simple ``ratio > threshold`` approach (which is what
inline bash regex patterns typically do). The statistical test accounts for
measurement noise and sample size.

Why This Beats Naive Ratio Comparison
-------------------------------------

Many CI setups compare ratios with a simple threshold:

.. code:: bash

    if (( $(echo "$RATIO > 10.0" | bc -l) )); then
      REGRESSION="true"
    fi

This has several problems:

- Ignores measurement variance (a 2x ratio from noisy data may be meaningless)

- Ignores sample size (single-sample ratios are unreliable)

- Cannot distinguish real regressions from noise

- No confidence intervals

ASV + asv-spyglass addresses all of these by requiring ``--record-samples``,
applying Mann-Whitney U, and computing confidence intervals. The action
preserves this rigor in its PR comments.

Practical Recommendations
-------------------------

- Always use ``--record-samples`` in ASV. Without it, no statistical testing
  is possible and asv-spyglass falls back to simple ratios.

- Use at least 3-5 samples per benchmark (ASV's default repetition count).
  More samples improve statistical power.

- Set ``--factor`` based on your project's noise floor. If benchmarks typically
  vary by +/-5%, a factor of 1.1 (10%) is appropriate. For lower-noise
  setups, 1.05 works.

- The ``~`` prefix is not a false alarm -- it is a signal to investigate
  further, not to ignore.