Methodology
What a leaderboard row actually means, and the scoring rules behind every WER, RTFx, and parity verdict. Short by design — enough to trust the numbers and to reproduce them.
A row is a tuple, not a model
A leaderboard row is never “Whisper large-v3.” It is the full identity tuple that produced one scored run. Two of its axes — backend and hardware — move both speed and quality dramatically on local inference, so the site keeps every axis legible and comparable rather than collapsing it into a single number.
highlighted axes move speed and quality — each is a column and a filter, never a footnote.
The tuple is hashed into a stable identity_key that de-duplicates rows, groups derived leaderboards, and permalinks each result detail page.
Quality and speed are separate axes
There is no single composite “AI score.” Quality and speed are reported side by side and never blended, so a fast model with poor accuracy cannot quietly win.
Quality is headline word error rate, wer_norm (lower is better), with cer and cased wer_ortho alongside it. Speed is real-time factor: rtfx_native is comparable across implementations, while rtfx_wall is harness-scoped and only comparable within a single implementation — it is rendered visually distinct so the two are never confused.
Speed is never compared across different Apple chips. RTFx heat is scoped per hardware SoC, and cross-chip speed comparison is confined to the one view where it is the explicit point — the hardware leaderboard.
WER normalization
Before scoring, both hypothesis and reference are put through one pinned, versioned pipeline so a WER means the same thing on every row:
- unicode_form — text is normalized to NFC so canonically-equivalent Unicode compares equal.
- normalizer — a pinned whisper-basic@0.1.12 normalizer (lower-casing, punctuation and whitespace handling). The id@version travels with every result, so a normalizer change is a visible, versioned event — never a silent shift in the numbers.
- remove_diacritics: false for Russian — diacritics are kept, so ё/й are preserved and not folded onto е/и. Removing them would understate WER on Russian.
wer_norm is computed on this normalized text; wer_ortho is the cased, un-normalized orthographic WER kept alongside it for reference. The reference set is content-hashed (references · sha256) so the exact targets are pinned too.
Micro-aggregation
Per-language metrics use micro-aggregation (aggregation: "micro"): errors and reference tokens are pooled across all utterances, then the rate is taken once — total edits ÷ total reference length. This weights every word equally, so long utterances count for more than short ones, rather than averaging per-utterance rates (macro), where one short sentence could swing the score.
Runs with errors are flagged as lower-confidence via n_error / n_missing counts and never silently averaged in. The row is built to grow into median ± CI (n=142 devices) once crowd aggregation lands.
Parity tolerances
Parity answers one question: did the port change the transcripts? Hold the model equal, vary the backend or chip, and check that the output stays equivalent within explicit tolerances. The verdict is PASS only when every per-language delta stays inside its bound; otherwise FAIL.
A report states what was held equal (shared) versus what changed (differs), shows each quality_delta against its tolerance, and reports identical_hypothesis_rate as a gauge. Mode quality is hardware-independent (transcripts only); mode full adds one-axis speed deltas. This is the check for validating a CUDA→MLX or PyTorch→Core ML port.
The schemas are the contract
Results and parity reports are validated against versioned JSON Schemas, and the TypeScript types this site reads are generated from them — so the UI cannot drift from the data. Reserved fields (streaming latency, diarization, power/thermal) already live in the schema, leaving room to grow without a redesign.
Not affiliated with Apple or the model vendors. Cloud baselines, when they land, are labeled separately and never mixed into a hardware leaderboard.