A leaderboard row is never “a model.” It’s a tuple.
Two of its axes — backend and hardware — move both speed and quality dramatically on local inference. So the whole site is built to make the full tuple legible and comparable, never collapsed into one number.
Quality and speed are separate axes.
There is no single composite “AI score.” A fast model with poor accuracy does not quietly win.
Headline accuracy is word error rate — lower is better — after NFC normalization and a pinned whisper-basic normalizer (Russian keeps ё/й). CER and cased wer_ortho sit alongside it.
rtfx_native is comparable across implementations; rtfx_wall only within one. Neither is ever compared across different Apple chips — the leaderboard blocks or flags it.
Hold the model equal, vary the backend or chip, and check that transcripts stay equivalent within explicit tolerances — a clear PASS / FAIL for validating a CUDA→MLX or PyTorch→Core ML port.
One table, several questions.
The same result set, sliced by preset filters — each answering a different local-inference decision.
The best local model on a given Mac — fix the chip and backend, rank by WER then speed.
Open on M4 Max →Fix one artifact, compare MLX vs llama.cpp vs Core ML vs PyTorch-MPS on a single machine.
Open the table →M1 Pro → M3 Ultra, model and backend held equal. The one view where cross-chip speed is the point.
Compare chips →Provenance, on every row.
Dataset revision, normalizer id@version, the full toolchain, content hashes, and sample counts travel with each result — reachable from any row. Runs with errors are flagged as lower-confidence, never silently averaged in. That is what separates a credible benchmark from a table of numbers.
Built on committed JSON and a shared schema.
Every result is a result.json and every comparison a parity-report.json, validated against versioned JSON Schemas. The types the site reads are generated from those schemas, so the UI can’t drift from the data.
TypeScript types are generated from the JSON Schemas, so the UI can’t drift from the data.
Streaming latency, diarization, and power/thermal columns are reserved in the schema today.
Designed for a future GET /v1/leaderboards and crowd aggregation — median ± CI over n_devices — with no redesign.
Pick the right local model — quality, speed, and the chip it runs on.
Open the leaderboard and start filtering by model, backend, quantization, and Apple Silicon.