Local ASR / STT benchmark · Apple Silicon

Speech models, measured where they actually run.

omni-bench scores local ASR on quality and speed — pinned to the exact backend, quantization, and Apple chip it runs on. Then it verifies parity: same model, different implementation — do the transcripts still match?

Explore the leaderboard →How it works

M1 → M4·MLX·llama.cpp·Core ML·PyTorch-MPS

Result · one scored run✓ parity PASS

nvidia/parakeet-tdt-0.6b

parakeet-tdt-0.6b-mlx-q4

q4swiftMLXM4 Max · 64GBFLEURS · ru

WER ↓

6.1%

CER

2.4%

RTFx ↑

42.1×

sha256:8bbc0ff543633… · whisper-basic@0.1.12 · micro

The north star

A leaderboard row is never “a model.” It’s a tuple.

Two of its axes — backend and hardware — move both speed and quality dramatically on local inference. So the whole site is built to make the full tuple legible and comparable, never collapsed into one number.

base_modelartifactimplementationbackendquantizationhardwareOStaskrun_profile

moves speed and quality — a column and a filter, never a footnote

What it measures

Quality and speed are separate axes.

There is no single composite “AI score.” A fast model with poor accuracy does not quietly win.

01 · Quality

WER, normalized

Headline accuracy is word error rate — lower is better — after NFC normalization and a pinned whisper-basic normalizer (Russian keeps ё/й). CER and cased wer_ortho sit alongside it.

02 · Speed

RTFx, hardware-scoped

rtfx_native is comparable across implementations; rtfx_wall only within one. Neither is ever compared across different Apple chips — the leaderboard blocks or flags it.

03 · Parity

Did the port change results?

Hold the model equal, vary the backend or chip, and check that transcripts stay equivalent within explicit tolerances — a clear PASS / FAIL for validating a CUDA→MLX or PyTorch→Core ML port.

Leaderboards

One table, several questions.

The same result set, sliced by preset filters — each answering a different local-inference decision.

Model × Hardware

The best local model on a given Mac — fix the chip and backend, rank by WER then speed.

Open on M4 Max →

Backend shootout

Fix one artifact, compare MLX vs llama.cpp vs Core ML vs PyTorch-MPS on a single machine.

Open the table →

Hardware leaderboard

M1 Pro → M3 Ultra, model and backend held equal. The one view where cross-chip speed is the point.

Compare chips →

Trust

Provenance, on every row.

Dataset revision, normalizer id@version, the full toolchain, content hashes, and sample counts travel with each result — reachable from any row. Runs with errors are flagged as lower-confidence, never silently averaged in. That is what separates a credible benchmark from a table of numbers.

scoring provenance
dataset.revision4da2b97412f…
normalizerwhisper-basic@0.1.12
unicode_formNFC
toolchain.jiwer4.0.0
counts349 ok · 1 error
identity_keysha256:8bbc0ff…

Open by design

Built on committed JSON and a shared schema.

Every result is a result.json and every comparison a parity-report.json, validated against versioned JSON Schemas. The types the site reads are generated from those schemas, so the UI can’t drift from the data.

Schemas are the contract

TypeScript types are generated from the JSON Schemas, so the UI can’t drift from the data.

Room to grow

Streaming latency, diarization, and power/thermal columns are reserved in the schema today.

Built for a read API

Designed for a future GET /v1/leaderboards and crowd aggregation — median ± CI over n_devices — with no redesign.

schemas/result.schema.jsonschemas/parity-report.schema.jsonRead the methodology →

Pick the right local model — quality, speed, and the chip it runs on.

Open the leaderboard and start filtering by model, backend, quantization, and Apple Silicon.

Explore the leaderboard →