Six axes
Not 'how high can the MOS score go'. The question we ask is whether a working creator would accept the output, in the language they ship in, on the latency budget their workflow can absorb, at a unit cost their pricing supports.
Fidelity
Does the output sound like a human, on a microphone the listener trusts? Measured against human-recorded reference takes by trained listeners + an internal MOS-style rubric.
Latency
p50 and p95 time-to-first-audio and total render time per voice, per language. Latency budgets per use case (live agents vs batch audiobook).
Language coverage
85+ languages today. Tracked by intelligibility (native-speaker pass rate) and prosody (does the language sound *spoken*, not transliterated).
Consent integrity
Does the consent flow actually catch the cases it claims to — public-figure uploads, non-owner clones, training opt-outs? Continuously red-teamed.
Robustness
Does the output degrade gracefully under noise, accents, code-switching, low-quality reference audio? Failure modes documented, not hidden.
Cost-per-second
Output cost normalized to one minute of finished audio. We track this because creator-pricing requires it; we don't ship features that quietly break the unit economics.
Process
Before any new voice model or pipeline change lands, we capture the current numbers on a fixed evaluation set. No baselines, no shipping.
Trained listeners — native speakers per language where it matters — rate outputs against the baseline blind. Internal A/B before any external A/B.
Consent and safety classifiers face an internal red team that explicitly tries to clone public figures, evade provenance, and bypass the takedown SLA.
Findings — including regressions we caught and reverted — get a changelog entry. We're explicit when something improves on one axis and degrades on another.
What we won't do
We don't cite a single number on a single dataset as proof of model quality. If a change improves fidelity by 5% but degrades latency by 30%, we publish both. Marketing copy follows the worst honest number, not the best one.
We use components from the audio-AI ecosystem — and we'll change them when something better arrives. Customer copy never names the underlying components, because (a) it's a moving target and (b) the product is AudioPod, not whatever we ran the experiment against this week.
Adjacent
Where the research methodology touches the rest of the surface.