How does AudioPod evaluate voice quality?

Six axes: fidelity, latency, language coverage, consent integrity, robustness, and cost-per-second. Trained listeners — native speakers per language where it matters — score outputs blind against a baseline before any change ships.

Do you publish benchmarks?

Yes, through the public catalog (135 voices, 100+ languages) and the citable research endpoint at /api/research.json. We publish both wins and regressions — if a change improves fidelity but degrades latency, both numbers ship.

How many languages do you support?

100+ languages for TTS synthesis (with native pronunciation and prosody). Voice cloning currently spans 30+ languages.

How do you test for misuse and deepfake risk?

Internal red-team runs every release: deliberately tries to clone public figures, evade provenance signals, and bypass the takedown SLA. Failures gate the release.

Will you name the third-party components you use?

No — never in customer copy. We use components from the audio-AI ecosystem and swap them as better options arrive. The product is AudioPod, not the underlying components; the experience stays constant even when the seam beneath changes.

Research

How we decide what's good enough to ship.

Six axes, a four-step evaluation process, and a refusal to ship features that look better on a benchmark but worse for a working creator. This is AudioPod's research methodology — the floor under everything in the changelog.

🎧 Listen to this page

Six axes

The dimensions we benchmark on.

Not 'how high can the MOS score go'. The question we ask is whether a working creator would accept the output, in the language they ship in, on the latency budget their workflow can absorb, at a unit cost their pricing supports.

Fidelity

Does the output sound like a human, on a microphone the listener trusts? Measured against human-recorded reference takes by trained listeners + an internal MOS-style rubric.

Latency

p50 and p95 time-to-first-audio and total render time per voice, per language. Latency budgets per use case (live agents vs batch audiobook).

Language coverage

100+ languages today. Tracked by intelligibility (native-speaker pass rate) and prosody (does the language sound *spoken*, not transliterated).

Consent integrity

Does the consent flow actually catch the cases it claims to — public-figure uploads, non-owner clones, training opt-outs? Continuously red-teamed.

Robustness

Does the output degrade gracefully under noise, accents, code-switching, low-quality reference audio? Failure modes documented, not hidden.

Cost-per-second

Output cost normalized to one minute of finished audio. We track this because creator-pricing requires it; we don't ship features that quietly break the unit economics.

Process

Baseline. Blind. Red-team. Document.

Baseline

Before any new voice model or pipeline change lands, we capture the current numbers on a fixed evaluation set. No baselines, no shipping.

Blind A/B

Trained listeners — native speakers per language where it matters — rate outputs against the baseline blind. Internal A/B before any external A/B.

Red-team safety

Consent and safety classifiers face an internal red team that explicitly tries to clone public figures, evade provenance, and bypass the takedown SLA.

Document and ship

Findings — including regressions we caught and reverted — get a changelog entry. We're explicit when something improves on one axis and degrades on another.

What we won't do

Two practices we refuse, no matter how good the numbers.

Cherry-picking benchmarks

We don't cite a single number on a single dataset as proof of model quality. If a change improves fidelity by 5% but degrades latency by 30%, we publish both. Marketing copy follows the worst honest number, not the best one.

Naming third-party SKUs in customer copy

We use components from the audio-AI ecosystem — and we'll change them when something better arrives. Customer copy never names the underlying components, because (a) it's a moving target and (b) the product is AudioPod, not whatever we ran the experiment against this week.

Adjacent

If this page interested you.

Where the research methodology touches the rest of the surface.

Responsible AI

Where consent integrity gets measured in practice.

Trust & security

How research findings on data handling become operating policy.

Manifesto

Why methodology matters more than headline numbers.

Common questions

Research FAQ

Read the changelog.

The methodology is the floor. The changelog is what crossed it.

See the changelog