---
name: transcribe-audio
description: Transcribe audio with speaker diarization and word-level timestamps.
license: MIT
---

# Skill: transcribe-audio

Transcribe podcasts, meetings, interviews, or any speech audio with
speaker diarization and word-level timestamps.

## When to use

Invoke when the user wants:

- A text transcript of a podcast, meeting, lecture, or interview
- Speaker-separated transcripts ("Speaker 1: …, Speaker 2: …")
- Subtitles / captions for a video
- Translation of speech into another language

## API

`POST https://api.audiopod.ai/api/v1/transcription/jobs`

Headers:

- `Authorization: Bearer <access_token>`
- `Content-Type: multipart/form-data`

Body (form-data):

- `file`: audio or video file (up to 2 GB)
- `language`: ISO 639-1 code (or `auto`)
- `diarize`: `true` to enable speaker diarization (up to 10 speakers)
- `output_format`: `srt` | `vtt` | `txt` | `json`

## Response

Streaming SSE on `/api/v1/transcription/jobs/{job_id}/stream`, or poll
the job endpoint for `COMPLETED` and a downloadable transcript URL.

## Accuracy

99.8% word accuracy on clean studio audio (English); 96–98% on noisy
field recordings after running the `denoise-audio` skill first.
