MINARA

Synthesize speech

POST /v1/voice/tts — Stream synthesized speech for up to 4096 chars of text. The provider body is piped through unbuffered, so first audio by

POST /v1/voice/tts

Stream synthesized speech for up to 4096 chars of text. The provider body is piped through unbuffered, so first audio bytes arrive before synthesis completes. format is one of mp3 (default), opus, wav; opus routes to the OpenAI provider (Telegram-style voice notes). Sentence-by-sentence callers can pass previous_text / next_text (each truncated to 600 chars) so ElevenLabs keeps prosody continuous across requests, and first_chunk: true on a reply's first sentence to let the gateway swap in the fastest model when the faster-start setting is on.

MethodPOST
Path/v1/voice/tts
AuthAuthorization: Bearer <token> required when GATEWAY_AUTH_TOKEN is set
Categoryvoice

Request body

{ "text": "...", "voice": "optional", "format": "mp3", "first_chunk": false, "previous_text": "optional", "next_text": "optional" }

Response body

audio bytes (audio/mpeg | audio/ogg | audio/wav)

On this page