xAI announced standalone Grok Speech to Text and Grok Text to Speech APIs, built on the same stack that powers Grok Voice, Tesla vehicles, and Starlink customer support.

This is a quieter but more practical move than a full voice-agent launch. Instead of saying ‘here is the whole talking bot,’ xAI is breaking voice into pieces developers can actually plug into products: transcription, streaming, generated speech, and expressive controls. Less stage magic, more parts bin. Useful trade.

Source credit: xAI's original source material.

STT is priced like infrastructure

For Speech to Text, xAI says developers can process large audio files through REST or transcribe in real time through WebSocket. Features include word-level timestamps, speaker diarization, multichannel support, and inverse text normalization for numbers, dates, currencies, and similar structured output.

Pricing is direct: $0.10 per hour for batch transcription and $0.20 per hour for streaming, according to xAI. That matters because audio workloads get very large very fast. Meetings, calls, podcasts, support recordings — the whole modern workplace as a searchable regret archive.

  • batch and streaming transcription
  • word-level timestamps and speaker diarization
  • multichannel support
  • 25+ language support

TTS gets expressive controls

For Text to Speech, xAI says developers can generate long-form speech through REST or real-time speech through WebSocket. It also supports speech tags such as [laugh], [sigh], [whisper], emphasis, slow, and pause for more natural prosody.

TTS is priced at $4.20 per million characters. The important thing is not that a model can read text aloud. We cleared that bar. The important thing is whether the voice can be controlled enough for products, support flows, learning apps, and media tools without sounding like a haunted IVR.

xAI is clearly treating voice as a platform layer: full voice agents on one side, modular STT and TTS APIs on the other. That gives developers more ways in and gives xAI more surfaces to prove the stack outside the Grok app.

The spectacle will still be the talking assistant. The business may be the boring APIs underneath it. Funny how often that happens.

In short

xAI released standalone speech-to-text and text-to-speech APIs with pricing, diarization, timestamps, multilingual support, and expressive speech tags. Translation: voice is moving from flashy agent demos into reusable infrastructure.