
Fireworks Streaming Transcription: 300ms with Whisper-v3-large-quality
By Fireworks AI Team|1/23/2025
DeepSeek R1, a state-of-the-art open model, is now available. Try it now or read our DeepSeek quickstart!
By Fireworks AI Team|1/23/2025
Today, we’re a new streaming speech-to-text API for real-time use cases like voice agents and live captioning!
Audio is a critical way to connect with AI applications. Much of the world’s information exists as audio and speaking is a natural way to interact with AI. We heard numerous customer requests from LLM users for ways to transcribe data to enable LLM reasoning over audio data like lecture notes and voice chats.
This led us to recently release the market’s fastest Whisper (as measured by Artificial Analysis), transcribing 1 hour of audio in 4 seconds. This service was only available for async transcription, where you upload an audio file and get back a complete transcription. Async transcription is great for use cases like summarizing podcasts or meeting recordings.
However, some use cases need transcription as soon as words are spoken. Call center agents need immediate transcripts to guide ongoing calls. Live broadcasts benefit from up-to-the-second captions that keep viewers engaged. Delayed responses can limit automation and slow user interactions. We’ve built a faster, streaming solution to unlock these use cases.
How it works: Customers establish a WebSocket connection that streams audio chunks of 50-500ms intervals. The API receives these chunks, transcribes them in near real time, and returns incremental text segments. Applications can display intermediate transcripts, drive automation, and integrate data into downstream processes. This continuous feedback loop keeps conversations fluid and ensures minimal disruption for users.
Check out this live demo of our streaming UI to see it in action
Why use Fireworks:
We have options for upgraded serverless endpoints or dedicated deployments. Contact us if you need higher rate limits, SLAs, lower bulk pricing or single-tenant serving.
Get started with the serverless streaming Fireworks audio endpoint today in code through docs or through our realtime transcription streaming notebook. Alternatively, get started with a button click in our UI playground. The playground lets you directly record or upload audio and run it through Fireworks’ audio APIs. Get code to recreate your UI call in one click.
Using the endpoint in code is simple - check out this the example below: it streams short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). In parallel, receive transcription from the WebSocket.
!pip3 install requests torch torchaudio websocket-client
import io
import time
import json
import torch
import requests
import torchaudio
import threading
import websocket
import urllib.parse
audio_chunk_bytes = <...> # Construct audio chunks
lock = threading.Lock()
segments = {}
def on_open(ws):
# Send audio chunks
def send_audio_chunks():
for chunk in audio_chunk_bytes:
ws.send(chunk, opcode=websocket.ABNF.OPCODE_BINARY)
time.sleep(chunk_size_ms / 1000.0)
time.sleep(2)
ws.close()
threading.Thread(target=send_audio_chunks).start()
def on_message(ws, message):
# Merge new segments with existing segments
msg = json.loads(message)
new_segments = {seg["id"]: seg["text"] for seg in msg.get("segments", [])}
with lock:
segments.update(new_segments)
print(json.dumps(segments, indent=2))
def on_error(ws, error):
print(f"WebSocket error: {error}")
# Open a connection URL with query params
url = "ws://audio-streaming.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions/streaming"
params = urllib.parse.urlencode({
"model": "whisper-v3",
"language": "en",
})
ws = websocket.WebSocketApp(
f"{url}?{params}",
header={"Authorization": "<YOUR_FIREWORKS_API_KEY>"},
on_open=on_open,
on_message=on_message,
on_error=on_error,
)
ws.run_forever()
Our streaming audio service pairs perfectly with services like our text inference to power use cases like live voice agents. Streaming audio can be one component of a broader compound AI approach that integrates speech, text, image, and specialized models.
Fireworks makes it easy to build compound AI systems, by providing one place for:
Keep in touch with us on Discord or Twitter. Stay tuned for more updates coming soon about Fireworks and audio!