Real-time Automatic Speech Recognition (ASR)
Welcome to Thai Real-time ASR - our WebSocket-based real-time Automatic Speech Recognition service. This service converts audio stream data into text stream data in real-time, supporting both file streaming and direct microphone capture.

Key Features
- Real-time streaming over WebSocket: Connect to
wss://api.iapp.co.th/v1/audio/stt/rt
and stream audio to get live transcription. - Low-latency updates: Receive
SegmentResult
messages withis_final
status as you speak. - Clear timing and IDs: Every segment includes
segment_id
,start_time
, andend_time
for easy alignment. - Language selection: Choose the spoken language via
transcribe_lang
(supported languages:th
,en
,zh
). - Smart sentence detection: Built-in VAD detects speech/silence to finalize segments accurately.
- Speaker hints:
speaker_no
field indicates speaker change hints for basic diarization. - Session summary: A
SessionResult
is sent once on close, containing all finalized segments. - Graceful stop: End from client side by sending
{ "action": "stop" }
; the server then closes cleanly.
Installing PyAudio
PyAudio provides Python bindings for PortAudio (required to access the microphone). Install PortAudio first, then PyAudio.
- Recommended: use a virtual environment.
- Upgrade pip/setuptools/wheel first to avoid build issues.
python -m pip install --upgrade pip setuptools wheel
Linux
Debian/Ubuntu:
sudo apt-get update
sudo apt-get install -y python3-dev portaudio19-dev
python -m pip install pyaudio
Fedora/RHEL/CentOS (dnf):
sudo dnf install -y portaudio-devel python3-devel
python -m pip install pyaudio
Arch/Manjaro:
sudo pacman -S --noconfirm portaudio
python -m pip install pyaudio
macOS
Using Homebrew:
brew install portaudio
# If build tools can’t find headers/libs, export flags (optional):
export LDFLAGS="-L$(brew --prefix portaudio)/lib"
export CPPFLAGS="-I$(brew --prefix portaudio)/include"
python -m pip install pyaudio
Notes:
- On Apple Silicon (M1/M2/M3), ensure you’re using a matching Python (arm64) and Homebrew under /opt/homebrew.
- If you still face build issues, try:
python -m pip install --no-binary :all: pyaudio
after installingportaudio
.
Windows
Most systems:
python -m pip install pyaudio
If that fails (missing build tools), use prebuilt wheels via pipwin:
python -m pip install pipwin
python -m pipwin install pyaudio
Verify installation
python -c "import pyaudio, sys; print('PyAudio OK, version:', getattr(pyaudio, '__version__', 'n/a')); sys.exit(0)"
Connecting to the Server
Clients must connect to the server using the WebSocket protocol (WSS or WS).
Endpoint URL
wss://api.iapp.co.th/v1/audio/stt/rt
Query Parameters
Clients can configure the system by sending query parameters along with the URL during connection:
Parameter | Type | Default | Description |
---|---|---|---|
transcribe_lang | string | "th" | Language code of the spoken language (supported languages: "th", "en", "zh") |
speaker_hints | boolean | false | Enable speaker hints |
Sending Audio Data
After connecting successfully, the client must continuously send audio data as binary frames (bytes).
Audio Format
- Sample Rate: 16,000 Hz
- Channels: 1 (Mono)
- Bit Depth: 16-bit signed integer (PCM)
The system is designed to support chunked audio processing. The recommended chunk size is 1024
bytes per chunk for smooth streaming and reduced latency.
import asyncio, json, signal, argparse
import pyaudio, websockets
from websockets.asyncio.client import connect
CHUNK = 1024
RATE = 16000
CHANNELS = 1
FORMAT = pyaudio.paInt16
async def stream_mic(ws, stream, stop_evt: asyncio.Event):
try:
while not stop_evt.is_set():
buff = stream.read(CHUNK, exception_on_overflow=False)
await ws.send(buff)
await asyncio.sleep(0)
except websockets.exceptions.ConnectionClosed:
pass
finally:
try:
await ws.send(json.dumps({"action": "stop"}))
await asyncio.wait_for(ws.wait_closed(), timeout=5)
except Exception:
pass
async def recv_loop(ws):
try:
async for msg in ws:
data = json.loads(msg)
print(data)
if data.get("type") == "SessionResult":
with open(f"{data['session_id']}_result.json", "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
except websockets.exceptions.ConnectionClosed:
pass
async def main():
url = "wss://api.iapp.co.th/v1/audio/stt/rt"
stop_evt = asyncio.Event()
try:
asyncio.get_running_loop().add_signal_handler(signal.SIGINT, stop_evt.set)
except NotImplementedError:
pass
pa = pyaudio.PyAudio()
stream = pa.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK)
try:
async with connect(url, ping_timeout=None, additional_headers={"apikey": ""}) as ws:
print("Recording... Press Ctrl+C to stop")
sender = asyncio.create_task(stream_mic(ws, stream, stop_evt))
receiver = asyncio.create_task(recv_loop(ws))
await asyncio.wait({sender, receiver}, return_when=asyncio.FIRST_COMPLETED)
stop_evt.set()
await asyncio.gather(sender, receiver, return_exceptions=True)
finally:
stream.stop_stream(); stream.close(); pa.terminate()
if __name__ == "__main__":
asyncio.run(main())
Receiving Data from the Server
The server sends back messages as JSON strings. The client must parse the JSON to use the data. There are two main message types:
1. Segment Result
Interim results produced during speech; sent continuously to update in real-time.
"type": "SegmentResult"
is_final
: Result status
false
: Interim result while the speaker is still speaking; the sentence is not yet complete.true
: Result when the system detects a period of silence and considers the sentence complete.
Schema:
{
"type": "SegmentResult",
"is_final": boolean,
"segment_id": number,
"start_time": number,
"end_time": number,
"transcript": string,
"speaker_no": number ,
"speaker_name": string,
"speaker_id": string
}
After the is_final: true
result for segment_id: 0
is sent, the next results will start from segment_id: 1
.
2. Session Result
A summary of the entire session, sent only once when the connection ends (e.g., the client disconnects).
"type": "SessionResult"
Schema:
{
"type": "SessionResult",
"session_id": string,
"timestamp_start": string,
"timestamp_end": string,
"results": [
{
"timestamp": string,
"segment_id": number,
"start_time": number,
"end_time": number,
"transcript": string,
"speaker_no": number,
"speaker_id": string,
"speaker_name": string
}
]
}
Closing the Connection
- Client-side: The client can close the connection by sending
{\"action\": \"stop\"}
to the server. - Server-side: The server will close the connection in the following cases:
- Invalid Query Parameters: The client provided invalid parameters at connection start.
- Initialization Error: An error occurred while initializing the transcriber on the server.
- Connection Closed by Client: The client closed the connection first.
When the client closes the connection, a SessionResult
will be sent back to the client before the session fully terminates.