Real-time Automatic Speech Recognition (ASR)

Welcome to Thai Real-time ASR - our WebSocket-based real-time Automatic Speech Recognition service. This service converts audio stream data into text stream data in real-time, supporting both file streaming and direct microphone capture.

Key Features

Real-time streaming over WebSocket: Connect to wss://api.iapp.co.th/v1/audio/stt/rt and stream audio to get live transcription.
Low-latency updates: Receive SegmentResult messages with is_final status as you speak.
Clear timing and IDs: Every segment includes segment_id, start_time, and end_time for easy alignment.
Language selection: Choose the spoken language via transcribe_lang (supported languages: th, en, zh).
Smart sentence detection: Built-in VAD detects speech/silence to finalize segments accurately.
Speaker hints: speaker_no field indicates speaker change hints for basic diarization.
Session summary: A SessionResult is sent once on close, containing all finalized segments.
Graceful stop: End from client side by sending { "action": "stop" }; the server then closes cleanly.

Installing PyAudio

PyAudio provides Python bindings for PortAudio (required to access the microphone). Install PortAudio first, then PyAudio.

Recommended: use a virtual environment.
Upgrade pip/setuptools/wheel first to avoid build issues.

python -m pip install --upgrade pip setuptools wheel

Linux

Debian/Ubuntu:

sudo apt-get update
sudo apt-get install -y python3-dev portaudio19-dev
python -m pip install pyaudio

Fedora/RHEL/CentOS (dnf):

sudo dnf install -y portaudio-devel python3-devel
python -m pip install pyaudio

Arch/Manjaro:

sudo pacman -S --noconfirm portaudio
python -m pip install pyaudio

macOS

Using Homebrew:

brew install portaudio
# If build tools can’t find headers/libs, export flags (optional):
export LDFLAGS="-L$(brew --prefix portaudio)/lib"
export CPPFLAGS="-I$(brew --prefix portaudio)/include"
python -m pip install pyaudio

Notes:

On Apple Silicon (M1/M2/M3), ensure you’re using a matching Python (arm64) and Homebrew under /opt/homebrew.
If you still face build issues, try: python -m pip install --no-binary :all: pyaudio after installing portaudio.

Windows

Most systems:

python -m pip install pyaudio

If that fails (missing build tools), use prebuilt wheels via pipwin:

python -m pip install pipwin
python -m pipwin install pyaudio

Verify installation

python -c "import pyaudio, sys; print('PyAudio OK, version:', getattr(pyaudio, '__version__', 'n/a')); sys.exit(0)"

Connecting to the Server

Clients must connect to the server using the WebSocket protocol (WSS or WS).

Endpoint URL

wss://api.iapp.co.th/v1/audio/stt/rt

Query Parameters

Clients can configure the system by sending query parameters along with the URL during connection:

Parameter	Type	Default	Description
`transcribe_lang`	string	`"th"`	Language code of the spoken language (supported languages: "th", "en", "zh")
`speaker_hints`	boolean	`false`	Enable speaker hints

Sending Audio Data

After connecting successfully, the client must continuously send audio data as binary frames (bytes).

Audio Format

Sample Rate: 16,000 Hz
Channels: 1 (Mono)
Bit Depth: 16-bit signed integer (PCM)

The system is designed to support chunked audio processing. The recommended chunk size is 1024 bytes per chunk for smooth streaming and reduced latency.

import asyncio, json, signal, argparse
import pyaudio, websockets
from websockets.asyncio.client import connect

CHUNK = 1024
RATE = 16000
CHANNELS = 1
FORMAT = pyaudio.paInt16

async def stream_mic(ws, stream, stop_evt: asyncio.Event):
    try:
        while not stop_evt.is_set():
            buff = stream.read(CHUNK, exception_on_overflow=False)
            await ws.send(buff)
            await asyncio.sleep(0)
    except websockets.exceptions.ConnectionClosed:
        pass
    finally:
        try:
            await ws.send(json.dumps({"action": "stop"}))
            await asyncio.wait_for(ws.wait_closed(), timeout=5)
        except Exception:
            pass

async def recv_loop(ws):
    try:
        async for msg in ws:
            data = json.loads(msg)
            print(data)
            if data.get("type") == "SessionResult":
                with open(f"{data['session_id']}_result.json", "w", encoding="utf-8") as f:
                    json.dump(data, f, indent=2, ensure_ascii=False)
    except websockets.exceptions.ConnectionClosed:
        pass

async def main():
    url = "wss://api.iapp.co.th/v1/audio/stt/rt"
    stop_evt = asyncio.Event()
    try:
        asyncio.get_running_loop().add_signal_handler(signal.SIGINT, stop_evt.set)
    except NotImplementedError:
        pass

    pa = pyaudio.PyAudio()
    stream = pa.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK)

    try:
        async with connect(url, ping_timeout=None, additional_headers={"apikey": ""}) as ws:
            print("Recording... Press Ctrl+C to stop")
            sender = asyncio.create_task(stream_mic(ws, stream, stop_evt))
            receiver = asyncio.create_task(recv_loop(ws))
            await asyncio.wait({sender, receiver}, return_when=asyncio.FIRST_COMPLETED)
            stop_evt.set()
            await asyncio.gather(sender, receiver, return_exceptions=True)
    finally:
        stream.stop_stream(); stream.close(); pa.terminate()

if __name__ == "__main__":
    asyncio.run(main())

Receiving Data from the Server

The server sends back messages as JSON strings. The client must parse the JSON to use the data. There are two main message types:

1. Segment Result

Interim results produced during speech; sent continuously to update in real-time.

"type": "SegmentResult"

is_final: Result status

false: Interim result while the speaker is still speaking; the sentence is not yet complete.
true: Result when the system detects a period of silence and considers the sentence complete.

Schema:

{
    "type": "SegmentResult",
    "is_final": boolean,
    "segment_id": number,
    "start_time": number,
    "end_time": number,
    "transcript": string,
    "speaker_no": number ,
    "speaker_name": string,
    "speaker_id": string
}

After the is_final: true result for segment_id: 0 is sent, the next results will start from segment_id: 1.

2. Session Result

A summary of the entire session, sent only once when the connection ends (e.g., the client disconnects).

"type": "SessionResult"

Schema:

{
  "type": "SessionResult",
  "session_id": string,
  "timestamp_start": string,
  "timestamp_end": string,
  "results": [
    {
      "timestamp": string,
      "segment_id": number,
      "start_time": number,
      "end_time": number,
      "transcript": string,
      "speaker_no": number,
      "speaker_id": string,
      "speaker_name": string
    }
  ]
}

Closing the Connection

Client-side: The client can close the connection by sending {\"action\": \"stop\"} to the server.
Server-side: The server will close the connection in the following cases:
- Invalid Query Parameters: The client provided invalid parameters at connection start.
- Initialization Error: An error occurred while initializing the transcriber on the server.
- Connection Closed by Client: The client closed the connection first.

When the client closes the connection, a SessionResult will be sent back to the client before the session fully terminates.

Real-time Automatic Speech Recognition (ASR)

Key Features

Installing PyAudio

Linux

macOS

Windows

Verify installation

Connecting to the Server

Endpoint URL

Query Parameters

Sending Audio Data

Audio Format

Receiving Data from the Server

1. Segment Result

2. Session Result

Closing the Connection

ChindaX

Speechflow

Key Features​

Installing PyAudio​

Linux​

macOS​

Windows​

Verify installation​

Connecting to the Server​

Endpoint URL​

Query Parameters​

Sending Audio Data​

Audio Format​

Receiving Data from the Server​

1. Segment Result​

2. Session Result​

Closing the Connection​

Key Features

Installing PyAudio

Linux

macOS

Windows

Verify installation

Connecting to the Server

Endpoint URL

Query Parameters

Sending Audio Data

Audio Format

Receiving Data from the Server

1. Segment Result

2. Session Result

Closing the Connection