การรู้จำเสียงพูดแบบเรียลไทม์ (ASR)

ยินดีต้อนรับสู่ Thai Real-time ASR - บริการรู้จำเสียงพูดอัตโนมัติแบบเรียลไทม์ผ่าน WebSocket บริการนี้จะแปลงข้อมูลเสียงที่สตรีมเข้ามาให้เป็นข้อความแบบเรียลไทม์ รองรับทั้งการสตรีมจากไฟล์และการบันทึกจากไมโครโฟนโดยตรง

คุณสมบัติหลัก

การสตรีมแบบเรียลไทม์ผ่าน WebSocket: เชื่อมต่อไปที่ wss://api.iapp.co.th/v1/audio/stt/rt และส่งเสียงเพื่อรับข้อความถอดเสียงทันที
อัปเดตความหน่วงต่ำ: รับข้อความ SegmentResult พร้อมสถานะ is_final ขณะพูด
เวลาและ ID ที่ชัดเจน: ทุก Segment จะมี segment_id, start_time, และ end_time เพื่อให้ง่ายต่อการอ้างอิง
เลือกภาษาได้: กำหนดภาษาที่พูดผ่าน transcribe_lang (ภาษาที่รองรับ: th, en, zh)
ตรวจจับประโยคอัตโนมัติ: ใช้ VAD ตรวจจับเสียง/ความเงียบ เพื่อปิด Segment อย่างแม่นยำ
บอกใบ้ผู้พูด: ฟิลด์ speaker_no ช่วยบอกการเปลี่ยนผู้พูด
สรุปเซสชัน: เมื่อปิดการเชื่อมต่อ จะได้รับ SessionResult ที่รวมข้อความทั้งหมด
ปิดการทำงานอย่างถูกต้อง: ฝั่ง Client สามารถส่ง { "action": "stop" } เพื่อหยุดการทำงาน เซิร์ฟเวอร์จะปิดการเชื่อมต่ออย่างเรียบร้อย

การติดตั้ง PyAudio

PyAudio เป็นไลบรารีที่เชื่อม Python เข้ากับ PortAudio (จำเป็นในการเข้าถึงไมโครโฟน) ต้องติดตั้ง PortAudio ก่อนแล้วจึงติดตั้ง PyAudio

แนะนำให้ใช้ Virtual Environment
อัปเกรด pip/setuptools/wheel ก่อนเพื่อลดปัญหาในการติดตั้ง

python -m pip install --upgrade pip setuptools wheel

Linux

Debian/Ubuntu:

sudo apt-get update
sudo apt-get install -y python3-dev portaudio19-dev
python -m pip install pyaudio

Fedora/RHEL/CentOS (dnf):

sudo dnf install -y portaudio-devel python3-devel
python -m pip install pyaudio

Arch/Manjaro:

sudo pacman -S --noconfirm portaudio
python -m pip install pyaudio

macOS

ใช้ Homebrew:

brew install portaudio
# ถ้า build tools หา headers/libs ไม่เจอ ให้ export flags (ถ้าจำเป็น):
export LDFLAGS="-L$(brew --prefix portaudio)/lib"
export CPPFLAGS="-I$(brew --prefix portaudio)/include"
python -m pip install pyaudio

หมายเหตุ: - บน Apple Silicon (M1/M2/M3) ให้ใช้ Python ที่ตรงกับสถาปัตยกรรม (arm64) และติดตั้ง Homebrew ไว้ที่ /opt/homebrew - ถ้ายังมีปัญหา ให้ลองติดตั้งแบบ build source:
bash python -m pip install --no-binary :all: pyaudio

Windows

ระบบทั่วไป:

python -m pip install pyaudio

ถ้าไม่สำเร็จ (เนื่องจากไม่มี build tools) ใช้ prebuilt wheels ผ่าน pipwin:

python -m pip install pipwin
python -m pipwin install pyaudio

ตรวจสอบการติดตั้ง

python -c "import pyaudio, sys; print('PyAudio OK, version:', getattr(pyaudio, '__version__', 'n/a')); sys.exit(0)"

การเชื่อมต่อกับเซิร์ฟเวอร์

ลูกค้าต้องเชื่อมต่อไปที่เซิร์ฟเวอร์ผ่าน WebSocket (WSS หรือ WS)

Endpoint URL

wss://api.iapp.co.th/v1/audio/stt/rt

Query Parameters

สามารถตั้งค่าการทำงานได้โดยส่ง Query Parameters ไปกับ URL ตอนเชื่อมต่อ:

พารามิเตอร์ ประเภท ค่าเริ่มต้น คำอธิบาย

transcribe_lang string "th" รหัสภาษาที่พูด (รองรับ "th", "en", "zh")

speaker_hints boolean false เปิดการบอกใบ้ผู้พูด

การส่งข้อมูลเสียง

หลังจากเชื่อมต่อสำเร็จ Client ต้องส่งข้อมูลเสียงอย่างต่อเนื่องในรูปแบบ Binary Frames

รูปแบบเสียง

Sample Rate: 16,000 Hz
Channels: 1 (Mono)
Bit Depth: 16-bit signed integer (PCM)

ระบบรองรับการประมวลผลแบบ chunk แนะนำให้ใช้ขนาด 1024 bytes ต่อ chunk เพื่อลดความหน่วง

import asyncio, json, signal, argparse
import pyaudio, websockets
from websockets.asyncio.client import connect

CHUNK = 1024
RATE = 16000
CHANNELS = 1
FORMAT = pyaudio.paInt16

async def stream_mic(ws, stream, stop_evt: asyncio.Event):
    try:
        while not stop_evt.is_set():
            buff = stream.read(CHUNK, exception_on_overflow=False)
            await ws.send(buff)
            await asyncio.sleep(0)
    except websockets.exceptions.ConnectionClosed:
        pass
    finally:
        try:
            await ws.send(json.dumps({"action": "stop"}))
            await asyncio.wait_for(ws.wait_closed(), timeout=5)
        except Exception:
            pass

async def recv_loop(ws):
    try:
        async for msg in ws:
            data = json.loads(msg)
            print(data)
            if data.get("type") == "SessionResult":
                with open(f"{data['session_id']}_result.json", "w", encoding="utf-8") as f:
                    json.dump(data, f, indent=2, ensure_ascii=False)
    except websockets.exceptions.ConnectionClosed:
        pass

async def main():
    url = "wss://api.iapp.co.th/v1/audio/stt/rt"
    stop_evt = asyncio.Event()
    try:
        asyncio.get_running_loop().add_signal_handler(signal.SIGINT, stop_evt.set)
    except NotImplementedError:
        pass

    pa = pyaudio.PyAudio()
    stream = pa.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK)

    try:
        async with connect(url, ping_timeout=None, additional_headers={"apikey": ""}) as ws:
            print("กำลังบันทึกเสียง... กด Ctrl+C เพื่อหยุด")
            sender = asyncio.create_task(stream_mic(ws, stream, stop_evt))
            receiver = asyncio.create_task(recv_loop(ws))
            await asyncio.wait({sender, receiver}, return_when=asyncio.FIRST_COMPLETED)
            stop_evt.set()
            await asyncio.gather(sender, receiver, return_exceptions=True)
    finally:
        stream.stop_stream(); stream.close(); pa.terminate()

if __name__ == "__main__":
    asyncio.run(main())

การรับข้อมูลจากเซิร์ฟเวอร์

เซิร์ฟเวอร์จะส่งข้อความกลับมาในรูปแบบ JSON Client ต้อง parse JSON เพื่อนำไปใช้งาน

1. Segment Result

ผลลัพธ์ชั่วคราวที่ได้ระหว่างพูด ระบบจะส่งมาอย่างต่อเนื่อง

"type": "SegmentResult"

is_final: สถานะผลลัพธ์

false: ผลลัพธ์ชั่วคราว ยังพูดไม่จบ
true: ผลลัพธ์สุดท้ายเมื่อระบบตรวจพบความเงียบ

โครงสร้าง:

{
    "type": "SegmentResult",
    "is_final": boolean,
    "segment_id": number,
    "start_time": number,
    "end_time": number,
    "transcript": string,
    "speaker_no": number ,
    "speaker_name": string,
    "speaker_id": string
}

เมื่อ is_final: true ถูกส่งสำหรับ segment_id: 0 ผลลัพธ์ถัดไปจะเริ่มจาก segment_id: 1

2. Session Result

สรุปผลลัพธ์ของทั้งเซสชัน ถูกส่งเพียงครั้งเดียวเมื่อปิดการเชื่อมต่อ

"type": "SessionResult"

โครงสร้าง:

{
  "type": "SessionResult",
  "session_id": string,
  "timestamp_start": string,
  "timestamp_end": string,
  "results": [
    {
      "timestamp": string,
      "segment_id": number,
      "start_time": number,
      "end_time": number,
      "transcript": string,
      "speaker_no": number,
      "speaker_id": string,
      "speaker_name": string
    }
  ]
}

การปิดการเชื่อมต่อ

ฝั่ง Client: สามารถปิดการเชื่อมต่อโดยส่ง {"action": "stop"}
ฝั่ง Server: เซิร์ฟเวอร์จะปิดการเชื่อมต่อในกรณีต่อไปนี้:
- Query Parameters ไม่ถูกต้อง
- เกิดข้อผิดพลาดระหว่างการเริ่มต้นระบบแปลงเสียง
- Client ปิดการเชื่อมต่อก่อน

เมื่อ Client ปิดการเชื่อมต่อ จะได้รับ SessionResult กลับมาก่อนการปิดเซสชันสมบูรณ์

การรู้จำเสียงพูดแบบเรียลไทม์ (ASR)

คุณสมบัติหลัก

การติดตั้ง PyAudio

Linux

macOS

Windows

ตรวจสอบการติดตั้ง

การเชื่อมต่อกับเซิร์ฟเวอร์

Endpoint URL

Query Parameters

การส่งข้อมูลเสียง

รูปแบบเสียง

การรับข้อมูลจากเซิร์ฟเวอร์

1. Segment Result

2. Session Result

การปิดการเชื่อมต่อ

ChindaX

Speechflow

คุณสมบัติหลัก​

การติดตั้ง PyAudio​

Linux​

macOS​

Windows​

ตรวจสอบการติดตั้ง​

การเชื่อมต่อกับเซิร์ฟเวอร์​

Endpoint URL​

Query Parameters​

การส่งข้อมูลเสียง​

รูปแบบเสียง​

การรับข้อมูลจากเซิร์ฟเวอร์​

1. Segment Result​

2. Session Result​

การปิดการเชื่อมต่อ​

คุณสมบัติหลัก

การติดตั้ง PyAudio

Linux

macOS

Windows

ตรวจสอบการติดตั้ง

การเชื่อมต่อกับเซิร์ฟเวอร์

Endpoint URL

Query Parameters

การส่งข้อมูลเสียง

รูปแบบเสียง

การรับข้อมูลจากเซิร์ฟเวอร์

1. Segment Result

2. Session Result

การปิดการเชื่อมต่อ