Skip to main content

Evaluating Thanoy The Thai Legal AI Assistant Performance

· 5 min read
Kobkrit Viriyayudhakorn
CEO @ iApp Technology

The following evaluation report assesses Thanoy, the Thai Legal AI Assistant powered by OpenThaiGPT, which is designed to provide accurate and reliable legal advice across various legal documents and queries. Trained on over 10,000 Thai legal articles and regulations, Thanoy offers an advanced solution for legal professionals and general users seeking legal guidance.

Thanoy AI Assistant

1. Introduction to Thanoy

Thanoy is an AI-powered assistant developed to enhance access to Thai legal information and advice. It leverages OpenThaiGPT to analyze and respond to user queries, offering insights into Thai laws and regulations. Key features include its availability through a LINE chatbot interface, ensuring users can access legal advice anytime. Thanoy is designed to ensure its responses are based on a comprehensive understanding of Thailand's legal landscape, making it an invaluable tool for both professionals and non-experts alike.

2. Evaluation Methodology

2.1 Evaluation Team and Approach

This comprehensive evaluation was conducted by iApp's LLM Team, led by @Por, using an automated assessment approach to ensure objectivity and scalability.

2.2 Technical Setup

  • Evaluation Model: OpenAI GPT-4o API
  • Temperature Setting: 0 (for maximum consistency and accuracy)
  • Sample Size: 1,000 samples from the first batch
  • Data Source: Over 100,000 chat logs in JSON-Lines format
  • Future Batches: Subsequent batches will randomly sample additional 1,000 samples

2.3 Evaluation Criteria

The evaluation assessed three key components for each interaction:

  1. Query: User's legal question
  2. Context: Retrieved legal documents and regulations
  3. Response: Thanoy's AI-generated legal advice

For each sample, GPT-4o evaluated:

  • Relevance: Whether Thanoy's response relates to the user's query and retrieved context
  • Quality Score: Rating from 0-10 for the overall response quality

3. Detailed Experimental Results

3.1 Overall Performance Metrics

  • Total Samples Evaluated: 1,000
  • Mean Relevance Score: 4.325/10
  • Standard Deviation: 3.29
  • Reference Documentation: Internal LarkSuite Wiki

3.2 Relevance Distribution

CategoryCountPercentage
Not Relevant659 requests65.9%
Relevant341 requests34.1%

3.3 Top Score Distributions

ScoreCountPercentage
2 points248 requests24.8%
3 points244 requests24.4%
8 points165 requests16.5%

4. In-Depth Analysis and Key Findings

4.1 Response-Query Alignment

Finding: The majority of Thanoy's responses demonstrate strong alignment with user queries, indicating effective natural language understanding and legal reasoning capabilities.

4.2 Context Retrieval Challenges

Critical Issue Identified: The primary performance bottleneck lies in the Retrieval-Augmented Generation (RAG) system:

  • Frequent Context Mismatches: The RAG system often retrieves irrelevant legal documents
  • Score Impact: Irrelevant context significantly reduces evaluation scores despite accurate responses

4.3 Context Dependency Analysis

Scenario 1 - Unnecessary Context with Retrieval:

  • Some queries don't require legal document context for accurate responses
  • When irrelevant context is provided, scores decrease even if responses are correct

Scenario 2 - Unnecessary Context without Retrieval:

  • Queries that don't need context and receive none often score lower
  • This occurs even when responses correctly address the user's question

4.4 Underlying Model Capabilities

Positive Finding: OpenThaiGPT demonstrates strong baseline legal knowledge:

  • Can provide accurate legal advice even with incorrect context retrieval
  • Shows robust understanding of Thai legal principles and concepts
  • Maintains response quality despite RAG system limitations

5. Technical Recommendations and Future Improvements

5.1 RAG System Enhancement Priority

Immediate Action Required: The current RAG system requires comprehensive redevelopment:

  • Current Challenge: Frequent retrieval of irrelevant legal documents
  • Proposed Solution: Implementation of GraphRAG technology
  • Team Status: Active research and review of GraphRAG methodologies

5.2 Evaluation Methodology Refinement

Future Batch Strategy:

  • Continue with 1,000-sample batches using random sampling
  • Implement A/B testing with improved RAG systems
  • Develop more granular evaluation metrics for legal accuracy

Thanoy is not just a tool for immediate legal advice but also a foundation for future advancements in AI-driven legal services. As AI technology evolves, the capabilities of assistants like Thanoy are expected to improve, particularly in terms of understanding complex legal language and providing even more precise insights. The feedback and performance from this evaluation are essential in driving future improvements and ensuring Thanoy can meet the growing demand for accessible legal assistance in Thailand.

6. Conclusion and Impact Assessment

This comprehensive evaluation of 1,000 samples reveals important insights about Thanoy's performance as a Thai Legal AI Assistant:

6.1 Key Strengths

  • Strong Language Understanding: OpenThaiGPT demonstrates robust comprehension of Thai legal queries
  • Baseline Legal Knowledge: Capable of providing accurate advice even with suboptimal context retrieval
  • Response Consistency: Maintains quality across diverse legal topics and question types

6.2 Critical Areas for Improvement

  • RAG System Overhaul: Primary focus on improving context retrieval accuracy (65.9% irrelevance rate)
  • GraphRAG Implementation: Active research toward next-generation retrieval technology
  • Evaluation Refinement: Enhanced metrics for legal-specific accuracy assessment

6.3 Strategic Significance

This evaluation, conducted by iApp's LLM Team, provides crucial data for Thanoy's evolution as a leading Thai legal AI assistant. The findings demonstrate both the potential and current limitations, establishing a clear roadmap for achieving higher performance standards in AI-driven legal services.

Reference: Complete evaluation methodology and detailed results are documented in iApp's internal research wiki for ongoing technical improvements.