Offline open-source RAG-based AI chatbot for Finnish-language software documentation on low-end hardware : modular model-agnostic framework and AI judge for multilingual use

Luong, Huy Quang

Offline open-source RAG-based AI chatbot for Finnish-language software documentation on low-end hardware : modular model-agnostic framework and AI judge for multilingual use

Luong, Huy Quang (2025)

Avaa tiedosto

Luong_Quang.pdf (5.642Mt)

Lataukset:

Luong, Huy Quang

2025

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2025060520613

Tiivistelmä

This thesis addresses the usability limitations of static software documentation and the confidentiality concerns that restrict the use of public AI tools. It aims to develop and evaluate an offline, open-source AI chatbot on low-end hardware, in partnership with Triplan Oy, prioritizing confidentiality and sustainability.
The chosen solution is Retrieval-Augmented Generation (RAG), which enhances Large Language Model (LLM) responses by dynamically retrieving relevant documents to ground the generated answers. The project employed a rapid prototyping approach, following the Design Science Research Process, where iterative design and continuous client feedback guided development and refinement. Key open-source technologies selected for the prototype included Ollama for efficient local LLM inference, FAISS for high-performance vector similarity search, and llama-index as the orchestration framework for data indexing and retrieval. Data preparation was critical for the Finnish documentation, focusing on proper UTF-8 encoding to preserve special characters (ä, ö, å) and implementing a section-based chunking strategy tailored to the document structure for optimal information retrieval.
The prototype was rigorously evaluated for both system latency and response quality. Median retrieval latencies stayed under one second even on low-end CPU, enabling fast, interactive display of relevant document chunks. GPU acceleration significantly reduced both semantic retrieval and response generation times. LLM size was identified as a dominant factor affecting latency on CPU, while embedding model choice had negligible impact. For response quality, the system achieved an 80% correctness rate overall, and 86% for aligned, in-scope queries, as assessed by an AI judge using the RAG Triad (Groundedness, Context Relevance, and Answer Relevance). The snowflake-arctic-embed2 embedding model also demonstrated numerically superior Recall@3 and statistically superior Recall@1 compared to bge-m3. However, the 1B LLM exhibited limitations when handling tricky or off-topic queries, leading to hallucinations and inconsistent abstention behavior, highlighting the quality trade-off of smaller models.
This work successfully demonstrates the feasibility of deploying secure, offline RAG-based AI chatbots in resource-constrained, language-specific environments. While the current prototype serves as a proof of concept with acknowledged limitations (e.g., restricted model diversity, single source document, basic error handling), it provides a flexible framework for future development. Subsequent efforts should focus on enhancing robustness, diversifying evaluation, integrating human-in-the-loop feedback, and refining prompt engineering for improved reliability and broader applicability.

Kokoelmat

Opinnäytetyöt