Retrieval-Augmented Generation Utilizing SQL Database : Case: Web Sport Statistics Application
Syrjä, Saku-Matti (2024)
Syrjä, Saku-Matti
2024
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2024111127813
https://urn.fi/URN:NBN:fi:amk-2024111127813
Tiivistelmä
This thesis investigates the implementation of Retrieval-Augmented Generation (RAG) techniques in conjunction with SQL databases to enhance the numerical accuracy of Large Language Models (LLMs). While LLMs have demonstrated remarkable capabilities in natural language processing, they often struggle with numerical precision and statistical information retrieval. This research addresses this limitation through a case study implementing a sports statistics application, specifically focusing on NHL data from 2008 to 2024.
The study comprises a comprehensive literature review examining trans-former architectures, SQL database systems, and RAG methodologies, followed by a detailed implementation of a full-stack web application. The ap-plication utilizes OpenAI's GPT-4o-mini model integrated with a PostgreSQL database containing comprehensive NHL statistics. The implementation demonstrates a novel approach to RAG by employing direct text-to-SQL query generation rather than traditional vector search methods.
Key findings reveal that while LLMs can effectively generate SQL queries for statistical retrieval, challenges persist in database design paradigms, where traditional normalization principles proved counterproductive for RAG applications. The study identifies specific limitations in handling season formatting and column selection, while also highlighting the potential for production-level applications. The research contributes to the field by presenting a practical framework for implementing SQL-based RAG systems and identifies areas for future improvement, including dataset optimization and model fine-tuning opportunities.
This work provides valuable insights into the integration of LLMs with structured databases and offers a foundation for developing more accurate and reliable statistical retrieval systems.
The study comprises a comprehensive literature review examining trans-former architectures, SQL database systems, and RAG methodologies, followed by a detailed implementation of a full-stack web application. The ap-plication utilizes OpenAI's GPT-4o-mini model integrated with a PostgreSQL database containing comprehensive NHL statistics. The implementation demonstrates a novel approach to RAG by employing direct text-to-SQL query generation rather than traditional vector search methods.
Key findings reveal that while LLMs can effectively generate SQL queries for statistical retrieval, challenges persist in database design paradigms, where traditional normalization principles proved counterproductive for RAG applications. The study identifies specific limitations in handling season formatting and column selection, while also highlighting the potential for production-level applications. The research contributes to the field by presenting a practical framework for implementing SQL-based RAG systems and identifies areas for future improvement, including dataset optimization and model fine-tuning opportunities.
This work provides valuable insights into the integration of LLMs with structured databases and offers a foundation for developing more accurate and reliable statistical retrieval systems.