AI-powered natural language to SQL generation for lange-scale data analysis solutions
Wang, Chenxi; Deng, Tuwen (2026)
Wang, Chenxi
Deng, Tuwen
2026
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-202605069826
https://urn.fi/URN:NBN:fi:amk-202605069826
Tiivistelmä
In enterprise-grade, high-scale data analysis scenarios involves Natural-Language-to-SQL transfor-mation is essential. Due to the vastness of table structures, general problems from large model occur during training, causing Large language model (LLM) to be unable to handle contextual information adequately, thus reducing query precision significantly.
This thesis proposes an integrated model of Artificial Intelligence (AI) and Natural language pro-cessing (NLP) to generate corresponding SQL query codes automatically based on natural language descriptions of massive datasets and large-scale data. By using Large language model orchestration, retrieval-augmented generation (RAG) and AI-workflow technology, users can conduct data analysis without technical ability.
The methodology of the intelligent text-to-SQL agent system can integrate retrieval-augmented gen-eration (RAG) and a closed-loop self-correction mechanism seamlessly. Firstly, the lightweight vec-tor-based database (Chroma DB) is used to retrieve the semantic schema, and only the top parameter most similar tables Data definition language (DDLs) are selected as the result. Next, the condensed schemata will be dynamically assembled into a structured prompt. Subsequently, these prompts help generate an initially constructed SQL statement. Finally, at the end of executing all SQL commands in one go, if there are any errors, it will automatically send a physical error log via Large language mod-el (LLM) iteration-based re-creation until a stable, ready-to-run query can be generated.
The results of thesis were evaluated through following testing. Zero-shot scenario tests in spider-cross-database benchmark (1,034 complex queries spanning different databases), compared against an LLM-alone control set for evaluating whether this boosted execution accuracy, lowered tokens about 79 per cent and greatly improved generalised performance through a specific way here described. These tested bridge semantic understanding and automatic processing to create an actual, SQL-independent analytical system for large amounts of data in enterprises, and provided not only theories but also practical applications.
This thesis proposes an integrated model of Artificial Intelligence (AI) and Natural language pro-cessing (NLP) to generate corresponding SQL query codes automatically based on natural language descriptions of massive datasets and large-scale data. By using Large language model orchestration, retrieval-augmented generation (RAG) and AI-workflow technology, users can conduct data analysis without technical ability.
The methodology of the intelligent text-to-SQL agent system can integrate retrieval-augmented gen-eration (RAG) and a closed-loop self-correction mechanism seamlessly. Firstly, the lightweight vec-tor-based database (Chroma DB) is used to retrieve the semantic schema, and only the top parameter most similar tables Data definition language (DDLs) are selected as the result. Next, the condensed schemata will be dynamically assembled into a structured prompt. Subsequently, these prompts help generate an initially constructed SQL statement. Finally, at the end of executing all SQL commands in one go, if there are any errors, it will automatically send a physical error log via Large language mod-el (LLM) iteration-based re-creation until a stable, ready-to-run query can be generated.
The results of thesis were evaluated through following testing. Zero-shot scenario tests in spider-cross-database benchmark (1,034 complex queries spanning different databases), compared against an LLM-alone control set for evaluating whether this boosted execution accuracy, lowered tokens about 79 per cent and greatly improved generalised performance through a specific way here described. These tested bridge semantic understanding and automatic processing to create an actual, SQL-independent analytical system for large amounts of data in enterprises, and provided not only theories but also practical applications.
Kokoelmat
Samankaltainen aineisto
Näytetään aineisto, joilla on samankaltaisia nimekkeitä, tekijöitä tai asiasanoja.
-
BITTE KLAR BILD GEBÄRDEN; systematic Promotion of Modality Specific Translation and Interpretation into German Sign Language in L2/M2
Ruf, Julia (2023)Empirical research in the field of Second Language Acquisition (SLA) of Signed Languages (SLs) is essential in order to inform and improve practices in teaching L2/M2 learners as well as training interpreters who work with ... -
Impact of Duolingo on YKI Test Preparation
Areb, Remedan (2024)This thesis explores the impact of Duolingo, a widely used online language learning platform, on the preparation for the YKI (Yleinen Kielitutkinto) exam among individuals with a foreign background seeking Finnish citizenship. ... -
Supporting Multilingual Children´s Language Development in Early Childhood Education
Matheri, Mary (2023)The aim of the thesis topic was to find out methods for both teachers and parents in supporting multilingual children´s language development. The thesis topic was commissioned by a Municipal early childhood and education ...



