Finetuning and Improving Prediction Results of LLMs Using Synthetic Data
Macías, Melany (2024)
Macías, Melany
2024
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2024053018430
https://urn.fi/URN:NBN:fi:amk-2024053018430
Tiivistelmä
The Metropolia University of Applied Sciences initiated a project intended to enhance its Moodle platform with an AI-powered plugin, designed to improve the educational process for educators. Central to this initiative is the development of a chatbot designed to engage with users in conversations about teaching material and sustainability, specifically about Sustainable Development Goals.
This thesis evaluates several open-source large language models — Llama 3 (8B), Gemma (2B and 7B), and Phi 2 (2.7B) — implementing a methodology that includes training dataset generation, automated evaluation, comparative analysis, and error analysis. Training data was created by collecting sustainability-related documents and using Mistral (7B) to convert plain text into Q&A pairs. Then, these base models were finetuned with the generated sustainability data as well as general datasets designed for dialogue and summarizations.
The model’s performances were measured using the BLEU, ROUGE, AND METEOR scores to assess the quality of text generation, while comparative analysis focused on evaluating the model efficiency relative to the resources consumed and the parameters size, and an error analysis was done to classify the error types. The study shows that finetuning always improved the performance; the best performing model being finetuned was Gemma (7B) with a METEOR score of 0.25, and the maximum time taken during finetuning was 8 hours and 30 minutes.
This thesis evaluates several open-source large language models — Llama 3 (8B), Gemma (2B and 7B), and Phi 2 (2.7B) — implementing a methodology that includes training dataset generation, automated evaluation, comparative analysis, and error analysis. Training data was created by collecting sustainability-related documents and using Mistral (7B) to convert plain text into Q&A pairs. Then, these base models were finetuned with the generated sustainability data as well as general datasets designed for dialogue and summarizations.
The model’s performances were measured using the BLEU, ROUGE, AND METEOR scores to assess the quality of text generation, while comparative analysis focused on evaluating the model efficiency relative to the resources consumed and the parameters size, and an error analysis was done to classify the error types. The study shows that finetuning always improved the performance; the best performing model being finetuned was Gemma (7B) with a METEOR score of 0.25, and the maximum time taken during finetuning was 8 hours and 30 minutes.