Developing an AI-powered multimodal chatbot
Chen, Hua; Le, Kiet (2025)
Chen, Hua
Le, Kiet
2025
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2025052114086
https://urn.fi/URN:NBN:fi:amk-2025052114086
Tiivistelmä
This thesis investigates designing and implementing a chatbot system based on Google's Gemini AI model that supports multimodal interaction. To overcome the limitations of traditional text-based systems, the chatbot incorporates text, emojis, audio, and images. It utilizes the processing capabilities of Gemini 2.0 Flash within a custom web interface built with an HTML/CSS/JavaScript frontend and a Flask backend. This architecture supports integrated communication across multiple modalities in a structured and extensible manner, enabling more flexible user interactions.
Several technical challenges were addressed, including the definition of a standardized payload format for communication with the Gemini 2.0 Flash API and the implementation of security mechanisms. These included the use of bcrypt with a 12-round key derivation function and salting for password hashing, the prevention of SQL injection through ORM-level query parameterization, and the structured management of user session lifecycles to ensure data integrity and enforce access control.
The system was evaluated over multiple development iterations using predefined test scenarios that included diverse input types, such as text, emojis, audio, and images. The results indicated consistent system behavior across modalities and adherence to the intended design specifications. The thesis also provides future research directions to enhance applicability to practical use cases.
Several technical challenges were addressed, including the definition of a standardized payload format for communication with the Gemini 2.0 Flash API and the implementation of security mechanisms. These included the use of bcrypt with a 12-round key derivation function and salting for password hashing, the prevention of SQL injection through ORM-level query parameterization, and the structured management of user session lifecycles to ensure data integrity and enforce access control.
The system was evaluated over multiple development iterations using predefined test scenarios that included diverse input types, such as text, emojis, audio, and images. The results indicated consistent system behavior across modalities and adherence to the intended design specifications. The thesis also provides future research directions to enhance applicability to practical use cases.