Enhancing Human-Robot Interaction : Integrating Large Language Models and Advanced Speech Recognition into the Pepper Robot

Abdel Hafez, Raneem

Enhancing Human-Robot Interaction : Integrating Large Language Models and Advanced Speech Recognition into the Pepper Robot

Abdel Hafez, Raneem (2024)

Avaa tiedosto

Abdel Hafez_Raneem.pdf (2.213Mt)

Lataukset:

Abdel Hafez, Raneem

2024

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-202402273525

Tiivistelmä

Conversing with robots was once limited to science fiction, but recent advances in robot integration into everyday human interactions have made it a reality. Pepper, the humanoid robot from Softbank Robotics, is a significant advancement in robotics, designed for seamless communication. However, its natural language capabilities are currently limited. Integrating Large Language Models (LLMs) and Speech Recognition is essential to enhance its communication abilities. Recent advancements in speech recognition, leveraging the Transformer architecture, have greatly improved transcription accuracy and efficiency of spoken language. Additionally, LLMs, like OpenAI's ChatGPT, can now generate human-like responses based on contextual input, pushing the boundaries of natural language understanding and generation.

This thesis investigates integrating LLMs into Pepper to improve its linguistic abilities and facilitate natural communication. The integration enhances Pepper's language understanding and generation, leading to more engaging Human-Robot Interaction (HRI).

Key questions driving this thesis include: How can large language models be integrated in pepper robot? Can Pepper accurately transcribe spoken audio? Are Pepper's responses human-like enough to facilitate meaningful interaction? And does Pepper's response time allow for a natural conversational flow?

The methodology involves the integration of LLMs into Pepper's architecture, alongside an Automatic Speech Recognition (ASR) for accurate speech recognition. Through the utilization of an ASR and evaluation of various LLMs, Pepper demonstrates commendable transcription capabilities and generates responses that are deemed sufficiently human-like for users to understand and engage with. Evaluation of the implemented models reveals notable differences in speed, with some models exhibiting faster response times than others.

Two primary phases are undertaken: establishing a web server infrastructure and configuring Pepper for seamless interaction with the server. Through qualitative assessments and quantitative analyses of response time and performance metrics, the most optimal LLM and ASR for Pepper's communication needs is identified.

Kokoelmat

Opinnäytetyöt (Avoin kokoelma)