Emotion recognition from text and voice using AI

Shchennikov, Daniil

Emotion recognition from text and voice using AI

Shchennikov, Daniil (2025)

Avaa tiedosto

Shchennikov_Daniil.pdf (1.471Mt)

Lataukset:

Shchennikov, Daniil

2025

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2025120231545

Tiivistelmä

This thesis investigates emotion recognition across two modalities— text and speech—by building a privacy-first, reproducible prototype
and comparing modality-appropriate baselines under both formal tests and informal mobile trials. The text pathway fine-tunes BERT-base on the GoEmotions dataset as a multi-label classifier with sigmoid/BCE and per-class thresholding calibrated on validation data. The speech pathway trains a compact CRDNN using SpeechBrain on an acted speech corpus as a single-label classifier with softmax/NLL. A React Native/Expo client and FastAPI backend deliver predictions to a phone and capture star-rating feedback; privacy is enforced by storing only hash which is privacy safe-finger print of text and audio, and not storing raw data.

On test splits derived from the datasets, the audio model reaches 0.833 accuracy (0.829 macro-F1) across eight classes, while the text model achieves 0.566 micro-F1, 0.52 macro-F1, and 0.409 subset accuracy over 28 labels. In informal, in-the-wild trials, audio performance declines to the mid-60% range, whereas text remains comparatively stable but exhibits recurring confusions (e.g. Neutral vs. low arousal phrasing). A conceptual mapping aggregates fine-grained text labels to rought audio categories to support qualitative, analysis-only cross-modal comparison. Prototype average latency is practical for interactive use (≈458 ms text; ≈500 ms audio).

Contributions include: (1) modality-aware baselines with calibrated decision rules; (2) a transparent, end-to-end pipeline and mobile prototype that unify formal and informal evaluation; (3) empirical evidence clarifying when each modality excels (prosody-driven vs. lexically explicit affect); and (4) actionable guidance for future work—speaker-disjoint, spontaneous speech data; augmentation; automated threshold refresh; late-fusion baselines; and on-device optimization. Overall, the results argue for modality-aware integration of affect models into larger systems where each stream contributes its strengths while respecting privacy.

Kokoelmat

Opinnäytetyöt (Avoin kokoelma)