Emotion recognition from text and voice using AI
Shchennikov, Daniil (2025)
Shchennikov, Daniil
2025
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2025120231545
https://urn.fi/URN:NBN:fi:amk-2025120231545
Tiivistelmä
This thesis investigates emotion recognition across two modalities— text and speech—by building a privacy-first, reproducible prototype
and comparing modality-appropriate baselines under both formal tests and informal mobile trials. The text pathway fine-tunes BERT-base on the GoEmotions dataset as a multi-label classifier with sigmoid/BCE and per-class thresholding calibrated on validation data. The speech pathway trains a compact CRDNN using SpeechBrain on an acted speech corpus as a single-label classifier with softmax/NLL. A React Native/Expo client and FastAPI backend deliver predictions to a phone and capture star-rating feedback; privacy is enforced by storing only hash which is privacy safe-finger print of text and audio, and not storing raw data.
On test splits derived from the datasets, the audio model reaches 0.833 accuracy (0.829 macro-F1) across eight classes, while the text model achieves 0.566 micro-F1, 0.52 macro-F1, and 0.409 subset accuracy over 28 labels. In informal, in-the-wild trials, audio performance declines to the mid-60% range, whereas text remains comparatively stable but exhibits recurring confusions (e.g. Neutral vs. low arousal phrasing). A conceptual mapping aggregates fine-grained text labels to rought audio categories to support qualitative, analysis-only cross-modal comparison. Prototype average latency is practical for interactive use (≈458 ms text; ≈500 ms audio).
Contributions include: (1) modality-aware baselines with calibrated decision rules; (2) a transparent, end-to-end pipeline and mobile prototype that unify formal and informal evaluation; (3) empirical evidence clarifying when each modality excels (prosody-driven vs. lexically explicit affect); and (4) actionable guidance for future work—speaker-disjoint, spontaneous speech data; augmentation; automated threshold refresh; late-fusion baselines; and on-device optimization. Overall, the results argue for modality-aware integration of affect models into larger systems where each stream contributes its strengths while respecting privacy.
and comparing modality-appropriate baselines under both formal tests and informal mobile trials. The text pathway fine-tunes BERT-base on the GoEmotions dataset as a multi-label classifier with sigmoid/BCE and per-class thresholding calibrated on validation data. The speech pathway trains a compact CRDNN using SpeechBrain on an acted speech corpus as a single-label classifier with softmax/NLL. A React Native/Expo client and FastAPI backend deliver predictions to a phone and capture star-rating feedback; privacy is enforced by storing only hash which is privacy safe-finger print of text and audio, and not storing raw data.
On test splits derived from the datasets, the audio model reaches 0.833 accuracy (0.829 macro-F1) across eight classes, while the text model achieves 0.566 micro-F1, 0.52 macro-F1, and 0.409 subset accuracy over 28 labels. In informal, in-the-wild trials, audio performance declines to the mid-60% range, whereas text remains comparatively stable but exhibits recurring confusions (e.g. Neutral vs. low arousal phrasing). A conceptual mapping aggregates fine-grained text labels to rought audio categories to support qualitative, analysis-only cross-modal comparison. Prototype average latency is practical for interactive use (≈458 ms text; ≈500 ms audio).
Contributions include: (1) modality-aware baselines with calibrated decision rules; (2) a transparent, end-to-end pipeline and mobile prototype that unify formal and informal evaluation; (3) empirical evidence clarifying when each modality excels (prosody-driven vs. lexically explicit affect); and (4) actionable guidance for future work—speaker-disjoint, spontaneous speech data; augmentation; automated threshold refresh; late-fusion baselines; and on-device optimization. Overall, the results argue for modality-aware integration of affect models into larger systems where each stream contributes its strengths while respecting privacy.
