Dialog Modelling Experiments with Finnish One-to-One Chat Data

Kauttonen, Janne; Aunimo, Lili

Dialog Modelling Experiments with Finnish One-to-One Chat Data

Kauttonen, Janne; Aunimo, Lili (2020)

Avaa tiedosto

KauttonenJAunimoLDialogmodellingexperiments.pdf (768.6Kt)

Lataukset:

Kauttonen, Janne

Aunimo, Lili

Editoija

Filchenkov, Andrey

Kauttonen, Janne

Pivovarova, Lidia

Springer

2020

doi:10.1007/978-3-030-59082-6

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2020100878386

Tiivistelmä

We analyzed two conversational corpora in Finnish: A public library question-answering (QA) data and a private medical chat dataẆe developed response retrieval (ranking) models using TF-IDF, StarSpace, ESIM and BERT methods. These four represent techniques ranging from the simple and classical ones to recent pretrained transformer neural networks. We evaluated the effect of different preprocessing strategies, including raw, casing, lemmatization and spell-checking for the different methods. Using our medical chat data, We also developed a novel three-stage preprocessing pipeline with speaker role classification. We found the BERT model pretrained with Finnish (FinBERT) an unambiguous winner in ranking accuracy, reaching 92.2% for the medical chat and 98.7% for the library QA in the 1-out-of-10 response ranking task where the chance level was 10%. The best accuracies were reached using uncased text with spell-checking (BERT models) or lemmatization (non-BERT models). The role of preprocessing had less impact for BERT models compared to the classical and other neural network models. Furthermore, we found the TF-IDF method still a strong baseline for the vocabulary-rich library QA task, even surpassing the more advanced StarSpace method. Our results highlight the complex interplay between preprocessing strategies and model type when choosing the optimal approach in chat-data modelling. Our study is the first work on dialogue modelling using neural networks for the Finnish language. It is also first of the kind to use real medical chat data. Our work contributes towards the development of automated chatbots in the professional domain.

Kokoelmat

Julkaisut