Machine Learning Approach to Classifying Finnish News  Articles

Taskinen, Anssi

Machine Learning Approach to Classifying Finnish News Articles

Taskinen, Anssi (2020)

Avaa tiedosto

Taskinen_Anssi.pdf (1.128Mt)

Lataukset:

Taskinen, Anssi

2020

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2020120225537

Tiivistelmä

This study investigates the possibility of classifying Finnish news articles with the methods of machine learning. The work aims to both save time otherwise spent in performing the task manually, and to improve the quality of the articles through their automatic classification. The study focuses on determining a single keyword for each article on the basis of its contents.

The outcome of the study is a test application providing an interface for requesting a keyword (class) for the inserted article (textual content). The implementation combines machine learning techniques with those of traditional programming; the latter ones are employed mainly in tasks related to data preformatting. The actual classification model has the form of a neural network combining convolutional layers with standard fully connected layers. Word embedding and other advanced text preprocessing techniques were used to convert the article texts to numerical form.

From the early stages of the work, the article classification task revealed itself to be a difficult one to tackle from the machine learning perspective. Most importantly, only a part of the available article data contained the relevant keyword field, which is necessary for training the machine learning models. Furthermore, most of the articles containing this label turned out to have only a single keyword available. This latter fact was taken into account by restricting the models to also output a unique keyword label for each article input. The obtained results provide insight into the possibility of classifying the text articles automatically.

The test program was implemented successfully, and has been used in a test environment to predict keyword class labels for real news articles. The highest observed success rates were nearly 60%. Finally, some proposals for further development are formulated in the end of the thesis.

Kokoelmat

Opinnäytetyöt