Comparing Natural Language Models for Software Category Classification

Turbin, Ivan

Comparing Natural Language Models for Software Category Classification

Turbin, Ivan (2023)

Avaa tiedosto

Turbin_Ivan.pdf (1.247Mt)

Lataukset:

Turbin, Ivan

2023

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2023112732000

Tiivistelmä

The purpose of this thesis is to compare natural language machine learning models to find classification differences in the software category classification field. Software category classification is a text classification task designed to find the appropriate software category based on its description. The objective in this thesis is to explain fundamental machine learning principals such as data augmentation, normalization, analysing performance and explaining common natural language models.

To achieve the goals it is necessary to obtain trainable data, normalize gathered data and build a model suitable for text classification. In the present study Microsoft and Cnet software stores are used as data sources. The categories and descriptions are gathered using a Python scraper with Beautiful Soup library which is ran targeting the software stores.

With the gathered data CNN, RNN and BERT text classification models were constructed and compared with one another. The comparison of the models was done by using machine learning performance metrics such as precision, recall, loss, accuracy, classification time and confusion matrix.

The findings showed that CNN is the optimal model for text classification given the gathered dataset. BERT model showed promising results, however due to the model being very large overfitting could be a potential problem. The performance can be further improved by finetuning the parameters and increasing the dataset size. Other methods of software classification could be applied to increase the accuracy of classification, such as image recognition of the program user interface.

Kokoelmat

Opinnäytetyöt