Occupational skills extraction with FinBERT
Chernova, Mariia (2020)
Chernova, Mariia
2020
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2020112524255
https://urn.fi/URN:NBN:fi:amk-2020112524255
Tiivistelmä
Job search market is highly competitive even in a small country such as Finland. Oikotie Työpaikat is a platform, where recruiters post jobs, candidates search and apply for open positions. In order to stay among the leaders in this race, Oikotie Työpaikat desires to put personalization features on the next level. Therefore, it was important for the platform to obtain a tool that allows to extract skills from job postings. In order to build in the future applications for better user experience.
The main objective of this master thesis was to develop framework that extracts skills from unstructured text such as job description. In the initial phase, the study explored the variety of currently used skills extraction systems and compared the possible options for implementing the framework. The next was to investigate different NLP techniques that take into account the context of words. These techniques include Self-Attention mechanism, RNN, LSTM, Transformer and BERT algorithms. Since the extraction system must be able to process words in Finnish, it was decided to leverage Google's open-source BERT model for the Finnish language (FinBERT) developed by Turku University. This version of BERT outperformed a previous multilingual model in a wide range of tasks, especially in classification problems. One of this task is NER, which can be easily applied to extract entities such as skills from unstructured texts, and it is utilized in this study.
The implementation process started with data cleaning and pre-processing an input for BERT model. The dataset provided by Oikotie Työpaikat contained about 300 000 job advertisements. 100 JDs were randomly selected from the pre-processed data. This dataset was labeled utilizing the web-based tool for NLP text annotation called TagTog. The architecture of the developed model contains the main block based on FinBERT and the additional layer was chosen as a simple Dense layer with a softmax activation function. This Dense layer was fine-tuned for the NER task. The developed model was trained and validated. The model performance was evaluating using confusion matrix and its based different evaluation metrics such as accuracy, precision, recall, and F1-score.
The developed skill extraction framework achieves noticeable results. Extracted skill phrases included soft skills, hard skill and different qualification certificates. Moreover, the developed framework has a potential to become a basis for various user and business applications, examples of which are also presented in this master thesis.
The main objective of this master thesis was to develop framework that extracts skills from unstructured text such as job description. In the initial phase, the study explored the variety of currently used skills extraction systems and compared the possible options for implementing the framework. The next was to investigate different NLP techniques that take into account the context of words. These techniques include Self-Attention mechanism, RNN, LSTM, Transformer and BERT algorithms. Since the extraction system must be able to process words in Finnish, it was decided to leverage Google's open-source BERT model for the Finnish language (FinBERT) developed by Turku University. This version of BERT outperformed a previous multilingual model in a wide range of tasks, especially in classification problems. One of this task is NER, which can be easily applied to extract entities such as skills from unstructured texts, and it is utilized in this study.
The implementation process started with data cleaning and pre-processing an input for BERT model. The dataset provided by Oikotie Työpaikat contained about 300 000 job advertisements. 100 JDs were randomly selected from the pre-processed data. This dataset was labeled utilizing the web-based tool for NLP text annotation called TagTog. The architecture of the developed model contains the main block based on FinBERT and the additional layer was chosen as a simple Dense layer with a softmax activation function. This Dense layer was fine-tuned for the NER task. The developed model was trained and validated. The model performance was evaluating using confusion matrix and its based different evaluation metrics such as accuracy, precision, recall, and F1-score.
The developed skill extraction framework achieves noticeable results. Extracted skill phrases included soft skills, hard skill and different qualification certificates. Moreover, the developed framework has a potential to become a basis for various user and business applications, examples of which are also presented in this master thesis.