Experimenting the use of a Large Language Model to improve company name entity resolution
Asikainen, Santeri (2024)
Asikainen, Santeri
2024
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2024082524351
https://urn.fi/URN:NBN:fi:amk-2024082524351
Tiivistelmä
This thesis was commissioned by a software company operating in procurement analytics business which sells software-as-a-service products to customers. One of the key services which is built inside many of the software products is the deduplication of company names, which exist inside and across client databases. Case company wanted to have better understanding of the current system performance, what text-patterns lead to manual corrections and what solutions could help in improving the system performance.
The study was conducted by first getting understanding of the current system and revealing potential issues in the process. Exploratory data analysis was done for two datasets. First dataset consisted of manually corrected entries on the most granular supplier level. Second dataset consisted of supplier groups in the current system which the system had failed to group together. Patterns of both datasets were grouped together and estimated how large of an impact could be achieved by handling or removing the patterns which have caused the mistakes. The analysis revealed that the most impactful pattern is the company name itself, while the impact of other patterns is small. One of the main reasons for the identified mistakes was the deterministic rule-based approach to link two text strings together as it led easily to false non-matching decision. As the Large Language Models have shown rapid improvement in various tasks in recent years, the study continued by experimenting LLM capabilities in recognizing and extracting company name from text strings.
The experiment was done by using OpenAI gpt-3.5-turbo model. The model was given three sample datasets, and the model output performance was compared against skilled human doing the same thing. Quantitative results were calculated as the % of correct responses. The model performance was also analysed using qualitative methods, where the interest was specifically in what kind of answers the model is giving when it fails in the task and is the model ‘hallucinating’ responses.
The model performance was 93.67 % for the dataset which have not yet been processed by case company’s entity resolution process and 70-80 % for the datasets which have already been processed. Case company could improve the current system performance by introducing LLM to the entity resolution process. LLM identification of the company names would help in finding duplicates within existing groups and also in proposing key words for matching for the new data entering the system.
The study was conducted by first getting understanding of the current system and revealing potential issues in the process. Exploratory data analysis was done for two datasets. First dataset consisted of manually corrected entries on the most granular supplier level. Second dataset consisted of supplier groups in the current system which the system had failed to group together. Patterns of both datasets were grouped together and estimated how large of an impact could be achieved by handling or removing the patterns which have caused the mistakes. The analysis revealed that the most impactful pattern is the company name itself, while the impact of other patterns is small. One of the main reasons for the identified mistakes was the deterministic rule-based approach to link two text strings together as it led easily to false non-matching decision. As the Large Language Models have shown rapid improvement in various tasks in recent years, the study continued by experimenting LLM capabilities in recognizing and extracting company name from text strings.
The experiment was done by using OpenAI gpt-3.5-turbo model. The model was given three sample datasets, and the model output performance was compared against skilled human doing the same thing. Quantitative results were calculated as the % of correct responses. The model performance was also analysed using qualitative methods, where the interest was specifically in what kind of answers the model is giving when it fails in the task and is the model ‘hallucinating’ responses.
The model performance was 93.67 % for the dataset which have not yet been processed by case company’s entity resolution process and 70-80 % for the datasets which have already been processed. Case company could improve the current system performance by introducing LLM to the entity resolution process. LLM identification of the company names would help in finding duplicates within existing groups and also in proposing key words for matching for the new data entering the system.