Recognizing names out of a string field : sanitation of invoicing data
Lindgren, Niklas (2018)
Lindgren, Niklas
Turun ammattikorkeakoulu
2018
All rights reserved
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2018061313691
https://urn.fi/URN:NBN:fi:amk-2018061313691
Tiivistelmä
In Finnish e-invoicing standards, there is a lack of granular standardization which leads to information being contained in an arbitrary text field. It is important to parse these fields to their values and components in order to validate and utilize the data itself. The main goal of the work was to build a specific part of a data analysis system to parse arbitrary name strings and output good enough data for use in machine learning and big data. Results were production ready and highly accurate.
This thesis briefly introduces big data to explain the need for the technology discussed in this thesis and discusses the cause of the issues in the fields of debt collection and invoicing. Programming languages and operating systems are compared for the practical work. Name recognition uses many third-party data sources that are also introduced including dictionaries, name statistics as well as Application Programming Interfaces (APIs).
The logic for the name recognition is explained with multiple examples ranging from multiple persons to a mixed field with a business and a person. Name recognition itself uses technologies such as tokenization, classification, blacklisting and intelligent guessing to reach the highest possible accuracy with the data sets used.
The thesis concludes that name recognition logic built is sufficiently accurate for the needs of this use case and the results can be used to connect identities from different input strings. Even when a partial recognition happens, the output is normalized enough that results from two identical strings mostly lead to the same results even if one of them is minimally different.
This thesis briefly introduces big data to explain the need for the technology discussed in this thesis and discusses the cause of the issues in the fields of debt collection and invoicing. Programming languages and operating systems are compared for the practical work. Name recognition uses many third-party data sources that are also introduced including dictionaries, name statistics as well as Application Programming Interfaces (APIs).
The logic for the name recognition is explained with multiple examples ranging from multiple persons to a mixed field with a business and a person. Name recognition itself uses technologies such as tokenization, classification, blacklisting and intelligent guessing to reach the highest possible accuracy with the data sets used.
The thesis concludes that name recognition logic built is sufficiently accurate for the needs of this use case and the results can be used to connect identities from different input strings. Even when a partial recognition happens, the output is normalized enough that results from two identical strings mostly lead to the same results even if one of them is minimally different.