Hyppää sisältöön
    • Suomeksi
    • På svenska
    • In English
  • Suomi
  • Svenska
  • English
  • Kirjaudu
Hakuohjeet
JavaScript is disabled for your browser. Some features of this site may not work without it.
Näytä viite 
  •   Ammattikorkeakoulut
  • Turun ammattikorkeakoulu
  • Opinnäytetyöt (Avoin kokoelma)
  • Näytä viite
  •   Ammattikorkeakoulut
  • Turun ammattikorkeakoulu
  • Opinnäytetyöt (Avoin kokoelma)
  • Näytä viite

Recognizing names out of a string field : sanitation of invoicing data

Lindgren, Niklas (2018)

 
Avaa tiedosto
lindgren_niklas.pdf (782.6Kt)
Lataukset: 


Lindgren, Niklas
Turun ammattikorkeakoulu
2018
All rights reserved
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2018061313691
Tiivistelmä
In Finnish e-invoicing standards, there is a lack of granular standardization which leads to information being contained in an arbitrary text field. It is important to parse these fields to their values and components in order to validate and utilize the data itself. The main goal of the work was to build a specific part of a data analysis system to parse arbitrary name strings and output good enough data for use in machine learning and big data. Results were production ready and highly accurate.

This thesis briefly introduces big data to explain the need for the technology discussed in this thesis and discusses the cause of the issues in the fields of debt collection and invoicing. Programming languages and operating systems are compared for the practical work. Name recognition uses many third-party data sources that are also introduced including dictionaries, name statistics as well as Application Programming Interfaces (APIs).

The logic for the name recognition is explained with multiple examples ranging from multiple persons to a mixed field with a business and a person. Name recognition itself uses technologies such as tokenization, classification, blacklisting and intelligent guessing to reach the highest possible accuracy with the data sets used.

The thesis concludes that name recognition logic built is sufficiently accurate for the needs of this use case and the results can be used to connect identities from different input strings. Even when a partial recognition happens, the output is normalized enough that results from two identical strings mostly lead to the same results even if one of them is minimally different.
Kokoelmat
  • Opinnäytetyöt (Avoin kokoelma)
Ammattikorkeakoulujen opinnäytetyöt ja julkaisut
Yhteydenotto | Tietoa käyttöoikeuksista | Tietosuojailmoitus | Saavutettavuusseloste
 

Selaa kokoelmaa

NimekkeetTekijätJulkaisuajatKoulutusalatAsiasanatUusimmatKokoelmat

Henkilökunnalle

Ammattikorkeakoulujen opinnäytetyöt ja julkaisut
Yhteydenotto | Tietoa käyttöoikeuksista | Tietosuojailmoitus | Saavutettavuusseloste