Multi-target regression with spatially distributed data: prediction of coordinates based on Finnish street addresses
Backlund, Noora (2024)
Backlund, Noora
2024
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2024091725306
https://urn.fi/URN:NBN:fi:amk-2024091725306
Tiivistelmä
Automations are increasingly used in decision making processes, reducing the need for manual work and improving throughput times for business processes. As not all components of the data are always initially available for the processes, machine learning may be used to estimate some of the data to allow a broader scope of data to be entered into automation routines. Finnish building location data and road geometry data were used to investigate whether new, previously unknown addresses could be reliably geolocated by predicting the latitude and longitude coordinates based on lemmatized and vectorized postal address information. Additional assessments were made on whether the model would be capable of indicating the reliability of the prediction, and whether the predictions overall were accurate enough to be used as a part of business processes. The Finnish building data was spatially sampled to relatively evenly cover the Finnish geographic boundaries and was enriched with addresses generated from road network endpoints. The street names were split into prefix and suffix portions, and lemmatized and vectorized using pre-built models by TurkuNLP group. Random Forest Regressor, XGBoost with Random Forest Regression, and Support Vector Regression algorithms were explored for training models capable of reliably predicting the latitude and longitude coordinates of a given street address. Bootstrapping options, number of estimators and maximum tree depths were optimized for Random Forest Regressor, with the most optimal model coupled with forestci module to provide prediction accuracy estimations to further rule out inaccurate predictions. XGBoost model tuning explored estimator numbers, subsampling rates and maximum depths. Support Vector Regression turned out to be too resource-intensive with the complex dataset to be able to produce a trained model within 20 hours of runtime. The highest accuracy model was a 300-estimator Random Forest Regressor model with a maximum tree depth of 54 levels, coupled with forestci module to discard predictions which were in the top 5th percentile in variance of individual tree-level predictions. The model was capable of geolocating unknown addresses to 712 meters from their actual locations on average, with 80% of addresses geolocated within 1000 meters of the true location. The model did still produce a small number of wildly inaccurate locations, the furthest of which were over 300 kilometers away from the true location. Random Forest Regression can be used to automate address geolocation for processes that do not require high precision, though an additional layer of sanity checks should be built on top of the model to validate the results.