Machine Learning for Small Data : A Comprehensive Study with the Case of Sovereign Debt Default Forecasting
Nguyen, Linh (2024)
Nguyen, Linh
2024
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2024053119154
https://urn.fi/URN:NBN:fi:amk-2024053119154
Tiivistelmä
This study explores the application of machine learning to forecast sovereign debt default likelihood. Challenges include a limited sample size of 8175 country-year observations across 180 countries from 1970 to 2019, high dimensionality with over 70 predictors, a missingness ratio exceeding 60%, and imbalanced binary class distribution (73% non-default, 27% default). To address these challenges and achieve models with strong generalization, I investigate optimal data analytics methods, focusing on data imputation and ensemble learning models. Different imputation techniques (single and multiple, including MICE, Amelia II, and MIDAS) are combined with ensemble models (XGBoost, LightGBM, AdaBoost, and Random Forest) to compare prediction and generalization performance. Results show that MICE-imputed datasets trained with LightGBM achieve the highest test AUROC score of 0.825, indicating robust generalization. Multiple imputation proves promising for small datasets with high missing value proportions, outperforming single-imputed models. However, the complexity of multiple imputation algorithms may impact performance adversely.