Assessing Linkage Risk in Pseudonymized Datasets Under Modern Machine Learning Algorithms
Acosta Der Megerdichian, Juan Sebastian (2026)
Acosta Der Megerdichian, Juan Sebastian
2026
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-202601302065
https://urn.fi/URN:NBN:fi:amk-202601302065
Tiivistelmä
Access to machine learning algorithms has expanded rapidly in recent years, lowering technical barriers and enabling a wider range of actors to perform advanced data analysis. At the same time, large datasets have become increasingly valuable for artificial intelligence training and scientific research, raising significant privacy concerns. In response, the European Union introduced the General Data Protection Regulation (GDPR) in 2016, which promotes pseudonymization as a safeguard to protect individual identities in shared datasets. However, record linkage — the process of identifying the same individuals across multiple datasets — can enable unintended re-identification, particularly when machine learning techniques are applied.
This study adopts an experimental approach to evaluate linkage risk in pseudonymized datasets under multiple conditions. It assesses the performance of several machine learning algorithms across three experimental scenarios: (1) evaluating linkage performance under different pseudonymization techniques, (2) measuring the effects of incremental pseudonymization applied step by step, and (3) testing whether models trained on pseudonymized data can successfully link records when a partially leaked, non-pseudonymized dataset becomes available.
The results indicate that even relatively simple machine learning models can achieve strong linkage performance across datasets. Increased cryptographic strength in pseudonymization techniques does not consistently correspond to reduced linkage capability, and in some cases pseudonymization appears to simplify data in ways that facilitate linkage. These results raise concerns about the long-term robustness of current pseudonymization practices in an environment where machine learning tools are widely accessible, datasets are increasingly reused and shared, and data continues to grow in both research and economic value.
This study adopts an experimental approach to evaluate linkage risk in pseudonymized datasets under multiple conditions. It assesses the performance of several machine learning algorithms across three experimental scenarios: (1) evaluating linkage performance under different pseudonymization techniques, (2) measuring the effects of incremental pseudonymization applied step by step, and (3) testing whether models trained on pseudonymized data can successfully link records when a partially leaked, non-pseudonymized dataset becomes available.
The results indicate that even relatively simple machine learning models can achieve strong linkage performance across datasets. Increased cryptographic strength in pseudonymization techniques does not consistently correspond to reduced linkage capability, and in some cases pseudonymization appears to simplify data in ways that facilitate linkage. These results raise concerns about the long-term robustness of current pseudonymization practices in an environment where machine learning tools are widely accessible, datasets are increasingly reused and shared, and data continues to grow in both research and economic value.
