Machine Learning-Based Ransomware Detection Through Static Analysis of PE File Features

Ali, Muhammad

Machine Learning-Based Ransomware Detection Through Static Analysis of PE File Features

Ali, Muhammad (2025)

Avaa tiedosto

Ali_Muhammad.pdf (1.554Mt)

Lataukset:

Ali, Muhammad

2025

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2025080623818

Tiivistelmä

This thesis introduces a machine learning framework for ransomware detection based on static analysis of executable file attributes. The primary contribution of the thesis is the introduction of a useful low-false-positive detection system, a gap-filling mechanism between the performance claims of the academic space and the real-world deployment needs of the cybersecurity applications. Using the EMBER 2018 dataset, which is a benchmark collection of real malware samples, this thesis comprehensively tests a number of machine learning models against a balanced sampling of 50,000 variants, 25,000 benign, and 25,000 malicious .The novelty of the work is the extensive feature engineering approach, extracting 50 different structural features from PE files, including entropy distribution, import features, header features, and histogram statistics, with a deliberate emphasis on minimizing false positives.

This thesis demonstrates through cross-validation and a large holdout set (7,500 samples) that ensemble methods, in particular, XGBoost, significantly outperform traditional methods. The optimized XGBoost model achieved 94.5% accuracy, 94.6% precision, and 94.4% recall, with a low false positive rate of 5.4% which was a significant improvement from previous methods, which were plagued by excessive false alarms.

Based on the feature importance analysis, the three best classifiers for ransomware detection were the GUI application flag, entropy in certain byte ranges, and imported features. These findings re-evaluate long-held beliefs about which file attributes were the strongest indicators of malicious code and present new challenges for security practitioners.

This thesis adds value to cybersecurity research by establishing realistic performance standards for static analysis-based malware detection, provides a systematic testing comparison of six machine learning algorithms under the same conditions, and displays that current ensemble methods can obtain practically deployable detection rates with manageable false positive rates. The proposed approach and findings connect theoretical research to operational security considerations and provide a pragmatic way to discover new ransomware variants that are unknown to traditional signature-based methods.

Kokoelmat

Opinnäytetyöt