Android Malware Detection: Building Useful Representations
Sayfullina, Luiza; Eirola, Emil; Komashinsky, Dmitry; Palumbo, Paolo; Karhunen, Juha (2017)
MetadataNäytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
Sayfullina L., Eirola E., Komashinsky D., Palumbo P., Karhunen J., (2017). Android Malware Detection: Building Useful Representations. 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), p. 201-206, IEEE. doi:http://dx.doi.org/10.1109/ICMLA.2016.0041
The problem of proactively detecting Android Malware has proven to be a challenging one. The challenges stem from a variety of issues, but recent literature has shown that this task is hard to solve with high accuracy when only a restricted set of features, like permissions or similar fixed sets of features, are used. The opposite approach of including all available features is also problematic, as it causes the features space to grow beyond reasonable size. In this paper we focus on finding an efficient way to select a representative feature space, preserving its discriminative power on unseen data. We go beyond traditional approaches like Principal Component Analysis, which is too heavy for large-scale problems with millions of features. In particular we show that many feature groups that can be extracted from Android application packages, like features extracted from the manifest file or strings extracted from the Dalvik Executable (DEX), should be filtered and used in classification separately. Our proposed dimensionality reduction scheme is applied to each group separately and consists of raw string preprocessing, feature selection via log-odds and finally applying random projections. With the size of the feature space growing exponentially as a function of the training set's size, our approach drastically decreases the size of the feature space of several orders of magnitude, this in turn allows accurate classification to become possible in a real world scenario. After reducing the dimensionality we use the feature groups in a light-weight ensemble of logistic classifiers. We evaluated the proposed classification scheme on real malware data provided by the antivirus vendor and achieved state-of-the-art 88.24% true positive and reasonably low 0.04% false positive rates with a significantly compressed feature space on a balanced test set of 10,000 samples.