Quantum Chemistry Preprocessing for Industrial Data : A Case Study with SOAP and MNIST
Morooka, Eiaki Von Roeder (2025)
Morooka, Eiaki Von Roeder
2025
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2025060921584
https://urn.fi/URN:NBN:fi:amk-2025060921584
Tiivistelmä
The goal of this study was to explore the application of quantum chemistry-inspired feature extraction techniques to structured point cloud data, using the Smooth Overlap of Atomic Positions (SOAP) descriptor for classification tasks. The work focuses on adapting SOAP, traditionally used for atomic structures, to process 2D grayscale MNIST handwritten digits transformed into 3D point clouds. The aim is to achieve invariant representations under rotation, translation, and mirror transformations while eliminating the need for data augmentation. Methodologically, the MNIST digits were converted into 3D point clouds, with SOAP applied to generate rotationally and translationally invariant feature vectors. The resulting SOAP power spectra are used as inputs for classification models. Dimensionality reduction was investigated using autoencoders to analyze the trade-off between feature compression and information retention. Additionally, the robustness of the approach was tested by introducing Gaussian noise to evaluate classification performance under data perturbations. The findings demonstrate that SOAP-based feature extraction effectively captures local structural information, enabling accurate classification without reliance on augmented data. Dimensionality reduction preserves essential features while significantly compressing the data. The method exhibits stability under noise, maintaining performance despite perturbations. The study highlights the versatility of SOAP for non-chemistry-based point cloud data, offering a robust alternative to traditional feature extraction techniques. The results suggest potential applications in other domains requiring invariant representations of structured data. Further research could explore scalability to larger datasets or different data modalities.