Version 1
: Received: 16 February 2022 / Approved: 18 February 2022 / Online: 18 February 2022 (02:19:27 CET)
How to cite:
Hattab, G.; Neumann, N.; Anžel, A.; Heider, D. A Parametric Approach to Molecular Encodings of Carbon-based Multilevel Atomic Neighborhoods. Preprints2022, 2022020220. https://doi.org/10.20944/preprints202202.0220.v1
Hattab, G.; Neumann, N.; Anžel, A.; Heider, D. A Parametric Approach to Molecular Encodings of Carbon-based Multilevel Atomic Neighborhoods. Preprints 2022, 2022020220. https://doi.org/10.20944/preprints202202.0220.v1
Hattab, G.; Neumann, N.; Anžel, A.; Heider, D. A Parametric Approach to Molecular Encodings of Carbon-based Multilevel Atomic Neighborhoods. Preprints2022, 2022020220. https://doi.org/10.20944/preprints202202.0220.v1
APA Style
Hattab, G., Neumann, N., Anžel, A., & Heider, D. (2022). A Parametric Approach to Molecular Encodings of Carbon-based Multilevel Atomic Neighborhoods. Preprints. https://doi.org/10.20944/preprints202202.0220.v1
Chicago/Turabian Style
Hattab, G., Aleksandar Anžel and Dominik Heider. 2022 "A Parametric Approach to Molecular Encodings of Carbon-based Multilevel Atomic Neighborhoods" Preprints. https://doi.org/10.20944/preprints202202.0220.v1
Abstract
Exploring new ways to represent and discover organic molecules is critical for developing novel therapies. With recent advances in bioinformatics, virtual screening of databases is possible. However, biochemical data must be encoded using computer algorithms to make them machine-readable, taking into account distance and similarity measures to support tasks such as similarity searching. Motivated by the ubiquity of the carbon element and the structured patterns that emerge, we propose a parametric approach to molecular encodings of carbon-based multilevel atomic neighborhoods. It implements a walk along the carbon chain of an organic molecule to compute different representations of its feature encoding in the form of a binary or numerical array that can be exported later into an image. Resulting encodings are reproducible and readily formatted for various domain tasks including machine learning tasks. This approach was evaluated using a 10-fold stratified cross validation for binary classification with eight data sets and six different encodings (384 models) in the domain knowledge of cell-penetrating peptides. The parametric approach is built on open-source software and is implemented as a Python package (cmangoes). Source code and documentation are available at https://github.com/ghattab/cmangoes.
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.