Version 1
: Received: 30 November 2023 / Approved: 1 December 2023 / Online: 1 December 2023 (05:39:10 CET)
Version 2
: Received: 25 June 2024 / Approved: 26 June 2024 / Online: 26 June 2024 (10:02:50 CEST)
How to cite:
Ahmad, W.; Iqbal, M.; Amin, M. A.; Bangyal, W. H.; Shahzad, A. R. Machine Learning Driven Dashboard for Chronic Myeloid Leukemia Prediction using Protein Sequences. Preprints2023, 2023120053. https://doi.org/10.20944/preprints202312.0053.v1
Ahmad, W.; Iqbal, M.; Amin, M. A.; Bangyal, W. H.; Shahzad, A. R. Machine Learning Driven Dashboard for Chronic Myeloid Leukemia Prediction using Protein Sequences. Preprints 2023, 2023120053. https://doi.org/10.20944/preprints202312.0053.v1
Ahmad, W.; Iqbal, M.; Amin, M. A.; Bangyal, W. H.; Shahzad, A. R. Machine Learning Driven Dashboard for Chronic Myeloid Leukemia Prediction using Protein Sequences. Preprints2023, 2023120053. https://doi.org/10.20944/preprints202312.0053.v1
APA Style
Ahmad, W., Iqbal, M., Amin, M. A., Bangyal, W. H., & Shahzad, A. R. (2023). Machine Learning Driven Dashboard for Chronic Myeloid Leukemia Prediction using Protein Sequences. Preprints. https://doi.org/10.20944/preprints202312.0053.v1
Chicago/Turabian Style
Ahmad, W., Waqas Haider Bangyal and Abdul Raheem Shahzad. 2023 "Machine Learning Driven Dashboard for Chronic Myeloid Leukemia Prediction using Protein Sequences" Preprints. https://doi.org/10.20944/preprints202312.0053.v1
Abstract
In Southeast Asia, the incidence of Leukemia, a malignant blood cancer originating from hema-topoietic progenitor cells, is on the rise, marked by a concerning 54% mortality rate. This study focuses on enhancing early-stage prediction to improve patient recovery prospects significantly. Leveraging Machine Learning and Data Science, we employ protein sequential data from frequently mutated genes such as BCL2, HSP90, PARP, and RB to predict Chronic Myeloid Leukemia (CML). Our approach relies on robust feature extraction techniques, namely Di-peptide Composition (DPC), Amino Acid Composition (AAC), and Pseudo amino acid composition (Pse-AAC), with prior attention to addressing outliers and validating feature selection through the Pearson Corre-lation Coefficient. Data augmentation ensures a well-rounded dataset for analysis. Employing a range of Machine Learning models, including Support Vector Machine (SVM), XGBoost, Random Forest (RF), K Nearest Neighbor (KNN), Decision Tree (DT), and Logistic Regression (LR), we achieve accuracy rates spanning from 66% to 94%. These classifiers undergo comprehensive as-sessment using performance metrics such as accuracy, sensitivity, specificity, F1-score, and the confusion matrix. Our proposed solution, encompassing a user-friendly web application dashboard, presents an invaluable tool for early CML diagnosis with profound implications for practitioners, offering a deploy-able asset within healthcare institutions and hospitals.
Keywords
Protein Sequences; Pseudo-AAC; AAC; Dipeptide-C; Machine Learning Classifiers; Chronic Myeloid Leukemia; Blood Cancer
Subject
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.