Accepted_test

The performance of machine learning approach in genome-wide association study of disease
by Gennady Khvorykh | Mikhail Belousov | Svetlana Limborska | Andrey Khrunin | National Research Centre "Kurchatov Institute" | National Research University Higher School of Economics | National Research Centre "Kurchatov Institute" | National Research Centre "Kurchatov Institute"
Abstract ID: 166
Event: BGRS-abstracts
Sections: [Sym 4] Section “Genome-wide association studies”

The machine learning approach is gaining ground over classical statistics in genome-wide association studies (GWAS). However, its effectiveness still needs to be defined and proven. In this study, we evaluated the accuracy, specificity, and sensitivity of identifying disease-associated single nucleotide polymorphisms (SNPs) using Random Forest (RF) and eXtreme Gradient Boosted trees (XGBoost), followed by SNP ranking based on feature impurity importance and SHapley Additive exPlanations (SHAP) values, using artificial genotype-phenotype data. We examined datasets both with and without consideration of linkage disequilibrium (LD). Our findings suggest that ML algorithms and measures of feature contribution to the outcome variable are effective when the sizes of the case and control groups are equal and collectively comprise several thousand individuals.