Accepted_test

The performance of machine learning approach in genome-wide association study of disease

Authors:
Gennady Khvorykh, National Research Centre "Kurchatov Institute"
Mikhail Belousov, National Research University Higher School of Economics
Svetlana Limborska, National Research Centre "Kurchatov Institute"
Andrey Khrunin, National Research Centre "Kurchatov Institute"

Abstract ID: 166

Event: BGRS-abstracts

Sections: [Sym 4] Section “Genome-wide association studies”

The machine learning approach is gaining ground over classical statistics in genome-wide association studies (GWAS). However, its effectiveness still needs to be defined and proven. In this study, we evaluated the accuracy, specificity, and sensitivity of identifying disease-associated single nucleotide polymorphisms (SNPs) using Random Forest (RF) and eXtreme Gradient Boosted trees (XGBoost), followed by SNP ranking based on feature impurity importance and SHapley Additive exPlanations (SHAP) values, using artificial genotype-phenotype data. We examined datasets both with and without consideration of linkage disequilibrium (LD). Our findings suggest that ML algorithms and measures of feature contribution to the outcome variable are effective when the sizes of the case and control groups are equal and collectively comprise several thousand individuals.