Accepted_test
The machine learning approach is gaining ground over classical statistics in genome-wide association studies (GWAS). However, its effectiveness still needs to be defined and proven. In this study, we evaluated the accuracy, specificity, and sensitivity of identifying disease-associated single nucleotide polymorphisms (SNPs) using Random Forest (RF) and eXtreme Gradient Boosted trees (XGBoost), followed by SNP ranking based on feature impurity importance and SHapley Additive exPlanations (SHAP) values, using artificial genotype-phenotype data. We examined datasets both with and without consideration of linkage disequilibrium (LD). Our findings suggest that ML algorithms and measures of feature contribution to the outcome variable are effective when the sizes of the case and control groups are equal and collectively comprise several thousand individuals.