Accepted_test

Searching for DNA-binding proteins (DBPs) Using Deep Learning Methods
by Alexander Gavrilenko | Institute for Artificial Intelligence Moscow State University
Abstract ID: 380
Event: BGRS-abstracts
Sections: [Sym 12] Section “Systems theory, big biological data analysis, ontologies and artificial intelligence”

DNA-binding proteins (DBPs) are pivotal in various biological processes, including replication, DNA repair, transcription regulation, and translation. Despite their significance, traditional experimental methods to study protein-DNA interactions are labor-intensive and prone to biases. The progress of high-throughput sequencing has yielded a lot of genomic data, yet less than 1% of proteins in UniProtKB have experimentally verified annotations, highlighting the need for automated identification methods for DBPs. Pre-trained protein language models, akin to natural language processing models, offer a promising solution by encoding protein sequences into numerical vectors that capture their properties and functions.

This study aims to develop an algorithm to identify DBPs using a pre-trained protein language model, Ankh, and a gradient boosting classifier built with LightAutoML. The dataset was sourced from UniProtKB/Swiss-Prot, encompassing 31,803 positive class (DNA-binding) and 31,803 negative class (non-DNA-binding) proteins.

The classification algorithm demonstrated superior performance with the following metrics: Precision (0.816), Sensitivity (0.859), Specificity (0.800), and Matthews Correlation Coefficient (MCC, 0.661). Compared to existing methods, our model was up to 7.84% more accurate in terms of MCC than BiCaps-DBP on the PDB2272 test dataset. These results highlight the algorithm's high efficiency relative to established methods such as PHMMER, BiCaps-DBP, and Local-DPP. The developed algorithm can significantly accelerate and reduce the cost of identifying DNA-binding proteins, enabling a faster and more accurate understanding of their functions in living organisms. This advancement has important implications for medical and biotechnological research, promoting a deeper understanding of DBP functions.