Accepted_test

A unified data preprocessing framework to address inconsistencies in comparative plastid genome studies
by Asan Emirsaliev | Irina Mitrofanova | Elena Salina | Dmitry Afonnikov | ICG SB RAS | N.V.Tsitsin MBG RAS | ICG SB RAS | ICG SB RAS
Abstract ID: 763
Event: BGRS-abstracts
Sections: [Sym 6] Section “Genomics, genetics and systems biology of plants”

Plastid genomes (plastomes) offer valuable data for phylogenetic analysis and plant genotyping. However, diverse approaches, tools, and standards used over time have led to inconsistencies in feature names and annotations, plastome sequence representations, as well as database record duplication, complicating comparative genomic studies.
This work presents a unified data preprocessing framework to standardize plastid genome data, enabling accurate comparative analyses, which addresses common discrepancies through stages including data retrieval, deduplication, annotation assessment, sequence validation, record selection, genome region ordering and orientation, annotation completion, and data extraction based on analysis purpose.
This approach has potential for extension to other small complex circular genomes like mitochondria genomes, or it could be integrated into plastome analysis pipelines.