Accepted_test

A unified data preprocessing framework to address inconsistencies in comparative plastid genome studies

Authors:
Asan Emirsaliev, ICG SB RAS
Irina Mitrofanova, N.V.Tsitsin MBG RAS
Elena Salina, ICG SB RAS
Dmitry Afonnikov, ICG SB RAS

Abstract ID: 763

Event: BGRS-abstracts

Sections: [Sym 6] Section “Genomics, genetics and systems biology of plants”

Plastid genomes (plastomes) offer valuable data for phylogenetic analysis and plant genotyping. However, diverse approaches, tools, and standards used over time have led to inconsistencies in feature names and annotations, plastome sequence representations, as well as database record duplication, complicating comparative genomic studies.
This work presents a unified data preprocessing framework to standardize plastid genome data, enabling accurate comparative analyses, which addresses common discrepancies through stages including data retrieval, deduplication, annotation assessment, sequence validation, record selection, genome region ordering and orientation, annotation completion, and data extraction based on analysis purpose.
This approach has potential for extension to other small complex circular genomes like mitochondria genomes, or it could be integrated into plastome analysis pipelines.