Accepted_test
Plastid genomes (plastomes) offer valuable data for phylogenetic analysis and plant genotyping. However, diverse approaches, tools, and standards used over time have led to inconsistencies in feature names and annotations, plastome sequence representations, as well as database record duplication, complicating comparative genomic studies.
This work presents a unified data preprocessing framework to standardize plastid genome data, enabling accurate comparative analyses, which addresses common discrepancies through stages including data retrieval, deduplication, annotation assessment, sequence validation, record selection, genome region ordering and orientation, annotation completion, and data extraction based on analysis purpose.
This approach has potential for extension to other small complex circular genomes like mitochondria genomes, or it could be integrated into plastome analysis pipelines.