Pygenomics: Python package for processing genomic intervals and bioinformatic data formats

by Gaik Tamazian | Nikolay Cherkasov | Alexander Kanapin | Anastasia Samsonova | Saint Petersburg
State University | Saint Petersburg State University | Saint Petersburg State University | Saint
Petersburg State University

Motivation: Computational analysis of genome sequencing data and its derivatives such as
assembled genome sequences, annotated genes, repeats, and genomic variants plays an
important role in modern bioinformatic studies. Such studies are usually implemented in the
form of computational pipelines which combine invoking bioinformatic programs (e.g.,
genome assemblers or gene prediction programs) with extra routines that convert input or
output files of the programs between various bioinformatic data formats and query the
produced files. Many bioinformatic data formats are based on genomic intervals, and thus
querying files in such formats requires operating with the intervals.
Methods: We present pygenomics – an open-source Python package that provides routines
for reading and writing bioinformatic data in various formats and operating with genomic
intervals. Pygenomics is implemented in pure Python and does not require any other
libraries except for the Python standard library. The package is developed according to the
functional programming paradigm that ensures immutability of the package entities,
absence of side effects in the package functions except for ones related to input-output
(I/O), and extendable stream-based I/O.
Results: Pygenomics implements reading and writing from a number of bioinformatic data
formats, including BAM, BED, GFF3, and VCF. The package provides the application
programming interface (API) and the command-line interface (CLI) for calling its routines
from a source code or as stand-alone programs, respectively. Implementation of pygenomics
in pure Python allows to seamlessly incorporate the package routines into Snakemake
pipelines and to run them using CPython and PyPy interpreters. Absence of external
dependencies, implementation in pure Python, and the property-based testing framework
facilitate deployment of pygenomics to various computational platforms.