Less is more: filtering and trimming ONT sequencing data

by Natalia Nenasheva | MIPT, VIGG, Genotek

Since many tasks in bioinformatics require high quality reads with unique genome mapping,
we tested if the overall ONT data quality can be improved by filtering out low-quality reads
or read fragments. To this end we developed a set of criteria applied to the raw (fastq) base
calling data. To further improve the potential of ONT data we used the combined Illumina –
ONT sequencing rather than the standard database genome reference sequence.
When filtering the data, we took into account the leveling score normalized to the length of
the read, the amount of fragmentation of the read, and we considered only data with a
sufficiently high quality score – phred score > 10. Third filter was introduced: the read
fragmentation score. It has been shown that the most fragmented reads most often have a
low quality score.
Since the determination of various genetic variations requires alignment to the genome, we
tried to influence this step as well. For one of the samples, in addition to ONT data, illumina
sequencing data was also obtained. We suggested that it would be possible to take into
account single polymorphisms and compile a reference sequence for these samples