Bacterial Genomics

PhiX read removal using BBDuk

Introduction

The bacteriophage PhiX174 is a commonly added spike-in sequence for Illumina sequencing. When added, it serves as a positive control for the whole sequencing run. Many Illumina instruments align reads against this well-known short sequence (thanks to Fred Sanger)to identify SNPs as a proxy for how reliable the unknown samples sequences are.

At one point, reported here, the forgetful removal of this resulted in >1,000 genomes contaminated with it on NCBI and 10% of the genomes published in the literature. This artificially added sequence, when retained, will form overlaps of reads during assembly and join biological sample DNA that are not contiguous but would otherwise seem to be from the raw data. Therefore PhiX fragments must be removed to form a higher quality assembly.

PhiX removal

Sequencing errors can (and do) occur occasionally, so I allow for a 1 SNP to occur in a 31 bp aligned stretch to PhiX.

Log information