Bacterial Genomics

wf-paired-end-illumina-assembly: Output

Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and is used to perform de novo assembly on raw Illumina paired-end reads from bacterial isolates.

[!NOTE]

[sample] is a unique identifier that is parsed from input FastQ filenames and excludes everything after [R1/R2].

[assembler] is the name of the assembler used to assemble contigs. [Default: SPAdes].

[!TIP] All tab-separated value (TSV) files can be converted to Excel spreadsheets (XLSX) by using the parameter --create_excel_outputs when running the pipeline.

When using this parameter, a summary workbook is created to allow for all summary files to be added to separate worksheets within the workbook.

Input quality control

Input files are checked for corruption and must meet a minimum file size to be processed within this pipeline. If this check passes, the input files will go through several read cleaning steps before other analyses are performed.

Initial FastQ file check

QC Steps

Host read removal

Host read removal can be skipped or performed by Hostile and/or NCBI SRA-Human-Scrubber by specifying --host_remove {both,hostile,sra-human-scrubber,skip}. For SRA-Human-Scrubber, reads are repaired using BBTools to discard broken sister reads. Information about the number of reads discarded and retained are saved in the output directory. Please see the host removal using Hostile documentation and host removal using SRA-Human-Scrubber documentation for more information.

QC Steps


Output files

PhiX read removal

PhiX reads are commonly used as a positive control for Illumina sequencing. During assembly, PhiX reads are considered contaminants and if retained, a misassembled genome will be formed. Therefore, a PhiX reference file is required and a default PhiX reference file is included with this pipeline. Please see the PhiX removal using BBDuk documentation for more information.

QC Steps


Output files

Adapter clipping and quality trimming

Illumina instruments can detect and remove adapter sequences, but sometimes adapters can end up in the FastQ output due to sequencing errors. A default adapters reference file is used within this pipeline. Trimmomatic also performs quality trimming, where broken sister reads are retained for downstream processes. Please see the adapter clipping and quality trimming using Trimmomatic documentation for more information.

QC Steps


Output files

Merge overlapping sister reads

Overlapping content between sister reads that are at least 80% similar are collapsed into a singleton read. Please see the overlapping of paired-end reads documentation for more information.

QC Steps


Output files

Taxonomic classification of trimmed reads

These classifiers perform classifications on a read-by-read basis or through the use of k-mers on the cleaned and trimmed FastQ files. The results that are obtained are heavily dependent on the quality and diversity of the database used. Therefore, the results produced from these classifiers should be used as a quality check to evaluate the possibility of sample contamination.

[!WARNING] Taxonomic classification tools will be skipped if the accompanying database is not specified.

Kraken

Kraken is a k-mer based classification tool that assigns taxonomic labels using the Lowest Common Ancestor (LCA) algorithm.

Output files

Kraken2

Kraken2 is a k-mer based classification tool that assigns taxonomic labels using the Lowest Common Ancestor (LCA) algorithm.

Output files

Assembly

The cleaned and trimmed reads are used to assemble contigs using SPAdes or SKESA [Default: SPAdes]. Contigs that have low compositional complexity are discarded. Please see the contig filtering documentation for more information. Contigs from SPAdes require polishing to create the final genome assembly, which is done by using BWA, Pilon, and Samtools. Contigs from SKESA do not require this step.

[!IMPORTANT] Outputs generated by SPAdes and SKESA cannot be compared even when using the same FastQ inputs.

[!TIP] For many input FastQ files, SKESA may be useful in decreasing runtime. For input FastQ files that may be heavily contaminated, SPAdes may help maintain contiguity.

QC Steps


Output files

SPAdes

SPAdes is a k-mer based software that forms a genome assembly from read sequences. Contigs from SPAdes require polishing to create the final genome assembly, which is done by using BWA, Pilon, and Samtools.

Output files

SKESA

Strategic K-mer Extension for Scrupulous Assemblies (SKESA) is a software that is based on DeBruijn graphs that forms a genome assembly from read sequences.

Output files

Assembly metrics and classification

Quality assessment and coverage

QUAST is used to perform quality assessment on the assembly file to report metrics such as N50, cumulative length, longest contig length, and GC composition. Using bedtools, the BAM file created with the assembly is used to calculate genome size and the coverage of paired-end and single-end reads.

Output files

Multilocus sequence typing (MLST)

The final assembly file is scanned against PubMLST typing schemes to determine the MLST for each sample. Unless a specific typing scheme is specified, the best typing scheme for each sample is automatically selected.

Output files


MLST output interpretation
Symbol Meaning Length Identity
n exact intact allele 100% 100%
~n novel full length allele similar to n 100% --minid
n? partial match to known allele --mincov --minid
- allele missing < --mincov < --minid
n,m multiple alleles    

Genome annotation

The final assembly file is annotated to identify and label features using Prokka. Please see genome annotation using Prokka documentation for more information.

[!IMPORTANT] Gene symbols may not be as update as the product description when using Prokka. It is recommended to use NCBI’s Prokaryotic Genome Annotation Pipeline (PGAP) to further investigate putatively present genes.

QC Steps


Output files

16S ribosomal RNA (rRNA) classification

The GenBank file is parsed for 16S rRNA gene records. If there are no 16S rRNA gene records, Barrnap is used to predict 16S rRNA genes using the assembly file. BLAST is then used to align these gene records to its database, where the best alignment is filtered out based on bit score.

[!NOTE] Some assembled genomes do not contain identifiable 16S rRNA sequences and therefore 16S is not able to be classified. If the classification of 16S rRNA sequences is required, the sample must be re-sequenced.

[!IMPORTANT] The 16S rRNA classification produced should not be used as a definitive classification as some taxa have 16S sequences that are extremely similar between different species.

For an exact species match, 100% identity and 100% alignment are needed.

QC Steps


Output files

Assembly taxonomic classification

GTDB-Tk is a taxonomic classification tool that uses the Genome Database Taxonomy GTDB. GTDB-Tk can be used as a quality check for the assembly file.

Output files

Summaries

Concatenation of output metrics for all samples.

[!NOTE] The Summary-Report excel file is only created when the parameter --create_excel_outputs is used.

The Summary-Report excel file has the date and time appended to the filename using the following shorthand notation: year (yyyy), month (MM), day (dd), hour (HH), minute (mm), second (ss).

Output files

Pipeline information

Information about the pipeline execution, output logs, error logs, and QC file checks for each sample are stored here.

[!NOTE] Pipeline execution files have the date and time appended to the filename using the following shorthand notation: year (yyyy), month (MM), day (dd), hour (HH), minute (mm), second (ss).

Pipeline information
Process log information
QC file checks