This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
The pipeline is built using Nextflow and is used to perform de novo assembly on raw Illumina paired-end reads from bacterial isolates.
[!NOTE]
[sample]is a unique identifier that is parsed from input FastQ filenames and excludes everything after [R1/R2].
[assembler]is the name of the assembler used to assemble contigs. [Default: SPAdes].
[!TIP] All tab-separated value (TSV) files can be converted to Excel spreadsheets (XLSX) by using the parameter
--create_excel_outputswhen running the pipeline.When using this parameter, a summary workbook is created to allow for all summary files to be added to separate worksheets within the workbook.
Input files are checked for corruption and must meet a minimum file size to be processed within this pipeline. If this check passes, the input files will go through several read cleaning steps before other analyses are performed.
[Default: 25M]. This is to prevent unusually small input sets from wasting compute time processing data that will not yield a usable bacterial genome assembly.Host read removal can be skipped or performed by Hostile and/or NCBI SRA-Human-Scrubber by specifying --host_remove {both,hostile,sra-human-scrubber,skip}. For SRA-Human-Scrubber, reads are repaired using BBTools to discard broken sister reads. Information about the number of reads discarded and retained are saved in the output directory.
Please see the host removal using Hostile documentation and host removal using SRA-Human-Scrubber documentation for more information.
[Default: 25M].CleanedReads/Hostile/
[sample].Summary.Hostile-Removal.[tsv,xlsx]: Summary of the number of reads discarded and retained from Hostile.CleanedReads/SRA-Human-Scrubber/
[sample].Summary.BBTools-Repair-Removal.[tsv,xlsx]: Summary of the number of reads discarded and retained after repairing broken sister reads produced from SRA-Human-Scrubber.[sample].Summary.SRA-Human-Scrubber-Removal.[tsv,xlsx]: Summary of the number of reads discarded and retained from SRA-Human-Scrubber.PhiX reads are commonly used as a positive control for Illumina sequencing. During assembly, PhiX reads are considered contaminants and if retained, a misassembled genome will be formed. Therefore, a PhiX reference file is required and a default PhiX reference file is included with this pipeline. Please see the PhiX removal using BBDuk documentation for more information.
The PhiX reference genome that is used with BBDuk to remove sequence reads must be accessible and meet a minimum file size [Default: 5k].
FastQ files after PhiX removal are checked to ensure that they meet a minimum file size before continuing to downstream processes [Default: 25M]. This is to halt analysis of a sample mostly containing PhiX reads rather than the sample DNA itself.
CleanedReads/BBDUK/
[sample].Summary.PhiX.[tsv,xlsx]: Number of reads discarded and retained from BBDuk.Illumina instruments can detect and remove adapter sequences, but sometimes adapters can end up in the FastQ output due to sequencing errors. A default adapters reference file is used within this pipeline. Trimmomatic also performs quality trimming, where broken sister reads are retained for downstream processes. Please see the adapter clipping and quality trimming using Trimmomatic documentation for more information.
The adapters reference file that is used with Trimmomatic to remove sequence reads must be accessible and meet a minimum file size [Default: 10k].
FastQ files after removing adapter sequences are checked to ensure that they meet a minimum file size before continuing to downstream processes [Default: 25M]. This is to halt analysis of a sample that is primarily contaminated with artificial Illumina sequences.
CleanedReads/Trimmomatic/
[sample].trimmomatic.[tsv,xlsx]: Summary of the number of reads discarded and retained from Trimmomatic.Overlapping content between sister reads that are at least 80% similar are collapsed into a singleton read. Please see the overlapping of paired-end reads documentation for more information.
[Default: 20M]. This is to prevent the analysis of paired-end read files that are heavily misconstructed into small fragment sizes.CleanedReads/
[sample]_single.fq.gz: Final cleaned singleton reads.[sample]_R[1/2].paired.fq.gz: Final cleaned paired reads.CleanedReads/FLASH/
[sample].overlap.[tsv,xlsx]: Number of reads that were overlapped into singleton reads.[sample].clean-reads.[tsv,xlsx]: Number of non-overlapping reads.These classifiers perform classifications on a read-by-read basis or through the use of k-mers on the cleaned and trimmed FastQ files. The results that are obtained are heavily dependent on the quality and diversity of the database used. Therefore, the results produced from these classifiers should be used as a quality check to evaluate the possibility of sample contamination.
[!WARNING] Taxonomic classification tools will be skipped if the accompanying database is not specified.
Kraken is a k-mer based classification tool that assigns taxonomic labels using the Lowest Common Ancestor (LCA) algorithm.
Taxonomy/kraken/[sample]/
[sample].kraken_summary.[tsv,xlsx]: Summary of the unclassified, top 3 genus and top 3 species classifications from the Kraken report.[sample].kraken_output.[tsv,xlsx].gz: Taxonomic classification in the Kraken report format.Kraken2 is a k-mer based classification tool that assigns taxonomic labels using the Lowest Common Ancestor (LCA) algorithm.
Taxonomy/kraken2/[sample]/
[sample].kraken2_summary.[tsv,xlsx]: Summary of the unclassified, top 3 genus and top 3 species classifications from the Kraken report.[sample].kraken2_output.[tsv,xlsx].gz: Taxonomic classification in the Kraken report format.The cleaned and trimmed reads are used to assemble contigs using SPAdes or SKESA [Default: SPAdes]. Contigs that have low compositional complexity are discarded. Please see the contig filtering documentation for more information. Contigs from SPAdes require polishing to create the final genome assembly, which is done by using BWA, Pilon, and Samtools. Contigs from SKESA do not require this step.
[!IMPORTANT] Outputs generated by SPAdes and SKESA cannot be compared even when using the same FastQ inputs.
[!TIP] For many input FastQ files, SKESA may be useful in decreasing runtime. For input FastQ files that may be heavily contaminated, SPAdes may help maintain contiguity.
The contigs produced from an assembler software package must meet a minimum file size criteria [Default: 1M]. This is to prevent the analysis of a highly incomplete bacterial genome.
The resulting contig file after low compositional complexity contigs are discarded must meet a minimum file size [Default: 1M]. This is to prevent the analysis of a highly incomplete bacterial genome.
The cleaned paired-end reads are mapped onto the filtered assembly file in sequential steps ([Default: 3]), and the resulting binary paired-end alignment file must meet a minimum file size criteria [Default: 25M]. This is to prevent the analysis of an assembly file that has an unusually low read sequence amount.
The assembly file goes through SNP and InDel corrections in sequential steps ([Default: 3]), and the resulting assembly file must meet a minimum file size criteria [Default: 1M]. This is to prevent further analysis of an unusually incomplete genome.
The final error-corrected assembly file must meet a minimum file size criteria [Default: 1M]. This is to ensure that the final assembly file is not unexpectedly small or incomplete.
If singletons (single-end reads) exist after read cleaning, they are mapped onto the assembly file and the resulting binary single-end alignment file must meet a minimum file size criteria [Default: 1k]. This is to ensure that read depth calculations can be performed on the single-end reads.
Assembly/
[sample]-[assembler]_contigs.fna: Final genome assembly.SPAdes is a k-mer based software that forms a genome assembly from read sequences. Contigs from SPAdes require polishing to create the final genome assembly, which is done by using BWA, Pilon, and Samtools.
Assembly/SPAdes/[sample]/
[sample]-SPAdes.log.gz: SPAdes log file.[sample]-SPAdes_graph.gfa: Assembly graph in gfa format.[sample]-SPAdes_warnings.log: Log file that lists warnings when forming a genome assembly.[sample]-SPAdes_params.txt.gz: Command used to perform the SPAdes analysis.[sample]-SPAdes_contigs.fasta: Assembled contigs in FastA format.[sample]-SPAdes.SNPs-corrected.cnt.txt: Number of SNPs corrected in each round of corrections.[sample]-SPAdes.InDels-corrected.cnt.txt: Number of InDels corrected in each round of corrections.Strategic K-mer Extension for Scrupulous Assemblies (SKESA) is a software that is based on DeBruijn graphs that forms a genome assembly from read sequences.
Assembly/SKESA/[sample]/
[sample]-SKESA_contigs.fasta: Assembled contigs in FastA format.QUAST is used to perform quality assessment on the assembly file to report metrics such as N50, cumulative length, longest contig length, and GC composition. Using bedtools, the BAM file created with the assembly is used to calculate genome size and the coverage of paired-end and single-end reads.
Assembly/QA/[sample]/
[sample]-[assembler].QuastSummary.[tsv,xlsx]: Assembly metrics such as N50, cumulative length, longest contig length, and GC composition.[sample]-[assembler].GenomeCoverage.[tsv,xlsx]: Genome coverage information.[sample]-[assembler].CleanedReads-Bases.[tsv,xlsx]: Number of cleaned bases.The final assembly file is scanned against PubMLST typing schemes to determine the MLST for each sample. Unless a specific typing scheme is specified, the best typing scheme for each sample is automatically selected.
Summaries/
Summary.MLST.[tsv,xlsx]: Summary of the MLST results for all samples.| Symbol | Meaning | Length | Identity |
|---|---|---|---|
n |
exact intact allele | 100% | 100% |
~n |
novel full length allele similar to n | 100% | ≥ --minid |
n? |
partial match to known allele | ≥ --mincov |
≥ --minid |
- |
allele missing | < --mincov |
< --minid |
n,m |
multiple alleles |
The final assembly file is annotated to identify and label features using Prokka. Please see genome annotation using Prokka documentation for more information.
[!IMPORTANT] Gene symbols may not be as update as the product description when using Prokka. It is recommended to use NCBI’s Prokaryotic Genome Annotation Pipeline (PGAP) to further investigate putatively present genes.
[Default: 3M]. This is to prevent further analysis of a highly incomplete annotation set.Annotation/Prokka/
[sample]-[assembler].gbk: Annotated genome in GenBank file format.The GenBank file is parsed for 16S rRNA gene records. If there are no 16S rRNA gene records, Barrnap is used to predict 16S rRNA genes using the assembly file. BLAST is then used to align these gene records to its database, where the best alignment is filtered out based on bit score.
[!NOTE] Some assembled genomes do not contain identifiable 16S rRNA sequences and therefore 16S is not able to be classified. If the classification of 16S rRNA sequences is required, the sample must be re-sequenced.
[!IMPORTANT] The 16S rRNA classification produced should not be used as a definitive classification as some taxa have 16S sequences that are extremely similar between different species.
For an exact species match, 100% identity and 100% alignment are needed.
The extracted 16S rRNA gene sequence file from the genome assembly must meet a minimum file size criteria [Default: 500b]. This is to prevent the classification of a highly incomplete 16S rRNA gene sequence.
The sample identifiers in the 16S rRNA gene sequence file gets renamed and this resulting file must meet a minimum file size criteria [Default: 500b]. This is to prevent the classification of a highly incomplete 16S rRNA gene sequence.
The 16S rRNA gene sequence file is classified using BLASTn and the resulting output file must meet a minimum file size criteria [Default: 10b]. This is to ensure that the BLASTn output file contains at least one alignment with taxonomic information to be filtered and reported.
The best 16S BLASTn alignment sequence filtered by bit score is saved to a file in FastA format and must meet a minimum file size criteria [Default: 10b]. This is to ensure that the resulting file contains alignment information that can be parsed and reported.
SSU/
16S-top-species.[tsv,xlsx]: Summary of the best BLAST alignment for each sample.16S.[sample]-[assembler].fa: 16S rRNA gene sequence of the best BLAST alignment in FastA format.SSU/BLAST/
[sample]-[assembler].blast.[tsv,xlsx].gz: BLAST output 16S rRNA gene records in tab-separated value (TSV) format.GTDB-Tk is a taxonomic classification tool that uses the Genome Database Taxonomy GTDB. GTDB-Tk can be used as a quality check for the assembly file.
Summaries/
Summary.GTDB-Tk.[tsv,xlsx]: Summary of the GTDB-Tk taxonomic classification for each sample.Concatenation of output metrics for all samples.
[!NOTE] The Summary-Report excel file is only created when the parameter
--create_excel_outputsis used.The Summary-Report excel file has the date and time appended to the filename using the following shorthand notation: year (yyyy), month (MM), day (dd), hour (HH), minute (mm), second (ss).
Summaries/
Summary.16S.[tsv,xlsx]: Summary of the best BLAST alignment for each sample.Summary.MLST.[tsv,xlsx]: Summary of the MLST results for all samples.Summary.PhiX.[tsv,xlsx]: Number of reads discarded and retained for each sample.Summary.Assemblies.[tsv,xlsx]: Assembly metrics such as N50, cumulative length, longest contig length, and GC composition for each sample.Summary.GenomeCoverage.[tsv,xlsx]: Summary of the overall genome coverage for each sample.Summary.QC_File_Checks.[tsv,xlsx]: Summary of all QC file checks detailing if a sample passes or fails each process.Summary.CleanedReads-Bases.[tsv,xlsx]: Summary of the number of cleaned bases for each sample.Summary.CleanedReads-AlignmentStats.[tsv,xlsx]: Summary of the genome size and coverages of the paired-end and single-end reads for each sample.Summary-Report_yyyy-MM-dd_HH-mm-ss.xlsx: Excel workbook where each file in the Summaries directory is added to a separate worksheet within the workbook.Information about the pipeline execution, output logs, error logs, and QC file checks for each sample are stored here.
[!NOTE] Pipeline execution files have the date and time appended to the filename using the following shorthand notation: year (yyyy), month (MM), day (dd), hour (HH), minute (mm), second (ss).
pipeline_info/
software_versions.yml: Summary of the software packages used in each process and their version information.nextflow_log.[job_id].txt: Execution log file produced by Nextflow.ASM_[num_of_samples].o[job_id]: Output log file produced by the job scheduler.ASM_[num_of_samples].e[job_id]: Error log file produced by the job scheduler.pipeline_dag_yyyy-MM-dd_HH-mm-ss.html: Direct acrylic graph (DAG) image of the workflow that gives a visual representation of how each process connects to each other.execution_trace_yyyy-MM-dd_HH-mm-ss.txt: Text-based summary report detailing the work directory hash, runtime, CPU usage, memory usage, etc. for each process.execution_report_yyyy-MM-dd_HH-mm-ss.html: Summary report of all processes, including processes that passed/failed, resource usage, etc.execution_timeline_yyyy-MM-dd_HH-mm-ss.html: Summary report detailing the runtime and memory usage of each process.pipeline_info/process_logs/
[sample].[process].command.out: Output log file for each sample in each process.[sample].[process].command.err: Error log file for each sample in each process.pipeline_info/qc_file_checks/
[sample].Raw_Initial_FastQ_Files.[tsv,xlsx]: Details if both reads (R1,R2) meet the minimum file size criteria for the pipeline [Default: 25M].[sample].Summary.Hostile-Removal.[tsv,xlsx]: Details if both reads (R1,R2) meet the minimum file size criteria for after host removal using Hostile [Default: 25M].[sample].SRA_Human_Scrubber_FastQ_File.[tsv,xlsx]: Details if both reads (R1,R2) meet the minimum file size criteria for after host removal using SRA-Human-Scrubber [Default: 25M].[sample].BBTools-Repair-removed_FastQ_Files.[tsv,xlsx]: Details if both reads (R1,R2) meet the minimum file size criteria after repairing broken sister reads from SRA-Human-Scrubber [Default: 25M].[sample].PhiX_Genome.[tsv,xlsx]: Details if the input PhiX reference genome meets the minimum file size criteria [Default: 5k].[sample].PhiX-removed_FastQ_Files.[tsv,xlsx]: Details if both reads (R1,R2) meet the minimum file size criteria after PhiX reads have been removed [Default: 25M].[sample].Adapters_FastA.[tsv,xlsx]: Details if the input adapters reference file meets the minimum file size criteria [Default: 10k].[sample].Adapter-removed_FastQ_Files.[tsv,xlsx]: Details if both reads (R1,R2) meet the minimum file size criteria after adapter sequences have been removed [Default: 25M].[sample].Non-overlapping_FastQ_Files.[tsv,xlsx]: Details if both reads (R1,R2) meet the minimum file size criteria after removing overlapping reads [Default: 20M].[sample].Raw_Assembly_File.[tsv,xlsx]: Details if the genome assembly file produced by an assembler software package meets the minimum file size criteria [Default: 1M].[sample].Filtered_Assembly_File.[tsv,xlsx]: Details if the genome assembly file meets the minimum file size criteria after low compositional complexity contigs are discarded [Default: 1M].[sample].Binary_PE_Alignment_Map_File.[tsv,xlsx]: Details if the binary paired-end (PE) alignment file meets the minimum file size criteria after the cleaned paired-end reads are mapped onto the filtered genome assembly [Default: 25M].[sample].Polished_Assembly_File.[tsv,xlsx]: Details if the genome assembly file meets the minimum file size criteria after SNP and InDel corrections are performed [Default: 1M].[sample].Final_Corrected_Assembly_FastA_File.[tsv,xlsx]: Details if the final error-corrected genome assembly file meets the minimum file size criteria [Default: 1M].[sample].Binary_SE_Alignment_Map_File.[tsv,xlsx]: Details if the single-end (SE) alignment file meets the minimum file size criteria after the cleaned singleton reads are mapped onto the final genome assembly file [Default: 1k].[sample].Annotated_GenBank_File.[tsv,xlsx]: Details if the annotated GenBank file meets the minimum file size criteria [Default: 3M].[sample].SSU_Extracted_File.[tsv,xlsx]: Details if the extracted 16S rRNA gene sequence file meets the minimum file size criteria [Default: 500b].[sample]-[assembler].SSU_Renamed_File.[tsv,xlsx]: Details if the 16S rRNA gene sequence file meets the minimum file size criteria after sample identifiers are added to each sequence [Default: 500b].[sample].16S_BLASTn_Output_File.[tsv,xlsx]: Details if the BLASTn output file meets the minimum file size criteria [Default: 10b].[sample].Filtered_16S_BLASTn_File.[tsv,xlsx]: Details if the best BLASTn alignment sequence meets the minimum file size criteria [Default: 10b].