This process uses the Hostile package to align FastQ reads to a reference genome and stores all unmapped reads. Its default use is to remove human reads, with the primary output being FastQ files.
One terrific feature of this package is its bundled assembly files of human genomes, and the default uses the latest telomere-to-telomere (T2T) human assembly that Adam Phillippy and colleagues accomplished in 2022 combined with human leukocyte antigen (HLA) sequences from here.
If a user has a different background genome to remove, for example, perhaps the bacterial pathogen was collected from a brown common sewer rat from New York City (Rattus norvegicus) or a Holstein dairy cow in the USA (Bos taurus), a non-default reference genome can be supplied. For Illumina read removal with a non-default reference genome, a user must specify a path prefix of all 6 of the bowtie2 index files– not a FastA file.
To avoid removing reads that match to both the target pathogen and the host and maximize retention of the target pathogen, hostile supplies the utilities to compare both at the same time and form a custom reference file here.
A clever approach in Bede Constantinides preprint paper was to gather 985 reference grade bacterial genomes from the FDA ARGOS database into a large FastA for masking the combined T2T-CHM and IPD-IMGT/HLA human genome reference, which is available here.
A peer-reviewed manuscript is likely in progress, but for now the biorxiv paper describes the Hostile package.