Skip to content

Influenza

Omnifluss can perform genome reconstruction of Influenza viruses using Illumina short-read NGS data.

Usage

Here, we begin the explanations of how to run omnifluss with basic examples and progress to more advanced setups. Omnifluss can be finetuned in many aspects such as runtime parameters, internal algorithms, deactivation of subroutines, optimisation to certain compute environments and usage of different databases.

Basic run

To begin with, a command for a basic run of omnifluss is

nextflow run rki-mf1/omnifluss \
    -profile singularity \
    --input samplesheet.csv \
    --reference my_virus_reference.fasta \
    --kraken2_db /path/to/my/kraken2db/ \
    --outdir results

Note: Mind the usage of different hyphen here! We use a single hyphen for Nextflow options and use a double hyphen for omnifluss specific parameters.

This command launches a basic omnifluss run with tasks executed within singularity containers, sequence data defined in the samplesheet, a FASTA file containing a reference sequence, a Kraken2 database, and results stored in an output folder called results. See the Output chapter for the documentation of omnifluss' outputs.

Reproducibility

To ensures that a specific version of omnifluss is used when running the pipeline, you can specify a release tag. If you keep using the same release tag, you'll be running the same version of omnifluss, even if there have been changes to the code since that version. You can visit the releases page and find the latest pipeline version at the top of the website. Then, by running omnifluss with the Nextflow option -r (using one hyphen, eg. -r v0.2.1), you can switch to a particular version of omnifluss:

nextflow run rki-mf1/omnifluss \
    -r v0.2.1 \
    -profile singularity \
    --input samplesheet.csv \
    --reference my_virus_reference.fasta \
    --kraken2_db /path/to/my/kraken2db/ \
    --outdir results

The version of omnifluss used in a particular run is written to the Nextlow log file for reproducibility.

Updating the pipeline

When you run omnifluss as in the basic run example, Nextflow automatically pulls the pipeline code from the GitHub repository and stores a local copy (called cached version). When running the pipeline again, Nextflow uses this cached version by default if available - even if the pipeline code has been updated since the initial copying. You can manually update the cached version of the pipeline to the latest available version via

nextflow pull rki-mf1/omnifluss

Again, you can also add -r to update the cached version to a specific release via

nextflow pull -r v0.2.0 rki-mf1/omnifluss

Configs and profiles

Omnifluss provides a plethora of parameters (use --help to inspect the manual page) to configure the workflow. To process Illumina paired-end short-read sequencing data of Influenza virus samples, we have prepared a configuration file with best-practise settings. This predefined configuration file can be provided to an omnifluss run via the -profile option:

nextflow run rki-mf1/omnifluss \
    -profile singularity,INV_illumina \
    --input samplesheet.csv \
    --reference my_virus_reference.fasta \
    --kraken2_db /path/to/my/kraken2db/ \
    --outdir results

Using the INV_illumina profile will overwrite multiple default parameters of omnifluss. Please inspect the configuration file for more details about the Influenza virus-specific parameters. Note that you can still provide omnifluss parameters on the command line in addition to profiles. The pipeline parameter precedence in Nextflow prioritizes command line parameters over parameters specified in a configuration file. See -params-file for more details on custom configuration files.

Parameters

--input

The samples to be analysed are provided to omnifluss via a sample sheet (.csv) using the --input parameter. It specifies the raw read sequence data files (.fastq) used by omnifluss. For instance:

sample,fastq_1,fastq_2
INV_ILL_NB1,/path/to/experiment_NB1_R1.fastq.gz,/path/to/experiment_NB1_R2.fastq.gz
INV_ILL_NB2,/path/to/experiment_NB2_R1.fastq.gz,/path/to/experiment_NB2_R2.fastq.gz
INV_ILL_NB3,/path/to/experiment_NB3_R1.fastq.gz,/path/to/experiment_NB3_R2.fastq.gz

which refers to the structured information

sample fastq_1 fastq_2
INV_ILL_NB1 /path/to/experiment_NB1_R1.fastq.gz /path/to/experiment_NB1_R2.fastq.gz
INV_ILL_NB2 /path/to/experiment_NB2_R1.fastq.gz /path/to/experiment_NB2_R2.fastq.gz
INV_ILL_NB3 /path/to/experiment_NB3_R1.fastq.gz /path/to/experiment_NB3_R2.fastq.gz

The argument parser will automatically detect the sample and paired-end information provided by the sample sheet. The sample sheet requires a three-column entry per sample which has to match the definition below.

Column Description
sample Custom sample name. This entry might be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (_).
fastq_1 Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz".
fastq_2 Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz".

--reference

File containing one or multiple reference sequence. Parameter value is a FASTA file. For the influenza virus, the FASTA file typically contains eight reference sequences, i.e. one per segment. However, another collection might suit your research question. All reads from the --input FASTQ files are mapped against the sequences in --reference and these mappings will ultimately be used for a reference-guided assembly.

Exactly one of the two parameters --reference and --reference_selection_db has to be provided to omnifluss when process Influenza virus data. These two parameters provide the references for the consensus sequence reconstruction. Depending on the choice of the first reference parameter, the --reference_selection parameter has to be set accordingly. The parameter combination --reference_selection kma --reference_selection_db <path> (default when using the INV_illumina profile) activates a selection process for the best fitting reference from the given reference database at <path>. Please see --reference_selection_db for more details on the reference database. The parameter combination --reference_selection static --reference <fasta> takes strictly the sequences in <fasta> as reference sequences for the genome reconstruction.

--reference_selection

Reference selection mode. Choice of "kma" and "static". Please see --reference for more details.

--reference_selection_db

Database for automatic reference selection. Parameter value is a path to a reference database. The reference database has to comply with the following format: for each of the genome segments of the Influenza virus (HA, NA, MP, NP, NS, PA, PB1, PB2), the reference database can contain up to one FASTA file containing one or multiple reference sequences. Further, the FASTA file names have to begin with <segment_name>., i.e. a valid database could looks like

/path/
├── HA.segment.fasta
├── MP.segment.fasta
├── NA.segment.fasta
├── NP.segment.fasta
├── NS.segment.fasta
├── PA.segment.fasta
├── PB1.segment.fasta
└── PB2.segment.fasta

Please see --reference for more details.

--kraken2_db

\<WIP>

--fastp_adapter_fasta

You can especify a plain FASTA file for adapter clipping. E.g. for Illumina Nextera Transposase adapter

>Illumina Nextera Transposase adapter fwd
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG
>Illumina Nextera Transposase adapter rev
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG

--help

Various parameters can be finetuned throughout the workflow. You can find the full list of parameters via nextflow run rki-mf1/omnifluss -r <release-tag> --help.

Note: The documentation of pipeline parameters is generated automatically from the pipeline schema. Options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).

-params-file

Each parameter of omnifluss can also be written into a configuration file (in JSON or YAML format) and provided to omnifluss via -params-file <file>. For instance, the basic run case of omnifluss can be shortened and specified with a configuration file:

nextflow run rki-mf1/omnifluss \
    -profile singularity \
    -params-file params.yaml

with the corresponding YAML file params.yaml:

input: 'samplesheet.csv'
reference: 'my_virus_reference.fasta'
kraken2_db: '/path/to/my/kraken2db/'
outdir: 'results'

Warning: Do not use the -c <file> to specify pipeline parameters as this will result in errors! Custom config files specified in -c must only be used for tuning process resource specifications or module arguments (args).

-resume

Specify -resume when restarting a pipeline. Nextflow reuses all cached intermediate results from pipeline steps start are not affected by changes between the runs. For more info about this parameter, see this blog post.

You can also supply a run name to resume a specific run: -resume [run-name]. Use the nextflow log command to show previous run names.

--outdir

\<WIP>

Output

After a successful run, the pipeline creates the following files and folders in your working directory:

work                # Directory containing the Nextflow working files
<outdir>            # Results in specified location (defined with --outdir)
.nextflow_log       # Log file from Nextflow

\<WIP>