Skip to content

3. Input and usage

Håkon Kaspersen edited this page Feb 7, 2024 · 37 revisions

Usage

The pipeline can be executed locally from your computer after cloning the repository, or you can execute it directly from GitHub. To execute locally:

nextflow run /path/to/main.nf <parameters>

To execute from GitHub:

nextflow run NorwegianVeterinaryInstitute/ALPPACA <parameters>

Detailed examples of the five tracks are listed below.

ANI analysis

nextflow run NorwegianVeterinaryInstitute/ALPPACA -c <your_config> --track ani --input "/path/to/input.csv" --out_dir <dirname> -profile <docker/singularity/conda> -work-dir <dirname>

cgMLST analysis

nextflow run NorwegianVeterinaryInstitute/ALPPACA -c <your_config> --track cgmlst --input "path/to/input.csv" --schema "/path/to/schema" --ptf "/path/to/prodigal_training_file.ptf" --mlst_schema "campylobacter" 

Core gene analysis

nextflow run NorwegianVeterinaryInstitute/ALPPACA -c <your_config> --track core_gene --input "/path/to/input.csv" --clean_mode moderate --bakta_db "/path/to/bakta_db" --out_dir <dirname> -profile <docker/singularity/conda> -work-dir <dirname> 

Core genome analysis

nextflow run NorwegianVeterinaryInstitute/ALPPACA -c <your_config> --track core_genome --input "/path/to/input.csv" --out_dir <dirname> -profile <docker/singularity/conda> -work-dir <dirname> 

Mapping analysis

nextflow run NorwegianVeterinaryInstitute/ALPPACA -c <your_config> --track mapping --input "/path/to/input.csv" --R1 "*1.fastq.gz" --R2 "*2.fastq.gz" --suffix "_1.fastq.gz" --snippyref "ref.fasta" --out_dir <dirname> -profile <docker/singularity/conda> -work-dir <dirname>

Input

A comma-separated text file listing the full path to all assemblies/reads used as input, with the headers "sample" and "path", see example input below. Please note the long format of the table when inputting reads!

Assemblies:

sample,path
sample1,/path/to/sample1.fasta
sample2,/path/to/sample2.fasta

Reads:

sample,path
sample1,/path/to/sample1_R1.fastq.gz
sample1,/path/to/sample1_R2.fastq.gz
sample2,/path/to/sample2_R1.fastq.gz
sample2,/path/to/sample2_R2.fastq.gz

The script bin/generate_input.R may be used to generate the input file from a directory path (need to have R available in your PATH):

Rscript bin/generate_input.R <full/path/to/files> <pattern>

Where pattern signifies the suffix of the files that will, when removed, create the sample ID. For assemblies, the script may be used as such:

Rscript bin/generate_input.R <full/path/to/files> ".fasta"

For reads, the script may be used as such:

Rscript bin/generate_input.R <full/path/to/files> "_L001_R..fastq.gz"

Where the last argument specifies R. to get both the R1 and R2 files. The script utilize the pattern argument to generate the sample ID common to both R1 and R2, so make sure that it matches your filenames! The script will output a file in the current directory called samplesheet.csv.

Parameter descriptions

Input and output
--input:              Input csv file
--out_dir:            Output directory name

Workflow-specific parameters
ANI
--kmer_size:          The kmer size used in fastANI, default 16
--fragment_length:    The fragment length used in fastANI, default 3000
--min_fraction:       The minimum fraction of genome that must be shared for trusting ANI, default 0.2

cgMLST
--output_schema:      Output downloaded/prepped cgMLST schema (not available with --prepped_schema), default: false
--schema:             Path to schema location, either prepped or unprepped
--ptf:                Path to prodigal training file
--prepped_schema:     Skip schema prep step, default: false
--download_external:  Download schema from chewie-NS, default: false
--species_value:      The species ID listed at Chewie-NS (https://chewbbaca.online/stats)
--id_value:           The schema number listed at Chewie-NS (https://chewbbaca.online/stats)
--skip_schema_eval:   Skip schema evaluation step, default: false
--skip_mlst:          Skip seven-gene MLST step, default: false
--mlst_schema:        Name of the scheme, as defined in mlst (https://github.com/tseemann/mlst)
--bsr:                Blast score ratio, default: 0.6
--min_len:            Minimum sequence length accepted for coding sequence to be included in the schema, default: 0
--translation_table:  Genetic code used to predict genes, default: 11
--size_threshold:     CDS size variation threshold, default: 0.2
--mode:               Execution mode of ChewBBACA, default 4
--max_missing:        Max missing alleles allowed for filtering genomes, default: 10
--clustering_method:  Method used for clustering genomes, either "single" or "nj", default: single

Core gene
--bakta_db:           Full path to the bakta database.
--output_gffs:        If set, will output the gff3 files from Bakta to results/gffs.
--qc:                 (Optional) Run the Panaroo QC module. Requires the --refdb parameter.
--refdb:              Path to the reference database used by panaroo qc
--clean_mode:         The clean_mode setting in Panaroo, either sensitive, moderate, or strict
--identity_threshold: The sequence identity threshold used in Panaroo (default: 0.98)
--len_dif_percent:    Length difference cutoff used in Panaroo (default: 0.98)

Core genome
--parsnp_ref:         Specify the reference used in ParSNP. Use "!" to randomly select from one of the input genomes. You can also supply full path to a specific genome, or simply the name of one of the input files.

Mapping
--snippyref:          Full path to the fasta reference used for mapping
--R1:                 Suffix of the R1 files (e.g. `*1.fastq.gz`)
--R2:                 Suffix of the R2 files (e.g. `*2.fastq.gz`)
--suffix:             Suffix that will be removed from the R1 file names to get the common sample ID for both R1 and R2 (e.g. `_1.fastq.gz`)

General parameters
--deduplicate:        Should duplicate samples be removed? Sequences that are 100% similar in the alignment will be removed and listed in an output file if this is set to `true`. This will make the analyses run slightly faster, and make the final tree more readable.
--treebuilder:        The treebuilder software used by Gubbins
--gubbinsmodel:       The evolutionary model used by Gubbins
--iqtree_model:       The evolutionary model used by IQTree. Set as "MFP" to use Model Finder Plus. NOTE: MFP may cause a significant increase in runtime!
--mset:               When using MFP, only test subgroups of the models specified here. This is to reduce the number of total models tested.
--cmax:               The maximum number of R models to test when using MFP, to reduce the total number of models tested.
--bootstrap:          Number of UFBoot bootstrap replicates
--seed:               IQTree seed, by default set to "12345". This is to make runs reproducible with the same dataset.
--outgroup:           Name of the outgroup, leave blank if not used. Note: Need to be specified with the -o option from IQTree: --outgroup "-o name"
--fast_mode:          Run IQ-Tree in fast mode with SH-aLRT. This can be used to make large datasets run faster. Note that no consensus tree is generated in this mode, only the ML-tree, and the bootstrapping is interpreted differently. Please refer to the IQ-Tree documentation for more information.

Time-related parameters
--time_multiplier:    Default value 1. If increased to 2, doubles the time requested by the system for each process