Skip to content

1. Pipeline and program descriptions

Håkon Kaspersen edited this page Jan 4, 2024 · 7 revisions

Statement of need

Unraveling the evolutionary relationship between organisms is a crucial part of many comparative genomics projects. However, the complexity of preparing, running, and interpreting such analyses is a hindrance for many researchers not specialized in evolutionary biology. Additionally, choosing a combination of compatible software for various analysis scenarios may be difficult. This repository contains the tool ALPPACA (A tooL for Prokaryotic Phylogeny And Clustering Analysis), a Nextflow pipeline for phylogenetic analysis of prokaryotic genomes. The pipeline has been developed to make it easier to run phylogenetic analysis and provides reports to simplify result interpretation.

Our pipeline is composed of two clustering and three phylogeny tracks, outlined below. These tracks are designed to allow analysis of datasets represented by different genetic diversity levels. For the phylogeny tracks, these levels of similarity influence what assumptions are used to consider sequences as orthologous when reconstructing the multiple alignment required for phylogenetic inference. One major advantage of this pipeline is the possibility for the user to select tracks based on their data at hand, which makes it possible to generate analysis results in a rapid fashion.

Track descriptions

Data Processing Flow Chart Figure 1: Flowchart of the five workflows in ALPPACA. Purple: ANI track. Yellow: cgMLST track (under development). Green: Core gene track. Blue: Core genome track. Red: Mapping track. Created with biorender.com

In all three phylogeny tracks, IQTree is used to reconstruct the phylogeny from each multiple alignment. It is the generation of the multiple alignments that differ between each track, and this is described in detail below. Snp-distances are calculated with snp-dists. Each track also has the option of deduplicating the alignment with seqkit, and removing constant sites from the alignment with snp-sites.

ANI (Purple)

This track is useful if you are unsure about the level of diversity of your dataset. It runs FastANI, a tool that calculates the average nucleotide identity between genomes.

cgMLST (Yellow)

The cgMLST track runs core gene multi-locus sequence typing with ChewBBACA. This track also contains an option to run the classic seven-gene MLST using mlst. The user can either supply their own cgMLST schema or download one from Chewie-NS. This track clusters the isolates based on the allele typing results.

Core gene (Green)

This track is useful for datasets with a relatively high level of diversity, such as across sequence types or closely related species. It starts by running Bakta, annotating the genomes. A pangenome analysis is then run by Panaroo, which allows for identifying core orthologous genes. The core gene method is particularly useful if you have a diverse dataset and want to identify potential clades of interest to investigate further.

Mapping (Red)

This track is useful for datasets that expect a medium to low level of diversity, and takes reads as input. In this track, the reads are mapped to a reference with Snippy. The reference serves as a common referential for SNP-detection, which allow reconstruction of all the loci that are common with the reference. Only genetic information derived from vertical inheritance is included in this analysis, as Gubbins is used to detect recombinant areas, and Maskrc-svg mask these areas from the multiple alignment before phylogenetic inference. The major advantage of this track is the use of reads instead of assemblies, since generating assemblies is a time-consuming process.

Core genome (Blue)

This track is suitable only for datasets where the samples are expected to be very closely related. Here, multiple sequence alignment is generated with ParSNP, which takes the whole genome into account, as opposed to the core gene track. This increases the resolution of the analysis, if the dataset consists of very closely related samples. Similar to the mapping track, only genetic information derived from vertical descent is included in the phylogenetic reconstruction.

Clone this wiki locally