-
Notifications
You must be signed in to change notification settings - Fork 3
1. Pipeline and program descriptions
Unraveling the evolutionary relationship between organisms is a crucial part of many comparative genomics projects. However, the complexity of preparing, running, and interpreting such analyses is a hindrance for many researchers not specialized in evolutionary biology. Additionally, choosing a combination of compatible software for various analysis scenarios may be difficult. This repository contains the tool ALPPACA (A tooL for Prokaryotic Phylogeny And Clustering Analysis), a Nextflow pipeline for phylogenetic analysis of prokaryotic genomes. The pipeline has been developed to make it easier to run phylogenetic analysis and provides reports to simplify result interpretation.
Our pipeline is composed of three main tracks, outlined below. These tracks are designed to allow analysis of datasets represented by three different genetic diversity levels. These levels of similarity influence what assumptions are used to consider sequences as orthologous when reconstructing the multiple alignment required for phylogenetic inference. For instance, our pipeline can be run on datasets with an expected low- (eg. outbreak situation), medium- (eg. within a single MLST cluster), or high diversity level (eg. across ST or for closely related species). One major advantage of this pipeline is the possibility for the user to select tracks based on their data at hand, which makes it possible to generate analysis results in a rapid fashion.
Figure 1: Flowchart of the three workflows in ALPPACA. Yellow boxes: Input. Blue boxes: Processes. Green background: Core gene workflow. Blue background: Core genome workflow. Red background: Mapping workflow. Grey background: output. Created with biorender.com
This track identifies the core genes in the input assemblies by running Prokka, followed by Panaroo QC and Panaroo pan-genome. Then, duplicated genomes are removed with seqkit (optional). SNPs are then filtered with snp-sites (optional). Then, a maximum likelihood (ML) tree is generated with IQTree, and SNP distances are calculated with snp-dists.
In this track, the core genome is identified using ParSNP. Then, duplicated genomes are removed with seqkit (optional). Then, recombinant areas are identified with Gubbins, and masked with maskrc-svg before filtering with snp-sites (optional). Then, the ML tree is generated with IQTree, and SNP-distances calculated with snp-dists.
In this track, the core genome is generated through Snippy, which use reads as input. Then, duplicated genomes are removed with seqkit (optional). Recombinant areas are identified with Gubbins and subsequently masked with maskrc-svg, before filtering SNPs with snp-sites (optional). Then, the phylogenetic tree is generated by IQTree. SNP-distances are calculated with snp-dists.