This README describes how to launch the Bacannot
pipeline on the MAF AWS Infrastructure.
For information about the original pipeline and all the tools that are used by the analysis pipeline please refer to the Bacannot README file.
- A Contig ID is defined as the sequence header before the first space, please make sure that each ID is unique within the fasta file.
- Make sure that each Contig ID is less than
37
characters (before the first space). This is a hard limit set by theprokka
pipeline. You may use a very basic helper scriptrenameFastaHeaders.py
for this. USAGE. - For simple use cases of this pipeline, where you only have a genome that needs annotation, there is a helper script
createSubmissionYaml.py
that will accept a local folder of fasta files, an s3path and an output yaml file name. USAGE. - The
createSubmissionYaml.py
script will also print a suggested pipeline submission command that you may use to launch the pipeline using the submission files that you've just created.
aws batch submit-job \
--job-name nf-bacannot-mrsa \
--job-queue priority-maf-pipelines \
--job-definition nextflow-production \
--container-overrides command=FischbachLab/nf-bacannot,\
"-profile","maf",\
"--input","s3://genomics-workflow-core/Results/Bacannot/MRSA/20221102/MRSA.yaml",\
"--output","s3://genomics-workflow-core/Results/Bacannot/00_TEST/MRSA/20230407"
aws batch submit-job \
--job-name nf-bacannot-hCom2 \
--job-queue priority-maf-pipelines \
--job-definition nextflow-production \
--container-overrides command=FischbachLab/nf-bacannot,\
"-profile","maf",\
"--input","s3://genomics-workflow-core/Results/Bacannot/hCom2/20221102/inputs/hCom2.yaml"
"--output","s3://genomics-workflow-core/Results/Bacannot/hCom2/20221102"
python renameFastaHeaders.py <ORIGINAL_FASTA_FILE> <RENAMED_FASTA_FILE>
python renameFastaHeaders.py fasta_folder/genome.fasta renamed_fasta_folder/genome.fasta
Create submission YAML file for bacannot pipeline
Install dependency:
conda create -n bacannot python=3.11
pip install -U ruamel.yaml cloudpathlib[s3]
Run the script:
python createSubmissionYaml.py \
-g <Local or S3 Path to Genome(s) directory> \
-project <Name of the project that this data belongs to> \
-prefix <Subset of the data in this Project; or date in YYYYMMDD format> \
-s <Output YAML file name> \
--extension fna (Optional: if you wish to use a different extension for the fasta files, default is fasta) \
--copy-genomes (Optional: if you wish to copy the input genomes to the output directory, default is False) \
--use-bakta (Optional: if you wish to use Bakta, instead of the standard Prokka, Most people SHOULD NOT use this flag, default is False)
python createSubmissionYaml.py \
-g s3://genomics-workflow-core/Results/BinQC/MITI-MCB/20230324/fasta/ \
-project MITI-MCB \
-prefix 20230411 \
-s test.yaml
Copies GFF files from each sample folder to the aggregate folder.
python aggregateGFFs.py \
-p s3://genomics-workflow-core/Results/Bacannot/MITI-MCB/20230515 \
-s s3://genomics-workflow-core/Results/Bacannot/MITI-MCB/20230515/inputs/DELETE_ME.yaml
This pipeline generates A LOT of data per genome. Each genome contains a directory structure described here. The easiest way to explore this data interactively is by using docker
.
Make sure you have docker
installed. See instructions here.
Once docker
is installed and running, sync the genome directory that is of interest to you, by using the aws s3 sync
command. The following commands will explain the process using the annotation outputs of the Slackia-exigua-ATCC-700122-MAF-2
genome, present on S3 at s3://genomics-workflow-core/Results/Bacannot/00_TEST/20221031/
.
aws s3 sync s3://genomics-workflow-core/Results/Bacannot/00_TEST/20221031/Slackia-exigua-ATCC-700122-MAF-2/ Slackia-exigua-ATCC-700122-MAF-2
This command will download all the data into a local folder called Slackia-exigua-ATCC-700122-MAF-2
.
cd Slackia-exigua-ATCC-700122-MAF-2
docker run -v $(pwd):/work -d --rm --platform linux/amd64 -p 3838:3838 -p 4567:4567 --name ServerBacannot fmalmeida/bacannot:server
If this is your first time running this viewer, you might see docker trying to download a lot of data. This is normal and can take some time depending on your internet speeds. Once complete, you're now ready to interact with your data. Simply open your favorite web browser and go to http://localhost:3838/
. Note the use of http
and not https
. Some browsers may automatically make this change. In case you are unable to seen your webpage copy and paste this in your web browser (rather that clicking on this link).
If you're using an EC2 instance, go to the AWS EC2 console by logging into your AWS account and Identify your instance and note the public IP address
for your instance. Open your favorite web browser and go to http://Public.IP.Address:3838/
. Note the use of http
and not https
. Some browsers may automatically make this change. In case you are unable to seen your webpage copy and paste this in your web browser (rather that clicking on this link).
Et voila! You can now explore your data!
All great things must come to an end. Use the following command to shut down your docker daemon that will in turn kill the data explorer webpage.
docker rm -f ServerBacannot