-
Notifications
You must be signed in to change notification settings - Fork 23
Add support for presupplied annotation files (FAA + GFF or FAA + GBK) #340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
…verison (as reported on antiSMASH github)
Co-authored-by: James A. Fellows Yates <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to change feature
-> gbk
everywhere if we want to go with that route, including also the schema_input.tsv
etc, and would also have to update the test-data samplesheet files 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just few comments to consider :) but good job 💯
workflows/funcscan.nf
Outdated
meta, files -> | ||
def fasta_found = files.find{it.toString().tokenize('.').last().matches('fasta|fas|fna|fa')} | ||
def faa_found = files.find{it.toString().endsWith('.faa')} | ||
def gbk_found = files.find{it.toString().tokenize('.').last().matches('gbk')} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also add 'gbk|gbff' also as that is the gbk extension output for bakta
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ill add it here as i couldnt comment on it for some reason :D
So in line 188
// TODO: Only NT at the moment. AA tax. classification will be added only when its PR is merged.
remove this because now that the user supllies always both fasta and gbk files for the preannotated track, we dont need to update the taxonomy workflow ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FFFFF bakta!? WHY!?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh that's a good question.. currently only FASTAs go to taxonomy... I can't remember if that's preanno ones too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, the user doesn't always supply GBK, I don't understand now...?
workflows/funcscan.nf
Outdated
ch_versions = ch_versions.mix( SEQKIT_SEQ_LONG.out.versions ) | ||
ch_versions = ch_versions.mix( SEQKIT_SEQ_SHORT.out.versions ) | ||
|
||
ch_prepped_input_long = SEQKIT_SEQ_LONG.out.fastx | ||
ch_intermediate_input_long = SEQKIT_SEQ_LONG.out.fastx | ||
.map{ meta, file -> [ meta + [id: meta.id + '_long', length: "long" ], file ] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check the actual final output files (.tsv) with the test_taxonomy.config
? Do please correct me if im wrong with understanding the workflow now, so we are now renaming the meta.ids with suffixes _long
and _short
, will that not interfere with adding the taxonomy to the right files because taxonomy is added according to not only contig but sample_id too which comes from the meta_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, which is why I asked you to test 😆
If I understand your question correctly:
meta.id
indeeds comes from sample
, so updating meta is replacing the sample_id
with the suffix
But the taxonomy workflow takes input from after this renaming, so it should be OK i hope? Please run the branch and check 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With regards to the summary and taxonomy :
-
ampcombi: it doesnt output all samples (but that will be fixed when i put in the new ampcombi submodules). Taxonomy is merged
-
combgc: it doesnt consider the samples_1 and 2 that are preannotated, only takes those from the annotatation step in the final report, which i dont understand why. Taxonomy is merged
-
hamronization: Works perfectly.
So either therees a problem with this collectfile() function fro both ampcombi/combgc or somethings up with the publishdir path, that every time the pipeline stops and resumes it rewrites the final report.
workflows/funcscan.nf
Outdated
ch_versions = ch_versions.mix( SEQKIT_SEQ_LONG.out.versions ) | ||
ch_versions = ch_versions.mix( SEQKIT_SEQ_SHORT.out.versions ) | ||
|
||
ch_prepped_input_long = SEQKIT_SEQ_LONG.out.fastx | ||
ch_intermediate_input_long = SEQKIT_SEQ_LONG.out.fastx | ||
.map{ meta, file -> [ meta + [id: meta.id + '_long', length: "long" ], file ] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, which is why I asked you to test 😆
If I understand your question correctly:
meta.id
indeeds comes from sample
, so updating meta is replacing the sample_id
with the suffix
But the taxonomy workflow takes input from after this renaming, so it should be OK i hope? Please run the branch and check 😬
workflows/funcscan.nf
Outdated
meta, files -> | ||
def fasta_found = files.find{it.toString().tokenize('.').last().matches('fasta|fas|fna|fa')} | ||
def faa_found = files.find{it.toString().endsWith('.faa')} | ||
def gbk_found = files.find{it.toString().tokenize('.').last().matches('gbk')} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FFFFF bakta!? WHY!?
workflows/funcscan.nf
Outdated
meta, files -> | ||
def fasta_found = files.find{it.toString().tokenize('.').last().matches('fasta|fas|fna|fa')} | ||
def faa_found = files.find{it.toString().endsWith('.faa')} | ||
def gbk_found = files.find{it.toString().tokenize('.').last().matches('gbk')} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh that's a good question.. currently only FASTAs go to taxonomy... I can't remember if that's preanno ones too
workflows/funcscan.nf
Outdated
if ( params.run_taxa_classification ) { | ||
TAXA_CLASS ( ch_prepped_input ) | ||
TAXA_CLASS ( ch_prepped_input.fastas ) | ||
ch_versions = ch_versions.mix( TAXA_CLASS.out.versions ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ( params.run_taxa_classification ) { | |
TAXA_CLASS ( ch_prepped_input ) | |
TAXA_CLASS ( ch_prepped_input.fastas ) | |
ch_versions = ch_versions.mix( TAXA_CLASS.out.versions ) | |
ch_intermediate_fasta_for_taxa = ch_intermediate_input.fastas.map{ meta, fasta, faa, gbk -> [ meta, fasta ] } | |
.mix(ch_intermediate_input.preannotated.map{ meta, fasta, faa, gbk -> [ meta, fasta ] }) | |
if ( params.run_taxa_classification ) { | |
TAXA_CLASS ( ch_intermediate_fasta_for_taxa ) | |
ch_versions = ch_versions.mix( TAXA_CLASS.out.versions ) |
workflows/funcscan.nf
Outdated
ch_versions = ch_versions.mix( SEQKIT_SEQ_LONG.out.versions ) | ||
ch_versions = ch_versions.mix( SEQKIT_SEQ_SHORT.out.versions ) | ||
|
||
ch_prepped_input_long = SEQKIT_SEQ_LONG.out.fastx | ||
ch_intermediate_input_long = SEQKIT_SEQ_LONG.out.fastx | ||
.map{ meta, file -> [ meta + [id: meta.id + '_long', length: "long" ], file ] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With regards to the summary and taxonomy :
-
ampcombi: it doesnt output all samples (but that will be fixed when i put in the new ampcombi submodules). Taxonomy is merged
-
combgc: it doesnt consider the samples_1 and 2 that are preannotated, only takes those from the annotatation step in the final report, which i dont understand why. Taxonomy is merged
-
hamronization: Works perfectly.
So either therees a problem with this collectfile() function fro both ampcombi/combgc or somethings up with the publishdir path, that every time the pipeline stops and resumes it rewrites the final report.
OK tests to run:
|
…hannels for CREATETSV
…g fastas taxonom results to BGC
Superseded by #381 🙃 |
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).