You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
it is missing all entries from the gff with only 'gene' and 'CDS' lines (most genes)
includes only the entries with 'gene', 'ncrna' (or other) AND 'exon' lines, e.g.
CP009273.1 Genbank gene 186199 186334 . + . ID=gene-BW25113_4414;Name=tff;gbkey=Gene;gene=tff;gene_biotype=ncRNA;gene_synonym=ECK0167,JWR0225,t44;locus_tag=BW25113_4414
CP009273.1 Genbank ncRNA 186199 186334 . + . ID=rna-BW25113_4414;Parent=gene-BW25113_4414;Note=identified in a large scale screen;gbkey=ncRNA;gene=tff;locus_tag=BW25113_4414;product=novel sRNA%2C function unknown
CP009273.1 Genbank exon 186199 186334 . + . ID=exon-BW25113_4414-1;Parent=rna-BW25113_4414;Note=identified in a large scale screen;gbkey=ncRNA;gene=tff;locus_tag=BW25113_4414;product=novel sRNA%2C function unknown
E_coli_BW25113_genome.filtered.gtf
The gtf produced by the pipeline looks like this:
Note all of these entries are missing in the Salmon table (possibly because of the missing 'exon' line?)
I also tried adding 'exon' lines to the gff manually, which resulted in an RSEM error for not finding 'gene_id', possibly due to a broken gffread conversion:
Command error:
INFO: Converting SIF file to temporary sandbox...
rsem-extract-reference-transcripts rsem/genome 0 E_coli_BW25113_genome.filtered.gtf None 0 rsem/E_coli_BW25113_genome.fasta
The GTF file might be corrupted!
Stop at line : CP009273.1 Genbank exon 190 255 . + . transcript_id "cds-AIN30539.1"; gene_name "thrL"; Dbxref "NCBI_GP:AIN30539.1"; Name "AIN30539.1"; gbkey "CDS"; gene "thrL"; locus_tag "BW25113_0001"; product "thr operon leader peptide"; protein_id "AIN30539.1"; transl_table "11";
Error Message: Cannot find gene_id!
"rsem-extract-reference-transcripts rsem/genome 0 E_coli_BW25113_genome.filtered.gtf None 0 rsem/E_coli_BW25113_genome.fasta" failed! Plase check if you provide correct parameters/options for the pipeline!
INFO: Cleaning up image...
Using gff is highly unstable in the pipeline, maybe another gtf conversion tool is needed
I also converted the gff to gtf with AGAT: agat_convert_sp_gff2gtf.pl --gff E_coli_BW25113_annotation.gff3 -o E_coli_BW25113_annotation_AGAT.gtf
and the pipeline ran successfully, generating counts for all genes.
It does unfortunately replace the gene names with 'agat-gene-1' etc (though maybe that can be fixed with --gtf_extra_attributes)
EDIT: adding --gtf_extra_attributes gene did indeed rescue correct gene names in the output
Any ideas?
System information
N E X T F L O W ~ version 24.10.4
Launching `https://github.com/nf-core/rnaseq` [suspicious_nobel] DSL2 - revision: b96a75361a [3.18.0]
------------------------------------------------------
,--./,-.
___ __ __ __ ___ /,-._.--~'
|\ | |__ __ / ` / \ |__) |__ } {
| \| | \__, \__/ | \ |___ \`-._,-`-,
`._,._,'
nf-core/rnaseq 3.18.0
------------------------------------------------------
Input/output options
input : samplesheet.csv
outdir : results_rnaseq
Reference genome options
genome : null
fasta : E_coli_BW25113_genome.fasta
gff : E_coli_BW25113_annotation.gff
igenomes_ignore: true
Process skipping options
skip_rseqc : true
skip_biotype_qc: true
Core Nextflow options
revision : 3.18.0
runName : suspicious_nobel
containerEngine: singularity
...
profile : singularity
configFiles :
The text was updated successfully, but these errors were encountered:
Description of the bug
Trying to use a gff3 annotation for the pipeline, the output table is missing most of the genes.
I suspect it's something with the gff to gtf conversion with gffread, possibly using only entries with 'exon' lines
Command used and terminal output
Relevant files
E_coli_BW25113_annotation.gff
contains 4k+ genes
tx2gene.tsv
has 4491 lines
salmon.merged.gene_counts.tsv
has only 178 lines
E_coli_BW25113_genome.filtered.gtf
The gtf produced by the pipeline looks like this:
Note all of these entries are missing in the Salmon table (possibly because of the missing 'exon' line?)
I also tried adding 'exon' lines to the gff manually, which resulted in an RSEM error for not finding 'gene_id', possibly due to a broken gffread conversion:
Using gff is highly unstable in the pipeline, maybe another gtf conversion tool is needed
I also converted the gff to gtf with AGAT:
agat_convert_sp_gff2gtf.pl --gff E_coli_BW25113_annotation.gff3 -o E_coli_BW25113_annotation_AGAT.gtf
and the pipeline ran successfully, generating counts for all genes.
It does unfortunately replace the gene names with 'agat-gene-1' etc (though maybe that can be fixed with --gtf_extra_attributes)
EDIT: adding
--gtf_extra_attributes gene
did indeed rescue correct gene names in the outputAny ideas?
System information
The text was updated successfully, but these errors were encountered: