Skip to content

Address RefSeq transcript misalignments #447

@holtgrewe

Description

@holtgrewe

RefSeq transcripts can align with indels and mismatches to the reference sequence. While mismatches could be argued to be non-critical (assuming the GenBank entries that the RefSeq transcript is based on is from healthy individuals), indels cannot.

For hg19, 884 transcripts in 501 genes are affected.

The following solution will be implemented:

  • The default_sources.ini file gets a settings "fixIndels" and "fixIndelsUcsc".
  • When parsing the RefSeq transcript database, the Note attribute is analyzed.
    If it contains the substrings "indel" or "substitution" then this is recorded into the built TranscriptModel.
  • When fixIndels=true is given then the user also has to provide the path to the reference sequence.
  • The file at fixIndelsUcsc is used for providing the UCSC transcript alignments.
    This will be used for the exon and CDS information.
    The sequence will be taken from the reference.

NB: This will create an incompatibility between the databases built before and after Jannovar v0.29.

For each hg*/refseq* entry, a _fixindel variant is added that contains these fix transcripts. This way, the fixed transcripts are strictly opt-in and only supplement those where the indel is not fixed. Variants for both can be reported.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions