Skip to content

understanding cause of poor model fit of heterozygote peak #144

Open
@andrbern8000

Description

@andrbern8000

Good afternoon,

I am assembling fish genomes de novo using hifi data and have run into a few issues for a few of my target species (all diploid);
first, to better understand the size and heterozygosity of the genome and to confirm our estimates of sequence coverage, I ran meryl (default settings for 'count' and 'histogram', k = 21) and genomescope2 (default settings, k = 21).

The summary output of the genomescope2 model fit was not too bad (~73-89% - see below), but when the results were visualized, it appears as though the observed kmer frequencies (blue line) for the 'heterozygote' peak did not match the distribution estimated using the full model (black line). Basically, the observed peak spans a much wider coverage range than the full model peak.

I am wondering what may be driving this observed vs. full model difference (i.e., sequencing errors?) and if this is a cause for concern (i.e., a data issue that needs to be addressed prior to assembly). Should I adjust some of the genomescope2 parameters?

I am very new to genome assembly and would appreciate any advice you (or anyone else) might have.

Thanks,
Andrea

GenomeScope version 2
p = 2
k = 21

property; min; max
Homozygous (aa); 98.04%; 98.10%
Heterozygous (ab); 1.90%; 1.96%
Genome Haploid Length; 377413934 bp; 379528391 bp
Genome Repeat Length; 61537310 bp; 61882072 bp
Genome Unique Length; 315876624 bp; 317646318 bp
Model Fit; 73.1021%; 88.551%
Read Error Rate; 0.460545%; 0.460545%

cc_meryl_genomescope2_k21

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions