Description
Good afternoon,
I am assembling fish genomes de novo using hifi data and have run into a few issues for a few of my target species (all diploid);
first, to better understand the size and heterozygosity of the genome and to confirm our estimates of sequence coverage, I ran meryl (default settings for 'count' and 'histogram', k = 21) and genomescope2 (default settings, k = 21).
The summary output of the genomescope2 model fit was not too bad (~73-89% - see below), but when the results were visualized, it appears as though the observed kmer frequencies (blue line) for the 'heterozygote' peak did not match the distribution estimated using the full model (black line). Basically, the observed peak spans a much wider coverage range than the full model peak.
I am wondering what may be driving this observed vs. full model difference (i.e., sequencing errors?) and if this is a cause for concern (i.e., a data issue that needs to be addressed prior to assembly). Should I adjust some of the genomescope2 parameters?
I am very new to genome assembly and would appreciate any advice you (or anyone else) might have.
Thanks,
Andrea
GenomeScope version 2
p = 2
k = 21
property; min; max
Homozygous (aa); 98.04%; 98.10%
Heterozygous (ab); 1.90%; 1.96%
Genome Haploid Length; 377413934 bp; 379528391 bp
Genome Repeat Length; 61537310 bp; 61882072 bp
Genome Unique Length; 315876624 bp; 317646318 bp
Model Fit; 73.1021%; 88.551%
Read Error Rate; 0.460545%; 0.460545%