Skip to content

The SNV statistics does not match with bcftools stats #1087

Open
@Rohit-Satyam

Description

@Rohit-Satyam

I see that maftools is inaccurately estimating the number of T>C and C>T and other statistics in SNV Class plot. I am also attaching the VCF file and MAF file for your reference.

Command
Please post your commands and the output (errors or any unexpected output)
BCFtools output:

bcftools stats temp-snpeff.vcf 
[W::bcf_hrec_check] Invalid tag name: "B)QB"
# This file was produced by bcftools stats (1.17+htslib-1.17) and can be plotted using plot-vcfstats.
# The command line was:	bcftools stats  temp-snpeff.vcf
#
# Definition of sets:
# ID	[2]id	[3]tab-separated file names
ID	0	temp-snpeff.vcf
# SN, Summary numbers:
#   number of records   .. number of data rows in the VCF
#   number of no-ALTs   .. reference-only sites, ALT is either "." or identical to REF
#   number of SNPs      .. number of rows with a SNP
#   number of MNPs      .. number of rows with a MNP, such as CC>TT
#   number of indels    .. number of rows with an indel
#   number of others    .. number of rows with other type, for example a symbolic allele or
#                          a complex substitution, such as ACT>TCGA
#   number of multiallelic sites     .. number of rows with multiple alternate alleles
#   number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
# 
#   Note that rows containing multiple types will be counted multiple times, in each
#   counter. For example, a row with a SNP and an indel increments both the SNP and
#   the indel counter.
# 
# SN	[2]id	[3]key	[4]value
SN	0	number of samples:	1
SN	0	number of records:	736
SN	0	number of no-ALTs:	0
SN	0	number of SNPs:	690
SN	0	number of MNPs:	46
SN	0	number of indels:	0
SN	0	number of others:	0
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
# TSTV, transitions/transversions:
# TSTV	[2]id	[3]ts	[4]tv	[5]ts/tv	[6]ts (1st ALT)	[7]tv (1st ALT)	[8]ts/tv (1st ALT)
TSTV	0	610	80	7.62	610	80	7.62
# SiS, Singleton stats:
# SiS	[2]id	[3]allele count	[4]number of SNPs	[5]number of transitions	[6]number of transversions	[7]number of indels	[8]repeat-consistent	[9]repeat-inconsistent	[10]not applicable
SiS	0	1	690	610	80	0	0	0	0
# AF, Stats by non-reference allele frequency:
# AF	[2]id	[3]allele frequency	[4]number of SNPs	[5]number of transitions	[6]number of transversions	[7]number of indels	[8]repeat-consistent	[9]repeat-inconsistent	[10]not applicable
AF	0	0.000000	690	610	80	0	0	0	0
# QUAL, Stats by quality
# QUAL	[2]id	[3]Quality	[4]number of SNPs	[5]number of transitions (1st ALT)	[6]number of transversions (1st ALT)	[7]number of indels
QUAL	0	0.0	690	610	80	0
# IDD, InDel distribution:
# IDD	[2]id	[3]length (deletions negative)	[4]number of sites	[5]number of genotypes	[6]mean VAF
# ST, Substitution types:
# ST	[2]id	[3]type	[4]count
ST	0	A>C	12
ST	0	A>G	132
ST	0	A>T	14
ST	0	C>A	14
ST	0	C>G	7
ST	0	C>T	170
ST	0	G>A	135
ST	0	G>C	4
ST	0	G>T	8
ST	0	T>A	17
ST	0	T>C	173
ST	0	T>G	4
# DP, Depth distribution
# DP	[2]id	[3]bin	[4]number of genotypes	[5]fraction of genotypes (%)	[6]number of sites	[7]fraction of sites (%)

But the maftools plotmafSummary function says that T>C and C>T mutations are 305 in number but they are not (I also subset MAF file manually and check the number of rows). Now of variants in the samples are also reported less.

library(maftools)
mafs = list.files(path = ".", pattern = "*.maf$", full.names = TRUE)
maf_dat = lapply(mafs, data.table::fread, sep = "\t", skip = "Hugo_Symbol")
names(maf_dat) = unlist(data.table::tstrsplit(basename(path = mafs), spli = "\\.", keep = 1))
maf = data.table::rbindlist(l = maf_dat, use.names = TRUE, fill = TRUE, idcol = "maf_file")
maf = maftools::read.maf(maf, removeDuplicatedVariants = FALSE)
plotmafSummary(maf = maf, rmOutlier = FALSE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

Session info
Run sessionInfo() and post the output below

> sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Asia/Riyadh
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] maftools_2.24.0 dplyr_1.1.4     readr_2.1.5    

loaded via a namespace (and not attached):
 [1] tidyr_1.3.1         generics_0.1.3      lattice_0.22-5      stringi_1.8.7       hms_1.1.3           digest_0.6.37      
 [7] magrittr_2.0.3      grid_4.5.0          RColorBrewer_1.1-3  pkgload_1.4.0       fastmap_1.2.0       Matrix_1.7-3       
[13] plyr_1.8.9          cellranger_1.1.0    processx_3.8.6      pkgbuild_1.4.7      sessioninfo_1.2.3   survival_3.8-3     
[19] urlchecker_1.0.1    ps_1.9.1            promises_1.3.2      BiocManager_1.30.25 purrr_1.0.4         cli_3.6.4          
[25] shiny_1.10.0        rlang_1.1.6         crayon_1.5.3        splines_4.5.0       ellipsis_0.3.2      bit64_4.6.0-1      
[31] remotes_2.5.0       withr_3.0.2         cachem_1.1.0        DNAcopy_1.82.0      devtools_2.4.5      tools_4.5.0        
[37] parallel_4.5.0      tzdb_0.5.0          memoise_2.0.1       httpuv_1.6.16       BiocGenerics_0.54.0 curl_6.2.2         
[43] vctrs_0.6.5         R6_2.6.1            mime_0.13           stats4_4.5.0        lifecycle_1.0.4     stringr_1.5.1      
[49] S4Vectors_0.46.0    fs_1.6.6            htmlwidgets_1.6.4   bit_4.6.0           usethis_3.1.0       vroom_1.6.5        
[55] miniUI_0.1.2        pkgconfig_2.0.3     desc_1.4.3          callr_3.7.6         pillar_1.10.2       later_1.4.2        
[61] data.table_1.17.0   glue_1.8.0          profvis_0.4.0       Rcpp_1.0.14         tibble_3.2.1        tidyselect_1.2.1   
[67] rstudioapi_0.17.1   xtable_1.8-4        htmltools_0.5.8.1   mgsub_1.7.3         compiler_4.5.0      readxl_1.4.5 

temp.zip

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions