-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy pathCODEX2.Rmd
216 lines (171 loc) · 12.2 KB
/
CODEX2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
title: "CODEX2: Full-spectrum copy number variation detection by high-throughput DNA sequencing"
author: "Yuchao Jiang"
date: "`r format(Sys.Date())`"
abstract: >
High-throughput DNA sequencing enables detection of copy number variations (CNVs) on the genome-wide scale with finer resolution compared to array-based methods, but suffers from biases and artifacts that lead to false discoveries and low sensitivity. We describe CODEX2, a statistical framework for full-spectrum CNV profiling that is sensitive for variants with both common and rare population frequencies and that is applicable to study designs with and without negative control samples. We demonstrate and evaluate CODEX2 on whole-exome and targeted sequencing data, where biases are the most prominent. CODEX2 outperforms existing methods and, in particular, significantly improves sensitivity for common CNVs.
output:
rmarkdown::html_document:
highlight: pygments
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Analysis overview
The figure below illustrates the two experimental designs for which CODEX2 can be applied: (i) case-control design with a group of negative control samples, where the goal is to detect CNVs disproportionately present in the "cases" versus the "controls"; and (ii) detection of all CNVs present in all samples design, such as in the Exome Aggregation Consortium. The key innovation in CODEX2 is the usage of negative control genome regions in a genome-wide latent factor model for sample- and position-specific background correction, and the utilization of negative control samples, under a case-control design, to further improve background bias estimation under this model. The negative control genome regions defined by CODEX2 are regions that do not harbor common CNVs, but that are still allowed to harbor rare CNVs, and can be constructed from existing studies or learned from data.
```{r, out.width = "600px", fig.align = "center", echo=FALSE}
knitr::include_graphics("https://raw.githubusercontent.com/yuchaojiang/CODEX2/master/demo/Figure1.png")
```
# Installation
```{r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BSgenome.Hsapiens.UCSC.hg19", version = "3.8")
BiocManager::install("BSgenome.Hsapiens.UCSC.hg38", version = "3.8")
BiocManager::install("CODEX", version = "3.8")
BiocManager::install("WES.1KG.WUGSC", version = "3.8")
devtools::install_github("yuchaojiang/CODEX2/package")
```
# Pre-computation and quality control
## Pre-processing
This step is to get directories of .bam files, read in exon target positions from .bed files, and get sample names. The direct input of CODEX2 include: *bamdir*, which is a vector indicating the directories of all .bam files; *sampname*, which is a column vector with row entries of sample names; *bedFile*, which indicates the directory of the .bed file (WES target file, no header, sorted by start and end positions with non-overlapping targets); and *chr*, which specifies the chromosome. CODEX2 processes the entire genome chromosome by chromosome; make sure the chromosome formats are consistent between the .bed and the .bam files.
```{r, message=FALSE}
library(CODEX2)
library(WES.1KG.WUGSC) # Load Toy data from the 1000 Genomes Project.
dirPath <- system.file("extdata", package = "WES.1KG.WUGSC")
bamFile <- list.files(dirPath, pattern = '*.bam$')
bamdir <- file.path(dirPath, bamFile)
sampname <- substr(bamFile,1,7)
bedFile <- file.path(dirPath, "chr22_400_to_500.bed")
bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile,
sampname = sampname, projectname = "CODEX2_demo")
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname
```
## Getting GC content and mappability
Obtain GC content and mappability for each exon/target/window. Mappability is calculated from the hg19 mappability from ENCODE ([link](http://rohsdb.cmb.usc.edu/GBshape/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability)). Mappability for hg38 is lifted over from hg19 ([link](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/)).
```{r, message=FALSE}
genome = BSgenome.Hsapiens.UCSC.hg19 # hg19
# library(BSgenome.Hsapiens.UCSC.hg38); genome = BSgenome.Hsapiens.UCSC.hg38 # hg38
gc <- getgc(ref, genome = genome)
mapp <- getmapp(ref, genome = genome)
values(ref) <- cbind(values(ref), DataFrame(gc, mapp))
```
## Getting raw read depth
Read depth matrix, as well as read lengths across all samples, will be returned. This will need to be generated for each chromosome.
```{r, message=FALSE}
coverageObj <- getcoverage(bambedObj, mapqthres = 20)
Y <- coverageObj$Y
write.csv(Y, file = paste(projectname, '_coverage.csv', sep=''), quote = FALSE)
head(Y[,1:5])
```
## Quality control
Take a sample-wise and exon-wise quality control procedure on the depth of coverage matrix.
```{r, message = FALSE}
qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000),
length_thresh = c(20, 2000), mapp_thresh = 0.9,
gc_thresh = c(20, 80))
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc
ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat; gc_qc <- ref_qc$gc
write.table(qcmat, file = paste(projectname, '_qcmat', '.txt', sep=''),
sep = '\t', quote = FALSE, row.names = FALSE)
```
# Running CODEX2
For demonstration purpose, we in silico spiked in CNVs spanning exon 1580 - 1620 with a population frequency 40%. There are altogether 90 samples, 36 of which have the heterozygous deletion. The toy dataset is stored as part of the CODEX2 R-package.
```{r}
# Load pre-stored data. This would not be needed if you are processing your own data and start from the above sections.
Y_qc <- Y_qc_demo
colnames(Y_qc) <- paste('sample_', 1:ncol(Y_qc), sep='')
ref_qc <- ref_qc_demo
gc_qc <- ref_qc$gc
```
Estimate library size factor based on genome-wide read depth after QC. It is important, especially in cancer genomics, to calculate library size factor using genome-wide data, as chromosomal read depth can be attentuated by large copy number aberrations.
```{r, message=FALSE}
Y.nonzero <- Y_qc[apply(Y_qc, 1, function(x){!any(x==0)}),]
pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)
```
## Running CODEX2 without specifying negative control samples/regions
Y_qc and gc_qc can be obtained from the sequencing bam files using the code in the previous section.
```{r, message=FALSE}
chr <- 20 # This can be run for one chromosome or multiple chromosomes
chr.index <- which(seqnames(ref_qc)==chr)
normObj.null <- normalize_null(Y_qc = Y_qc[chr.index,],
gc_qc = gc_qc[chr.index],
K = 1:5, N = N)
Yhat.null <- normObj.null$Yhat
AIC.null <- normObj.null$AIC; BIC.null <- normObj.null$BIC
RSS.null <- normObj.null$RSS
```
Choose the number of latent Poisson factors. BIC is used as the model selection metric by default.
```{r, eval=FALSE}
choiceofK(AIC.null, BIC.null, RSS.null, K = 1:5 , filename = "codex2_null_choiceofK.pdf")
```
```{r, echo=FALSE, fig1, fig.height = 2.5, fig.width = 6, fig.align = "center"}
par(mfrow = c(1, 3))
plot(1:5, RSS.null, type = "b", xlab = "Number of latent variables", pch=20)
plot(1:5, AIC.null, type = "b", xlab = "Number of latent variables", pch=20)
plot(1:5, BIC.null, type = "b", xlab = "Number of latent variables", pch=20)
par(mfrow = c(1,1))
```
## Running CODEX2 with negative control samples
For the case-control scenario, the normal sample index is known (samples without spike-in signals).
```{r, message=FALSE}
# Below are pre-computed demo dataset, stored as part of the CODEX2 R-package.
normObj <- normalize_codex2_ns(Y_qc = Y_qc[chr.index,],
gc_qc = gc_qc[chr.index],
K = 1:5, norm_index = norm_index_demo,
N = N)
Yhat.ns <- normObj$Yhat; fGC.hat.ns <- normObj$fGC.hat;
beta.hat.ns <- normObj$beta.hat; g.hat.ns <- normObj$g.hat; h.hat.ns <- normObj$h.hat
AIC.ns <- normObj$AIC; BIC.ns <- normObj$BIC; RSS.ns <- normObj$RSS
```
Choose the number of latent Poisson factors. BIC is used as the model selection metric by default.
```{r, eval=FALSE}
choiceofK(AIC.ns, BIC.ns, RSS.ns, K = 1:5 , filename = "codex2_ns_choiceofK.pdf")
```
```{r, echo=FALSE, fig2, fig.height = 2.5, fig.width = 6, fig.align = "center"}
par(mfrow = c(1, 3))
plot(1:5, RSS.ns, type = "b", xlab = "Number of latent variables", pch=20)
plot(1:5, AIC.ns, type = "b", xlab = "Number of latent variables", pch=20)
plot(1:5, BIC.ns, type = "b", xlab = "Number of latent variables", pch=20)
par(mfrow = c(1,1))
```
## Running CODEX2 with negative control regions
We can empirically identify common CNV regions by a first-pass CODEX run: For exons residing in common CNV regions, the s.d. of normalized z-scores (using normalize_null) across all samples will be large. This can also be provided by the user as known, e.g., from existing database (DGV or dbVar) or knowledge (tumor supressors or oncogenes with recurrent CNA changes).
```{r, message= FALSE, eval=FALSE}
cnv_index <- 1580:1620
normObj <- normalize_codex2_nr(Y_qc = Y_qc[chr.index,], gc_qc = gc_qc[chr.index],
K = 1:5, cnv_index = cnv_index,
N = N)
Yhat.nr <- normObj$Yhat; fGC.hat.nr <- normObj$fGC.hat;
beta.hat.nr <- normObj$beta.hat; g.hat.nr <- normObj$g.hat; h.hat.nr <- normObj$h.hat
AIC.nr <- normObj$AIC; BIC.nr <- normObj$BIC; RSS.nr <- normObj$RSS
```
# Running segmentation by CODEX2
We offer two versions of segmentation procedures: poisson-likelihood based recursive segmentation (recommended) and hidden Markov model (not recommended). For CODEX2 with negative control regions, simply change Yhat.ns to Yhat.nr in the code below. For germline CNV detection, use 'integer' mode; for CNV detection in heterogeneous sample/tissue (e.g., somatic copy number changes in bulk cancer samples), use 'fraction' mode.
The output file is tab delimited and has 13 columns with rows corresponding to CNV events. The columns include sample_name (sample names), chr (chromosome), cnv (deletion or duplication), st_bp (cnv start position in base pair, the start position of the first exon in the cnv), ed_bp (cnv end position in base pair, the end position of the last exon in the cnv), length_kb (CNV length in kb), st_exon (the first exon after QC in the cnv, integer value numbered in qcObj\$ref_qc), ed_exon (the last exon after QC in the cnv, integer value numbered in qcObj\$ref_qc), raw_cov (raw coverage), norm_cov (normalized coverage), copy_no (copy number estimate), lratio (likelihood ratio of CNV event versus copy neutral event), mBIC (modified BIC value, used to determine the stop point of segmentation).
```{r, message=FALSE}
finalcall.CBS <- segmentCBS(Y_qc[chr.index,], # recommended
Yhat.ns, optK = which.max(BIC.ns),
K = 1:5,
sampname_qc = colnames(Y_qc),
ref_qc = ranges(ref_qc)[chr.index],
chr = chr, lmax = 400, mode = "integer")
finalcall.HMM <- segmentHMM(Y_qc[chr.index,], # not recommended
Yhat.ns, optK = which.max(BIC.ns),
K = 1:5,
sampname_qc = colnames(Y_qc),
ref_qc = ranges(ref_qc)[chr.index],
chr = chr, mode = "integer")
```
Post-segmentation pruning and filtering are recommended based on CNV length (filter1), length per exon (filter2), likelihood ratio (filter3), and number of exons (filter4).
```{r, message = FALSE}
filter1 <- finalcall.CBS$length_kb<=200
filter2 <- finalcall.CBS$length_kb/(finalcall.CBS$ed_exon-finalcall.CBS$st_exon+1)<50
finalcall.CBS.filter <- finalcall.CBS[filter1 & filter2, ]
filter3 <- finalcall.CBS.filter$lratio>40
filter4 <- (finalcall.CBS.filter$ed_exon-finalcall.CBS.filter$st_exon)>1
finalcall.CBS.filter=finalcall.CBS.filter[filter3|filter4,]
```