Skip to content

Commit 140334c

Browse files
committed
Spec change: never quote-encode cells whose genotypes aren't reference-identical or non-called
This ensures all non-reference genotypes on a row can be read without reference to any previous row, a significant convenience in exchange for a negligible size increase.
1 parent 4b9e325 commit 140334c

File tree

3 files changed

+44
-12
lines changed

3 files changed

+44
-12
lines changed

doc/SPEC.md

+5-3
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,17 @@ Project VCF (pVCF; aka multi-sample VCF) is the prevailing file format for small
1010

1111
spVCF adopts from pVCF the tab-delimited text format with header, and the first nine columns providing all variant-level details. The sparse encoding concerns the genotype matrix `V[i,j]`, *i* indexing variant sites and *j* indexing the *N* samples, written across tab-delimited columns ten through 9+*N* of the pVCF text file. Each cell `V[i,j]` is a colon-delimited text string including the genotype and various QC measures (DP, AD, PL, etc.).
1212

13-
In the spVCF encoding, cells are first replaced with a double-quotation mark `"` if they're identical to the cell *above*:
13+
In the spVCF encoding, cells are first replaced with a double-quotation mark `"` if they're (i) identical to the cell *above*, and (ii) their GT field is reference-identical or non-called:
1414

1515
```
16-
S[i,j] := " if i>0 and V[i,j] == V[i-1,j],
16+
S[i,j] := " if i>0 and V[i,j] == V[i-1,j] and V[i,j]["GT"] in ["0/0","0|0","./.",".|."],
1717
V[i,j] otherwise.
1818
```
1919

2020
Here 'identical' covers all QC measures exactly. Such exact repetition is common in pVCF produced using tools like [GATK GenotypeGVCFs](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php) and [GLnexus](https://github.com/dnanexus-rnd/GLnexus), which merge gVCF or similar files summarizing reference coverage in lengthy bands.
2121

22+
For clarity, the list of "quotable" GTs enumerated above shows diploid genotypes only. In general, quotable GTs are those whose constituent allele calls are either all reference (0), or all non-called (.).
23+
2224
Second, within each row of `S`, consecutive runs of quotation marks are abbreviated with a text integer, so for example a horizontal run of 42 quotes is written `"42` and tab-delimited from adjacent cells. The result is a ragged, tab-delimited matrix.
2325

2426
**Worked example**
@@ -59,7 +61,7 @@ With checkpoints, it's possible to reuse the familiar `bgzip` and `tabix` utilit
5961

6062
Lastly, spVCF suggests the following convention to remove typically-unneeded detail from the matrix, and increase the compressibility of what remains, prior to the sparse encoding. In any cell with QC measures indicating zero non-reference reads (typically `AD=d,0` for some *d*, but this depends on how the pVCF-generating pipeline expresses non-reference read depth), report only `GT` and `DP` and omit any other fields. Also, round `DP` down to a power of two (0, 1, 2, 4, 8, 16, ...).
6163

62-
This "squeezing" requires the encoder to reorder the colon-delimited fields in each cell so that `GT` and `DP` precede any other fields. Then it's valid for a subset of cells to omit remaining fields completely, as permitted by VCF. The FORMAT specification in column 9 of each line must reflect this reordering.
64+
This "squeezing" requires the encoder to reorder the colon-delimited fields in each cell so that `GT` and `DP` precede any other fields. Then it's valid for a subset of cells to omit remaining fields completely, as permitted by VCF. The FORMAT specification in column 9 of each line must reflect this reordering. Notice that not all reference-identical genotype calls are necessarily squeezed, namely if the QC data indicate even one non-reference read.
6365

6466
The optional squeezing transformation can be applied to any pVCF, usually to great benefit, whether or not the spVCF sparse encoding is also used.
6567

src/spVCF.cc

+35-5
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,7 @@ class EncoderImpl : public TranscoderBase {
158158
const char* ProcessLine(char* input_line) override;
159159

160160
private:
161+
bool unquotableGT(const char* entry);
161162
void Squeeze(const vector<char*>& line);
162163

163164
uint64_t checkpoint_period_ = 0;
@@ -172,6 +173,7 @@ class EncoderImpl : public TranscoderBase {
172173
vector<string> roundDP_table_;
173174
};
174175

176+
#include <iostream>
175177
const char* EncoderImpl::ProcessLine(char* input_line) {
176178
++line_number_;
177179
// Pass through header lines
@@ -187,8 +189,11 @@ const char* EncoderImpl::ProcessLine(char* input_line) {
187189
if (tokens.size() < 10) {
188190
fail("Invalid: fewer than 10 columns");
189191
}
190-
uint64_t N = tokens.size() - 9;
192+
if (strncmp(tokens[8], "GT:", 3) && strcmp(tokens[8], "GT")) {
193+
fail("cells don't start with genotype (GT)");
194+
}
191195

196+
uint64_t N = tokens.size() - 9;
192197
if (dense_entries_.empty()){ // First line: allocate the dense entries
193198
dense_entries_.resize(N);
194199
stats_.N = N;
@@ -248,7 +253,7 @@ const char* EncoderImpl::ProcessLine(char* input_line) {
248253
if (*t == '"') {
249254
fail("Input seems to be sparse-encoded already");
250255
}
251-
if (m.empty() || strcmp(m.c_str(), t) != 0) {
256+
if (m.empty() || strcmp(m.c_str(), t) != 0 || unquotableGT(t)) {
252257
// Entry doesn't match the last one recorded densely for this
253258
// column. Output any accumulated run of quotes in the current row,
254259
// then this new entry, and update the state appropriately.
@@ -323,6 +328,33 @@ const char* EncoderImpl::ProcessLine(char* input_line) {
323328
return buffer_.Get();
324329
}
325330

331+
// Determine if the entry's GT makes it "unquotable", meaning the called
332+
// allele(s) don't consist of all 0 or all .
333+
// A half-call like ./0 is considered unquotable.
334+
// ASSUMES GT is the first FORMAT field (as required by VCF)
335+
bool EncoderImpl::unquotableGT(const char* entry) {
336+
bool zero = false, dot = false;
337+
if (*entry == 0 || *entry == ':') {
338+
fail("missing GT entry");
339+
}
340+
for (; *entry && *entry != ':'; entry++) {
341+
switch (*entry) {
342+
case '0':
343+
zero = true;
344+
break;
345+
case '.':
346+
dot = true;
347+
break;
348+
case '/':
349+
case '|':
350+
break;
351+
default:
352+
return true;
353+
}
354+
}
355+
return (zero == dot);
356+
}
357+
326358
// Truncate cells to GT:DP, and round DP down to a power of two, if
327359
// - AD is present and indicates zero read depth for alternate alleles;
328360
// - VR is present and zero
@@ -347,9 +379,7 @@ void EncoderImpl::Squeeze(const vector<char*>& line) {
347379
size_t formatsz = split(line[8], ':', back_inserter(format));
348380

349381
// locate fields of interest
350-
if (format[0] != "GT") {
351-
fail("cells don't start with genotype (GT)");
352-
}
382+
assert(format[0] == "GT");
353383
int iDP = -1;
354384
auto pDP = find(format.begin(), format.end(), "DP");
355385
if (pDP != format.end()) {

test/spVCF.t

+4-4
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,11 @@ plan tests 25
1515
pigz -dc "$HERE/data/small.vcf.gz" > $D/small.vcf
1616
"$EXE" encode -o $D/small.spvcf $D/small.vcf
1717
is "$?" "0" "filename I/O"
18-
is "$(cat $D/small.spvcf | wc -c)" "36976928" "filename I/O output size"
18+
is "$(cat $D/small.spvcf | wc -c)" "37095284" "filename I/O output size"
1919

2020
pigz -dc "$HERE/data/small.vcf.gz" | "$EXE" encode -q > $D/small.spvcf
2121
is "$?" "0" "piped I/O"
22-
is "$(cat $D/small.spvcf | wc -c)" "36976928" "piped I/O output size"
22+
is "$(cat $D/small.spvcf | wc -c)" "37095284" "piped I/O output size"
2323

2424
"$EXE" decode -o $D/small.roundtrip.vcf $D/small.spvcf
2525
is "$?" "0" "decode"
@@ -35,7 +35,7 @@ is "$(egrep -o "spVCF_checkpointPOS=[0-9]+" $D/small.spvcf | uniq | cut -f2 -d =
3535
3636
"$EXE" encode -S -p 500 -o $D/small.squeezed.spvcf $D/small.vcf
3737
is "$?" "0" "squeeze"
38-
is "$(cat $D/small.squeezed.spvcf | wc -c)" "17447427" "squeezed output size"
38+
is "$(cat $D/small.squeezed.spvcf | wc -c)" "17553478" "squeezed output size"
3939
4040
"$EXE" decode -q -o $D/small.squeezed.roundtrip.vcf $D/small.squeezed.spvcf
4141
is "$?" "0" "squeezed roundtrip decode"
@@ -79,7 +79,7 @@ is "$(cat $D/small.squeezed.slice_chr21.spvcf | sha256sum)" \
7979
8080
pigz -dc "$HERE/data/small.vcf.gz" | "$EXE" encode -t $(nproc) - > $D/small.mt.spvcf
8181
is "$?" "0" "multithreaded encode"
82-
is "$(cat $D/small.mt.spvcf | wc -c)" "36973335" "multithreaded output size"
82+
is "$(cat $D/small.mt.spvcf | wc -c)" "37091691" "multithreaded output size"
8383
8484
time "$EXE" decode -o $D/small.mt.roundtrip.vcf $D/small.mt.spvcf
8585
is "$?" "0" "decode from multithreaded"

0 commit comments

Comments
 (0)