Skip to content

Commit 7a54cc0

Browse files
committed
Fix a backup worker assertion failure
The pop version obtained from GRV proxy replies may go backwards, which can cause assertion failure when pulling mutations, i.e., missing mutations. Fix this bug and add an assertion that popVersion can go lower. -f ./tests/slow/BackupAndRestore.toml -s 2467687006 -b off 100k 20250312-202457-jzhou-34e803fa47d1c946 backup tests
1 parent f1c744d commit 7a54cc0

File tree

3 files changed

+17
-3
lines changed

3 files changed

+17
-3
lines changed

design/backup_v2_partitioned_logs.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ From an end-to-end perspective, the new backup system works in the following ste
168168
4. Periodically, backup workers upload mutations to the requested blob storage, and save the progress into the database;
169169
5. The backup is restorable when backup workers have saved versions that are larger than the complete snapshot’s end version, and the backup is stopped if a stop on restorable flag is set in the request.
170170

171-
The new backup has four major components: 1) backup workers; 2) recruitment of backup workers; 3) extension of tag partitioned log system to support pseudo tags; 4) integration with existing `TaskBucket` based backup command interface; and 5) integration with the Performant Restore System.
171+
The new backup has the following major components: 1) backup workers; 2) recruitment of backup workers; 3) extension of tag partitioned log system to support pseudo tags; 4) integration with existing `TaskBucket` based backup command interface; and 5) integration with the existing `TaskBucket` based restore system to read partitioned mutation logs.
172172

173173
### Backup workers
174174

fdbserver/BackupWorker.actor.cpp

+15-1
Original file line numberDiff line numberDiff line change
@@ -348,6 +348,10 @@ struct BackupData {
348348
}
349349
ASSERT_WE_THINK(backupEpoch == oldestBackupEpoch);
350350
const Tag popTag = logSystem.get()->getPseudoPopTag(tag, ProcessClass::BackupClass);
351+
DisabledTraceEvent("BackupWorkerPop", myId)
352+
.detail("Tag", popTag)
353+
.detail("SavedVersion", savedVersion)
354+
.detail("PopVersion", popVersion);
351355
logSystem.get()->pop(std::max(popVersion, savedVersion), popTag);
352356
}
353357

@@ -481,6 +485,7 @@ struct BackupData {
481485
// Save the noop pop version, which sets min version for
482486
// the next backup job. Note this version may change after the wait.
483487
state Version popVersion = self->popTrigger.get();
488+
ASSERT(self->popVersion < popVersion);
484489
wait(_saveNoopVersion(self, popVersion));
485490
self->popVersion = popVersion;
486491
TraceEvent("BackupWorkerNoopPop", self->myId)
@@ -1036,6 +1041,8 @@ ACTOR Future<Void> pullAsyncData(BackupData* self) {
10361041
.detail("Tag", self->tag)
10371042
.detail("Version", tagAt)
10381043
.detail("PopVersion", self->popVersion)
1044+
.detail("TriggerVersion", self->popTrigger.get())
1045+
.detail("StartVersion", self->startVersion)
10391046
.detail("SavedVersion", self->savedVersion);
10401047
loop {
10411048
while (self->paused.get()) {
@@ -1149,7 +1156,14 @@ ACTOR Future<Void> monitorBackupKeyOrPullData(BackupData* self, bool keyPresent)
11491156
std::max({ self->popVersion, self->savedVersion, committedVersion.get() });
11501157
self->minKnownCommittedVersion =
11511158
std::max(committedVersion.get(), self->minKnownCommittedVersion);
1152-
self->popTrigger.set(newPopVersion);
1159+
if (newPopVersion < self->popTrigger.get()) {
1160+
// this can happen if a different GRV proxy replies
1161+
DisabledTraceEvent("BackupWorkerNoopPop", self->myId)
1162+
.detail("Version", newPopVersion)
1163+
.detail("OldPop", self->popTrigger.get());
1164+
} else {
1165+
self->popTrigger.set(newPopVersion);
1166+
}
11531167
committedVersion = Never();
11541168
} else {
11551169
committedVersion = self->getMinKnownCommittedVersion();

fdbserver/TagPartitionedLogSystem.actor.cpp

+1-1
Original file line numberDiff line numberDiff line change
@@ -253,7 +253,7 @@ Version TagPartitionedLogSystem::popPseudoLocalityTag(Tag tag, Version upTo) {
253253
for (const int8_t locality : pseudoLocalities) {
254254
minVersion = std::min(minVersion, pseudoLocalityPopVersion[Tag(locality, tag.id)]);
255255
}
256-
// TraceEvent("TLogPopPseudoTag", dbgid).detail("Tag", tag.toString()).detail("Version", upTo).detail("PopVersion", minVersion);
256+
// TraceEvent("TLogPopPseudoTag", dbgid).detail("Tag", tag).detail("Version", upTo).detail("PopVersion", minVersion);
257257
return minVersion;
258258
}
259259

0 commit comments

Comments
 (0)