-
-
Notifications
You must be signed in to change notification settings - Fork 764
borg check errors after upgrade to 1.2.0 #6687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I should also mention that I don't use encryption with any of these repos. |
Thanks for the detailed bug report. |
BTW, do you use |
I have never used break-lock or bypass-lock. I updated the top post to show the correct directories. |
I have made a copy of one of the repos. Can you tell me exactly how to run the |
thus, for the first repo:
But likely it will only work after you have repaired the repo index. |
The directory listings did not reveal anything special. But considering the described circumstances this looks like a bug. |
In case we do not find the issue otherwise, maybe keep an unmodified copy of at least one of the repos until this is resolved. |
I made a full copy of the third repository above (on the Debian system) and made hard link copies of the first two repositories. I'm now doing repairs on two of the repositories. When they complete, I will include the output, and do the get-obj check. |
On the Debian system:
On the Ubuntu system:
|
On the Debian system:
On the Ubuntu system:
The repair said that orphans were being deleted, so maybe that is why they don't show up? And maybe the segment files are still there because I haven't run compact? |
I get the same "object ... not found" message if I try |
OK, so the 1 missing index entry was an orphan chunk. That takes quite a bit of severity off from this ticket. :-) It can't be found in the index before repair, because it is not in the index (maybe it never was added to the index). It can't be found in the index after repair, because it was determined orphaned (no archive uses this object) and thus removed. So what remains to be determined is what borg operation creates this orphan chunk. |
What could be done to determine the contents of the orphan chunk is to seek to the given offset into the given segment file and look what's there. |
I tried looking at that offset in the file, but the chunks are compressed, so it wasn't obvious. If there's some way to extract the chunk and decompress it, let me know. About the cause: would network errors cause orphaned chunks? I do occasionally see backups that don't successfully complete due to a network error. But I thought that further repo operations and/or What is particularly odd is that none of those segment files has a recent date. One of them is from 2017! Maybe a chunk in there was somehow left after a prune operation? Maybe a prune operation had a network issue? |
I'm also puzzled that the segment file is still there after the borg repair. Maybe the chunk will get removed after a |
If a small not-used-anymore (== "deleted") chunk sits in a segment file with other still-used chunks, it might well stay there. borg computes a ratio deleted/total size and if the ratio is below the threshold, borg compact won't compact the segment. in recent borg, you can also give borg compact a threshold of 0 (but that might move a lot of data around) for not much space saving. |
BTW, as long as the chunk is logically deleted, it won't show as orphan.
|
You could try this (it should decrypt and decompress as needed). It dumps ALL objects, so needs quite some space / time.
|
From the
E.g. if it starts backing up
Maybe borg could do some cleanup in an exception handler, but IIRC it does not do that yet. |
None of this explains why there would be orphan chunks from years ago. I run borg check every month, and it never reported any errors until yesterday, so I don't think these orphan chunks could have arisen from I/O errors at the time of backup. Somehow they recently became orphans. Could there be a bug in prune? Or maybe several prune operations got interrupted, all in the same month, even though that never caused an error in the past five years? |
Yeah, guess that's right. Would be interesting to look at the plaintext of that chunk. |
I don't think it's worth the time/space to extract all chunks. I tried using
to extract 1MB starting where the chunk starts (this is on the Debian system), and then using Here is the start of the
|
It's a bit hard to decipher due to the 16bit formatting with bytes swapped, but:
Hmm, does not decompress, strange. |
It seems it is uncompressed data:
|
So a null byte, followed by the ascii string "-0.328\n". Strange. I confirmed this by adding 43 to the offset and extracting 8 bytes from the segment file. I don't know what that file could have been. I should have mentioned that I use Edit: for the record, the length above, 0x33, is 51 in decimal, and includes the 43-byte header, which is why the data portion is 8 bytes long. |
I repaired the third repository, the other one on the Ubuntu system. Similar results.
|
A theory how this could happen (see #6687 (comment)):
Did some reviews, but did not find anything suspicious:
But one thing came to my mind:
Have to investigate more whether that could be an issue (guess that shouldn't be an issue, the hints file could also just get lost at some time). |
get-obj fails to retrieve it and I don't have enough scratch space for |
Deleting the hints file before running compact seems to avoid the issue. That indicates to me that the |
@horazont if there is no entry for some chunkid in the shadow index, that triggers a special case in the compact code (like "we do not know about shadowed PUTs, thus we better keep the DEL"), so this is what you achieve by deleting the hints file. The problem in your hints file seems to be that the list of segment numbers with shadowed PUTs is empty for that chunkid, so the compact code decides to drop the seemingly not needed DEL. But the list should not be empty as there actually still is a PUT for that chunkid in the segment files. Maybe the shadow_index being wrong is related to these changes: #5636 #5670 |
#5670 was wrong, I reopened #5661, see there. could it be that the issues seen here were seen after borg did "double-puts" (writing a chunk x to the repo which already has a chunk x), like when using the manifest gets also always (over)written to same id 0, but might behave differently because it is always written to its own segment file (causing the segment with the superseded manifest always being compacted) - otherwise we would have seen this issue all the time, not just rarely. not 100% sure if this also fixes #6687 (this issue), but it could be. if somebody is seeing this frequently, locally applying the fix and observing the behaviour in practice would be welcome! the fix is for the repo code, so it needs to be applied on the repo side when running borg in client/server mode. |
See #5661 for some conclusion. More conclusions about not persisting si in borg < 1.2 and then persisting it in borg 1.2 might cause an issue: if there was an old PUT x ... DEL x in the repo (made with borg < 1.2) we do not have a si entry about that. if borg >= 1.2 does a PUT x ... DEL x we will have a si entry about x, but it will only list the 2nd PUT (and borg will assume that an existing entry for x lists all the PUT segment numbers). if then borg compacts the segment file with the first DEL, it will drop the DEL as it does not know about the 1st PUT. it will still be correct, because there is that last DEL, keeping x deleted. if then borg compacts the segment files with the last PUT and also the one with the last DEL, then these both will be removed. that will resurrect the first PUT when borg check --repair rebuilds the index (repository part) and delete it again after the archives part, because it is orphan. after that, it should be stable, as there is an si entry now for x listing the segment number of the first PUT. TODO: think about whether |
before this fix, borg check --repair just created an empty shadow index, which can lead to incomplete entries if entries are added later. and such incomplete (but present) entries can lead to compact_segment() resurrecting old PUTs by accidentally dropping related DELs.
before this fix, borg check --repair just created an empty shadow index, which can lead to incomplete entries if entries are added later. and such incomplete (but present) entries can lead to compact_segments() resurrecting old PUTs by accidentally dropping related DELs.
Testing the PR #7896 and #7897: Practical testing of these is very welcome (be careful!). For @horazont :
I suspect the users who either have old repos from borg < 1.2 or use borg recreate to change the compression are more likely to experience this issue than with repos that are fresh since borg 1.2 and never see borg recreate to change compression. |
After another self-review of these, I just merged the 2 PRs into 1.2-maint. Guess if some more people besides @horazont could practically test the code (see previous comment), we could have a 1.2.7 release rather soon. |
I deleted/changed some ancient data on that host. Ideally, that will trigger relevant ancient (borg 1.1.x-created) chunks to be expired from the repository next week, which would then trigger the issue. |
before this fix, borg check --repair just created an empty shadow index, which can lead to incomplete entries if entries are added later. and such incomplete (but present) entries can lead to compact_segments() resurrecting old PUTs by accidentally dropping related DELs.
The trap triggered. With the pre-prune copy and
Then I checked out 1.2-maint and installed it into a venv ( Using the borg from the venv, I did:
🥳 |
For the record: the --repair is necessary, just using 1.2-maint before pruning is not sufficient. I think that is expected? |
@horazont yes, I added rebuilding of the compaction infos when using |
Did someone else except @horazont do practical testing with the current 1.2-maint code? Guess it would be good to get more practical testing before 1.2.7 release, especially considering this issue here. |
OK, guess I'll just close this as fixed (see #6687 (comment) ) until otherwise proven. If somebody is seeing something that looks like this issue, please first do a Above mentioned fixes will be in borg 1.2.7 soon... (master branch is also fixed). |
I think the Changelog should mention that users who have had problems with Oh, I also just realized that the Changelog seems to mention an incorrect issue/PR:
Should the "5661" be "7896"? And maybe the second "6687" should be "7897", to point to the PR instead of this issue? |
@jdchristensen thanks for the feedback! some links for easier checking of these issues:
Updated the change log: #7961 Usually I rather give the issue number than the PR number, but here I just gave both now. There was quite some back and forth here... |
change log: more infos about #6687 fix
Thanks a lot. The effort you put in BorgBackup is amazing! How I was affected by this issueI have been using BorgBackup 1.1.15 for many years and love its functionality and reliability! I am using Debian Linux and my Borg repository is on an external HDD, which is attached via USB during the time of backup creation. On 3rd November 2023, I decided to migrate to a newer version of BorgBackup. I followed the Notes (excellent documentation, BTW!):
Did some I happily created 65 further archives during the next days. ( On 30th November 2023 it was time for my monthly Afterwards I did a
This was the first time I ever encountered problems via Research on the internet led me to this ticket #6687. I am sorry, that I couldn't contribute to solve this issue. But I am very grateful, that you could fix it by now. I am looking forward to test BorgBackup 1.2.7 in the next days. Do you think, this issue and its fix (install borg version >= 1.2.7 and run |
@palbr Guess if someone upgrades from 1.1.x to 1.2.7+, they might be never affected, so there is nothing to repair in that case. But I already added it to the 1.2.7 changelog entry (after release though), see above. Also, guess running |
Have you checked borgbackup docs, FAQ, and open Github issues?
Yes.
Is this a BUG / ISSUE report or a QUESTION?
Possible bug.
System information. For client/server mode post info for both machines.
Your borg version (borg -V).
1.2.0 on all clients and servers. Previous version was 1.1.17 (not 1.1.7, as I wrote on the mailing list).
Operating system (distribution) and version.
Debian buster and Ubuntu 20.04 on servers. Debian buster, Ubuntu 20.04 and Ubuntu 21.10 on clients.
Hardware / network configuration, and filesystems used.
Multiple local and remote clients accessing each repository. Repositories are on ext4, on RAID 1 mdadm devices, with spinning disks underlying them. The Debian server also uses lvm.
How much data is handled by borg?
The repos are all around 100GB in size, with up to 400 archives each. The repositories have been in use for many years.
Full borg commandline that lead to the problem (leave away excludes and passwords)
borg check /path/to/repo [more details below]
Describe the problem you're observing.
borg check shows errors on three different repositories on two different machines. See below for details.
Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.
Yes, borg check shows the same errors when run again.
Include any warning/errors/backtraces from the system logs
I upgraded from borg 1.1.17 to 1.2.0 on several different systems on about April 9. On May 9, my monthly "borg check" runs gave errors on three repositories on two systems. Note that I use the setup where several clients do their backups into the same repositories. I don't have any non-shared repositories for comparison.
At the time of the upgrade from 1.1.17 to 1.2.0, I ran
borg compact --cleanup-commits ...
followed byborg check ...
on all repos. There were no errors then. After that, I runborg compact
without--cleanup-commits
followed byborg check
once per month. The errors occurred at the one month mark.System 1 runs Ubuntu 20.04. Two of the three repos on this machine now have errors:
System 2 runs Debian buster. One of the three repos on this machine now has errors:
I have used borg on these systems for years, and no hardware has changed recently. System 1 has the repos on a RAID 1 mdadm device with two SATA spinning disks. System 2 also has the repos on RAID 1 mdadm devices with two SATA disks, with lvm as a middle layer. In both cases, smartctl shows no issues for any of the drives, and memtester also shows no errors.
Since the errors have happened on different machines within a month of upgrading to 1.2.0, I am concerned that this is a borg issue rather than a hardware issue. It is also suspicious to me that the error is the same in all cases, with a committed index not found. Hardware errors tend to produce garbage.
I have not run repair yet. Is there anything I should do before running repair to try to figure out the issue?
Update: there is a bounty for finding/fixing this bug: https://app.bountysource.com/issues/108445140-borg-check-errors-after-upgrade-to-1-2-0
The text was updated successfully, but these errors were encountered: