Basic understanding and optimising settings #8737

Lukas-52 · 2025-04-10T07:42:04Z

Lukas-52
Apr 10, 2025

Let me start of by saying that the search for a better backup solution has lead me down some of the most interesting rabbit holes in a long time. Especially Borgbackup. I spent the last 5 days reading about its internal workings and followed several discussions of Hash Collisions, Crypto optimizations and all kinds of great things. Even tho all of this info feels very chaotic and spread out its all there if one is willing to look for it. Unlike some other backup software cough Duplicacy cough...

As far as i understand it Borgbackup works as follows (leaving out compression and encryption for simplicity sake):

The Source folders get scanned, all files that need to be backed up are aligned and sort of treated as one giant block of Data (the docs used the analogy of a tar ball). This giant mother block is then chopped into Chunks. Either fixed size or by using a buzhash that is trying to find "smart" places to cut by searching for blocks of zeros within a given window of permitted Chunk sizes.
This means that one block can actually contain blocks of more then one file (not entirely sure about that tho).
After being chunked those chunks now get SHA-256 hashed. This hash is used to identify the chunk. If a hash already exists inside the manifest (i think that's the term used in the docs) AND the chunk size is identical, this chunk will be assumed to be a duplicate and thus NOT put in the repository (the "Destination").
A List is created that tells Borg what chunks need to be put back together to recreate the original file. This allows for a single block to be used in several files, thus de-duplication was achieved. To keep things a bit more manageable Borg put several chunks into one file. (think Haystack or a zip archive) . Thus what you see on your Repository file system will be a bunch of 500 MB (default) files with strange names.

To my understanding no checksum for the original file is being stored. Duplicacy for example does that, allowing to double check on restore that your files are actually OK. However it too does not do much in terms of hash collision mitigation.
Adding checksums yourself is easy to do, depending on the nature of your Data. In my case i just keep a separate DB that has those checksums as well as last modified dates and file sizes, just be sure :)

Assuming my understanding of Borgs inner workings is correct, i was hoping someone could help me optimizing things a little for my use case.

Use Case 1:
Large Media files that will not see much change, if any. I know a lot of people will scream rclone now, but running several backup schemes in parallel is a pain in my experience. Also Borg has some advantages, mostly that accidental deletions don't mess up your backup if you don't catch them in time.
Given that Media files usually are already compressed, is there anything to gain by disabling compression in borg? Does variable chunking make any sense if 99% of files are 1GB+ in size? I this use case only a local copy to an external Hard drive will be made.

Use Case 2:
In this borg makes more sense. All types of files, from just a couple of bytes to several GB, both compressed media and Databases, logs, pretty much everything. Also changes are more likely. A goof third of those files are expected to change pretty much between every backup. For that i feel like borgs defaults are probably the best choice.
Another option would be to seperate the media files out into their own backup job and again disabling compression. I don't think a larger chunk size would make sense here, since most of is jpg. Unfortunately the people who come up with Metadata standarts in the photo world seem to life in a very strange parallel universe and thought it was a good idea to embed them into files... This means that changing a single tag results in a new file getting written with a new checksum and modified date. I was hoping that borg could catch at least most of that, especially for larger .mp4 files.

And one last thing (feel free to just ignore this, i know how annoying users can be):
Given that i am about to go full out on borg, would it be worth waiting for 2.0? Will backups created in the current version of borg be compatible with 2.0?

Big thank you to everyone who made it this far!

ThomasWaldmann · 2025-04-12T12:41:00Z

ThomasWaldmann
Apr 12, 2025
Maintainer

Interesting rabbit hole: same here, when I discovered attic (borg's predecessor) back then. :-)

But it works a bit differently than what you described above - maybe some docs could be improved if you really got that from the borg docs:

there is no "giant block", the chunker works on each file separately.
buzhash is not trying to find blocks of zeros. the point of the content-defined chunking is just that: it cuts at places that are defined by the content (not by offset, as the fixed chunker).
the id hash of the chunk is a keyed hash, not simple sha256. the latter would be bad as it would support attackers doing fingerprinting. the secret key used for the keyed hash avoids that.
the chunks index is used to determine whether the repo has a specific chunk.
it's correct that borg does not compute/store a full-file hash, but it can verify data integrity with the chunk id (and also with the crypto data authentication).

0 replies

ThomasWaldmann · 2025-04-12T12:44:58Z

ThomasWaldmann
Apr 12, 2025
Maintainer

About your use case 1:

the default lz4 compressor is very fast, so guess you won't win much (if anything) by disabling it.
the point of the buzhash chunker is that it can deduplicate content that is shifting position within a file (the fixed chunker will fail miserably for that). the file size doesn't matter except that small files will always result in 1 chunk only.

0 replies

ThomasWaldmann · 2025-04-12T12:48:20Z

ThomasWaldmann
Apr 12, 2025
Maintainer

About use case 2:

tweaking the chunker target block size has to be done carefully.
smaller block size will usually give a bit better deduplication.
bigger block size will result in a lower overall chunk count and less management overhead (less RAM needs).

0 replies

ThomasWaldmann · 2025-04-12T12:50:15Z

ThomasWaldmann
Apr 12, 2025
Maintainer

About borg2:

don't hold your breath for it, it can still take quite a while until it gets released for production.
don't use the borg2 betas for production.
there will be "borg transfer", so one can transfer archives from borg1 repos to borg2 repos.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic understanding and optimising settings #8737

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Basic understanding and optimising settings #8737

Lukas-52 Apr 10, 2025

Replies: 4 comments

ThomasWaldmann Apr 12, 2025 Maintainer

ThomasWaldmann Apr 12, 2025 Maintainer

ThomasWaldmann Apr 12, 2025 Maintainer

ThomasWaldmann Apr 12, 2025 Maintainer

Lukas-52
Apr 10, 2025

ThomasWaldmann
Apr 12, 2025
Maintainer

ThomasWaldmann
Apr 12, 2025
Maintainer

ThomasWaldmann
Apr 12, 2025
Maintainer

ThomasWaldmann
Apr 12, 2025
Maintainer