Skip to content

Introduce zfs rewrite subcommand #17246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 12, 2025
Merged

Introduce zfs rewrite subcommand #17246

merged 1 commit into from
May 12, 2025

Conversation

amotin
Copy link
Member

@amotin amotin commented Apr 15, 2025

Motivation and Context

For years users were asking for an ability to re-balance pool after vdev addition, de-fragment randomly written files, change some properties for already written files, etc. The closest option would be to either copy and rename a file or send/receive/rename the dataset. Unfortunately all of those options have some downsides.

Description

This change introduces new zfs rewrite subcommand, that allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties.

How Has This Been Tested?

Manually tested it on FreeBSD. Linux-specific code is not yet tested.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@github-actions github-actions bot added the Status: Work in Progress Not yet ready for general review label Apr 15, 2025
@amotin
Copy link
Member Author

amotin commented Apr 15, 2025

I've tried to find some kernel APIs to wire this to, but found that plenty of Linux file systems each implement their own IOCTL's for similar purposes. I did the same, except the IOCTL number I took almost arbitrary, since ZFS seems quite rough in this area. I am open to any better ideas before this is committed.

@HPPinata
Copy link

This looks amazing! Not having to sift through half a dozen shell scripts every time this comes up to see what currently handles the most edge cases correctly is very much appreciated. Especially with RaidZ expansion, being able to direct users to run a built-in command instead of debating what script to send them to would be very nice.

Also being able to reliably rewrite a live dataset while it's in use without having to worry about skipped files or mtime conflicts would make the whole process much less of a hassle. With the only thing to really worry about being snapshots/space usage this seems as close to perfect as reasonably possible (without diving deep into internals and messing with snapshot immutability). Bravo!

@amotin amotin added the Status: Design Review Needed Architecture or design is under discussion label Apr 16, 2025
@clhedrick
Copy link

thank you. Fixes one of the biggest problems with ZFS.

Is there a way to suspend the process? It might be nice to have it run only during off hours.

@amotin
Copy link
Member Author

amotin commented Apr 16, 2025

Is there a way to suspend the process? It might be nice to have it run only during off hours.

It does one file at a time, and should be killable in between. Signal handling within one huge file can probably be added. Though the question of the process restart is on the user. I didn't plan to go that deep into the area within this PR.

@clhedrick
Copy link

I couldn't find documentation in the files changed, so I have to guess how it actually works. Is it a file at a time? I guess you could feed it with a "find" command. For a system with a billion files, do you have a sense how long this is gong to take? We can do scrubs in a day or two, but rsync is impractically slow. If this is happening at the file system level, that migth be the case here as well.

@stuartthebruce
Copy link

I guess you could feed it with a "find" command.

This will likely be a good use case for GNU Parallel.

@HPPinata
Copy link

I couldn't find documentation in the files changed, so I have to guess how it actually works. Is it a file at a time? I guess you could feed it with a "find" command. For a system with a billion files, do you have a sense how long this is gong to take? We can do scrubs in a day or two, but rsync is impractically slow. If this is happening at the file system level, that migth be the case here as well.

It can take a directory as an argument and there are some recursive functions and iterators in the code so piping find into it should not be necessary. That avoids some userspace file handling overhead, but it still has to go through the contents of each directory one file at a time. I also don't see any parallel execution or threading (though I'm not too familiar with ZFS internals, maybe some of the primitives used here run asynchronously?).

Whether doing parallelism in userspace by just calling it for many files/directories at once or not it should have the required locking to just run in the background and be significantly more elegant than the CP + mtime (or potentially userspace hash) check to make sure files didn't change during the copy process avoiding one of the potential pitfalls of existing solutions.

@amotin
Copy link
Member Author

amotin commented Apr 16, 2025

I haven't benchmarked it deep yet, but unless the files are tiny, I don't expect there is a major need for parallelism. The code in kernel should handle up to 16MB at a time, plus allows ZFS to do read-ahead and write-back on top of that, so there will be quite a lot in the pipeline to saturate the disks and/or the system, especially if there is some compression/checksuming/encryption. And without need to copy data to/from user-space, the only thread will not be doing too much, I think mostly a decompression from ARC. Bunch of small files on a wide HDD pool I suspect may indeed suffer from read latency, but that in user-space we can optimize/parallelize all day long.

@tonyhutter
Copy link
Contributor

tonyhutter commented Apr 16, 2025

I gave this a quick test. It's very fast and does exactly what it says 👍

# Copy ZFS source workspace to pool with compression=off
$ time cp -a ~/zfs /tank2

real	0m0.600s
user	0m0.032s
sys	0m0.519s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  893M  8.4G  10% /tank2


# Set compression to 'gzip' and rewrite
$ sudo ./zfs set compression=gzip tank2
$ time sudo ./zfs rewrite -r /tank2

real	0m2.272s
user	0m0.005s
sys	0m0.005s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  402M  8.9G   5% /tank2


# Set compression to 'lz4' and rewrite
$ sudo ./zfs set compression=lz4 tank2
$ time sudo ./zfs rewrite -r /tank2
real	0m1.947s
user	0m0.002s
sys	0m0.010s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  456M  8.8G   5% /tank2


# Set compression to 'zstd' and rewrite
$ sudo ./zfs set compression=zstd tank2
$ time sudo ./zfs rewrite -r /tank2

real	0m0.616s
user	0m0.003s
sys	0m0.006s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  366M  8.9G   4% /tank2

I can already see people writing scripts that go though every dataset, setting the optimal compression, recordsize, etc, and zfs rewrite-ing them.

@amotin
Copy link
Member Author

amotin commented Apr 16, 2025

Cool! Though the recordsize is one of things it can't change, since it would requite real byte-level copy, not just marking existing blocks dirty. I am not sure it can be done under the load in general. At least it would be much more complicated.

@snajpa
Copy link
Contributor

snajpa commented Apr 17, 2025

Umm this is basically same as doing send | recv, isn't it? I mean, in a way, this is already possible to do without any changes, isn't it? Recv will even respect a lower recordsize, if I'm not mistaken - at least when receiving into a pool without large blocks support, it has to do that.

I'm thinking whether we can do better, in the original sense of ZFS "better", meaning "automagic" - what do you think of using snapshots, send|recv, in a loop with ever decreasing delta size and then when the delta isn't decreasing anymore, we could swap those datasets and use (perhaps slightly modified) zfs_resume_fs transparently to the userspace... that way we would get transparent migration into a dataset with different options, that would scratch some itches for people, wouldn't it?

It'd be even cooler if it could coalesce smaller blocks into larger ones, but that potentially implies performance problems with write amplification, I would say if the app writes in smaler chunks that it gets onto disk in such smaller chunks, it's probably for the best to leave them that way. For any practical use-case I could think of though, I would definitely appreciate the ability to split the blocks of a dataset using smaller recordsize.

If there's a way how to make zfs rewrite more automagical, I think it's at least worth considering.

@HPPinata
Copy link

HPPinata commented Apr 17, 2025

send recv has the huge downside of requiring 2x the space, even if you do the delta size thing since it has to send the entire dataset at least once and old data can't be deleted until the new dataset is complete.
Also recv doesn't increase block sizes, it only splits them if they are larger than the other pool supports (and iirc. there have even been some issues with that).
Also that idea sounds a lot more complex than simply walking the directory tree and iterating through the files to mark their records as dirty to cause a rewrite.

we would get transparent migration into a dataset with different options, that would scratch some itches for people, wouldn't it?

Isn't this exactly what rewrite does? Change the options, run it and all the blocks are changed in the background. Without an application even seeing a change to the file. And unlike send recv it only needs a few MB of extra space.

Edit: with the only real exception being record size, but recv also solves that only partially at best and it doesn't look like there's a reasonable way to work around that in a wholly transparent fashion.

@amotin
Copy link
Member Author

amotin commented Apr 19, 2025

  • Added -x flag to not cross mount points.
  • Added signal handling in kernel.
  • Added man page.

@amotin amotin force-pushed the rewrite branch 4 times, most recently from d23a371 to c5f4413 Compare April 19, 2025 22:49
@stuartthebruce
Copy link

Which release is this game changing enhancement likely to land in?

@amotin
Copy link
Member Author

amotin commented Apr 20, 2025

@stuartthebruce So far it haven't landed even in master, so anybody who want to speed it up is welcome to test and comment. In general though, when completed, there is no reason why aside of 2.4.0 it can't be ported back to some 2.3.x of the time.

@stuartthebruce
Copy link

@stuartthebruce So far it haven't landed even in master, so anybody who want to speed it up is welcome to test and comment. In general though, when completed, there is no reason why aside of 2.4.0 it can't be ported back to some 2.3.x of the time.

Good to know there are no obvious blockers from including in a future 2.3.x. Once this hits master I will help by setting up a test system with 1/2PB of 10^9 small files to see if I can break it. Is there any reason to think the code will be sensitive to Linux vs FreeBSD?

@amotin
Copy link
Member Author

amotin commented Apr 20, 2025

Is there any reason to think the code will be sensitive to Linux vs FreeBSD?

IOCTL interface of the kernels is obviously slightly different, requiring OS-specific shims, as with most of other VFS-related code. But seems like not a big problem, as Tony confirmed it works on Linux too from the first try.

@amotin
Copy link
Member Author

amotin commented Apr 20, 2025

Once this hits master

Since this introduces new IOCTL API, I'd appreciate some feedback before it hit master in case some desired functionality might require API changes aside of the flags field I already reserved for later extensions. I was thinking about some options to not rewrite in some cases, but didn't want to pollute the code until I am convinced it is required.

@stuartthebruce
Copy link

Since this introduces new IOCTL API, I'd appreciate some feedback before it hit master in case some desired functionality might require API changes aside of the flags field I already reserved for later extensions. I was thinking about some options to not rewrite in some cases, but didn't want to pollute the code until I am convinced it is required.

OK, I will see if I can find some time this next week to stress test.

@amotin amotin marked this pull request as ready for review April 20, 2025 20:39
This allows to rewrite content of specified file(s) as-is without
modifications, but at a different location, compression, checksum,
dedup, copies and other parameter values.  It is faster than read
plus write, since it does not require data copying to user-space.
It is also faster for sync=always datasets, since without data
modification it does not require ZIL writing.  Also since it is
protected by normal range range locks, it can be done under any
other load.  Also it does not affect file's modification time or
other properties.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
@tonyhutter tonyhutter merged commit 49fbdd4 into openzfs:master May 12, 2025
22 of 24 checks passed
@amotin amotin deleted the rewrite branch May 12, 2025 17:59
ixhamza pushed a commit to truenas/zfs that referenced this pull request May 14, 2025
This allows to rewrite content of specified file(s) as-is without
modifications, but at a different location, compression, checksum,
dedup, copies and other parameter values.  It is faster than read
plus write, since it does not require data copying to user-space.
It is also faster for sync=always datasets, since without data
modification it does not require ZIL writing.  Also since it is
protected by normal range range locks, it can be done under any
other load.  Also it does not affect file's modification time or
other properties.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
ixhamza pushed a commit to truenas/zfs that referenced this pull request May 14, 2025
This allows to rewrite content of specified file(s) as-is without
modifications, but at a different location, compression, checksum,
dedup, copies and other parameter values.  It is faster than read
plus write, since it does not require data copying to user-space.
It is also faster for sync=always datasets, since without data
modification it does not require ZIL writing.  Also since it is
protected by normal range range locks, it can be done under any
other load.  Also it does not affect file's modification time or
other properties.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
@satmandu satmandu mentioned this pull request May 26, 2025
14 tasks
@owlshrimp
Copy link

@robn I had some of the same thinking myself. But even more than adding file-oriented sub-command to zfs I hate adding new commands. Patching third-party tools to use the IOCTL does not sound realistic for cross-platform feature. I found some solace in the fact that we already have zfs project sub-command, that is file-oriented too. I am not particularly locked to the zfs rewrite name, haven't used it anywhere yet, but zfs project we can't rename already.

re:naming I guess "zfs anneal" could be a possibility if "rewrite" is to be used for something else. That's what I've been calling this kind of functionality myself. It seems to me like a fitting description for the task this tool is meant to accomplish.

@vedranmiletic
Copy link

zfs anneal

Neat idea! As an added bonus, it's metaphor from metallurgy, just like scrub and resilver.

@stuartthebruce
Copy link

FYI, I found another use case for this feature: to heal a remote snapshot backup. For example, I have a large (146TB) snapshot with two (out of 55.7M) files that return EIO CKSUM errors on a remote backup but are OK on the primary production instance (long story). I will use /bin/dd to force a re-write on the production instance so that in incremental snapshot update will heal the remote backup, but this would be better done with zfs rewrite.

@MagicRB
Copy link

MagicRB commented Jul 20, 2025

Could this be the basis of a somewhat naive defrag? Similar to a zpool scrub, running zpool defrag would start a background job which could walk over all files in some order and issue rewrites for each. It would break up snapshots, but honestly not sure if we'd care, I seem to recall BTRFS breaks up snapshots too.

@gamanakis
Copy link
Contributor

I just came across this, it seems we can use this to rewrite files to special vdevs, right?

@MagicRB
Copy link

MagicRB commented Jul 20, 2025

from what i read in this thread, its as if you did a cp file file2 ; rm file ; mv file2 file but way safer and faster. So yes

gamanakis pushed a commit to gamanakis/zfs that referenced this pull request Jul 20, 2025
This allows to rewrite content of specified file(s) as-is without
modifications, but at a different location, compression, checksum,
dedup, copies and other parameter values.  It is faster than read
plus write, since it does not require data copying to user-space.
It is also faster for sync=always datasets, since without data
modification it does not require ZIL writing.  Also since it is
protected by normal range range locks, it can be done under any
other load.  Also it does not affect file's modification time or
other properties.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
@fiveangle
Copy link

fiveangle commented Jul 21, 2025

zfs anneal

Neat idea! As an added bonus, it's metaphor from metallurgy, just like scrub and resilver.

I believe you are all thinking about the metaphor zfs temper. Annealing would be to metaphorically normalize an object's structure to be as amorphous as possible, while tempering would be to perform a specialized sequence of discrete operations on an object so that its structure was as optimized and robust as possible.

@owlshrimp
Copy link

FYI, I found another use case for this feature: to heal a remote snapshot backup. For example, I have a large (146TB) snapshot with two (out of 55.7M) files that return EIO CKSUM errors on a remote backup but are OK on the primary production instance (long story). I will use /bin/dd to force a re-write on the production instance so that in incremental snapshot update will heal the remote backup, but this would be better done with zfs rewrite.

This does raise a significant semantics question for anyone implementing this feature, I think: How do you handle corrupted files? Do you overwrite the file and mark the whole thing correct with valid checksums, with the corrupted bits set to whatever undefined value you read back (or just zeros)? Or do you leave the checksum errors on that file in place after? I suppose this is highly implementation-dependant at the moment.

This particular use case would suggest clearing the errors and setting the file contents to "something, anything" which seems like an "ok" thing to so, so long as the user understands and accepts that this is going to happen. Maybe warn before executing zfs rewrite on a filesystem with checksum errors? (should it stop if it hits one at runtime?)

@owlshrimp
Copy link

zfs anneal

Neat idea! As an added bonus, it's metaphor from metallurgy, just like scrub and resilver.

I believe you are all thinking about the metaphor zfs temper. Annealing would be to metaphorically normalize an object's structure to be as amorphous as possible, while tempering would be to perform a specialized sequence of discrete operations on an object so that its structure was as optimized and robust as possible.

Not exactly. Generally when people think of tempering they think of softening material, usually in the case of steel which has been made glass hard and now needs to be pulled back to something more reasonable to use.

I suggested annealing because annealing removes all sorts of discontinuities in the metal in the process of that normalization, in the same way that ZFS rewrite could remove inconsistencies in compression algorithm, raidz stripe length, etc etc. One of the things that "annealing" covers but "tempering" does not is the softening of copper back to a workable state (after it's been work-hardened from lots of hammering) so that further hammering and shaping can be done. The latter is the analogy I had in mind when I suggested the term.

@HPPinata
Copy link

FYI, I found another use case for this feature: to heal a remote snapshot backup. For example, I have a large (146TB) snapshot with two (out of 55.7M) files that return EIO CKSUM errors on a remote backup but are OK on the primary production instance (long story). I will use /bin/dd to force a re-write on the production instance so that in incremental snapshot update will heal the remote backup, but this would be better done with zfs rewrite.

This does raise a significant semantics question for anyone implementing this feature, I think: How do you handle corrupted files? Do you overwrite the file and mark the whole thing correct with valid checksums, with the corrupted bits set to whatever undefined value you read back (or just zeros)? Or do you leave the checksum errors on that file in place after? I suppose this is highly implementation-dependant at the moment.

This particular use case would suggest clearing the errors and setting the file contents to "something, anything" which seems like an "ok" thing to so, so long as the user understands and accepts that this is going to happen. Maybe warn before executing zfs rewrite on a filesystem with checksum errors? (should it stop if it hits one at runtime?)

I believe you missunderstood. The files you rewrite aren't the corrupted ones. The corrupted ones are in the offsite snapshot. So anything that's read and written out again is valid data, with passing checksums. This is just about forcing zfs send to retransmit that data without having to change it from userspace.

Usually this would be handled by doing a replication from scratch or with the standard cp, rm, mv cycle on the live version to force replication to transmit the blocks again and have a new, uncorrupted snapshot in the offsite location.

@fiveangle
Copy link

Not exactly.

Yes, not exactly. As all metaphors...

Tempering is the final operation to leave an alloy in its toughest, most optimized, and inexorable form. Annealing leaves metal in its softest, most malleable form... the exact opposite goal for user data.

Anyway, the merge was performed with "rewrite" which is far and away the lowest obfuscation, however boring it may be, so is honestly the best choice for the masses.

one of the things that "annealing" covers [...] is the softening of [metal] back to a workable state [...] so that further hammering and shaping can be done. [That] is the analogy I had in mind when I suggested the term.

👀 I feel like Mugatu in Zoolander, but nothing a pint at the pub won't solve... 🤣

@owlshrimp
Copy link

Tempering is the final operation to leave an alloy in its toughest, most optimized, and inexorable form. Annealing leaves metal in its softest, most malleable form... the exact opposite goal for user data.

Annealing allows one to do further work to a piece of copper, removing the internal dislocations accumulated from previous work. I'm sure that users would like to continue to work with and shape their arrays over time and this gives them a way to remove accumulated inconsistencies at various points along the way.

Anyway, the merge was performed with "rewrite" which is far and away the lowest obfuscation, however boring it may be, so is honestly the best choice for the masses.

Really my biggest issue with "rewrite" is that it could be confused with block pointer rewrite, the once promised feature never delivered, which it certainly isn't except in the most roundabout way imaginable. Yes, technically rewriting the data rewrites the block pointers, but it's a very different kind of operation.

@clhedrick
Copy link

How does this interact with snapshots? If I rewrite everything and have a snapshot, am I now using twice the space?

@robn
Copy link
Member

robn commented Jul 23, 2025

How does this interact with snapshots? If I rewrite everything and have a snapshot, am I now using twice the space?

Yes, if you rewrite a block that's in a snapshot, then the snapshot keeps one, and you get a new one, so it will use more space on the pool. This is mentioned in zfs-rewrite(8).

It may not be twice the space; it depends on the properties at time of rewrite.

@robn
Copy link
Member

robn commented Jul 23, 2025

Really my biggest issue with "rewrite" is that it could be confused with block pointer rewrite, the once promised feature never delivered, which it certainly isn't except in the most roundabout way imaginable. Yes, technically rewriting the data rewrites the block pointers, but it's a very different kind of operation.

I'm not really sure the distinction between "rewrite the data" and "rewrite the block pointer" really means anything here, since the BP describes how to interpret the data, so changing the data layout or transforms necessarily requires the the block pointer to change.

In any case, I doubt there's going to be much confusion. "Block pointer rewrite" is pretty inside-baseball at this point; most people who just use OpenZFS have likely never heard of BPR or if they have, not with any particular idea of what it is or should be. Hell, I've been working on OpenZFS for three years now and I only know it in wish form ("gee, it'd be nice to just upgrade all the block pointers"). Which is what zfs-rewrite is, just with a narrower scope.

@owlshrimp
Copy link

How does zfs rewrite interact with an array that has data errors?

@maxximino
Copy link
Contributor

Which is what zfs-rewrite is, just with a narrower scope.

AFAIU the biggest difference between "zfs rewrite" and the mythical block-pointer-rewrite is that data rewritten by "zfs rewrite" count as newly-written data, with some unpleasant consequences:

  • until you delete old snapshots, the old data is still needed and will still take space. So, it can't be used to implement e.g. arbitrary layout changes in the zpool unless the user agrees to forfeit all snapshots.
  • (haven't tried myself, but fairly convinced) if you use zfs send | recv to send backups offsite, you'll be sending everything again, with big waste of bandwidth and space on the receiving side.

At VFS layer, zfs rewrite is practically invisible (and awesome). At other layers, not so much. AFAIUm real BP rewrite ideally should be invisible also in other layers.

@amotin
Copy link
Member Author

amotin commented Jul 23, 2025

How does zfs rewrite interact with an array that has data errors?

Kernel should return error to user-space, same as read, but user-space there is made to log errors and continue with next file.

@amotin
Copy link
Member Author

amotin commented Jul 24, 2025

  • (haven't tried myself, but fairly convinced) if you use zfs send | recv to send backups offsite, you'll be sending everything again, with big waste of bandwidth and space on the receiving side.

@maxximino Look here: #17565 . :)

@amotin amotin mentioned this pull request Jul 25, 2025
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.