-
Notifications
You must be signed in to change notification settings - Fork 258
chunked: allow conversion without zstd compression #2343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chunked: allow conversion without zstd compression #2343
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
f593118
to
f6a2815
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an extremely brief skim…
This might well make sense as an intermediate step. For full performance, wouldn't we want to entirely avoid copyAllBlobToFile
+ convertTarToZstdChunked
?
“Just” stream the incoming tar file as it comes; for every regular file, store it somewhere inside the staging destination, checksum it, and either hardlink/reflink it with a pre-existing one, or move it in place if it is entirely new. (Yes, that would be a third implementation of setting file metadata, along with pkg/archive
(maybe there are even more??), and we’d really want to consolidate them.)
Alternatively, after #2325 , wouldn’t it be fairly easy to add chunked-like pre-staging of layer contents without holding locks to the regular pull path? And if we did that, could we teach pkg/archive
to build the composefs structures ~directly? I think that (“just”, again) requires computing MeasureVerity
, ordinary digests for RegularFilePathForValidatedDigest
, and the tar header metadata.
Unlike truly-chunked pulls, on the “convert” path (and, also, for containers/image#2792 ), there’s not that much benefit in trying not to read files, so the hole / roll-sum data computation can, I think (?) be entirely avoided; we only want to match entire files, by digest, for reflinks/hardlinks.
I know, “just”…
I've done some benchmarking and the cost of writing the tarball is minimal compared to computing the checksum of each file/chunk. It is still useful to compute the chunks because we fill the cache with this information and future downloads can use these chunks. I agree it would look cleaner to work directly on the tar stream, but the implementation cost is much higher because we will need to duplicate a lot of code paths. |
f6a2815
to
a7455c2
Compare
pkg/chunked/storage_linux.go
Outdated
if part != nil { | ||
limit := mf.CompressedSize | ||
// If we are reading from a source file, use the uncompressed size to limit the reader, because | ||
// the compressed size refers to the original layer stream. | ||
if missingPart.OriginFile != nil && partCompression == fileTypeNoCompression { | ||
limit = mf.UncompressedSize | ||
} | ||
c.rawReader = io.LimitReader(part, limit) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m generally unhappy with the complexity. I feel that there must be some way to express this simpler.
I think that conflating compressedFileType
to mean both “format of the file” and “origin+format of this chunk” is confusing, although it seems somewhat convenient for this PR.
We have the if c.rawReader != nil
code to discard remaining data even for OriginFile
chunks, where we don’t actually need to do that.
This part that duplicates the OriginFile
condition could certainly be done without duplicating the condition, e.g. setting an extra readingFromALocalFile
boolean in the case missingPart.OriginFile != nil
section.
As a possible starting point, is there any need to split the prepareCompressedStreamToFile
/ appendCompressedStreamToFile
code into two parts, and to interleave the openDestinationFile
in the middle? I am quite possibly missing something, but I don’t see why that is necessary. And if it is not, combining the two …StreamToFile
into a single function with a single switch
would be simpler.
I’m afraid I can’t predict the full scope how the refactoring could be done, I can only see one or two steps ahead.
Thanks, that’s interesting.
Wait … we are writing this not-really-zstd data to durable storage as |
Flush() is only called before Close() so it has not any effect. Signed-off-by: Giuseppe Scrivano <[email protected]>
a7455c2
to
92ce27a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d really like storeMissingFiles
to be simplified further, but I guess with the c.rawReader = io.LimitReader
consolidation this PR, is, on net, not making things worse.
f5b395e
to
b3ada72
Compare
The reader for the part is now limited before calling prepareCompressedStreamToFile(). Signed-off-by: Giuseppe Scrivano <[email protected]>
a new function NoCompression() is added to provide a way to create uncompressed zstd:chunked files. Signed-off-by: Giuseppe Scrivano <[email protected]>
This commit introduces the capability to convert tar layers to the zstd:chunked format, without performing any zstd compression. This allows using zstd:chunked without the cost of compression and decompression. Closes: https://issues.redhat.com/browse/RUN-3056 Signed-off-by: Giuseppe Scrivano <[email protected]>
b3ada72
to
87c6994
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with a bit of heavy heart about the complexity of storeMissingFiles
.
/lgtm |
This commit introduces the capability to convert tar layers to the zstd:chunked format, without performing any zstd compression.
This allows using zstd:chunked without the cost of compression and decompression.
Closes: https://issues.redhat.com/browse/RUN-3056