Skip to content

Commit 54df3c7

Browse files
authored
perf(cargo-package): match certain path prefix with pathspec (#14997)
### What does this PR try to resolve? This revives #14962. See benchmark chart in <#14962 (comment)>. #14962 was closed because we found more bugs in `cargo package`, and #14962 could potentially make them even harder to fix. Two of them have been fixed so this is good to ship IMO with its own good. --- An improvement #14955. `check_repo_state` checks the entire git repo status. This is usually fine if you have only a few packages in a workspace. For huge monorepos, it may hit performance issues. For example, on awslabs/aws-sdk-rust@2cbd34d the workspace has roughly 434 members to publish. `git ls-files` reported us 204379 files in this Git repository. That means git may need to check status of all files 434 times. That would be `204379 * 434 = 88,700,486` checks! Moreover, the current algorithm is finding the intersection of `PathSource::list_files` and `git status`. It is an `O(n^2)` check. Let's assume files are evenly distributed into each package, so roughly 470 files per package. If we're unlucky to have some dirty files, say 100 files. We will have to do `470 * 100 = 47,000` times of path comparisons. Even worse, because we `git status` everything in the repo, we'll have to it for all members, even when those dirty files are not part of the current package in question. So it becomes `470 * 100 * 434 = 20,398,000`! #### Solution Instead of comparing with the status of the entire repository, this patch use the magic pathspec[1] to tell git only reports paths that match a certain path prefix. This wouldn't help the `O(n^2)` algorithm, but at least it won't check dirty files outside the current package. Also, we don't `git status` against entire git worktree/index anymore. [1]: https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefpathspecapathspec ### How should we test and review this PR? Run this command against awslabs/aws-sdk-rust@2cbd34d, and see if it is getting better. ``` CARGO_LOG_PROFILE=1 cargor package --no-verify --offline --allow-dirty -p aws-sdk-accessanalyzer -p aws-sdk-apigateway ``` I've verified checksums of `.crate` files generated from master (d85d761) and this commit (3dabdcd). They are the same. ### Additional information There are some other alternatives, like making `PathSource::list_files` additionally reports dirty files. While we already have rooms to do it, this approach should be the most straightforward one at this moment. Some other approaches like * Switch to gitoxide (I tried and it didn't as good as expected. Maybe I did something wrong). * A flag `--no-vcs` to skip vcs at all * Improve the `O(n^2)` algorithm
2 parents f15df8f + 3dabdcd commit 54df3c7

File tree

1 file changed

+21
-4
lines changed
  • src/cargo/ops/cargo_package

1 file changed

+21
-4
lines changed

src/cargo/ops/cargo_package/vcs.rs

+21-4
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
//! Helpers to gather the VCS information for `cargo package`.
22
33
use std::collections::HashSet;
4+
use std::path::Path;
45
use std::path::PathBuf;
56

67
use anyhow::Context as _;
@@ -173,7 +174,9 @@ fn git(
173174
// - ignored (in case the user has an `include` directive that
174175
// conflicts with .gitignore).
175176
let mut dirty_files = Vec::new();
176-
collect_statuses(repo, &mut dirty_files)?;
177+
let pathspec = relative_pathspec(repo, pkg.root());
178+
collect_statuses(repo, &[pathspec.as_str()], &mut dirty_files)?;
179+
177180
// Include each submodule so that the error message can provide
178181
// specifically *which* files in a submodule are modified.
179182
status_submodules(repo, &mut dirty_files)?;
@@ -263,12 +266,18 @@ fn dirty_files_outside_pkg_root(
263266
}
264267

265268
/// Helper to collect dirty statuses for a single repo.
266-
fn collect_statuses(repo: &git2::Repository, dirty_files: &mut Vec<PathBuf>) -> CargoResult<()> {
269+
fn collect_statuses(
270+
repo: &git2::Repository,
271+
pathspecs: &[&str],
272+
dirty_files: &mut Vec<PathBuf>,
273+
) -> CargoResult<()> {
267274
let mut status_opts = git2::StatusOptions::new();
268275
// Exclude submodules, as they are being handled manually by recursing
269276
// into each one so that details about specific files can be
270277
// retrieved.
271-
status_opts
278+
pathspecs
279+
.iter()
280+
.fold(&mut status_opts, git2::StatusOptions::pathspec)
272281
.exclude_submodules(true)
273282
.include_ignored(true)
274283
.include_untracked(true);
@@ -300,8 +309,16 @@ fn status_submodules(repo: &git2::Repository, dirty_files: &mut Vec<PathBuf>) ->
300309
// If its files are required, then the verification step should fail.
301310
if let Ok(sub_repo) = submodule.open() {
302311
status_submodules(&sub_repo, dirty_files)?;
303-
collect_statuses(&sub_repo, dirty_files)?;
312+
collect_statuses(&sub_repo, &[], dirty_files)?;
304313
}
305314
}
306315
Ok(())
307316
}
317+
318+
/// Use pathspec so git only matches a certain path prefix
319+
fn relative_pathspec(repo: &git2::Repository, pkg_root: &Path) -> String {
320+
let workdir = repo.workdir().unwrap();
321+
let relpath = pkg_root.strip_prefix(workdir).unwrap_or(Path::new(""));
322+
// to unix separators
323+
relpath.to_str().unwrap().replace('\\', "/")
324+
}

0 commit comments

Comments
 (0)