Skip to content

Add cuda-compat-mode flag to configure command #307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 13, 2025

Conversation

elezar
Copy link
Member

@elezar elezar commented Apr 29, 2025

This changes adds a --cuda-compat-mode flag to the configure CLI. This allows more flexibility than the existing --no-cntlibs flag.

Possible values of the flag are:

  • mount (default) - CUDA compat libraries are mounted from /usr/local/cuda/compat to the standard library path in the container.
  • ldconfig - The folder containing the CUDA compat libraries is added as a command line argument to the ldconfig command executed in the container.
  • disabled - This is equivalent ot specifying the --no-cntlibs flag.

Note that the default behaviour is not changed.

This is levaraged by NVIDIA/nvidia-container-toolkit#1055 to provide more flexibility in how CUDA Forward Compat is handled.

@klueska
Copy link
Contributor

klueska commented Apr 30, 2025

First comment before looking at any code changes -- I would say that passing --no-cntlibs causes the --cuda-compat-mode flag to be ignored in favor of not injecting the libs at all. Maybe this is already the case, but the way the description of this PR reads, it seems that it only matches to the disabled mode of the new flag.

Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass

Comment on lines 181 to 185
// We add cuda-compat-mode=$arg to the container_flags.
if (str_join(&err, &ctx->container_flags, "cuda-compat-mode", " ") < 0)
goto fatal;
if (str_join(&err, &ctx->container_flags, arg, "=") < 0)
goto fatal;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any precedent for a flag of this type that can take on multiple values in this CLI?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean apart from the device selection flags that are comma-separated strings?

There are no other flags that select from a distinct set of options. This is the first such flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --require flag could also be considered similar.

Comment on lines 243 to 244
// If CUDA Compat is enabled and neither cuda-compat-mode=mount nor cuda-compat-mode=ldconfig is specified
// default to cuda-compat-mode=mount to maintain backward compatibility.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how consistent we've been, but I believe this code base prefers the /* ... */ style comments over //

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about consistency. Will update though.

Comment on lines 239 to 247
if ((flags & OPT_CUDA_COMPAT_MODE_MOUNT) & (flags & OPT_CUDA_COMPAT_MODE_LDCONFIG)) {
error_setx(&ctx->err, "only one cuda-compat-mode can be specified at a time");
return (NULL);
}
// If CUDA Compat is enabled and neither cuda-compat-mode=mount nor cuda-compat-mode=ldconfig is specified
// default to cuda-compat-mode=mount to maintain backward compatibility.
if (!(flags & OPT_NO_CNTLIBS) & !(flags & OPT_CUDA_COMPAT_MODE_MOUNT) & !(flags & OPT_CUDA_COMPAT_MODE_LDCONFIG)) {
flags &= OPT_CUDA_COMPAT_MODE_MOUNT;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need an additional check here to say that if OPT_NO_CNTLIBS is set then we undo OPT_CUDA_COMPAT_MODE_MOUNT and OPT_CUDA_COMPAT_MODE_LDCONFIG if they happen to also be set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As implemented, the code that checks OPT_CUDA_COMPAT_MODE* later is never triggered if OPT_NO_CNTLIBS is set, but being explicit here is better.

src/nvc_mount.c Outdated
Comment on lines 773 to 751
size_t nlibs = cnt->nlibs;
char **libs = array_copy(&ctx->err, (const char * const *)cnt->libs, cnt->nlibs);
if (libs == NULL)
goto fail;

filter_libraries(info, libs, &nlibs);
if ((tmp = (const char **)mount_files(&ctx->err, cnt->cfg.rootfs, cnt, cnt->cfg.libs_dir, libs, nlibs)) == NULL) {
free(libs);
if ((tmp = (const char **)mount_files(&ctx->err, cnt->cfg.rootfs, cnt, cnt->cfg.libs_dir, cnt->libs, cnt->nlibs)) == NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naively looking at this, its not clear why the removed code here got removed. The body of mount_files() hasn't changed, so why is this "prep work" no longer necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that this is not longer required is that the filtering has been moved to be run as SOON as we have access to the container and driver information. This happens in configure.c.

This means that this function is called with the correct set of libs depending on the selected mode.

Comment on lines 23 to 28
enum {
OPT_CUDA_COMPAT_MODE_MOUNT = 1 << 14,
OPT_CUDA_COMPAT_MODE_LDCONFIG = 1 << 15,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have an explicit OPT_CUDA_COMPAT_MODE_DISABLED option here

static int get_compat_library_path(struct error *, const char * [], size_t, char **);

int
update_compat_libraries(struct nvc_context *ctx, struct nvc_container *cnt, const struct nvc_driver_info *info) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really an update or more of a "maybe add" for the compat libraries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compat libraries are originally discovered in nvc_container_new and this function updates them based on the options (although OPT_CUDA_COMPAT_MODE_DISABLED is strictly speaking already handled) and the driver version. Ideally we would make the following changes:

  1. Switch the order of initialising the driver and the container.
  2. Pass the driver info to nvc_container_new so that we can perform the filtering there.

The issue with this (especially 2.) is that it breaks ABI compatibility for users that rely on this library. I elected to update this information here instead so as to have later code function consistently.

Note that we need the compat library information in two possible places:

  1. When mounting the driver files in nvc_driver_mount
  2. When running ldconfig in nvc_ldcache_update

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised that I didn't really answer your question:

The function does the following:

  1. If the the set of libraries is empty it does nothing (this is the case if cuda-compat-mode=disabled).
  2. It filters (and updates) the set of libraries based on the major version and cuda-compat-mode.
  3. Updates cuda_compat_dir if required.

Comment on lines 33 to 49
size_t nlibs = cnt->nlibs;
char **libs = array_copy(&ctx->err, (const char * const *)cnt->libs, cnt->nlibs);
if (libs == NULL) {
return -1;
}

// For cuda-compat-mode=mount, we also allow compat libraries with a LOWER major versions.
bool allow_lower_major_versions = (cnt-> flags & OPT_CUDA_COMPAT_MODE_MOUNT);
filter_by_major_version(allow_lower_major_versions, info, libs, &nlibs);

// We free the previously allocated libs to allow these to be reassigned if required.
free(cnt->libs);
cnt->libs = NULL;
cnt->nlibs = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels messy but I don't have any concrete suggestions to improve it off the top of my head

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not 100% happy with it. I think I may be struggling because I'm not properly "thinking in C" at the moment.

@elezar
Copy link
Member Author

elezar commented May 2, 2025

First comment before looking at any code changes -- I would say that passing --no-cntlibs causes the --cuda-compat-mode flag to be ignored in favor of not injecting the libs at all. Maybe this is already the case, but the way the description of this PR reads, it seems that it only matches to the disabled mode of the new flag.

Yes, the --no-cntlibs flag takes precedence over the --cuda-compute-mode options specified. If specified, no libraries are ever discovered. --cuda-compute-mode=disabled is equivalent to --no-cntlibs in terms of behaviour.

@elezar elezar force-pushed the forward-compat-by-folder branch 2 times, most recently from 5c14039 to 62c339d Compare May 2, 2025 14:01
@elezar elezar marked this pull request as ready for review May 6, 2025 11:36
@elezar elezar requested review from klueska and cdesiniotis May 6, 2025 11:36
Comment on lines 44 to 59
filter_by_major_version(allow_lower_major_versions, info, libs, &nlibs);

// We free the previously allocated libs to allow these to be reassigned if required.
free(cnt->libs);
cnt->libs = NULL;
cnt->nlibs = 0;

if (cnt->flags & OPT_CUDA_COMPAT_MODE_LDCONFIG) {
if (get_compat_library_path(&ctx->err, (const char **)libs, nlibs, &cnt->cuda_compat_dir) < 0) {
goto fail;
}
// For cuda-compat-mode=ldconfig we don't require the library
// paths since we already have the compat dir stored.
free(libs);
}
if (cnt-> flags & OPT_CUDA_COMPAT_MODE_MOUNT) {
// Munting is handled later. We only need to update the required
// container information.
cnt->libs = libs;
cnt->nlibs = nlibs;
}
return (0);
fail:
free(libs);
return (-1);
}
Copy link
Contributor

@cdesiniotis cdesiniotis May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any harm in not freeing the cnt->libs array in this function when cuda-compat-mode=ldconfig? If it is not required, then what about the below suggestion for readability sake?

Suggested change
filter_by_major_version(allow_lower_major_versions, info, libs, &nlibs);
// We free the previously allocated libs to allow these to be reassigned if required.
free(cnt->libs);
cnt->libs = NULL;
cnt->nlibs = 0;
if (cnt->flags & OPT_CUDA_COMPAT_MODE_LDCONFIG) {
if (get_compat_library_path(&ctx->err, (const char **)libs, nlibs, &cnt->cuda_compat_dir) < 0) {
goto fail;
}
// For cuda-compat-mode=ldconfig we don't require the library
// paths since we already have the compat dir stored.
free(libs);
}
if (cnt-> flags & OPT_CUDA_COMPAT_MODE_MOUNT) {
// Munting is handled later. We only need to update the required
// container information.
cnt->libs = libs;
cnt->nlibs = nlibs;
}
return (0);
fail:
free(libs);
return (-1);
}
filter_by_major_version(allow_lower_major_versions, info, libs, &nlibs);
// Use the filtered library list
free(cnt->libs);
cnt->libs = libs;
cnt->nlibs = nlibs;
if (cnt->flags & OPT_CUDA_COMPAT_MODE_LDCONFIG) {
if (get_compat_library_path(&ctx->err, (const char **)libs, nlibs, &cnt->cuda_compat_dir) < 0) {
return (1);
}
}
return (0);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any harm in doing it the way you propose, but I wanted to ensure that cnt->libs and cnt->nlibs are consistently set to NULL and 0 to prevent inadvertent usage. Since we now also explicilty check the options where these are used that should not be as much of an issue and your suggestion is definitely simpler.

@elezar elezar force-pushed the forward-compat-by-folder branch 3 times, most recently from 2282d50 to 0498c01 Compare May 12, 2025 13:19
Comment on lines 239 to 242
if (flags & OPT_CUDA_COMPAT_MODE_DISABLED) {
/*
* If the OPT_CUDA_COMPAT_MODE_DISABLED flag is specified, we
* explicitly ignore other OP_CUDA_COMPAT_MODE_* flags.
*/
flags &= ~(OPT_CUDA_COMPAT_MODE_MOUNT | OPT_CUDA_COMPAT_MODE_LDCONFIG);
} else {
if (!(flags & (OPT_CUDA_COMPAT_MODE_LDCONFIG | OPT_CUDA_COMPAT_MODE_MOUNT))) {
/*
* If no OPT_CUDA_COMPAT_MODE_* flags are specified,
* default to OPT_CUDA_COMPAT_MODE_MOUNT to maintain
* backward compatibility.
*/
flags &= OPT_CUDA_COMPAT_MODE_MOUNT;
}
}
if ((flags & OPT_CUDA_COMPAT_MODE_MOUNT) && (flags & OPT_CUDA_COMPAT_MODE_LDCONFIG)) {
error_setx(&ctx->err, "only one cuda-compat-mode can be specified at a time");
return (NULL);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract this into a function / simplify it to modify the flags inline / return an error as necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the new version where I break this logic into a function. I return an error code and set the flags value by reference.

@@ -25,7 +25,6 @@ const struct argp list_usage = {
{"no-persistenced", 0x84, NULL, 0, "Don't include the NVIDIA persistenced socket", -1},
{"no-fabricmanager", 0x85, NULL, 0, "Don't include the NVIDIA fabricmanager socket", -1},
{"no-gsp-firmware", 0x86, NULL, 0, "Don't include GSP Firmware", -1},
{"no-cntlibs", 0x87, NULL, 0, "Don't overwrite host mounts with CUDA compat libs from the container", -1},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can we just remove this without deprecating it? Do we also want to add a cuda-compat-mode flag here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was erroneously added to mirror the --no-cntlibs option in the configure command. Note that the list command does NOT operate on a container and as such this option never had any effect since there are no container libraries.

elezar added 2 commits May 12, 2025 16:59
This changes adds a --cuda-compat-mode flag to the configure
CLI. This allows more flexibility than the existing --no-cntlibs flag.

Possible values of the flag are:
* mount (default) - CUDA compat libraries are mounted from /usr/local/cuda/compat to
  the standard library path in the container.
* ldconfig - The folder containing the CUDA compat libraries is added as a command
  line argument to the ldconfig command executed in the container.
* disabled - This is equivalent ot specifying the --no-cntlibs flag and skips
  the detection and injection of compat libraries from the container to the
  container entirely.

Signed-off-by: Evan Lezar <[email protected]>
Since the compat libraries are discovered as part of container
creation, the --no-cntlibs option that disables their discovery
is not relevant to the nvidia-container-cli list command.

This change removes this flag.

Signed-off-by: Evan Lezar <[email protected]>
@elezar elezar force-pushed the forward-compat-by-folder branch from 0498c01 to e03b6a8 Compare May 12, 2025 14:59
@elezar elezar requested review from klueska and cdesiniotis May 13, 2025 10:55
@elezar elezar merged commit d26524a into NVIDIA:main May 13, 2025
3 checks passed
@elezar elezar deleted the forward-compat-by-folder branch May 13, 2025 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants