Add cuda-compat-mode flag to configure command #307

elezar · 2025-04-29T12:46:20Z

This changes adds a --cuda-compat-mode flag to the configure CLI. This allows more flexibility than the existing --no-cntlibs flag.

Possible values of the flag are:

mount (default) - CUDA compat libraries are mounted from /usr/local/cuda/compat to the standard library path in the container.
ldconfig - The folder containing the CUDA compat libraries is added as a command line argument to the ldconfig command executed in the container.
disabled - This is equivalent ot specifying the --no-cntlibs flag.

Note that the default behaviour is not changed.

This is levaraged by NVIDIA/nvidia-container-toolkit#1055 to provide more flexibility in how CUDA Forward Compat is handled.

klueska · 2025-04-30T14:21:25Z

First comment before looking at any code changes -- I would say that passing --no-cntlibs causes the --cuda-compat-mode flag to be ignored in favor of not injecting the libs at all. Maybe this is already the case, but the way the description of this PR reads, it seems that it only matches to the disabled mode of the new flag.

klueska

First pass

Makefile

klueska · 2025-04-30T14:23:03Z

src/cli/configure.c

+                        // We add cuda-compat-mode=$arg to the container_flags.
+                        if (str_join(&err, &ctx->container_flags, "cuda-compat-mode", " ") < 0)
+                                goto fatal;
+                        if (str_join(&err, &ctx->container_flags, arg, "=") < 0)
+                                goto fatal;


Is there any precedent for a flag of this type that can take on multiple values in this CLI?

Do you mean apart from the device selection flags that are comma-separated strings?

There are no other flags that select from a distinct set of options. This is the first such flag.

The --require flag could also be considered similar.

src/cli/configure.c

klueska · 2025-04-30T14:25:08Z

src/nvc_container.c

+        // If CUDA Compat is enabled and neither cuda-compat-mode=mount nor cuda-compat-mode=ldconfig is specified
+        // default to cuda-compat-mode=mount to maintain backward compatibility.


I don't know how consistent we've been, but I believe this code base prefers the /* ... */ style comments over //

Not sure about consistency. Will update though.

klueska · 2025-04-30T14:27:42Z

src/nvc_container.c

+        if ((flags & OPT_CUDA_COMPAT_MODE_MOUNT) & (flags & OPT_CUDA_COMPAT_MODE_LDCONFIG)) {
+                error_setx(&ctx->err, "only one cuda-compat-mode can be specified at a time");
+                return (NULL);
+        }
+        // If CUDA Compat is enabled and neither cuda-compat-mode=mount nor cuda-compat-mode=ldconfig is specified
+        // default to cuda-compat-mode=mount to maintain backward compatibility.
+        if (!(flags & OPT_NO_CNTLIBS) & !(flags & OPT_CUDA_COMPAT_MODE_MOUNT) & !(flags & OPT_CUDA_COMPAT_MODE_LDCONFIG)) {
+                flags &= OPT_CUDA_COMPAT_MODE_MOUNT;
+        }


I think we need an additional check here to say that if OPT_NO_CNTLIBS is set then we undo OPT_CUDA_COMPAT_MODE_MOUNT and OPT_CUDA_COMPAT_MODE_LDCONFIG if they happen to also be set.

As implemented, the code that checks OPT_CUDA_COMPAT_MODE* later is never triggered if OPT_NO_CNTLIBS is set, but being explicit here is better.

klueska · 2025-04-30T14:36:43Z

src/nvc_mount.c

-                size_t nlibs = cnt->nlibs;
-                char **libs = array_copy(&ctx->err, (const char * const *)cnt->libs, cnt->nlibs);
-                if (libs == NULL)
-                        goto fail;
-
-                filter_libraries(info, libs, &nlibs);
-                if ((tmp = (const char **)mount_files(&ctx->err, cnt->cfg.rootfs, cnt, cnt->cfg.libs_dir, libs, nlibs)) == NULL) {
-                        free(libs);
+                if ((tmp = (const char **)mount_files(&ctx->err, cnt->cfg.rootfs, cnt, cnt->cfg.libs_dir, cnt->libs, cnt->nlibs)) == NULL) {


naively looking at this, its not clear why the removed code here got removed. The body of mount_files() hasn't changed, so why is this "prep work" no longer necessary.

The reason that this is not longer required is that the filtering has been moved to be run as SOON as we have access to the container and driver information. This happens in configure.c.

This means that this function is called with the correct set of libs depending on the selected mode.

src/options.h

klueska · 2025-04-30T14:38:23Z

src/cli/compat_mode.h

+enum {
+    OPT_CUDA_COMPAT_MODE_MOUNT    = 1 << 14,
+    OPT_CUDA_COMPAT_MODE_LDCONFIG = 1 << 15,
+};


I think we should have an explicit OPT_CUDA_COMPAT_MODE_DISABLED option here

klueska · 2025-04-30T14:39:34Z

src/cli/compat_mode.c

+static int get_compat_library_path(struct error *, const char * [], size_t, char **);
+
+int
+update_compat_libraries(struct nvc_context *ctx, struct nvc_container *cnt, const struct nvc_driver_info *info) {


Is this really an update or more of a "maybe add" for the compat libraries?

The compat libraries are originally discovered in nvc_container_new and this function updates them based on the options (although OPT_CUDA_COMPAT_MODE_DISABLED is strictly speaking already handled) and the driver version. Ideally we would make the following changes:

Switch the order of initialising the driver and the container.

Pass the driver info to nvc_container_new so that we can perform the filtering there.

The issue with this (especially 2.) is that it breaks ABI compatibility for users that rely on this library. I elected to update this information here instead so as to have later code function consistently.

Note that we need the compat library information in two possible places:

When mounting the driver files in nvc_driver_mount

When running ldconfig in nvc_ldcache_update

I just realised that I didn't really answer your question:

The function does the following:

If the the set of libraries is empty it does nothing (this is the case if cuda-compat-mode=disabled).

It filters (and updates) the set of libraries based on the major version and cuda-compat-mode.

Updates cuda_compat_dir if required.

klueska · 2025-04-30T14:40:18Z

src/cli/compat_mode.c

+        size_t nlibs = cnt->nlibs;
+        char **libs = array_copy(&ctx->err, (const char * const *)cnt->libs, cnt->nlibs);
+        if (libs == NULL) {
+                return -1;
+        }
+
+        // For cuda-compat-mode=mount, we also allow compat libraries with a LOWER major versions.
+        bool allow_lower_major_versions = (cnt-> flags & OPT_CUDA_COMPAT_MODE_MOUNT);
+        filter_by_major_version(allow_lower_major_versions, info, libs, &nlibs);
+
+        // We free the previously allocated libs to allow these to be reassigned if required.
+        free(cnt->libs);
+        cnt->libs = NULL;
+        cnt->nlibs = 0;


This feels messy but I don't have any concrete suggestions to improve it off the top of my head

I'm also not 100% happy with it. I think I may be struggling because I'm not properly "thinking in C" at the moment.

elezar · 2025-05-02T11:25:40Z

First comment before looking at any code changes -- I would say that passing --no-cntlibs causes the --cuda-compat-mode flag to be ignored in favor of not injecting the libs at all. Maybe this is already the case, but the way the description of this PR reads, it seems that it only matches to the disabled mode of the new flag.

Yes, the --no-cntlibs flag takes precedence over the --cuda-compute-mode options specified. If specified, no libraries are ever discovered. --cuda-compute-mode=disabled is equivalent to --no-cntlibs in terms of behaviour.

cdesiniotis · 2025-05-07T16:52:46Z

src/cli/compat_mode.c

+        filter_by_major_version(allow_lower_major_versions, info, libs, &nlibs);
+
+        // We free the previously allocated libs to allow these to be reassigned if required.
+        free(cnt->libs);
+        cnt->libs = NULL;
+        cnt->nlibs = 0;
+
+        if (cnt->flags & OPT_CUDA_COMPAT_MODE_LDCONFIG) {
+                if (get_compat_library_path(&ctx->err, (const char **)libs, nlibs, &cnt->cuda_compat_dir) < 0) {
+                        goto fail;
+                }
+                // For cuda-compat-mode=ldconfig we don't require the library
+                // paths since we already have the compat dir stored.
+                free(libs);
+        }
+        if (cnt-> flags & OPT_CUDA_COMPAT_MODE_MOUNT) {
+                // Munting is handled later. We only need to update the required
+                // container information.
+                cnt->libs = libs;
+                cnt->nlibs = nlibs;
+        }
+        return (0);
+fail:
+        free(libs);
+        return (-1);
+}


Is there any harm in not freeing the cnt->libs array in this function when cuda-compat-mode=ldconfig? If it is not required, then what about the below suggestion for readability sake?

Suggested change

filter_by_major_version(allow_lower_major_versions, info, libs, &nlibs);

// We free the previously allocated libs to allow these to be reassigned if required.

free(cnt->libs);

cnt->libs = NULL;

cnt->nlibs = 0;

if (cnt->flags & OPT_CUDA_COMPAT_MODE_LDCONFIG) {

if (get_compat_library_path(&ctx->err, (const char **)libs, nlibs, &cnt->cuda_compat_dir) < 0) {

goto fail;

}

// For cuda-compat-mode=ldconfig we don't require the library

// paths since we already have the compat dir stored.

free(libs);

}

if (cnt-> flags & OPT_CUDA_COMPAT_MODE_MOUNT) {

// Munting is handled later. We only need to update the required

// container information.

cnt->libs = libs;

cnt->nlibs = nlibs;

}

return (0);

fail:

free(libs);

return (-1);

}

filter_by_major_version(allow_lower_major_versions, info, libs, &nlibs);

// Use the filtered library list

free(cnt->libs);

cnt->libs = libs;

cnt->nlibs = nlibs;

if (cnt->flags & OPT_CUDA_COMPAT_MODE_LDCONFIG) {

if (get_compat_library_path(&ctx->err, (const char **)libs, nlibs, &cnt->cuda_compat_dir) < 0) {

return (1);

}

}

return (0);

}

I don't think there's any harm in doing it the way you propose, but I wanted to ensure that cnt->libs and cnt->nlibs are consistently set to NULL and 0 to prevent inadvertent usage. Since we now also explicilty check the options where these are used that should not be as much of an issue and your suggestion is definitely simpler.

klueska · 2025-05-12T14:13:27Z

src/nvc_container.c

+        if (flags & OPT_CUDA_COMPAT_MODE_DISABLED) {
+                /*
+                 * If the OPT_CUDA_COMPAT_MODE_DISABLED flag is specified, we
+                 * explicitly ignore other OP_CUDA_COMPAT_MODE_* flags.
+                 */
+                flags &= ~(OPT_CUDA_COMPAT_MODE_MOUNT | OPT_CUDA_COMPAT_MODE_LDCONFIG);
+        } else {
+                if (!(flags & (OPT_CUDA_COMPAT_MODE_LDCONFIG | OPT_CUDA_COMPAT_MODE_MOUNT))) {
+                        /*
+                         * If no OPT_CUDA_COMPAT_MODE_* flags are specified,
+                         * default to OPT_CUDA_COMPAT_MODE_MOUNT to maintain
+                         * backward compatibility.
+                         */
+                        flags &= OPT_CUDA_COMPAT_MODE_MOUNT;
+                }
+        }
+        if ((flags & OPT_CUDA_COMPAT_MODE_MOUNT) && (flags & OPT_CUDA_COMPAT_MODE_LDCONFIG)) {
+                error_setx(&ctx->err, "only one cuda-compat-mode can be specified at a time");
+                return (NULL);
+        }


Can we extract this into a function / simplify it to modify the flags inline / return an error as necessary.

Please see the new version where I break this logic into a function. I return an error code and set the flags value by reference.

klueska · 2025-05-12T14:27:08Z

src/cli/list.c

@@ -25,7 +25,6 @@ const struct argp list_usage = {
                {"no-persistenced", 0x84, NULL, 0, "Don't include the NVIDIA persistenced socket", -1},
                {"no-fabricmanager", 0x85, NULL, 0, "Don't include the NVIDIA fabricmanager socket", -1},
                {"no-gsp-firmware", 0x86, NULL, 0, "Don't include GSP Firmware", -1},
-                {"no-cntlibs", 0x87, NULL, 0, "Don't overwrite host mounts with CUDA compat libs from the container", -1},


Why can we just remove this without deprecating it? Do we also want to add a cuda-compat-mode flag here?

It was erroneously added to mirror the --no-cntlibs option in the configure command. Note that the list command does NOT operate on a container and as such this option never had any effect since there are no container libraries.

This changes adds a --cuda-compat-mode flag to the configure CLI. This allows more flexibility than the existing --no-cntlibs flag. Possible values of the flag are: * mount (default) - CUDA compat libraries are mounted from /usr/local/cuda/compat to the standard library path in the container. * ldconfig - The folder containing the CUDA compat libraries is added as a command line argument to the ldconfig command executed in the container. * disabled - This is equivalent ot specifying the --no-cntlibs flag and skips the detection and injection of compat libraries from the container to the container entirely. Signed-off-by: Evan Lezar <[email protected]>

Since the compat libraries are discovered as part of container creation, the --no-cntlibs option that disables their discovery is not relevant to the nvidia-container-cli list command. This change removes this flag. Signed-off-by: Evan Lezar <[email protected]>

elezar self-assigned this Apr 29, 2025

elezar requested review from klueska and ArangoGutierrez April 29, 2025 12:46

elezar mentioned this pull request Apr 29, 2025

Add nvidia-container-cli.compat-mode config option NVIDIA/nvidia-container-toolkit#1055

Merged

elezar force-pushed the forward-compat-by-folder branch from ec22785 to 96ec4e1 Compare April 30, 2025 12:51

klueska reviewed Apr 30, 2025

View reviewed changes

elezar force-pushed the forward-compat-by-folder branch 2 times, most recently from 5c14039 to 62c339d Compare May 2, 2025 14:01

elezar marked this pull request as ready for review May 6, 2025 11:36

elezar requested review from klueska and cdesiniotis May 6, 2025 11:36

cdesiniotis reviewed May 7, 2025

View reviewed changes

elezar force-pushed the forward-compat-by-folder branch 3 times, most recently from 2282d50 to 0498c01 Compare May 12, 2025 13:19

klueska reviewed May 12, 2025

View reviewed changes

elezar added 2 commits May 12, 2025 16:59

elezar force-pushed the forward-compat-by-folder branch from 0498c01 to e03b6a8 Compare May 12, 2025 14:59

elezar requested review from klueska and cdesiniotis May 13, 2025 10:55

elezar merged commit d26524a into NVIDIA:main May 13, 2025
3 checks passed

elezar deleted the forward-compat-by-folder branch May 13, 2025 19:46

elezar mentioned this pull request May 13, 2025

Bump third_party/libnvidia-container from a198166 to d26524a NVIDIA/nvidia-container-toolkit#1081

Merged

		// If CUDA Compat is enabled and neither cuda-compat-mode=mount nor cuda-compat-mode=ldconfig is specified
		// default to cuda-compat-mode=mount to maintain backward compatibility.

Add cuda-compat-mode flag to configure command #307

Add cuda-compat-mode flag to configure command #307

Uh oh!

Conversation

elezar commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klueska commented Apr 30, 2025

Uh oh!

klueska left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elezar commented May 2, 2025

Uh oh!

cdesiniotis May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

elezar commented Apr 29, 2025 •

edited

Loading

cdesiniotis May 7, 2025 •

edited

Loading