-
Notifications
You must be signed in to change notification settings - Fork 231
Add cuda-compat-mode flag to configure command #307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ec22785
to
96ec4e1
Compare
First comment before looking at any code changes -- I would say that passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass
src/cli/configure.c
Outdated
// We add cuda-compat-mode=$arg to the container_flags. | ||
if (str_join(&err, &ctx->container_flags, "cuda-compat-mode", " ") < 0) | ||
goto fatal; | ||
if (str_join(&err, &ctx->container_flags, arg, "=") < 0) | ||
goto fatal; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any precedent for a flag of this type that can take on multiple values in this CLI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean apart from the device selection flags that are comma-separated strings?
There are no other flags that select from a distinct set of options. This is the first such flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The --require
flag could also be considered similar.
src/nvc_container.c
Outdated
// If CUDA Compat is enabled and neither cuda-compat-mode=mount nor cuda-compat-mode=ldconfig is specified | ||
// default to cuda-compat-mode=mount to maintain backward compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how consistent we've been, but I believe this code base prefers the /* ... */ style comments over //
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about consistency. Will update though.
src/nvc_container.c
Outdated
if ((flags & OPT_CUDA_COMPAT_MODE_MOUNT) & (flags & OPT_CUDA_COMPAT_MODE_LDCONFIG)) { | ||
error_setx(&ctx->err, "only one cuda-compat-mode can be specified at a time"); | ||
return (NULL); | ||
} | ||
// If CUDA Compat is enabled and neither cuda-compat-mode=mount nor cuda-compat-mode=ldconfig is specified | ||
// default to cuda-compat-mode=mount to maintain backward compatibility. | ||
if (!(flags & OPT_NO_CNTLIBS) & !(flags & OPT_CUDA_COMPAT_MODE_MOUNT) & !(flags & OPT_CUDA_COMPAT_MODE_LDCONFIG)) { | ||
flags &= OPT_CUDA_COMPAT_MODE_MOUNT; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need an additional check here to say that if OPT_NO_CNTLIBS
is set then we undo OPT_CUDA_COMPAT_MODE_MOUNT
and OPT_CUDA_COMPAT_MODE_LDCONFIG
if they happen to also be set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As implemented, the code that checks OPT_CUDA_COMPAT_MODE*
later is never triggered if OPT_NO_CNTLIBS
is set, but being explicit here is better.
src/nvc_mount.c
Outdated
size_t nlibs = cnt->nlibs; | ||
char **libs = array_copy(&ctx->err, (const char * const *)cnt->libs, cnt->nlibs); | ||
if (libs == NULL) | ||
goto fail; | ||
|
||
filter_libraries(info, libs, &nlibs); | ||
if ((tmp = (const char **)mount_files(&ctx->err, cnt->cfg.rootfs, cnt, cnt->cfg.libs_dir, libs, nlibs)) == NULL) { | ||
free(libs); | ||
if ((tmp = (const char **)mount_files(&ctx->err, cnt->cfg.rootfs, cnt, cnt->cfg.libs_dir, cnt->libs, cnt->nlibs)) == NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naively looking at this, its not clear why the removed code here got removed. The body of mount_files()
hasn't changed, so why is this "prep work" no longer necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that this is not longer required is that the filtering has been moved to be run as SOON as we have access to the container and driver information. This happens in configure.c
.
This means that this function is called with the correct set of libs depending on the selected mode.
src/cli/compat_mode.h
Outdated
enum { | ||
OPT_CUDA_COMPAT_MODE_MOUNT = 1 << 14, | ||
OPT_CUDA_COMPAT_MODE_LDCONFIG = 1 << 15, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have an explicit OPT_CUDA_COMPAT_MODE_DISABLED option here
static int get_compat_library_path(struct error *, const char * [], size_t, char **); | ||
|
||
int | ||
update_compat_libraries(struct nvc_context *ctx, struct nvc_container *cnt, const struct nvc_driver_info *info) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really an update or more of a "maybe add" for the compat libraries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compat libraries are originally discovered in nvc_container_new
and this function updates them based on the options (although OPT_CUDA_COMPAT_MODE_DISABLED
is strictly speaking already handled) and the driver version. Ideally we would make the following changes:
- Switch the order of initialising the driver and the container.
- Pass the driver info to
nvc_container_new
so that we can perform the filtering there.
The issue with this (especially 2.) is that it breaks ABI compatibility for users that rely on this library. I elected to update this information here instead so as to have later code function consistently.
Note that we need the compat library information in two possible places:
- When mounting the driver files in
nvc_driver_mount
- When running ldconfig in
nvc_ldcache_update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realised that I didn't really answer your question:
The function does the following:
- If the the set of libraries is empty it does nothing (this is the case if
cuda-compat-mode=disabled
). - It filters (and updates) the set of libraries based on the major version and
cuda-compat-mode
. - Updates
cuda_compat_dir
if required.
src/cli/compat_mode.c
Outdated
size_t nlibs = cnt->nlibs; | ||
char **libs = array_copy(&ctx->err, (const char * const *)cnt->libs, cnt->nlibs); | ||
if (libs == NULL) { | ||
return -1; | ||
} | ||
|
||
// For cuda-compat-mode=mount, we also allow compat libraries with a LOWER major versions. | ||
bool allow_lower_major_versions = (cnt-> flags & OPT_CUDA_COMPAT_MODE_MOUNT); | ||
filter_by_major_version(allow_lower_major_versions, info, libs, &nlibs); | ||
|
||
// We free the previously allocated libs to allow these to be reassigned if required. | ||
free(cnt->libs); | ||
cnt->libs = NULL; | ||
cnt->nlibs = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels messy but I don't have any concrete suggestions to improve it off the top of my head
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not 100% happy with it. I think I may be struggling because I'm not properly "thinking in C" at the moment.
Yes, the |
5c14039
to
62c339d
Compare
src/cli/compat_mode.c
Outdated
filter_by_major_version(allow_lower_major_versions, info, libs, &nlibs); | ||
|
||
// We free the previously allocated libs to allow these to be reassigned if required. | ||
free(cnt->libs); | ||
cnt->libs = NULL; | ||
cnt->nlibs = 0; | ||
|
||
if (cnt->flags & OPT_CUDA_COMPAT_MODE_LDCONFIG) { | ||
if (get_compat_library_path(&ctx->err, (const char **)libs, nlibs, &cnt->cuda_compat_dir) < 0) { | ||
goto fail; | ||
} | ||
// For cuda-compat-mode=ldconfig we don't require the library | ||
// paths since we already have the compat dir stored. | ||
free(libs); | ||
} | ||
if (cnt-> flags & OPT_CUDA_COMPAT_MODE_MOUNT) { | ||
// Munting is handled later. We only need to update the required | ||
// container information. | ||
cnt->libs = libs; | ||
cnt->nlibs = nlibs; | ||
} | ||
return (0); | ||
fail: | ||
free(libs); | ||
return (-1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any harm in not freeing the cnt->libs
array in this function when cuda-compat-mode=ldconfig
? If it is not required, then what about the below suggestion for readability sake?
filter_by_major_version(allow_lower_major_versions, info, libs, &nlibs); | |
// We free the previously allocated libs to allow these to be reassigned if required. | |
free(cnt->libs); | |
cnt->libs = NULL; | |
cnt->nlibs = 0; | |
if (cnt->flags & OPT_CUDA_COMPAT_MODE_LDCONFIG) { | |
if (get_compat_library_path(&ctx->err, (const char **)libs, nlibs, &cnt->cuda_compat_dir) < 0) { | |
goto fail; | |
} | |
// For cuda-compat-mode=ldconfig we don't require the library | |
// paths since we already have the compat dir stored. | |
free(libs); | |
} | |
if (cnt-> flags & OPT_CUDA_COMPAT_MODE_MOUNT) { | |
// Munting is handled later. We only need to update the required | |
// container information. | |
cnt->libs = libs; | |
cnt->nlibs = nlibs; | |
} | |
return (0); | |
fail: | |
free(libs); | |
return (-1); | |
} | |
filter_by_major_version(allow_lower_major_versions, info, libs, &nlibs); | |
// Use the filtered library list | |
free(cnt->libs); | |
cnt->libs = libs; | |
cnt->nlibs = nlibs; | |
if (cnt->flags & OPT_CUDA_COMPAT_MODE_LDCONFIG) { | |
if (get_compat_library_path(&ctx->err, (const char **)libs, nlibs, &cnt->cuda_compat_dir) < 0) { | |
return (1); | |
} | |
} | |
return (0); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any harm in doing it the way you propose, but I wanted to ensure that cnt->libs
and cnt->nlibs
are consistently set to NULL
and 0
to prevent inadvertent usage. Since we now also explicilty check the options where these are used that should not be as much of an issue and your suggestion is definitely simpler.
2282d50
to
0498c01
Compare
src/nvc_container.c
Outdated
if (flags & OPT_CUDA_COMPAT_MODE_DISABLED) { | ||
/* | ||
* If the OPT_CUDA_COMPAT_MODE_DISABLED flag is specified, we | ||
* explicitly ignore other OP_CUDA_COMPAT_MODE_* flags. | ||
*/ | ||
flags &= ~(OPT_CUDA_COMPAT_MODE_MOUNT | OPT_CUDA_COMPAT_MODE_LDCONFIG); | ||
} else { | ||
if (!(flags & (OPT_CUDA_COMPAT_MODE_LDCONFIG | OPT_CUDA_COMPAT_MODE_MOUNT))) { | ||
/* | ||
* If no OPT_CUDA_COMPAT_MODE_* flags are specified, | ||
* default to OPT_CUDA_COMPAT_MODE_MOUNT to maintain | ||
* backward compatibility. | ||
*/ | ||
flags &= OPT_CUDA_COMPAT_MODE_MOUNT; | ||
} | ||
} | ||
if ((flags & OPT_CUDA_COMPAT_MODE_MOUNT) && (flags & OPT_CUDA_COMPAT_MODE_LDCONFIG)) { | ||
error_setx(&ctx->err, "only one cuda-compat-mode can be specified at a time"); | ||
return (NULL); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extract this into a function / simplify it to modify the flags inline / return an error as necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the new version where I break this logic into a function. I return an error code and set the flags
value by reference.
@@ -25,7 +25,6 @@ const struct argp list_usage = { | |||
{"no-persistenced", 0x84, NULL, 0, "Don't include the NVIDIA persistenced socket", -1}, | |||
{"no-fabricmanager", 0x85, NULL, 0, "Don't include the NVIDIA fabricmanager socket", -1}, | |||
{"no-gsp-firmware", 0x86, NULL, 0, "Don't include GSP Firmware", -1}, | |||
{"no-cntlibs", 0x87, NULL, 0, "Don't overwrite host mounts with CUDA compat libs from the container", -1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can we just remove this without deprecating it? Do we also want to add a cuda-compat-mode flag here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was erroneously added to mirror the --no-cntlibs
option in the configure
command. Note that the list
command does NOT operate on a container and as such this option never had any effect since there are no container libraries.
This changes adds a --cuda-compat-mode flag to the configure CLI. This allows more flexibility than the existing --no-cntlibs flag. Possible values of the flag are: * mount (default) - CUDA compat libraries are mounted from /usr/local/cuda/compat to the standard library path in the container. * ldconfig - The folder containing the CUDA compat libraries is added as a command line argument to the ldconfig command executed in the container. * disabled - This is equivalent ot specifying the --no-cntlibs flag and skips the detection and injection of compat libraries from the container to the container entirely. Signed-off-by: Evan Lezar <[email protected]>
Since the compat libraries are discovered as part of container creation, the --no-cntlibs option that disables their discovery is not relevant to the nvidia-container-cli list command. This change removes this flag. Signed-off-by: Evan Lezar <[email protected]>
0498c01
to
e03b6a8
Compare
This changes adds a
--cuda-compat-mode
flag to the configure CLI. This allows more flexibility than the existing--no-cntlibs
flag.Possible values of the flag are:
mount
(default) - CUDA compat libraries are mounted from/usr/local/cuda/compat
to the standard library path in the container.ldconfig
- The folder containing the CUDA compat libraries is added as a command line argument to the ldconfig command executed in the container.disabled
- This is equivalent ot specifying the--no-cntlibs
flag.Note that the default behaviour is not changed.
This is levaraged by NVIDIA/nvidia-container-toolkit#1055 to provide more flexibility in how CUDA Forward Compat is handled.