You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When trying to enable the NVIDIA addon in MicroK8s, I encountered an error message. After attempting to disable and re-enable the addon, MicroK8s incorrectly states that the addon is already enabled, despite no related pods running in any namespace.
What Should Happen Instead?
Enabling the operator should work without errors. Disabling the operator should actually disable it, and MicroK8s should not incorrectly state that the operator is enabled when it isn't (and microk8s should not refuse to re-enable an addon based on incorrect assumptions).
Detailed Story
When I first tried to enable the addon, it replied me a weird message:
$ microk8s enable nvidia --no-network-operator --gpu-operator --gpu-operator-driver=host --gpu-operator-version=v24.6.0
Infer repository core for addon nvidia
Addon core/dns is already enabled
Addon core/helm3 is already enabled
"nvidia" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
Deploy NVIDIA GPU operator
Using host GPU driver
Error: INSTALLATION FAILED: Post "https://127.0.0.1:16443/apis/rbac.authorization.k8s.io/v1/namespaces/gpu-operator/roles?fieldManager=helm": unexpected EOF
Deployed NVIDIA GPU operator
I checked to see if any pod related to the NVIDIA GPU operator was running in any namespace, but there was nothing (no new namespace was created and I had only pods in the kube-system namespace).
Then I decided to disable and enable it again:
$ microk8s disable nvidia
Traceback (most recent call last):
File "/snap/microk8s/7040/scripts/wrappers/disable.py", line 44, in<module>
disable(prog_name="microk8s disable")
File "/snap/microk8s/7040/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/snap/microk8s/7040/usr/lib/python3/dist-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/snap/microk8s/7040/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/snap/microk8s/7040/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/snap/microk8s/7040/scripts/wrappers/disable.py", line 40, in disable
xable("disable", addons)
File "/snap/microk8s/7040/scripts/wrappers/common/utils.py", line 470, in xable
protected_xable(action, addon_args)
File "/snap/microk8s/7040/scripts/wrappers/common/utils.py", line 498, in protected_xable
unprotected_xable(action, addon_args)
File "/snap/microk8s/7040/scripts/wrappers/common/utils.py", line 514, in unprotected_xable
enabled_addons_info, disabled_addons_info = get_status(available_addons_info, True)
File "/snap/microk8s/7040/scripts/wrappers/common/utils.py", line 566, in get_status
kube_output = kubectl_get("all,ingress")
File "/snap/microk8s/7040/scripts/wrappers/common/utils.py", line 248, in kubectl_get
return run(KUBECTL, "get", cmd, "--all-namespaces", die=False)
File "/snap/microk8s/7040/scripts/wrappers/common/utils.py", line 69, in run
result.check_returncode()
File "/snap/microk8s/7040/usr/lib/python3.8/subprocess.py", line 448, in check_returncode
raise CalledProcessError(self.returncode, self.args, self.stdout,
subprocess.CalledProcessError: Command '('/snap/microk8s/7040/microk8s-kubectl.wrapper', 'get', 'all,ingress', '--all-namespaces')' returned non-zero exit status 1.
This error was a bit disturbing, but I tried it again:
$ microk8s disable nvidia
Infer repository core for addon nvidia
Disabling NVIDIA support
NVIDIA support disabled
However, when I tried to enable the addon again, I got a message that it is already enabled:
$ microk8s enable nvidia --no-network-operator --gpu-operator --gpu-operator-driver=host --gpu-operator-version=v24.6.0 --force
Infer repository core for addon nvidia
Addon core/nvidia is already enabled
This seems to be a bug... and it is very difficult because now I am in a limbo: disabling the extension does not seem to do anything, and enabling it again does not work because microk8s seems to believe it is already enabled.
Anyone can help me to solve this problem? And get the nvidia addon working?
Reproduction Steps
Hardware utilised:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
$ nvidia-smi
Fri Aug 2 15:09:43 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A2 Off | 00000000:98:00.0 Off | 0 |
| 0% 39C P0 20W / 60W | 0MiB / 15356MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Run microk8s enable nvidia --no-network-operator --gpu-operator --gpu-operator-driver=host --gpu-operator-version=v24.6.0
Check for running pods related to the NVIDIA GPU operator in any namespace.
Run microk8s disable nvidia
Run microk8s enable nvidia --no-network-operator --gpu-operator --gpu-operator-driver=host --gpu-operator-version=v24.6.0
The origin of the false positive seems to come from this check (maybe it is a bit fragile?), but maybe also the fact that microk8s disable nvidia does not delete any of the cluster roles created during microk8s enable nvidia?
Summary
When trying to enable the NVIDIA addon in MicroK8s, I encountered an error message. After attempting to disable and re-enable the addon, MicroK8s incorrectly states that the addon is already enabled, despite no related pods running in any namespace.
What Should Happen Instead?
Enabling the operator should work without errors. Disabling the operator should actually disable it, and MicroK8s should not incorrectly state that the operator is enabled when it isn't (and microk8s should not refuse to re-enable an addon based on incorrect assumptions).
Detailed Story
When I first tried to enable the addon, it replied me a weird message:
I checked to see if any pod related to the NVIDIA GPU operator was running in any namespace, but there was nothing (no new namespace was created and I had only pods in the
kube-system
namespace).Then I decided to disable and enable it again:
This error was a bit disturbing, but I tried it again:
However, when I tried to enable the addon again, I got a message that it is already enabled:
This seems to be a bug... and it is very difficult because now I am in a limbo: disabling the extension does not seem to do anything, and enabling it again does not work because microk8s seems to believe it is already enabled.
Anyone can help me to solve this problem? And get the
nvidia
addon working?Reproduction Steps
Hardware utilised:
microk8s enable nvidia --no-network-operator --gpu-operator --gpu-operator-driver=host --gpu-operator-version=v24.6.0
microk8s disable nvidia
microk8s enable nvidia --no-network-operator --gpu-operator --gpu-operator-driver=host --gpu-operator-version=v24.6.0
Introspection Report
inspection-report-20240802_151308.tar.gz
Can you suggest a fix?
I don't have any ideas to solve this problem, looking for help really.
Are you interested in contributing with a fix?
No, I am not able to contribute a fix at this time.
The text was updated successfully, but these errors were encountered: