Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

microk8s v1.26 daemon-kubelite crashloop with cgroup error #3895

Closed
jpalpant opened this issue Apr 2, 2023 · 4 comments
Closed

microk8s v1.26 daemon-kubelite crashloop with cgroup error #3895

jpalpant opened this issue Apr 2, 2023 · 4 comments

Comments

@jpalpant
Copy link

jpalpant commented Apr 2, 2023

Summary

I am running microk8s v1.26 inside of WSL2 (Ubuntu 22.04) via the snap. After some unknown period of time and system updates, I noticed that microk8s wasn't behaving correctly. microk8s inspect said services were starting, but I eventually noticed that daemon-kubelite was crashlooping with this message:

Apr 01 22:49:00 windows-node-01 microk8s.daemon-kubelite[12820]: E0401 22:49:00.771376 12820 cgroup_manager_linux.go:472] cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods/cpu.weight: no such file or directory
Apr 01 22:49:00 windows-node-01 microk8s.daemon-kubelite[12820]: E0401 22:49:00.771456 12820 kubelet.go:1466] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"
Apr 01 22:49:01 windows-node-01 systemd[1]: snap.microk8s.daemon-kubelite.service: Main process exited, code=exited, status=1/FAILURE
Apr 01 22:49:01 windows-node-01 systemd[1]: snap.microk8s.daemon-kubelite.service: Failed with result 'exit-code'.
Apr 01 22:49:01 windows-node-01 systemd[1]: snap.microk8s.daemon-kubelite.service: Consumed 2.087s CPU time.
Apr 01 22:49:01 windows-node-01 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 31.

Any advice on what I could look into to track this down? I expect my machine is misconfigured, but it could be something others run into. I don't have any experience with cgroups but am happy to pull more information or logs if it's helpful.

$ uname -a
Linux windows-node-01 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy

What Should Happen Instead?

daemon-kubelite should start and run normally.

Reproduction Steps

Unfortunately no, I'm not sure how I got into this situation except possibly standard Windows updates.

Introspection Report

inspection-report-20230401_225828.tar.gz

Can you suggest a fix?

Are you interested in contributing with a fix?

@ktsakalozos
Copy link
Member

Hi @jpalpant, I wonder if a kernel update disabled cgroups, can you have a look at https://stackoverflow.com/questions/73021599/how-to-enable-cgroup-v2-in-wsl2

Looking at the logs you attached I see

[    0.000000] Linux version 5.15.90.1-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Fri Jan 27 02:56:13 UTC 2023
[    0.000000] Command line: initrd=\initrd.img WSL_ROOT_INIT=1 panic=-1 nr_cpus=20 cgroup_no_v1=all swiotlb=force console=hvc0 debug pty.legacy_count=0 cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory

... which seems right.

@jpalpant
Copy link
Author

jpalpant commented Apr 3, 2023

Looking on the machine, I see

$ mount -l | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)

I think that's the right mount for cgroup v2 to be enabled, is that right?

I also do see

$ cat /proc/cmdline
initrd=\initrd.img WSL_ROOT_INIT=1 panic=-1 nr_cpus=20 cgroup_no_v1=all swiotlb=force console=hvc0 debug pty.legacy_count=0 cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory

Is that a valid set of kernel arguments? It seems like cgroups should be enabled, but I'm not sure if there's any other place I ought to check. I do see that /sys/fs/cgroup/kubepods exists, but it doesn't have a cpu.weights file/folder:

$ ls /sys/fs/cgroup/kubepods/
cgroup.controllers      cgroup.stat             cpuset.cpus.partition     hugetlb.1GB.rsvd.current  hugetlb.2MB.rsvd.max  memory.max           memory.swap.max  rdma.max
cgroup.events           cgroup.subtree_control  cpuset.mems               hugetlb.1GB.rsvd.max      io.stat               memory.min           misc.current
cgroup.freeze           cgroup.threads          cpuset.mems.effective     hugetlb.2MB.current       memory.current        memory.oom.group     misc.max
cgroup.kill             cgroup.type             hugetlb.1GB.current       hugetlb.2MB.events        memory.events         memory.stat          pids.current
cgroup.max.depth        cpu.stat                hugetlb.1GB.events        hugetlb.2MB.events.local  memory.events.local   memory.swap.current  pids.events
cgroup.max.descendants  cpuset.cpus             hugetlb.1GB.events.local  hugetlb.2MB.max           memory.high           memory.swap.events   pids.max
cgroup.procs            cpuset.cpus.effective   hugetlb.1GB.max           hugetlb.2MB.rsvd.current  memory.low            memory.swap.high     rdma.current

@jpalpant
Copy link
Author

jpalpant commented Apr 10, 2023

I think I may have resolved this - I discovered that it is usually possible to enable a cgroup controller by writing to the cgroup.controllers and cgroup.subtree_control files, so I wanted to manually enable the "cpu" cgroup controller. I couldn't do so:

echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control
bash: echo: write error: Invalid argument

I found this article that suggests that you may not enable the cpu controller if any process is using "real-time scheduling". I checked for realtime processes using the ps command they suggested: ps -T axo pid,ppid,user,group,lwp,nlwp,start_time,comm,cgroup,cls|grep RR, and discovered that I had installed rtkit, and that rtkit-daemon was running. I don't think I care about rtkit, so I removed it with apt remove rtkit.

After that, I was able to enable the cpu cgroup v2 controller. I did so manually for both the top-level and the "kubepods" cgroup

$ echo '+cpu' >> /sys/fs/cgroup/cgroup.subtree_control
$ echo '+cpu' >> /sys/fs/cgroup/kubepods/cgroup.subtree_control
$ ls  /sys/fs/cgroup/kubepods/
besteffort              cgroup.procs            cpu.stat               hugetlb.1GB.current       hugetlb.2MB.events.local  memory.high          memory.swap.high  rdma.max       
burstable               cgroup.stat             cpu.weight             hugetlb.1GB.events        hugetlb.2MB.max           memory.low           memory.swap.max
cgroup.controllers      cgroup.subtree_control  cpu.weight.nice        hugetlb.1GB.events.local  hugetlb.2MB.rsvd.current  memory.max           misc.current
cgroup.events           cgroup.threads          cpuset.cpus            hugetlb.1GB.max           hugetlb.2MB.rsvd.max      memory.min           misc.max
cgroup.freeze           cgroup.type             cpuset.cpus.effective  hugetlb.1GB.rsvd.current  io.stat                   memory.oom.group     pids.current
cgroup.kill             cpu.idle                cpuset.cpus.partition  hugetlb.1GB.rsvd.max      memory.current            memory.stat          pids.events
cgroup.max.depth        cpu.max                 cpuset.mems            hugetlb.2MB.current       memory.events             memory.swap.current  pids.max
cgroup.max.descendants  cpu.max.burst           cpuset.mems.effective  hugetlb.2MB.events        memory.events.local       memory.swap.events   rdma.current

And that worked. At this point, I'm not sure if any kernel parameters were needed, but at the moment I have the following ~/.wslconfig:

[wsl2]
localhostForwarding = true
swap = 0
kernelCommandLine = cgroup_no_v1=all cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory cgroup_enable=cpu systemd.unified_cgroup_hierarchy=1

It may not be necessary to enable the cgroup controllers manually either. After rtkit was removed I killed and restarted the WSL2 machine with wsl --shutdown, and /sys/fs/cgroup/kubepods had the correct cpu controller enabled without intervention. So I think it was just rtkit that has an incompatibility with microk8s and the cpu cgroup controller.

@fzhougithub
Copy link

@jpalpant Thanks a lot! I habve been struggled for this error for long time! Until follow your solution

systemctl stop rtkit-daemon
systemctl disable rtkit-daemon

kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=0.0.0.0 --cri-socket=unix:///run/containerd/containerd.sock --ignore-preflight-errors=all --v=5

Your Kubernetes control-plane has initialized successfully!

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants