Skip to content

fix: enforce the minimum cgroup cpu shares value to 2 #10221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 25, 2025

Conversation

phsm
Copy link
Contributor

@phsm phsm commented Jan 20, 2025

Description

This PR ensures that the cpu shares value is never < 2 to be compatible with Libvirt versions before 9.1.0.

It turned out that the older libvirt versions, such as Ubuntu 22.04 libvirt 8.0.0 has the hardcoded range of allowed cpu shares values for both cgroupv1 and cgroupv2, which is 2-262144.
This range enforcement was removed in Libvirt 9.1.0, see: libvirt/libvirt@38af649

If a host has lots of cores, and the huge CPU overprovisioning factor is set, then the computed shares value can become 1.
In such case, the following exception is generated on the Cloudstack Agent during provisioning:
org.libvirt.LibvirtException: unsupported configuration: Value of cputune 'shares' must be in range [2, 262144]

We noticed it when tried to restart a Shared network with cleanup.

Steps to reproduce:

  1. Get a KVM hypervisor host with lots of cores that has affected Libvirt version. Ubuntu 22.04 works.
  2. Make sure that cgroup v2 is enabled on the hypervisor: mount | grep -q cgroup2 && echo "yes, enabled" should echo the output.
  3. Set the overprovisioning ratio to some ridiculously high value, e.g. 1000
  4. Try to restart any network with cleanup. Since the virtual routers have tiny CPU specs (1 core, 500Mhz by default), it should trigger the bug.
  5. The com.cloud.exception.InsufficientServerCapacityException: No destination found for a deployment for VM instance will be generated on the management server.
  6. On the agent, you will see the following message in the log: org.libvirt.LibvirtException: unsupported configuration: Value of cputune 'shares' must be in range [2, 262144]

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Tested on Ubuntu 22.04 with Libvirt 8.0.0.
After the patch was applied, the error was gone, and the virtual router appeared after the restart with the cpushares value 2.

How did you try to break this feature and the system with this change?

The only effective change that this change does, is excluding the return value "1". It is highly unlikely to break anything.

@phsm phsm force-pushed the 4.20-libvirt-cgroup-cpushares-bug branch from 85f0879 to c9ce12a Compare January 20, 2025 16:54
Copy link

codecov bot commented Jan 20, 2025

Codecov Report

Attention: Patch coverage is 55.55556% with 4 lines in your changes missing coverage. Please review.

Project coverage is 16.14%. Comparing base (a163831) to head (73ffed4).
Report is 95 commits behind head on 4.20.

Files with missing lines Patch % Lines
...ervisor/kvm/resource/LibvirtComputingResource.java 55.55% 0 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               4.20   #10221   +/-   ##
=========================================
  Coverage     16.13%   16.14%           
- Complexity    12967    12972    +5     
=========================================
  Files          5639     5639           
  Lines        494264   494303   +39     
  Branches      59899    59913   +14     
=========================================
+ Hits          79760    79790   +30     
  Misses       405684   405684           
- Partials       8820     8829    +9     
Flag Coverage Δ
uitests 4.02% <ø> (ø)
unittests 16.99% <55.55%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@phsm phsm force-pushed the 4.20-libvirt-cgroup-cpushares-bug branch from 9e981b4 to f51aec9 Compare January 20, 2025 18:55
@DaanHoogland DaanHoogland added this to the 4.20.1 milestone Jan 21, 2025
@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12141

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-12144)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 51870 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10221-t12144-kvm-ol8.zip
Smoke tests completed. 140 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_11_isolated_network_with_dynamic_routed_mode Error 2.30 test_ipv4_routing.py
test_12_vpc_and_tier_with_dynamic_routed_mode Error 3.41 test_ipv4_routing.py
test_12_vpc_and_tier_with_dynamic_routed_mode Error 3.41 test_ipv4_routing.py

@phsm
Copy link
Contributor Author

phsm commented Jan 23, 2025

[SF] Trillian test result (tid-12144) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 51870 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10221-t12144-kvm-ol8.zip Smoke tests completed. 140 look OK, 1 have errors, 0 did not run Only failed and skipped tests results shown below:
Test Result Time (s) Test File
test_11_isolated_network_with_dynamic_routed_mode Error 2.30 test_ipv4_routing.py
test_12_vpc_and_tier_with_dynamic_routed_mode Error 3.41 test_ipv4_routing.py
test_12_vpc_and_tier_with_dynamic_routed_mode Error 3.41 test_ipv4_routing.py

Looks like this test failure is not related to my changes. I checked the mgmt server logs by this link:

2025-01-21 12:52:45,288 DEBUG [c.c.a.ApiServlet] (qtp253011924-20:[ctx-fbcd068c]) (logid:0973a8ad) ===START===  10.0.112.130 -- GET  account=testD1-TestIpv4Routing-ITEMQ5&bgppeerids=d692cabe-00e8-4b00-ab7b-476deab31da9&cidr=172.31.8.0%2F22&displaytext=TestVPC-TZU9IM&domainid=74a01d7
c-09ba-4737-aac7-b5f3966100ba&name=TestVPC-AJ4H2C&start=False&vpcofferingid=446cdc68-4cc7-4822-ace9-e54bfded5bac&zoneid=d3b43424-16f5-4120-8cfa-9d08e7b367ac&command=createVPC&response=json&apiKey=LIN6rqXuaJwMPfGYFh13qDwYz5VNNz1J2J6qIOWcd3oLQOq0WtD4CwRundBL6rzXToa3lQOC_vKjI3nkHtiD8Q&
signature=l7vSIHJB7VOufpqhCaiu0YyrK9A%3D
.......
2025-01-21 12:52:45,368 ERROR [c.c.a.ApiServer] (qtp253011924-20:[ctx-fbcd068c, ctx-81d91df8, ctx-2e46861c]) (logid:0973a8ad) unhandled exception executing api command: [Ljava.lang.String;@7e035717 java.lang.NullPointerException
        at java.base/java.util.Objects.requireNonNull(Objects.java:209)
        at com.cloud.bgp.BGPServiceImpl.allocateASNumber(BGPServiceImpl.java:258)

@phsm phsm force-pushed the 4.20-libvirt-cgroup-cpushares-bug branch 2 times, most recently from 08dfd7e to dc47b35 Compare January 23, 2025 10:06
@phsm phsm force-pushed the 4.20-libvirt-cgroup-cpushares-bug branch 2 times, most recently from bcf7433 to c4db395 Compare January 23, 2025 10:51
To be compatible with older libvirt versions

Co-authored-by: dahn <[email protected]>
@phsm phsm force-pushed the 4.20-libvirt-cgroup-cpushares-bug branch from c4db395 to 73ffed4 Compare January 23, 2025 13:56
Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

cc @phsm

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12192

@Pearl1594
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@Pearl1594 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link
Contributor

@Pearl1594 Pearl1594 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLGTM

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12462

@Pearl1594
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@Pearl1594 a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-12413)

@Pearl1594
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@Pearl1594 a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-12415)

@Pearl1594
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@Pearl1594 a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-12462)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 51811 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10221-t12462-kvm-ol8.zip
Smoke tests completed. 139 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_11_isolated_network_with_dynamic_routed_mode Error 2.31 test_ipv4_routing.py
test_12_vpc_and_tier_with_dynamic_routed_mode Error 3.38 test_ipv4_routing.py
test_12_vpc_and_tier_with_dynamic_routed_mode Error 3.38 test_ipv4_routing.py
test_06_purge_expunged_vm_background_task Failure 381.60 test_purge_expunged_vms.py

@Pearl1594 Pearl1594 merged commit 37c4df9 into apache:4.20 Feb 25, 2025
25 of 26 checks passed
@Pearl1594 Pearl1594 moved this to Done in ACS 4.20.1 Mar 17, 2025
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Jun 19, 2025
To be compatible with older libvirt versions

Co-authored-by: dahn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants