Skip to content

Refactor: Replace sleep() with wait() #10504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sroopsai
Copy link

@sroopsai sroopsai commented Mar 5, 2025

Description

This PR fixes #10486 .

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Copy link

boring-cyborg bot commented Mar 5, 2025

Congratulations on your first Pull Request and welcome to the Apache CloudStack community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/cloudstack/blob/main/CONTRIBUTING.md)
Here are some useful points:

@DaanHoogland
Copy link
Contributor

thanks @sroopsai , let's test this.

@DaanHoogland DaanHoogland added this to the 4.21.0 milestone Mar 6, 2025
@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link

codecov bot commented Mar 6, 2025

Codecov Report

Attention: Patch coverage is 0% with 13 lines in your changes missing coverage. Please review.

Project coverage is 16.60%. Comparing base (28e2411) to head (2658f94).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
...tils/src/main/java/com/cloud/utils/ThreadUtil.java 0.00% 12 Missing ⚠️
...va/com/cloud/agent/manager/DirectAgentAttache.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #10504      +/-   ##
============================================
- Coverage     16.60%   16.60%   -0.01%     
  Complexity    13925    13925              
============================================
  Files          5730     5731       +1     
  Lines        508082   508236     +154     
  Branches      61770    61789      +19     
============================================
+ Hits          84386    84388       +2     
- Misses       414260   414413     +153     
+ Partials       9436     9435       -1     
Flag Coverage Δ
uitests 3.93% <ø> (ø)
unittests 17.49% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✖️ debian ✔️ suse15. SL-JID 12680

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-12599)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 56751 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10504-t12599-kvm-ol8.zip
Smoke tests completed. 135 look OK, 1 have errors, 5 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 1518.92 test_network.py
all_test_vpc_vpn Skipped --- test_vpc_vpn.py
all_test_webhook_delivery Skipped --- test_webhook_delivery.py
all_test_webhook_lifecycle Skipped --- test_webhook_lifecycle.py
all_test_host_maintenance Skipped --- test_host_maintenance.py
all_test_hostha_kvm Skipped --- test_hostha_kvm.py

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-12615)

@blueorangutan
Copy link

[SF] Trillian test result (tid-12623)
Environment: kvm-ol9 (x2), Advanced Networking with Mgmt server ol9
Total time taken: 60312 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10504-t12623-kvm-ol9.zip
Smoke tests completed. 139 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestClusterDRS>:setup Error 0.00 test_cluster_drs.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 1520.31 test_network.py

@weizhouapache
Copy link
Member

@blueorangutan test ol8 vmware-70u3

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + vmware-70u3) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-12635)
Environment: vmware-70u3 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 64123 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10504-t12635-vmware-70u3.zip
Smoke tests completed. 134 look OK, 7 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_events_resource Error 319.55 test_events_resource.py
test_01_events_resource Error 319.56 test_events_resource.py
test_deploy_more_vms_than_limit_allows Error 153.47 test_deploy_vms_in_parallel.py
test_01_prepare_and_cancel_maintenance Error 0.09 test_ms_maintenance_and_safe_shutdown.py
test_04_deploy_vm_for_other_user_and_test_vm_operations Error 129.84 test_network_permissions.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 1524.67 test_network.py
test_02_restore_vm_with_disk_offering Error 53.05 test_restore_vm.py
test_03_restore_vm_with_disk_offering_custom_size Error 57.28 test_restore_vm.py
test_02_restore_vm_strict_tags_failure Error 57.52 test_vm_strict_host_tags.py

Copy link
Contributor

@sureshanaparti sureshanaparti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@sureshanaparti
Copy link
Contributor

@blueorangutan package

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the retry loop in runInContext to use wait() instead of Thread.sleep(), allowing the lock on the current instance to be released during each retry interval.

  • Replaced Thread.sleep(...) with wait(...) inside the synchronized retry loop.
  • Kept the retry logic and timeout calculation identical.
Comments suppressed due to low confidence (1)

engine/orchestration/src/main/java/com/cloud/agent/manager/DirectAgentAttache.java:168

  • Switching from Thread.sleep() to wait() releases the monitor lock, altering concurrency semantics. Verify that other threads should be allowed to enter synchronized methods on this instance during the wait.
wait(1000 * _HostPingRetryTimer.value());

@@ -165,7 +165,7 @@ protected synchronized void runInContext() {
PingCommand cmd = resource.getCurrentStatus(_id);
int retried = 0;
while (cmd == null && ++retried <= _HostPingRetryCount.value()) {
Thread.sleep(1000*_HostPingRetryTimer.value());
wait(1000 * _HostPingRetryTimer.value());
Copy link
Preview

Copilot AI Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing InterruptedException handling: wait(...) throws InterruptedException. Surround this call with a try/catch or declare the exception to prevent compilation errors.

Suggested change
wait(1000 * _HostPingRetryTimer.value());
try {
wait(1000 * _HostPingRetryTimer.value());
} catch (InterruptedException e) {
logger.warn("PingTask interrupted while waiting to retry ping [id: {}, uuid: {}, name: {}]", _id, _uuid, _name, e);
Thread.currentThread().interrupt(); // Restore the interrupted status
break;
}

Copilot uses AI. Check for mistakes.

@sureshanaparti sureshanaparti moved this to In Progress in Apache CloudStack 4.21.0 Jun 5, 2025
Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your code looks good @sroopsai , two remarks

  1. the co-pilot’s suggestion seem sensible (even when looking excessive)
  2. I think we can de with a utility that embeds the try/rethrow to use in other locations that use sleep() now as well.

note this is not a -1, just a question/suggestion to improve.

@sroopsai
Copy link
Author

sroopsai commented Jun 11, 2025

Thank You for the suggestions @DaanHoogland

  1. The other locations where sleep() is being called is in non-synchronised blocks, I think it is not good idea to replace those sleep() calls with wait(). But in this case sleep() is inside a synchronised block, there is no problem here. If you want I can write a utility method wrapping wait() call inside a try/catch block and use this method here and maybe use the method in future for sleep() calls (if written in synchronised blocks).

@DaanHoogland
Copy link
Contributor

Thank You for the suggestions @DaanHoogland

  1. The other locations where sleep() is being called is in non-synchronised blocks, I think it is not good idea to replace those sleep() calls with wait(). But in this case sleep() is inside a synchronised block, there is no problem here. If you want I can write a utility method wrapping wait() call inside a try/catch block and use this method here and maybe use the method in future for sleep() calls (if written in synchronised blocks).

that makes sense @sroopsai , A utility method that takes a timeout value and a set of log parameters would be great. Either in this PR or in a aseparate one.

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 13734

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13753

Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 13769

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13779

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13787

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-13538)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 58247 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10504-t13538-kvm-ol8.zip
Smoke tests completed. 141 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@DaanHoogland
Copy link
Contributor

@blueorangutan test ol8 vmware-70u3

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + vmware-70u3) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-13539)
Environment: vmware-70u3 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 62198 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10504-t13539-vmware-70u3.zip
Smoke tests completed. 140 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_prepare_and_cancel_maintenance Error 0.21 test_ms_maintenance_and_safe_shutdown.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Use of Thread.sleep()
5 participants