Skip to content

Externalize kvm agent storage reboot configuration #4586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

GutoVeronezi
Copy link
Contributor

Description

On KVMHAMonitor, when there is an inconsistency on the heartbeat's file or heartbeat timeout is extrapolated several times, the host is restarted. The host restarting amid several operations might bring inconsistency problems; therefore, it is interesting to let operators choose if they want the host restart or not on the hearbeat's timeout or storage errors.

This PR intends to externalize the heartbeat reboot configuration on KVMHAMonitor. The default value is true; therefore, the current behavior is maintained.

This PR depends on PR#4585.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functioanlity)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

How Has This Been Tested?

It has been tested locally on a test lab.

  1. On agent, I changed my log4j to debug mode.
  2. I added hearbeat.update.timeout=10 setting to agent.properties to force a timeout on hearbeat.
  3. I added reboot.host.and.alert.management.on.hearbeat.timeout setting to agent.properties and changed it some times;
  4. I restarted agent and watched the log.

Copy link
Contributor

@RodrigoDLopez RodrigoDLopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @GutoVeronezi nice feature ...

I was thinking and if we change the name of these files, put something more accurate or self explainable?

  1. AgentProperty.java => AgentProperties.java
    since we will work to bring all available properties to this file, right?

  2. AgentPropertyFile.java => AgentPropertiesFileHandler.java
    this class sounds like a handler or facility to me, am I right?

Bearing in mind that we are looking to standardize and normalize things here. What do you think of normalizing the KVMHAMonitor.java file, at least on the lines where some changes were needed.

String _hostIP => String hostIP

My eyes bleed when I see things like that in the code.

private static final Logger s_logger = Logger.getLogger(KVMHAMonitor.class);
private final Map<String, NfsStoragePool> _storagePool = new ConcurrentHashMap<String, NfsStoragePool>();
private final Map<String, NfsStoragePool> _storagePool = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we start to normalize these variable names in this file?
Removing these underscore at the start of any variable. at least where some changes were needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean

Suggested change
private final Map<String, NfsStoragePool> _storagePool = new ConcurrentHashMap<>();
private final Map<String, NfsStoragePool> storagePool = new ConcurrentHashMap<>();

@RodrigoDLopez ;)
(will need work in the rest of the file!!!)

@GutoVeronezi
Copy link
Contributor Author

@RodrigoDLopez sure, the new files names will be more intuitive.
I'll do it.

About the normalizations of the names, I understand what you mean, I'll do it too.

@DaanHoogland
Copy link
Contributor

good work @GutoVeronezi looking forward to see this merged, but I am going to put my foot down for very extensive testing. 👍

@RodrigoDLopez
Copy link
Contributor

nice job @GutoVeronezi
Code LGTM.

@GutoVeronezi
Copy link
Contributor Author

Can anyone review this PR?

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good (in spite of the unguarded formats in debug statements. but this needs (as said) extensive 3rd party testing.

private static final Logger s_logger = Logger.getLogger(KVMHAMonitor.class);
private final Map<String, NfsStoragePool> _storagePool = new ConcurrentHashMap<String, NfsStoragePool>();
private final Map<String, NfsStoragePool> _storagePool = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean

Suggested change
private final Map<String, NfsStoragePool> _storagePool = new ConcurrentHashMap<>();
private final Map<String, NfsStoragePool> storagePool = new ConcurrentHashMap<>();

@RodrigoDLopez ;)
(will need work in the rest of the file!!!)


private final String _hostIP; /* private ip address */
/* private ip address */
private final String _hostIP;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Suggested change
private final String _hostIP;
private final String hostIP;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the contributor decides to proceed with renaming this variable, I would also suggest removing the commented line /* private IP address */, and leave simply hostIP, or then something as hostPrivateIP or privateIpAddress, as this is expected to be the HostVO.privateIpAddress.

@DaanHoogland
Copy link
Contributor

Can anyone review this PR?

I just did, but this is not what you need most, I'll look in the company to find test effort for us, to get this merged. As a general rule we need two lgtm, on of which should have at least tested extensively and the other at least looked at the code.

@blueorangutan
Copy link

Packaging result: ✔️ centos7 ✖️ centos8 ✔️ debian. SL-JID 547

@blueorangutan
Copy link

Packaging result: ✖️ centos7 ✖️ centos8 ✖️ debian. SL-JID 561

@blueorangutan
Copy link

Packaging result: ✔️ centos7 ✔️ centos8 ✔️ debian. SL-JID 578

@GutoVeronezi
Copy link
Contributor Author

@blueorangutan test

Copy link
Member

@GabrielBrascher GabrielBrascher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM.

I added a suggestion regarding the variable naming change; nothing that would block this PR anyway.


private final String _hostIP; /* private ip address */
/* private ip address */
private final String _hostIP;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the contributor decides to proceed with renaming this variable, I would also suggest removing the commented line /* private IP address */, and leave simply hostIP, or then something as hostPrivateIP or privateIpAddress, as this is expected to be the HostVO.privateIpAddress.

@GutoVeronezi
Copy link
Contributor Author

@DaanHoogland @GabrielBrascher rename done.

@@ -263,3 +263,6 @@ iscsi.session.cleanup.enabled=false
# Automatically clean up iscsi sessions not attached to any VM.
# Should be enabled for users using managed storage for example solidfire.
# Should be disabled for users with unmanaged iscsi connections on their hosts
#
#This parameter specifies if the host must be rebooted when something go wrong with heartbeat.
# reboot.host.and.alert.management.on.hearbeat.timeout=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo hearbeat => heartbeat

* Default value: 60000 (ms).
*/
public static final AgentProperties<Integer> HEARTBEAT_UPDATE_TIMEOUT = new AgentProperties<Integer>("hearbeat.update.timeout", 60000);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the typo error 'hearbeat', and correct it wherever applicable.

@GutoVeronezi GutoVeronezi marked this pull request as draft May 29, 2021 15:25
@sureshanaparti
Copy link
Contributor

@GutoVeronezi is this PR ready for review?

@GutoVeronezi
Copy link
Contributor Author

@sureshanaparti as I mentioned in #4585 (comment), this PR will be rebased if #4585 be merged, therefore it still is a WiP.

@GutoVeronezi GutoVeronezi force-pushed the externalize-kvm-agent-storage-reboot-configuration branch from 1022328 to 53fb485 Compare July 19, 2021 13:36
@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian. SL-JID 701

Copy link
Contributor

@RodrigoDLopez RodrigoDLopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM.
i'll be able to test this one on this weekend. perhaps i'll post my results in the next monday.

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cltgm

@@ -270,3 +270,6 @@ iscsi.session.cleanup.enabled=false
# This parameter specifies the heartbeat update timeout in ms; The default value is 60000ms (1 min).
# Depending on the use case, this timeout might need increasing/decreasing.
# heartbeat.update.timeout=60000

# This parameter specifies if the host must be rebooted when something goes wrong with the heartbeat.
# reboot.host.and.alert.management.on.heartbeat.timeout=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope the newline is there ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaanHoogland It's there =)

GitHub warns when there is no newline at the end of the file, e.g.:
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, you're right. forgot about that.

@blueorangutan
Copy link

Trillian test result (tid-1434)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 49901 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4586-t1434-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_diagnostics.py
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Intermittent failure detected: /marvin/tests/smoke/test_loadbalance.py
Intermittent failure detected: /marvin/tests/smoke/test_router_dnsservice.py
Intermittent failure detected: /marvin/tests/smoke/test_routers_network_ops.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Smoke tests completed. 86 look OK, 3 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_05_ping_in_cpvm_success Failure 15.22 test_diagnostics.py
test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Failure 352.50 test_routers_network_ops.py
test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false Failure 345.59 test_routers_network_ops.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Failure 532.46 test_vpc_redundant.py

@blueorangutan
Copy link

Trillian test result (tid-1437)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 50106 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4586-t1437-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_internal_lb.py
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Intermittent failure detected: /marvin/tests/smoke/test_reset_vm_on_reboot.py
Intermittent failure detected: /marvin/tests/smoke/test_vm_snapshots.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 88 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Failure 474.21 test_vpc_redundant.py

@DaanHoogland
Copy link
Contributor

test_vpc_redundant is failing all over the open PRs, but running one last time anyway to be as sure as possible

@blueorangutan
Copy link

Trillian test result (tid-1446)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33833 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4586-t1446-kvm-centos7.zip
Smoke tests completed. 89 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@DaanHoogland
Copy link
Contributor

@RodrigoDLopez did you test this?

@RodrigoDLopez
Copy link
Contributor

@RodrigoDLopez did you test this?

Sorry @DaanHoogland I couldn't run the tests yet. I'm having problems with my test environment, but I will run the tests as soon as possible.

@GutoVeronezi
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@GutoVeronezi a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian. SL-JID 815

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-1588)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 37995 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4586-t1588-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_network.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Smoke tests completed. 89 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@DaanHoogland
Copy link
Contributor

@RodrigoDLopez did you test this?

Sorry @DaanHoogland I couldn't run the tests yet. I'm having problems with my test environment, but I will run the tests as soon as possible.

@RodrigoDLopez anything to share yet?

@RodrigoDLopez
Copy link
Contributor

I ran some tests with this propose. But I'm not sure about my environment. I couldn't see my kvm agent stop at any point.
anyway, I managed to verify that the behavior of the externalized parameter is in accordance with what was proposed.

forcing my heartbeat timeout on two kvm hosts it's possible to see that the ones without the parameter reboot.host.and.alert.management.on.heartbeat.timeout=false will log the mensage failed: timeout; stopping cloudstack-agent.

Kvm-nd01

heartbeat.update.timeout=10
reboot.host.and.alert.management.on.heartbeat.timeout=true
Thread-1:null) (logid:) The command (/usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 201.200.102.10 -p /export/primary -m /mnt/0727b905-35cb-36cb-961c-ea27b1a83333 -h 201.200.101.3 ), to the pool [0727b905-35cb-36cb-961c-ea27b1a83333], has the result [timeout].
(Thread-1:null) (logid:) Write heartbeat for pool [0727b905-35cb-36cb-961c-ea27b1a83333] failed: timeout; try: 5 of 5.
(Thread-1:null) (logid:) [IGNORED] Interrupted between heartbeat retries.
java.lang.InterruptedException: sleep interrupted
	at java.base/java.lang.Thread.sleep(Native Method)
	at com.cloud.hypervisor.kvm.resource.KVMHAMonitor.runHeartBeat(KVMHAMonitor.java:129)
	at com.cloud.hypervisor.kvm.resource.KVMHAMonitor.run(KVMHAMonitor.java:176)
	at java.base/java.lang.Thread.run(Thread.java:829)
(Thread-1:null) (logid:) Write heartbeat for pool [0727b905-35cb-36cb-961c-ea27b1a83333] failed: timeout; stopping cloudstack-agent. 

Kvm-nd02

heartbeat.update.timeout=10
reboot.host.and.alert.management.on.heartbeat.timeout=false
(Thread-1:null) (logid:) The command (/usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 201.200.102.10 -p /export/primary -m /mnt/0727b905-35cb-36cb-961c-ea27b1a83333 -h 201.200.101.4 ), to the pool [0727b905-35cb-36cb-961c-ea27b1a83333], has the result [timeout].
(Thread-1:null) (logid:) Write heartbeat for pool [0727b905-35cb-36cb-961c-ea27b1a83333] failed: timeout; try: 5 of 5.
(Thread-1:null) (logid:) [IGNORED] Interrupted between heartbeat retries.
java.lang.InterruptedException: sleep interrupted
	at java.base/java.lang.Thread.sleep(Native Method)
	at com.cloud.hypervisor.kvm.resource.KVMHAMonitor.runHeartBeat(KVMHAMonitor.java:129)
	at com.cloud.hypervisor.kvm.resource.KVMHAMonitor.run(KVMHAMonitor.java:176)
	at java.base/java.lang.Thread.run(Thread.java:829)

Copy link
Member

@GabrielBrascher GabrielBrascher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @GutoVeronezi.

LGTM based on code review and manual tests done by @RodrigoDLopez.

The test looks good, probably the host was not rebooted due to some test env details. The logic behind loading the agent configuration, validating it and changing the execution flow according to it seems OK.

@GutoVeronezi
Copy link
Contributor Author

@DaanHoogland @GabrielBrascher @RodrigoDLopez is this good to go? Is there anything else to do?

@nvazquez
Copy link
Contributor

Thanks for testing @RodrigoDLopez
@GutoVeronezi I think we can kick a new round of tests then should be good to go
@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@GutoVeronezi
Copy link
Contributor Author

@nvazquez ok, thanks!

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 981

@nvazquez
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-1754)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33182 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4586-t1754-kvm-centos7.zip
Smoke tests completed. 89 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants