Skip to content

Externalize KVM Agent's option to change migration thread timeout #4570

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

GutoVeronezi
Copy link
Contributor

Description

We noticed that sometimes migration in KVM fails due to a latency to move the VM process (domain) from the source host to the target. This has been noticed in Rhel7 and CentOS7. An exception can be observed due to the default (hard-coded) timeout that is extrapolated and ACS can't find the migrated VM's domain in the target host.

This PR intends to externalize that option in the KVM agent configuration. Default timeout will remain 10 seconds.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

How Has This Been Tested?

It has been tested locally.

Copy link
Contributor

@RodrigoDLopez RodrigoDLopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @GutoVeronezi
Can you please explain to me the context? and how it has been tested?
Reading your description I got an idea, but I couldn't reproduce it.

@GutoVeronezi
Copy link
Contributor Author

@RodrigoDLopez

I had two hosts with CentOS7.

I tried to migrate a VM from a host to another, but one of my hosts was with overloaded RAM, so, it would take more than 10 seconds (the hardcoded timeout) to migrate the VM.

With the externalized configuration, I could change the timeout to 20 seconds and my migration ran just fine.

@DaanHoogland
Copy link
Contributor

@GutoVeronezi makes sense functionally and the code looks good except for the naming, why do you use dest in the names of the method and the setting? isn't this a local setting of the agent we are on at this moment?

@GutoVeronezi
Copy link
Contributor Author

GutoVeronezi commented Jan 19, 2021

@GutoVeronezi makes sense functionally and the code looks good except for the naming, why do you use dest in the names of the method and the setting? isn't this a local setting of the agent we are on at this moment?

@DaanHoogland

Yes, it is local, but it refers to a migration from one host to the other, so it is verified if the VMs were correctly migrated from the source host to the destination host. The naming corresponds to the verification of the VM's DOM on the destination host.

@GutoVeronezi
Copy link
Contributor Author

Can anyone review this PR?

@DaanHoogland
Copy link
Contributor

@wido @weizhouapache @GabrielBrascher @ravening can one of you guys look at this, please?

@wido
Copy link
Contributor

wido commented Mar 25, 2021

I honestly don't get it yet. I need to think about this for a moment.

@GabrielBrascher
Copy link
Member

@wido from my understanding, KVM with Rhel/CentOS does some kind of processing which results in Libvirt taking longer to get the domain ID when migrating the VM. Ubuntu, on the other hand, retrieves domain ID in less than 10 seconds.

Am I right, @GutoVeronezi?

Copy link
Member

@GabrielBrascher GabrielBrascher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable dest.domain.migrate.retrieve.timeout is going to be placed at the agent.properties, right?

If so, I would recommend adding it on agent/conf/agent.properties. Something similar to what was done at PR #4585 where it was added commented lines as documentation as well as the variable with the default value.

@wido
Copy link
Contributor

wido commented Mar 26, 2021

The variable dest.domain.migrate.retrieve.timeout is going to be placed at the agent.properties, right?

If so, I would recommend adding it on agent/conf/agent.properties. Something similar to what was done at PR #4585 where it was added commented lines as documentation as well as the variable with the default value.

In addition, shouldn't this be prefixed with 'vm' then as other migration options have the same prefix.

@GutoVeronezi
Copy link
Contributor Author

@wido from my understanding, KVM with Rhel/CentOS does some kind of processing which results in Libvirt taking longer to get the domain ID when migrating the VM. Ubuntu, on the other hand, retrieves domain ID in less than 10 seconds.

Am I right, @GutoVeronezi?

Yes, that is the case.

@GabrielBrascher
Copy link
Member

@GutoVeronezi what do you think of @wido's suggestion?

At the agent.properties there are already some parameters regarding VM migrations, such as vm.migrate.speed:

# set the vm migrate speed, by default, it will try to guess the speed of the guest network
# In MegaBytes per second
#vm.migrate.speed=0

# set target downtime at end of livemigration, the 'hiccup' for final copy. Higher numbers
# make livemigration easier, lower numbers may cause migration to never complete. Less than 1
# means hypervisor default (20ms).
#vm.migrate.downtime=0

# Busy VMs may never finish migrating, depending on environment. When its available, we will
# want to add support for autoconvergence migration flag which should fix this. Set an upper
# limit in milliseconds for how long live migration should wait, at which point VM is paused and
# migration will finish quickly.  Less than 1 means disabled.
#vm.migrate.pauseafter=0

Maybe adding this parameter nearby those (e.g. after vm.migrate.pauseafter) would help users to find it among other VM migration parameters. One option would be to change from dest.domain.migrate.retrieve.timeout to vm.migrate.domain.retrieve.timeout.

@GutoVeronezi
Copy link
Contributor Author

@GabrielBrascher @wido I have changed the config name, is there something else to do?

Copy link
Member

@GabrielBrascher GabrielBrascher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

@GabrielBrascher
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ centos7 ✔️ centos8 ✔️ debian. SL-JID 302

@@ -288,6 +295,17 @@ Use VIR_DOMAIN_XML_SECURE (value = 1) prior to v1.0.0.
return new MigrateAnswer(command, result == null, result, null);
}

private int parseAgentPropertiesVmMigrateDomainRetrieveTimeout() throws IOException {
final File agentPropertiesFile = PropertiesUtil.findConfigFile(KeyStoreUtils.AGENT_PROPSFILE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GutoVeronezi
you can use PropertiesStorage instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weizhouapache I took a look into this class and I am a bit confused. It seems like a facade of PropertiesUtils mixed with an another purpouse.

It seems to me that this class is badly structured, without a clear flow and with nonsense logs. I think it will add a meaningless complexity to the code instead of facility it.

In other hand, in PR #4586 there is a proposal of a handler that will simplify this code. If it be approved, I will change this PR to use the handler.

@GutoVeronezi
Copy link
Contributor Author

@weizhouapache can you look at this?

@GutoVeronezi GutoVeronezi marked this pull request as draft May 17, 2021 14:15
@GutoVeronezi
Copy link
Contributor Author

Waiting for PR 4586 to be merged to use the AgentPropertiesFileHandler.

@rohityadavcloud
Copy link
Member

@GutoVeronezi just another example where related PRs should be clubbed together, commits can be separate to allow reviewing of them individually.

@GabrielBrascher
Copy link
Member

@rhtyd @GutoVeronezi my 2 cents here is: I think that PR #4586 has nothing to do with #4570.

The first one externalizes the KVM agent storage reboot configuration (#4586), while the second one (#4570) allows configuring migration thread timeout. They are completely different variables, that impact distinct components and change different behaviors.

The only thing that links both PRs is the AgentPropertiesFileHandler, which is basically using the same means to do a completely different thing.

I see it as a personal way of seeing. One could keep two PRs as they affect different parts and could easily be debugged and done separately. On the other hand, one could have them on a generic PR such as Externalize some KVM variables.

@GutoVeronezi
Copy link
Contributor Author

GutoVeronezi commented Jul 29, 2021

@DaanHoogland @GabrielBrascher @RodrigoDLopez @sureshanaparti I'm pinging you here because you all reviewed #4585 and #4586.

CC @weizhouapache

#4585 introduced a handler that facilitate agent properties' file reading, which can be used here. However, #4585 was merged and reverted, therefore I had to submit a new work (#5239), which was merged a few days ago. Now I rebased this one and applied only the real changes.

@GutoVeronezi
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@GutoVeronezi a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian. SL-JID 700

@weizhouapache
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Copy link
Contributor

@RodrigoDLopez RodrigoDLopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM.
I'll test it and then post my results as soon as possible.

@blueorangutan
Copy link

Trillian test result (tid-1425)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 66185 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4570-t1425-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_async_job.py
Intermittent failure detected: /marvin/tests/smoke/test_diagnostics.py
Intermittent failure detected: /marvin/tests/smoke/test_internal_lb.py
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Intermittent failure detected: /marvin/tests/smoke/test_router_dns.py
Intermittent failure detected: /marvin/tests/smoke/test_routers_network_ops.py
Intermittent failure detected: /marvin/tests/smoke/test_routers.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 84 look OK, 5 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_internallb_roundrobin_1VPC_3VM_HTTP_port80 Failure 430.95 test_internal_lb.py
test_01_invalid_upgrade_kubernetes_cluster Failure 3610.36 test_kubernetes_clusters.py
test_07_deploy_kubernetes_ha_cluster Failure 3613.87 test_kubernetes_clusters.py
test_08_deploy_and_upgrade_kubernetes_ha_cluster Failure 0.06 test_kubernetes_clusters.py
test_09_delete_kubernetes_ha_cluster Failure 0.05 test_kubernetes_clusters.py
ContextSuite context=TestKubernetesCluster>:teardown Error 90.92 test_kubernetes_clusters.py
test_router_dns_guestipquery Failure 600.37 test_router_dns.py
test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Failure 358.53 test_routers_network_ops.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Failure 497.64 test_vpc_redundant.py
test_05_rvpc_multi_tiers Failure 535.94 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 535.96 test_vpc_redundant.py

Copy link
Member

@GabrielBrascher GabrielBrascher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposal and Code LGTM

@weizhouapache
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian. SL-JID 894

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@GutoVeronezi
Copy link
Contributor Author

Could we re-run the tests in this one?

@nvazquez
Copy link
Contributor

@GutoVeronezi can you please fix the conflicts? We can start a new round of tests after that

@GutoVeronezi
Copy link
Contributor Author

@GutoVeronezi can you please fix the conflicts? We can start a new round of tests after that

@nvazquez done, thanks!

@nvazquez
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1009

@nvazquez
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Copy link
Contributor

@RodrigoDLopez RodrigoDLopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal is okay and CLGTM.

@blueorangutan
Copy link

Trillian test result (tid-1774)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 38667 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4570-t1774-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 89 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@nvazquez nvazquez merged commit 159c72f into apache:main Aug 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants