-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Externalize KVM Agent's option to change migration thread timeout #4570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Externalize KVM Agent's option to change migration thread timeout #4570
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @GutoVeronezi
Can you please explain to me the context? and how it has been tested?
Reading your description I got an idea, but I couldn't reproduce it.
I had two hosts with CentOS7. I tried to migrate a VM from a host to another, but one of my hosts was with overloaded RAM, so, it would take more than 10 seconds (the hardcoded timeout) to migrate the VM. With the externalized configuration, I could change the timeout to 20 seconds and my migration ran just fine. |
@GutoVeronezi makes sense functionally and the code looks good except for the naming, why do you use |
Yes, it is local, but it refers to a migration from one host to the other, so it is verified if the VMs were correctly migrated from the source host to the |
Can anyone review this PR? |
@wido @weizhouapache @GabrielBrascher @ravening can one of you guys look at this, please? |
I honestly don't get it yet. I need to think about this for a moment. |
@wido from my understanding, KVM with Rhel/CentOS does some kind of processing which results in Libvirt taking longer to get the domain ID when migrating the VM. Ubuntu, on the other hand, retrieves domain ID in less than 10 seconds. Am I right, @GutoVeronezi? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable dest.domain.migrate.retrieve.timeout
is going to be placed at the agent.properties
, right?
If so, I would recommend adding it on agent/conf/agent.properties
. Something similar to what was done at PR #4585 where it was added commented lines as documentation as well as the variable with the default value.
In addition, shouldn't this be prefixed with 'vm' then as other migration options have the same prefix. |
Yes, that is the case. |
@GutoVeronezi what do you think of @wido's suggestion? At the agent.properties there are already some parameters regarding VM migrations, such as
Maybe adding this parameter nearby those (e.g. after |
@GabrielBrascher @wido I have changed the config name, is there something else to do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM
@blueorangutan package |
@GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
Packaging result: ✔️ centos7 ✔️ centos8 ✔️ debian. SL-JID 302 |
@@ -288,6 +295,17 @@ Use VIR_DOMAIN_XML_SECURE (value = 1) prior to v1.0.0. | |||
return new MigrateAnswer(command, result == null, result, null); | |||
} | |||
|
|||
private int parseAgentPropertiesVmMigrateDomainRetrieveTimeout() throws IOException { | |||
final File agentPropertiesFile = PropertiesUtil.findConfigFile(KeyStoreUtils.AGENT_PROPSFILE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GutoVeronezi
you can use PropertiesStorage
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@weizhouapache I took a look into this class and I am a bit confused. It seems like a facade of PropertiesUtils
mixed with an another purpouse.
It seems to me that this class is badly structured, without a clear flow and with nonsense logs. I think it will add a meaningless complexity to the code instead of facility it.
In other hand, in PR #4586 there is a proposal of a handler that will simplify this code. If it be approved, I will change this PR to use the handler.
@weizhouapache can you look at this? |
Waiting for PR 4586 to be merged to use the |
@GutoVeronezi just another example where related PRs should be clubbed together, commits can be separate to allow reviewing of them individually. |
@rhtyd @GutoVeronezi my 2 cents here is: I think that PR #4586 has nothing to do with #4570. The first one externalizes the KVM agent storage reboot configuration (#4586), while the second one (#4570) allows configuring migration thread timeout. They are completely different variables, that impact distinct components and change different behaviors. The only thing that links both PRs is the I see it as a personal way of seeing. One could keep two PRs as they affect different parts and could easily be debugged and done separately. On the other hand, one could have them on a generic PR such as |
@DaanHoogland @GabrielBrascher @RodrigoDLopez @sureshanaparti I'm pinging you here because you all reviewed #4585 and #4586. #4585 introduced a handler that facilitate agent properties' file reading, which can be used here. However, #4585 was merged and reverted, therefore I had to submit a new work (#5239), which was merged a few days ago. Now I rebased this one and applied only the real changes. |
@blueorangutan package |
@GutoVeronezi a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
Packaging result: ✔️ el7 ✔️ el8 ✔️ debian. SL-JID 700 |
@blueorangutan test |
@weizhouapache a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM.
I'll test it and then post my results as soon as possible.
Trillian test result (tid-1425)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposal and Code LGTM
@blueorangutan package |
@weizhouapache a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
Packaging result: ✔️ el7 ✔️ el8 ✔️ debian. SL-JID 894 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clgtm
Could we re-run the tests in this one? |
@GutoVeronezi can you please fix the conflicts? We can start a new round of tests after that |
…rate-thread-timeout
@nvazquez done, thanks! |
@blueorangutan package |
@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1009 |
@blueorangutan test |
@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal is okay and CLGTM.
Trillian test result (tid-1774)
|
Description
We noticed that sometimes migration in KVM fails due to a latency to move the VM process (domain) from the source host to the target. This has been noticed in Rhel7 and CentOS7. An exception can be observed due to the default (hard-coded) timeout that is extrapolated and ACS can't find the migrated VM's domain in the target host.
This PR intends to externalize that option in the KVM agent configuration. Default timeout will remain 10 seconds.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
How Has This Been Tested?
It has been tested locally.