Skip to content

DiskPool stuck Creating #1823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Mefinst opened this issue Feb 23, 2025 · 21 comments
Open

DiskPool stuck Creating #1823

Mefinst opened this issue Feb 23, 2025 · 21 comments
Labels
kind/bug Categorizes issue or PR as related to a bug triage/needs-information Indicates an issue needs more information in order to work on it

Comments

@Mefinst
Copy link

Mefinst commented Feb 23, 2025

Describe the bug
DiskPool resources hang in Creating state due to IO-Engine unable to send Ok response after pool is in fact created.

IO-engine pod logs

From those I make a conclusion that IO-engine successfully creates or destroys a pool when requested to do so.
create_pool method times out.
As method times out, IO-engine fails to send success response.
Operator sends commands one after another to destroy and import pool, so it destroys previously created pool, then tries to import it, then tries to create new one, but hits timeout again.

I believe DiskPool operator should not send DestroyPoolRequest commands during creation process.

DiskPoll operator logs

Not so informative. Those contain the same messages which posted to kubectl descrie diskpool.

To Reproduce
Steps to reproduce the behavior:

  1. Install Openebs helm chart. Tested 4.1.1, 4.1.2, 4.2.0.
  2. Create DiskPool
  3. DiskPool stuck Creating

Expected behavior
DiskPool created.

Screenshots
If applicable, add screenshots to help explain your problem.

** OS info (please complete the following information):**
I use Openebs chart of versions 4.1.1, 4.1.2, 4.2.0 with default values.
I can not disclose infrastructure details due to NDA.

Additional context
Add any other context about the problem here.

@tiagolobocastro
Copy link
Contributor

Hi @Mefinst, could you please try the latest version of openebs (4.3.0)?

@Mefinst
Copy link
Author

Mefinst commented Feb 24, 2025

Hi @Mefinst, could you please try the latest version of openebs (4.3.0)?

How could I obtain 4.3.0?

@tiagolobocastro
Copy link
Contributor

tiagolobocastro commented Feb 24, 2025

Sorry, it's actually 4.2.0!
I got confused because you're io-engine logs show v2.7.0.

The io-engine upgrade is not performed automatically (daemonset has OnDelete upgrade strategy).
If you can have running workloads you can use kubectl-mayastor upgrade -n openebs to upgrade the io-engine gracefully. If you don't then you can just delete the existing io-engine pods.
(Note: kubectl-openebs is being improved to contain mayastor subcommands such as upgrade mayastor)

@Mefinst
Copy link
Author

Mefinst commented Feb 24, 2025

Those logs from version 4.1.1 to which I tried to downgrade seeing that changes to the pool creation procedure were made in issue 3820 to 4.1.2 version.
I've got the same problem and same logs using any of 3 versions in the initial message clean installed (remove diskpool resources, uninstall helm chart, wait for clean namespace, install helm chart).

@tiagolobocastro
Copy link
Contributor

Could you please share a support bundle for the clean install please?

@Mefinst
Copy link
Author

Mefinst commented Feb 24, 2025

@tiagolobocastro
Copy link
Contributor

Hey, this is still mayastor v2.7.1, are you sure you're on openebs 4.2?
helm ls -n openebs

@Mefinst
Copy link
Author

Mefinst commented Feb 24, 2025

Yep. That was 4.1.1.

Here is new one from 4.2.0.
mayastor-2025-02-24--14-45-22-UTC.tar.gz

❯ helm ls -n openebs
NAME   	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART        	APP VERSION
openebs	openebs  	1       	2025-02-24 14:41:16.536721252 +0000 UTC	deployed	openebs-4.2.0	4.2.0      

@tiagolobocastro
Copy link
Contributor

hmm indeed something is quite wrong.
I suspect the delete path may also be having issues with timeouts, and perhaps not handling it.
Would you be able to delete io-engine pod on chimera node?

@Mefinst
Copy link
Author

Mefinst commented Feb 26, 2025

Deleting io-engine is not helpful. Though I collected support info after I deleted io-engine.

mayastor-2025-02-26--20-07-25-UTC.tar.gz

@tiagolobocastro
Copy link
Contributor

Could you please scale down agent-core deployment.
Then run this command on the chimera io-engine pod, io-engine container: io-engine-client pool create chimera-hdd-pool-3 aio:///dev/disk/by-path/pci-0000:00:17.0-ata-3

@Mefinst
Copy link
Author

Mefinst commented Mar 2, 2025

❯ kubectl exec -n openebs openebs-io-engine-7ccj7 -c io-engine -- io-engine-client pool create chimera-hdd-pool-3 aio:///dev/disk/by-path/pci-0000:00:17.0-ata-3
gRPC status: status: AlreadyExists, message: ": volume already exists, failed to create pool chimera-hdd-pool-3", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Sun, 02 Mar 2025 03:29:42 GMT", "content-length": "0"} }
Backtrace [
    { fn: "<core::option::Option<std::backtrace::Backtrace> as snafu::GenerateImplicitData>::generate_with_source" },
    { fn: "io_engine_client::v1::pool_cli::handler::{{closure}}" },
    { fn: "io_engine_client::v1::main_::{{closure}}" },
    { fn: "io_engine_client::main::{{closure}}" },
    { fn: "io_engine_client::main" },
    { fn: "std::sys::backtrace::__rust_begin_short_backtrace" },
    { fn: "std::rt::lang_start::{{closure}}" },
    { fn: "std::rt::lang_start_internal" },
    { fn: "main" },
    { fn: "__libc_start_call_main" },
    { fn: "__libc_start_main@GLIBC_2.2.5" },
    { fn: "_start" },
]
command terminated with exit code 1

@Mefinst
Copy link
Author

Mefinst commented Mar 13, 2025

Is there anything else to try?

@tiagolobocastro
Copy link
Contributor

hmm I think there might be some problem with multiple slow pools at the same time.
Could you please delete the DiskPool CRs.
Then give a few minutes a create diskpool, 1 by 1, waiting until each is online.

@Mefinst
Copy link
Author

Mefinst commented Mar 13, 2025

Same result. Also I tried that before filing an issue?
At some point during initial testiing I was able to create a diskpool using /dev/sda like links for some reason. Though I wouldn't repeat that in production for obvious reasons.

Also I posted a link to openebs/openebs#3820 another issue where timeouts were a problem for large/slow disks. There were some kind of solution increasing timeouts in some cli way. But may or may not that help I don't even know where should I pass those.

     --request-timeout=360s
      --no-min-timeouts

@Mefinst
Copy link
Author

Mefinst commented Mar 20, 2025

Recreated agent core deployment using config parameters from previous message. Both pools became online.

Could you provide some insights on potential side effects of such configuration?

@tiagolobocastro
Copy link
Contributor

Downside is that some operations may take a very long time to fail or may get stuck for a long time.
Could you try removing those parameters now, and then restart the io-engine pods and check if at least we can import the pools correctly?

@Mefinst
Copy link
Author

Mefinst commented Mar 21, 2025

Yep. By kubectl get diskpool pools are online.
Also tested node restart. After several minutes pools became online.

Do "some operations" include volume attachment/detachment during pod creation/restarts and io operations in the volume?

@tiagolobocastro
Copy link
Contributor

Not IO operations, but like control-plane operations, like volume create and publish.
I suggest removing those parameters now since the import is working.

We still need to track why the latest build didn't work with default parameters, but haven't had time to delve into this.

@avishnu
Copy link
Member

avishnu commented Mar 27, 2025

@Mefinst could you test the scenario once again in the 4.2 version, by creating a pool on a similar block device? Share the logs if you are able to see the same issue.

@avishnu avishnu added triage/needs-information Indicates an issue needs more information in order to work on it kind/bug Categorizes issue or PR as related to a bug labels Mar 27, 2025
@Mefinst
Copy link
Author

Mefinst commented Apr 8, 2025

@avishnu

Delete DiskPools.
Create DiskPools.
Stuck Creating.

mayastor-2025-04-08--11-15-22-UTC.tar.gz

It's for the same 2 drives. Have only those now to test on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug triage/needs-information Indicates an issue needs more information in order to work on it
Projects
None yet
Development

No branches or pull requests

3 participants