DiskPool stuck Creating #1823

Mefinst · 2025-02-23T09:44:17Z

Describe the bug
DiskPool resources hang in Creating state due to IO-Engine unable to send Ok response after pool is in fact created.

IO-engine pod logs

From those I make a conclusion that IO-engine successfully creates or destroys a pool when requested to do so.
create_pool method times out.
As method times out, IO-engine fails to send success response.
Operator sends commands one after another to destroy and import pool, so it destroys previously created pool, then tries to import it, then tries to create new one, but hits timeout again.

I believe DiskPool operator should not send DestroyPoolRequest commands during creation process.

DiskPoll operator logs

Not so informative. Those contain the same messages which posted to kubectl descrie diskpool.

To Reproduce
Steps to reproduce the behavior:

Install Openebs helm chart. Tested 4.1.1, 4.1.2, 4.2.0.
Create DiskPool
DiskPool stuck Creating

Expected behavior
DiskPool created.

Screenshots
If applicable, add screenshots to help explain your problem.

** OS info (please complete the following information):**
I use Openebs chart of versions 4.1.1, 4.1.2, 4.2.0 with default values.
I can not disclose infrastructure details due to NDA.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

tiagolobocastro · 2025-02-24T10:09:34Z

Hi @Mefinst, could you please try the latest version of openebs (4.3.0)?

Mefinst · 2025-02-24T10:34:36Z

Hi @Mefinst, could you please try the latest version of openebs (4.3.0)?

How could I obtain 4.3.0?

tiagolobocastro · 2025-02-24T10:54:06Z

Sorry, it's actually 4.2.0!
I got confused because you're io-engine logs show v2.7.0.

The io-engine upgrade is not performed automatically (daemonset has OnDelete upgrade strategy).
If you can have running workloads you can use kubectl-mayastor upgrade -n openebs to upgrade the io-engine gracefully. If you don't then you can just delete the existing io-engine pods.
(Note: kubectl-openebs is being improved to contain mayastor subcommands such as upgrade mayastor)

Mefinst · 2025-02-24T12:07:27Z

Those logs from version 4.1.1 to which I tried to downgrade seeing that changes to the pool creation procedure were made in issue 3820 to 4.1.2 version.
I've got the same problem and same logs using any of 3 versions in the initial message clean installed (remove diskpool resources, uninstall helm chart, wait for clean namespace, install helm chart).

tiagolobocastro · 2025-02-24T12:19:33Z

Could you please share a support bundle for the clean install please?

Mefinst · 2025-02-24T13:14:19Z

mayastor-2025-02-24--13-13-09-UTC.tar.gz

tiagolobocastro · 2025-02-24T14:11:56Z

Hey, this is still mayastor v2.7.1, are you sure you're on openebs 4.2?
helm ls -n openebs

Mefinst · 2025-02-24T14:48:25Z

Yep. That was 4.1.1.

Here is new one from 4.2.0.
mayastor-2025-02-24--14-45-22-UTC.tar.gz

❯ helm ls -n openebs
NAME   	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART        	APP VERSION
openebs	openebs  	1       	2025-02-24 14:41:16.536721252 +0000 UTC	deployed	openebs-4.2.0	4.2.0

tiagolobocastro · 2025-02-26T19:12:26Z

hmm indeed something is quite wrong.
I suspect the delete path may also be having issues with timeouts, and perhaps not handling it.
Would you be able to delete io-engine pod on chimera node?

Mefinst · 2025-02-26T20:10:22Z

Deleting io-engine is not helpful. Though I collected support info after I deleted io-engine.

mayastor-2025-02-26--20-07-25-UTC.tar.gz

tiagolobocastro · 2025-02-26T22:08:33Z

Could you please scale down agent-core deployment.
Then run this command on the chimera io-engine pod, io-engine container: io-engine-client pool create chimera-hdd-pool-3 aio:///dev/disk/by-path/pci-0000:00:17.0-ata-3

Mefinst · 2025-03-02T03:31:36Z

❯ kubectl exec -n openebs openebs-io-engine-7ccj7 -c io-engine -- io-engine-client pool create chimera-hdd-pool-3 aio:///dev/disk/by-path/pci-0000:00:17.0-ata-3
gRPC status: status: AlreadyExists, message: ": volume already exists, failed to create pool chimera-hdd-pool-3", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Sun, 02 Mar 2025 03:29:42 GMT", "content-length": "0"} }
Backtrace [
    { fn: "<core::option::Option<std::backtrace::Backtrace> as snafu::GenerateImplicitData>::generate_with_source" },
    { fn: "io_engine_client::v1::pool_cli::handler::{{closure}}" },
    { fn: "io_engine_client::v1::main_::{{closure}}" },
    { fn: "io_engine_client::main::{{closure}}" },
    { fn: "io_engine_client::main" },
    { fn: "std::sys::backtrace::__rust_begin_short_backtrace" },
    { fn: "std::rt::lang_start::{{closure}}" },
    { fn: "std::rt::lang_start_internal" },
    { fn: "main" },
    { fn: "__libc_start_call_main" },
    { fn: "__libc_start_main@GLIBC_2.2.5" },
    { fn: "_start" },
]
command terminated with exit code 1

Mefinst · 2025-03-13T09:17:37Z

Is there anything else to try?

tiagolobocastro · 2025-03-13T10:14:29Z

hmm I think there might be some problem with multiple slow pools at the same time.
Could you please delete the DiskPool CRs.
Then give a few minutes a create diskpool, 1 by 1, waiting until each is online.

Mefinst · 2025-03-13T16:50:36Z

Same result. Also I tried that before filing an issue?
At some point during initial testiing I was able to create a diskpool using /dev/sda like links for some reason. Though I wouldn't repeat that in production for obvious reasons.

Also I posted a link to openebs/openebs#3820 another issue where timeouts were a problem for large/slow disks. There were some kind of solution increasing timeouts in some cli way. But may or may not that help I don't even know where should I pass those.

     --request-timeout=360s
      --no-min-timeouts

Mefinst · 2025-03-20T18:27:06Z

Recreated agent core deployment using config parameters from previous message. Both pools became online.

Could you provide some insights on potential side effects of such configuration?

tiagolobocastro · 2025-03-21T11:08:58Z

Downside is that some operations may take a very long time to fail or may get stuck for a long time.
Could you try removing those parameters now, and then restart the io-engine pods and check if at least we can import the pools correctly?

Mefinst · 2025-03-21T14:06:09Z

Yep. By kubectl get diskpool pools are online.
Also tested node restart. After several minutes pools became online.

Do "some operations" include volume attachment/detachment during pod creation/restarts and io operations in the volume?

tiagolobocastro · 2025-03-26T22:53:01Z

Not IO operations, but like control-plane operations, like volume create and publish.
I suggest removing those parameters now since the import is working.

We still need to track why the latest build didn't work with default parameters, but haven't had time to delve into this.

avishnu · 2025-03-27T10:09:35Z

@Mefinst could you test the scenario once again in the 4.2 version, by creating a pool on a similar block device? Share the logs if you are able to see the same issue.

Mefinst · 2025-04-08T11:18:40Z

@avishnu

Delete DiskPools.
Create DiskPools.
Stuck Creating.

mayastor-2025-04-08--11-15-22-UTC.tar.gz

It's for the same 2 drives. Have only those now to test on.

avishnu added triage/needs-information Indicates an issue needs more information in order to work on it kind/bug Categorizes issue or PR as related to a bug labels Mar 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DiskPool stuck Creating #1823

DiskPool stuck Creating #1823

Mefinst commented Feb 23, 2025

tiagolobocastro commented Feb 24, 2025

Mefinst commented Feb 24, 2025

tiagolobocastro commented Feb 24, 2025 •

edited

Loading

Mefinst commented Feb 24, 2025 •

edited

Loading

tiagolobocastro commented Feb 24, 2025

Mefinst commented Feb 24, 2025

tiagolobocastro commented Feb 24, 2025

Mefinst commented Feb 24, 2025

tiagolobocastro commented Feb 26, 2025

Mefinst commented Feb 26, 2025

tiagolobocastro commented Feb 26, 2025

Mefinst commented Mar 2, 2025

Mefinst commented Mar 13, 2025

tiagolobocastro commented Mar 13, 2025

Mefinst commented Mar 13, 2025 •

edited

Loading

Mefinst commented Mar 20, 2025

tiagolobocastro commented Mar 21, 2025

Mefinst commented Mar 21, 2025

tiagolobocastro commented Mar 26, 2025

avishnu commented Mar 27, 2025

Mefinst commented Apr 8, 2025 •

edited

Loading

DiskPool stuck Creating #1823

DiskPool stuck Creating #1823

Comments

Mefinst commented Feb 23, 2025

tiagolobocastro commented Feb 24, 2025

Mefinst commented Feb 24, 2025

tiagolobocastro commented Feb 24, 2025 • edited Loading

Mefinst commented Feb 24, 2025 • edited Loading

tiagolobocastro commented Feb 24, 2025

Mefinst commented Feb 24, 2025

tiagolobocastro commented Feb 24, 2025

Mefinst commented Feb 24, 2025

tiagolobocastro commented Feb 26, 2025

Mefinst commented Feb 26, 2025

tiagolobocastro commented Feb 26, 2025

Mefinst commented Mar 2, 2025

Mefinst commented Mar 13, 2025

tiagolobocastro commented Mar 13, 2025

Mefinst commented Mar 13, 2025 • edited Loading

Mefinst commented Mar 20, 2025

tiagolobocastro commented Mar 21, 2025

Mefinst commented Mar 21, 2025

tiagolobocastro commented Mar 26, 2025

avishnu commented Mar 27, 2025

Mefinst commented Apr 8, 2025 • edited Loading

tiagolobocastro commented Feb 24, 2025 •

edited

Loading

Mefinst commented Feb 24, 2025 •

edited

Loading

Mefinst commented Mar 13, 2025 •

edited

Loading

Mefinst commented Apr 8, 2025 •

edited

Loading