Skip to content

Group snapshot not taken on conflict #1280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jsafrane opened this issue Mar 11, 2025 · 1 comment
Open

Group snapshot not taken on conflict #1280

jsafrane opened this issue Mar 11, 2025 · 1 comment
Assignees

Comments

@jsafrane
Copy link
Contributor

What happened:

I do not know how exactly it happened, but I got into a situation when the snapshot controller was processing a VolumeGroupSnapshot and corresponding VolumeSnapshot already existed. The controller reacts to 409 Already Exists here:

createdVolumeSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(volumeSnapshotNamespace).Create(ctx, volumeSnapshot, metav1.CreateOptions{})
if err != nil && !apierrs.IsAlreadyExists(err) {
return groupSnapshotContent, fmt.Errorf(
"createSnapshotsForGroupSnapshotContent: creating volumesnapshot %w", err)
}

You can see the code continues processing when the snapshot already exists. However, the createdVolumeSnapshot variable has undefined content (it points to an object with empty namespace / name in my case) and later on, when the controller tries to update it, Patch() fails here:

_, err = utils.PatchVolumeSnapshot(createdVolumeSnapshot, []utils.PatchOp{

Error:
E0310 21:18:31.250917 1 groupsnapshot_controller_helper.go:257] could not sync group snapshot "e2e-volumegroupsnapshottable-8755/group-snapshot-z6w49": createSnapshotsForGroupSnapshotContent: binding volumesnapshot to volumesnapshotcontent resource name may not be empty

What you expected to happen:
The controller should perhaps Get() the VolumeSnapshot from its informer before creating it. And if it already exists, then fail + expect the VolumeSnapshot appears in the informer in the next retry.

All IsAlreadyExists should be then handled in a similar way, not just VolumeSnapshot.

How to reproduce it:
I don't know. It happened only once in OpenShift e2e tests. I am stress-testing the test, 1024 runs so far, 0 failures.

Logs from OpenShift e2e tests:

As you can see, it would be helpful to log names of created VolumeSnapshots, VolumeSnapshotContents, and VolumeGroupContents - it's quite hard to map which VolumeSnapshot failed to be patched and what's the corresponding VGSC.

Anything else we need to know?:

Environment:

  • Driver version: csi-driver-hostpath + e2e tests as in Kubernetes 1.32
  • Kubernetes version (use kubectl version): 1.32-ish
@jsafrane
Copy link
Contributor Author

/assign @leonardoce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants