Skip to content

Commit 51ac059

Browse files
committed
Address comments from last meeting
1 parent e23b658 commit 51ac059

File tree

1 file changed

+164
-75
lines changed
  • keps/sig-storage/5381-mutable-pv-affinity

1 file changed

+164
-75
lines changed

keps/sig-storage/5381-mutable-pv-affinity/README.md

Lines changed: 164 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,8 @@ SIG Architecture for cross-cutting KEPs).
9393
- [Alternatives](#alternatives)
9494
- [New GRPC](#new-grpc)
9595
- [User Specified Topology Requirement](#user-specified-topology-requirement)
96+
- [Support for SPs that don't Know Attached Nodes](#support-for-sps-that-dont-know-attached-nodes)
97+
- [Try to Eliminate Race Condition](#try-to-eliminate-race-condition)
9698
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
9799
<!-- /toc -->
98100

@@ -224,25 +226,40 @@ required:
224226
values:
225227
- cn-beijing-g
226228
```
227-
The updated affinity would be:
228-
```yaml
229-
required:
230-
nodeSelectorTerms:
231-
- matchExpressions:
232-
- key: topology.kubernetes.io/region
233-
operator: In
234-
values:
235-
- cn-beijing
236-
```
237-
238-
Or in the view of CSI accessibility requirement, from:
229+
Or in the view of CSI accessibility requirement:
239230
```json
240231
[{"topology.kubernetes.io/zone": "cn-beijing-g"}]
241232
```
242-
to:
243-
```json
244-
[{"topology.kubernetes.io/region": "cn-beijing"}]
245-
```
233+
234+
The workflow:
235+
1. User create a `VolumeAttributeClass`:
236+
```yaml
237+
apiVersion: storage.k8s.io/v1beta1
238+
kind: VolumeAttributesClass
239+
metadata:
240+
name: regional
241+
driverName: csi.provider.com
242+
parameters:
243+
type: regional
244+
```
245+
2. User modify the `volumeAttributesClassName` in the PVC to `regional`
246+
3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type": "regional"}`
247+
4. CSI driver blocks until the modification finished, then return with `accessible_topology` set to
248+
```json
249+
[{"topology.kubernetes.io/region": "cn-beijing"}]
250+
```
251+
5. external-resizer sets `PersistentVolume.spec.nodeAffinity` to
252+
```yaml
253+
required:
254+
nodeSelectorTerms:
255+
- matchExpressions:
256+
- key: topology.kubernetes.io/region
257+
operator: In
258+
values:
259+
- cn-beijing
260+
```
261+
then update the PV status to indicate the modification is successful.
262+
246263

247264
#### Story 2
248265

@@ -261,30 +278,51 @@ required:
261278
values:
262279
- available
263280
```
264-
The updated affinity would be:
265-
```yaml
266-
required:
267-
nodeSelectorTerms:
268-
- matchExpressions:
269-
- key: provider.com/disktype.cloud_essd
270-
operator: In
271-
values:
272-
- available
273-
```
274-
275-
Or in the view of CSI accessibility requirement, from:
281+
Or in the view of CSI accessibility requirement:
276282
```json
277283
[{"provider.com/disktype.cloud_ssd": "available"}]
278284
```
279-
to:
280-
```json
281-
[{"provider.com/disktype.cloud_essd": "available"}]
282-
```
283285

284286
Type A node only supports cloud_ssd, while Type B node supports both cloud_ssd and cloud_essd.
285287
We will only allow the modification if the volume is attached to type B nodes.
286288
And I want to make sure the Pods using new cloud_essd volume not to be scheduled to type A nodes.
287289

290+
In this case, it takes long to modify the volume, the new topology is not strictly less restrictive,
291+
and SP wants to minimize the time window of the race condition:
292+
293+
The workflow:
294+
1. User create a `VolumeAttributeClass`:
295+
```yaml
296+
apiVersion: storage.k8s.io/v1beta1
297+
kind: VolumeAttributesClass
298+
metadata:
299+
name: essd
300+
driverName: csi.provider.com
301+
parameters:
302+
type: cloud_essd
303+
```
304+
2. User modify the `volumeAttributesClassName` in the PVC to `essd`
305+
3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type": "cloud_essd"}`
306+
4. CSI driver returns with `in_progress` set to true, and `accessible_topology` set to
307+
```json
308+
[{"provider.com/disktype.cloud_essd": "available"}]
309+
```
310+
5. external-resizer sets `PersistentVolume.spec.nodeAffinity` to
311+
```yaml
312+
required:
313+
nodeSelectorTerms:
314+
- matchExpressions:
315+
- key: provider.com/disktype.cloud_essd
316+
operator: In
317+
values:
318+
- available
319+
```
320+
but the PV status is not updated yet.
321+
From now on, the new Pod will be scheduled to nodes with `provider.com/disktype.cloud_essd: available`,
322+
maybe they will stuck in `ContainerCreating` state until the modification finishes.
323+
6. external-resizer go back to step 3, retries until `in_progress` is set to false.
324+
7. external-resizer update the PV status to indicate the modification is successful.
325+
288326

289327
### Notes/Constraints/Caveats (Optional)
290328

@@ -390,11 +428,14 @@ message ControllerModifyVolumeResponse {
390428
// If it is not specified, the CO MAY assume
391429
// the volume is equally accessible from all nodes in the cluster and
392430
// MAY schedule workloads referencing the volume on any available
393-
// node.
431+
// node.
394432
//
395433
// SP MUST only set this field if allow_topology_updates is set
396434
// in the request. SP SHOULD fail the request if it needs to update
397435
// topology but is not allowed by CO.
436+
//
437+
// SP SHOULD only return topology that is a super-set of the
438+
// original topology to avoid race conditions when scheduling.
398439
repeated Topology accessible_topology = 1;
399440
400441
// Indicates whether the modification is still in progress.
@@ -422,42 +463,6 @@ But this KEP does not cover the automatic correction. Kubernetes should only ret
422463

423464
Scheduler Enhancements: make sure the Pod is re-queued when the PV is updated.
424465

425-
A typical workflow is (taking the user story 1 as an example):
426-
1. User create a `VolumeAttributeClass`:
427-
```yaml
428-
apiVersion: storage.k8s.io/v1beta1
429-
kind: VolumeAttributesClass
430-
metadata:
431-
name: regional
432-
driverName: csi.provider.com
433-
parameters:
434-
type: regional
435-
```
436-
2. User modify the `volumeAttributesClassName` in the PVC to `regional`
437-
3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type": "regional"}`
438-
4. CSI driver blocks until the modification finished, then return with `accessible_topology` set to `[{"topology.kubernetes.io/region": "cn-beijing"}]`
439-
5. external-resizer sets `PersistentVolume.spec.nodeAffinity` accordingly, then update the PV status to indicate the modification is successful.
440-
441-
If it takes long to modify the volume, the new topology is not strictly less restrictive,
442-
and SP wants to minimize the time window of the race condition (taking the user story 2 as an example):
443-
1. User create a `VolumeAttributeClass`:
444-
```yaml
445-
apiVersion: storage.k8s.io/v1beta1
446-
kind: VolumeAttributesClass
447-
metadata:
448-
name: essd
449-
driverName: csi.provider.com
450-
parameters:
451-
type: cloud_essd
452-
```
453-
2. User modify the `volumeAttributesClassName` in the PVC to `essd`
454-
3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type": "cloud_essd"}`
455-
4. CSI driver returns with `in_progress` set to true, and `accessible_topology` set to `[{"provider.com/disktype.cloud_essd": "available"}]`
456-
5. external-resizer sets `PersistentVolume.spec.nodeAffinity` accordingly, but the PV status is not updated yet.
457-
From now on, the new Pod will be scheduled to nodes with `provider.com/disktype.cloud_essd: available`,
458-
maybe they will stuck in `ContainerCreating` state until the modification finishes.
459-
6. external-resizer go back to step 3, retries until `in_progress` is set to false.
460-
7. external-resizer update the PV status to indicate the modification is successful.
461466

462467

463468
### Test Plan
@@ -1035,7 +1040,7 @@ rpc ControllerModifyVolumeTopology (ControllerModifyVolumeTopologyRequest)
10351040
message ControllerModifyVolumeTopologyRequest {
10361041
option (alpha_message) = true;
10371042
string volume_id = 1;
1038-
map<string, string> secrest = 2 [(csi_secret) = true];
1043+
map<string, string> secret = 2 [(csi_secret) = true];
10391044
map<string, string> mutable_parameters = 3;
10401045
}
10411046
@@ -1067,16 +1072,100 @@ We've considered a design:
10671072
* Add `accessibility_requirements` in `ModifyVolumeRequest`, like that in `CreateVolumeRequest`
10681073
* Add `allowedTopologies` in `VolumeAttributeClass`, like that in `StorageClass`
10691074

1070-
But facing a lot of unresolved questions:
1075+
We determine this lacks vaild use cases.
1076+
1077+
In most cases, SP can determine the desired topology of a volume from the `mutable_parameters`, or from the currently attached nodes.
1078+
An exception could be: modifying a volume from regional to zonal, and it is not attached to any node.
1079+
In this case, SP may need more information from the CO to determine the desired zone.
1080+
But we don't want user to create a separate VAC for each zone.
1081+
Instead, it maybe easier for user to just attach it to a node, so that SP can determine the desired zone.
1082+
1083+
For the other case (User Story 2), the topology (provider.com/disktype.cloud_essd) is actually not intended for user as an API.
1084+
User just want to modify the disk type, and we implement the underlying limitations as topology.
1085+
So we don't want to let user to specify this topology key directly.
1086+
1087+
Besides, this facing a lot of unresolved questions:
10711088
* How to merge `allowedTopologies` from `VolumeAttributeClass`, `StorageClass`?
10721089
* Should we use `allowedTopologies` from `StorageClass` if it is not specified in `VolumeAttributeClass`?
10731090
* Should we consider the topology of the currently attached nodes?
10741091
* Should we consider the topology of all the nodes in the cluster?
10751092

1076-
In most cases, SP can determine the desired topology of a volume from the `mutable_parameters`, or from the currently attached nodes.
1077-
An exception could be: modifying a volume from regional to zonal, and it is not attached to any node.
1078-
In this case, SP will need more information from the CO to determine the desired zone.
1079-
But we don't have such use case now, we decided leave it as a future work.
1093+
We may consider this again with vaild use cases.
1094+
1095+
### Support for SPs that don't Know Attached Nodes
1096+
1097+
Maybe there are some SPs that don't know the currently attached nodes,
1098+
so they cannot determine whether the topology change will break any workload.
1099+
1100+
Some kind of storage does not have persistent connection between client and server, such as object storage like S3.
1101+
They may fall into this category of SP.
1102+
But as network attached storage, they can be accessed wherever the network can reach.
1103+
So these SPs typically do not use the topology feature at all.
1104+
1105+
So, we decide that for an SP to support this feature,
1106+
they are required to properly detect potential breaking for existing workloads.
1107+
1108+
That said, the candidate design looks:
1109+
Add a new `dry_run` parameter to the `ControllerModifyVolumeRequest`.
1110+
CO first call `ControllerModifyVolume` with `dry_run=true` to fetch the new topology,
1111+
determine if the new topology is compatible with the existing workloads,
1112+
then decide whether to proceed the modification with `dry_run=false`.
1113+
1114+
Another way to get the new topology is further extending the "User Specified Topology Requirement" section,
1115+
Making it required for user to explicitly specify the new topology in the VAC and
1116+
remove `accessible_topology` from `ControllerModifyVolumeResponse`.
1117+
That is to say, SP must accept the new topology specified by user or it should reject the request.
1118+
The workflow will become:
1119+
1. User create a `VolumeAttributeClass`:
1120+
```yaml
1121+
apiVersion: storage.k8s.io/v1beta1
1122+
kind: VolumeAttributesClass
1123+
metadata:
1124+
name: regional
1125+
driverName: csi.provider.com
1126+
parameters:
1127+
type: regional
1128+
allowedTopologies:
1129+
- matchLabelExpressions:
1130+
- key: topology.kubernetes.io/region
1131+
values:
1132+
- cn-beijing
1133+
```
1134+
2. User modify the `volumeAttributesClassName` in the PVC to `regional`
1135+
3. external-resizer verifies all the nodes that all the nodes with this volume attached satisfy the `allowedTopologies`
1136+
4. external-resizer sets `PersistentVolume.spec.nodeAffinity` to
1137+
```yaml
1138+
required:
1139+
nodeSelectorTerms:
1140+
- matchExpressions:
1141+
- key: topology.kubernetes.io/region
1142+
operator: In
1143+
values:
1144+
- cn-beijing
1145+
```
1146+
5. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type": "regional"}`
1147+
6. CSI driver blocks until the modification finished
1148+
7. external-resizer then update the PV status to indicate the modification is successful.
1149+
1150+
Besides the reasons mentioned above, we also facing a critical drawback for this design:
1151+
Topology can have many orthogonal aspects, such as above mentioned zone/region and disk type.
1152+
If SP cannot return the topology, user will need to be aware of all aspects of topology used by SP.
1153+
And SP will not able to extend the topology in the future, since VAC is immutable.
1154+
1155+
Note that the above designs are also racy.
1156+
We may still break some workloads that just started after the check but before the modification.
1157+
1158+
### Try to Eliminate Race Condition
1159+
1160+
Haven't figured out a reasonable way to do this.
1161+
Basically, we need to
1162+
1. pause scheduling of the pods that are using the volume;
1163+
2. get ack from scheduler that pause is effective;
1164+
3. conduct ModifyVolume;
1165+
4. retrive the new topology then resume scheduling.
1166+
1167+
I think it is too complex and not worth it.
1168+
Similar failure is already possible with KEP-4876 when CSINode allocatable is out-of-date.
10801169

10811170
<!--
10821171
What other approaches did you consider, and why did you rule them out? These do

0 commit comments

Comments
 (0)