@@ -93,6 +93,8 @@ SIG Architecture for cross-cutting KEPs).
93
93
- [ Alternatives] ( #alternatives )
94
94
- [ New GRPC] ( #new-grpc )
95
95
- [ User Specified Topology Requirement] ( #user-specified-topology-requirement )
96
+ - [ Support for SPs that don't Know Attached Nodes] ( #support-for-sps-that-dont-know-attached-nodes )
97
+ - [ Try to Eliminate Race Condition] ( #try-to-eliminate-race-condition )
96
98
- [ Infrastructure Needed (Optional)] ( #infrastructure-needed-optional )
97
99
<!-- /toc -->
98
100
@@ -224,25 +226,40 @@ required:
224
226
values :
225
227
- cn-beijing-g
226
228
` ` `
227
- The updated affinity would be:
228
- ` ` ` yaml
229
- required :
230
- nodeSelectorTerms :
231
- - matchExpressions :
232
- - key : topology.kubernetes.io/region
233
- operator : In
234
- values :
235
- - cn-beijing
236
- ` ` `
237
-
238
- Or in the view of CSI accessibility requirement, from:
229
+ Or in the view of CSI accessibility requirement:
239
230
` ` ` json
240
231
[{"topology.kubernetes.io/zone" : " cn-beijing-g" }]
241
232
` ` `
242
- to:
243
- ` ` ` json
244
- [{"topology.kubernetes.io/region" : " cn-beijing" }]
245
- ` ` `
233
+
234
+ The workflow:
235
+ 1. User create a ` VolumeAttributeClass`:
236
+ ` ` ` yaml
237
+ apiVersion: storage.k8s.io/v1beta1
238
+ kind: VolumeAttributesClass
239
+ metadata:
240
+ name: regional
241
+ driverName: csi.provider.com
242
+ parameters:
243
+ type: regional
244
+ ` ` `
245
+ 2. User modify the `volumeAttributesClassName` in the PVC to `regional`
246
+ 3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type" : " regional" }`
247
+ 4. CSI driver blocks until the modification finished, then return with `accessible_topology` set to
248
+ ` ` ` json
249
+ [{"topology.kubernetes.io/region": "cn-beijing"}]
250
+ ` ` `
251
+ 5. external-resizer sets `PersistentVolume.spec.nodeAffinity` to
252
+ ` ` ` yaml
253
+ required:
254
+ nodeSelectorTerms:
255
+ - matchExpressions:
256
+ - key: topology.kubernetes.io/region
257
+ operator: In
258
+ values:
259
+ - cn-beijing
260
+ ` ` `
261
+ then update the PV status to indicate the modification is successful.
262
+
246
263
247
264
# ### Story 2
248
265
@@ -261,30 +278,51 @@ required:
261
278
values:
262
279
- available
263
280
` ` `
264
- The updated affinity would be :
265
- ` ` ` yaml
266
- required:
267
- nodeSelectorTerms:
268
- - matchExpressions:
269
- - key: provider.com/disktype.cloud_essd
270
- operator: In
271
- values:
272
- - available
273
- ` ` `
274
-
275
- Or in the view of CSI accessibility requirement, from :
281
+ Or in the view of CSI accessibility requirement :
276
282
` ` ` json
277
283
[{"provider.com/disktype.cloud_ssd": "available"}]
278
284
` ` `
279
- to :
280
- ` ` ` json
281
- [{"provider.com/disktype.cloud_essd": "available"}]
282
- ` ` `
283
285
284
286
Type A node only supports cloud_ssd, while Type B node supports both cloud_ssd and cloud_essd.
285
287
We will only allow the modification if the volume is attached to type B nodes.
286
288
And I want to make sure the Pods using new cloud_essd volume not to be scheduled to type A nodes.
287
289
290
+ In this case, it takes long to modify the volume, the new topology is not strictly less restrictive,
291
+ and SP wants to minimize the time window of the race condition :
292
+
293
+ The workflow :
294
+ 1. User create a `VolumeAttributeClass` :
295
+ ` ` ` yaml
296
+ apiVersion: storage.k8s.io/v1beta1
297
+ kind: VolumeAttributesClass
298
+ metadata:
299
+ name: essd
300
+ driverName: csi.provider.com
301
+ parameters:
302
+ type: cloud_essd
303
+ ` ` `
304
+ 2. User modify the `volumeAttributesClassName` in the PVC to `essd`
305
+ 3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type" : " cloud_essd" }`
306
+ 4. CSI driver returns with `in_progress` set to true, and `accessible_topology` set to
307
+ ` ` ` json
308
+ [{"provider.com/disktype.cloud_essd": "available"}]
309
+ ` ` `
310
+ 5. external-resizer sets `PersistentVolume.spec.nodeAffinity` to
311
+ ` ` ` yaml
312
+ required:
313
+ nodeSelectorTerms:
314
+ - matchExpressions:
315
+ - key: provider.com/disktype.cloud_essd
316
+ operator: In
317
+ values:
318
+ - available
319
+ ` ` `
320
+ but the PV status is not updated yet.
321
+ From now on, the new Pod will be scheduled to nodes with `provider.com/disktype.cloud_essd : available`,
322
+ maybe they will stuck in `ContainerCreating` state until the modification finishes.
323
+ 6. external-resizer go back to step 3, retries until `in_progress` is set to false.
324
+ 7. external-resizer update the PV status to indicate the modification is successful.
325
+
288
326
289
327
# ## Notes/Constraints/Caveats (Optional)
290
328
@@ -390,11 +428,14 @@ message ControllerModifyVolumeResponse {
390
428
// If it is not specified, the CO MAY assume
391
429
// the volume is equally accessible from all nodes in the cluster and
392
430
// MAY schedule workloads referencing the volume on any available
393
- // node.
431
+ // node.
394
432
//
395
433
// SP MUST only set this field if allow_topology_updates is set
396
434
// in the request. SP SHOULD fail the request if it needs to update
397
435
// topology but is not allowed by CO.
436
+ //
437
+ // SP SHOULD only return topology that is a super-set of the
438
+ // original topology to avoid race conditions when scheduling.
398
439
repeated Topology accessible_topology = 1;
399
440
400
441
// Indicates whether the modification is still in progress.
@@ -422,42 +463,6 @@ But this KEP does not cover the automatic correction. Kubernetes should only ret
422
463
423
464
Scheduler Enhancements : make sure the Pod is re-queued when the PV is updated.
424
465
425
- A typical workflow is (taking the user story 1 as an example) :
426
- 1. User create a `VolumeAttributeClass` :
427
- ` ` ` yaml
428
- apiVersion: storage.k8s.io/v1beta1
429
- kind: VolumeAttributesClass
430
- metadata:
431
- name: regional
432
- driverName: csi.provider.com
433
- parameters:
434
- type: regional
435
- ` ` `
436
- 2. User modify the `volumeAttributesClassName` in the PVC to `regional`
437
- 3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type" : " regional" }`
438
- 4. CSI driver blocks until the modification finished, then return with `accessible_topology` set to `[{"topology.kubernetes.io/region" : " cn-beijing" }]`
439
- 5. external-resizer sets `PersistentVolume.spec.nodeAffinity` accordingly, then update the PV status to indicate the modification is successful.
440
-
441
- If it takes long to modify the volume, the new topology is not strictly less restrictive,
442
- and SP wants to minimize the time window of the race condition (taking the user story 2 as an example) :
443
- 1. User create a `VolumeAttributeClass` :
444
- ` ` ` yaml
445
- apiVersion: storage.k8s.io/v1beta1
446
- kind: VolumeAttributesClass
447
- metadata:
448
- name: essd
449
- driverName: csi.provider.com
450
- parameters:
451
- type: cloud_essd
452
- ` ` `
453
- 2. User modify the `volumeAttributesClassName` in the PVC to `essd`
454
- 3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type" : " cloud_essd" }`
455
- 4. CSI driver returns with `in_progress` set to true, and `accessible_topology` set to `[{"provider.com/disktype.cloud_essd" : " available" }]`
456
- 5. external-resizer sets `PersistentVolume.spec.nodeAffinity` accordingly, but the PV status is not updated yet.
457
- From now on, the new Pod will be scheduled to nodes with `provider.com/disktype.cloud_essd : available`,
458
- maybe they will stuck in `ContainerCreating` state until the modification finishes.
459
- 6. external-resizer go back to step 3, retries until `in_progress` is set to false.
460
- 7. external-resizer update the PV status to indicate the modification is successful.
461
466
462
467
463
468
# ## Test Plan
@@ -1035,7 +1040,7 @@ rpc ControllerModifyVolumeTopology (ControllerModifyVolumeTopologyRequest)
1035
1040
message ControllerModifyVolumeTopologyRequest {
1036
1041
option (alpha_message) = true;
1037
1042
string volume_id = 1;
1038
- map<string, string> secrest = 2 [(csi_secret) = true];
1043
+ map<string, string> secret = 2 [(csi_secret) = true];
1039
1044
map<string, string> mutable_parameters = 3;
1040
1045
}
1041
1046
@@ -1067,16 +1072,100 @@ We've considered a design:
1067
1072
* Add `accessibility_requirements` in `ModifyVolumeRequest`, like that in `CreateVolumeRequest`
1068
1073
* Add `allowedTopologies` in `VolumeAttributeClass`, like that in `StorageClass`
1069
1074
1070
- But facing a lot of unresolved questions :
1075
+ We determine this lacks vaild use cases.
1076
+
1077
+ In most cases, SP can determine the desired topology of a volume from the `mutable_parameters`, or from the currently attached nodes.
1078
+ An exception could be : modifying a volume from regional to zonal, and it is not attached to any node.
1079
+ In this case, SP may need more information from the CO to determine the desired zone.
1080
+ But we don't want user to create a separate VAC for each zone.
1081
+ Instead, it maybe easier for user to just attach it to a node, so that SP can determine the desired zone.
1082
+
1083
+ For the other case (User Story 2), the topology (provider.com/disktype.cloud_essd) is actually not intended for user as an API.
1084
+ User just want to modify the disk type, and we implement the underlying limitations as topology.
1085
+ So we don't want to let user to specify this topology key directly.
1086
+
1087
+ Besides, this facing a lot of unresolved questions :
1071
1088
* How to merge `allowedTopologies` from `VolumeAttributeClass`, `StorageClass`?
1072
1089
* Should we use `allowedTopologies` from `StorageClass` if it is not specified in `VolumeAttributeClass`?
1073
1090
* Should we consider the topology of the currently attached nodes?
1074
1091
* Should we consider the topology of all the nodes in the cluster?
1075
1092
1076
- In most cases, SP can determine the desired topology of a volume from the `mutable_parameters`, or from the currently attached nodes.
1077
- An exception could be : modifying a volume from regional to zonal, and it is not attached to any node.
1078
- In this case, SP will need more information from the CO to determine the desired zone.
1079
- But we don't have such use case now, we decided leave it as a future work.
1093
+ We may consider this again with vaild use cases.
1094
+
1095
+ # ## Support for SPs that don't Know Attached Nodes
1096
+
1097
+ Maybe there are some SPs that don't know the currently attached nodes,
1098
+ so they cannot determine whether the topology change will break any workload.
1099
+
1100
+ Some kind of storage does not have persistent connection between client and server, such as object storage like S3.
1101
+ They may fall into this category of SP.
1102
+ But as network attached storage, they can be accessed wherever the network can reach.
1103
+ So these SPs typically do not use the topology feature at all.
1104
+
1105
+ So, we decide that for an SP to support this feature,
1106
+ they are required to properly detect potential breaking for existing workloads.
1107
+
1108
+ That said, the candidate design looks :
1109
+ Add a new `dry_run` parameter to the `ControllerModifyVolumeRequest`.
1110
+ CO first call `ControllerModifyVolume` with `dry_run=true` to fetch the new topology,
1111
+ determine if the new topology is compatible with the existing workloads,
1112
+ then decide whether to proceed the modification with `dry_run=false`.
1113
+
1114
+ Another way to get the new topology is further extending the "User Specified Topology Requirement" section,
1115
+ Making it required for user to explicitly specify the new topology in the VAC and
1116
+ remove `accessible_topology` from `ControllerModifyVolumeResponse`.
1117
+ That is to say, SP must accept the new topology specified by user or it should reject the request.
1118
+ The workflow will become :
1119
+ 1. User create a `VolumeAttributeClass` :
1120
+ ` ` ` yaml
1121
+ apiVersion: storage.k8s.io/v1beta1
1122
+ kind: VolumeAttributesClass
1123
+ metadata:
1124
+ name: regional
1125
+ driverName: csi.provider.com
1126
+ parameters:
1127
+ type: regional
1128
+ allowedTopologies:
1129
+ - matchLabelExpressions:
1130
+ - key: topology.kubernetes.io/region
1131
+ values:
1132
+ - cn-beijing
1133
+ ` ` `
1134
+ 2. User modify the `volumeAttributesClassName` in the PVC to `regional`
1135
+ 3. external-resizer verifies all the nodes that all the nodes with this volume attached satisfy the `allowedTopologies`
1136
+ 4. external-resizer sets `PersistentVolume.spec.nodeAffinity` to
1137
+ ` ` ` yaml
1138
+ required:
1139
+ nodeSelectorTerms:
1140
+ - matchExpressions:
1141
+ - key: topology.kubernetes.io/region
1142
+ operator: In
1143
+ values:
1144
+ - cn-beijing
1145
+ ` ` `
1146
+ 5. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type" : " regional" }`
1147
+ 6. CSI driver blocks until the modification finished
1148
+ 7. external-resizer then update the PV status to indicate the modification is successful.
1149
+
1150
+ Besides the reasons mentioned above, we also facing a critical drawback for this design :
1151
+ Topology can have many orthogonal aspects, such as above mentioned zone/region and disk type.
1152
+ If SP cannot return the topology, user will need to be aware of all aspects of topology used by SP.
1153
+ And SP will not able to extend the topology in the future, since VAC is immutable.
1154
+
1155
+ Note that the above designs are also racy.
1156
+ We may still break some workloads that just started after the check but before the modification.
1157
+
1158
+ # ## Try to Eliminate Race Condition
1159
+
1160
+ Haven't figured out a reasonable way to do this.
1161
+ Basically, we need to
1162
+ 1. pause scheduling of the pods that are using the volume;
1163
+ 2. get ack from scheduler that pause is effective;
1164
+ 3. conduct ModifyVolume;
1165
+ 4. retrive the new topology then resume scheduling.
1166
+
1167
+ I think it is too complex and not worth it.
1168
+ Similar failure is already possible with KEP-4876 when CSINode allocatable is out-of-date.
1080
1169
1081
1170
<!--
1082
1171
What other approaches did you consider, and why did you rule them out? These do
0 commit comments