Skip to content

Commit 0c770bc

Browse files
committed
KEP-4639: Update CRI API and workflow
Updating the KEP after the merge of: kubernetes/kubernetes#125659 This reflects the current state of the enhancement. Signed-off-by: Sascha Grunert <[email protected]>
1 parent a009ed3 commit 0c770bc

File tree

1 file changed

+40
-95
lines changed
  • keps/sig-node/4639-oci-volume-source

1 file changed

+40
-95
lines changed

keps/sig-node/4639-oci-volume-source/README.md

Lines changed: 40 additions & 95 deletions
Original file line numberDiff line numberDiff line change
@@ -370,7 +370,7 @@ And add the corresponding `OCIVolumeSource` type:
370370
// OCIVolumeSource represents a OCI volume resource.
371371
type OCIVolumeSource struct {
372372
// Required: Image or artifact reference to be used
373-
Reference string `json:"reference,omitempty" protobuf:"bytes,1,opt,name=reference"`
373+
Reference string `json:"reference" protobuf:"bytes,1,opt,name=reference"`
374374

375375
// Policy for pulling OCI objects
376376
// Defaults to IfNotPresent
@@ -482,8 +482,8 @@ While the `imagePullPolicy` is working on container level, the introduced
482482
values `IfNotPresent`, `Always` and `Never`, but will only pull once per pod.
483483

484484
Technically it means that we need to pull in [`SyncPod`](https://github.com/kubernetes/kubernetes/blob/b498eb9/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L1049)
485-
for OCI objects on a pod level and not during [`EnsureImageExists`](https://github.com/kubernetes/kubernetes/blob/b498eb9/pkg/kubelet/images/image_manager.go#L102)
486-
before the container gets started.
485+
for OCI objects on a pod level and not for each container during [`EnsureImageExists`](https://github.com/kubernetes/kubernetes/blob/b498eb9/pkg/kubelet/images/image_manager.go#L102)
486+
before they get started.
487487

488488
If users want to re-pull artifacts when referencing moving tags like `latest`,
489489
then they need to restart / evict the pod.
@@ -500,50 +500,44 @@ container image.
500500
#### CRI
501501

502502
The CRI API is already capable of managing container images [via the `ImageService`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L146-L161).
503-
Those RPCs will be re-used for managing OCI artifacts, while the [`ImageSpec`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L798-L813)
504-
as well as [`PullImageResponse`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L1530-L1534)
505-
will be extended to mount the OCI object to a local path:
503+
Those RPCs will be re-used for managing OCI artifacts, while the [`Mount`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L220-L247)
504+
message will be extended to mount an OCI object using the existing [`ImageSpec`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L798-L813)
505+
on container creation:
506506

507507
```protobuf
508-
509-
// ImageSpec is an internal representation of an image.
510-
message ImageSpec {
511-
// …
512-
513-
// Indicate that the OCI object should be mounted.
514-
bool mount = 20;
515-
516-
// SELinux label to be used.
517-
string mount_label = 21;
518-
}
519-
520-
message PullImageResponse {
508+
// Mount specifies a host volume to mount into a container.
509+
message Mount {
521510
// …
522511
523-
// Absolute local path where the OCI object got mounted.
524-
string mountpoint = 2;
512+
// Mount an image reference (image ID, with or without digest), which is a
513+
// special use case for OCI volume mounts. If this field is set, then
514+
// host_path should be unset. All OCI mounts are per feature definition
515+
// readonly. The kubelet does an PullImage RPC and evaluates the returned
516+
// PullImageResponse.image_ref value, which is then set to the
517+
// ImageSpec.image field. Runtimes are expected to mount the image as
518+
// required.
519+
// Introduced in the OCI Volume Source KEP: https://kep.k8s.io/4639
520+
ImageSpec image = 9;
525521
}
526522
```
527523

528524
This allows to re-use the existing kubelet logic for managing the OCI objects,
529525
with the caveat that the new `VolumeSource` won't be isolated in a dedicated
530526
plugin as part of the existing [volume manager](https://github.com/kubernetes/kubernetes/tree/6d0aab2/pkg/kubelet/volumemanager).
531527

532-
The added `mount_label` allow the kubelet to support SELinux contexts.
528+
Runtimes are already aware of the correct SELinux parameters during container
529+
creation and will re-use them for the OCI object mounts.
533530

534-
The kubelet will use the `mountpoint` on container creation
535-
(by calling the `CreateContainer` RPC) to indicate the additional required volume mount ([`ContainerConfig.Mount`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L1102))
536-
from the runtime. The runtime needs to ensure that mount and also manages its
537-
lifecycle, for example to remove the bind mount on container removal.
531+
The kubelet will use the returned `PullImageResponse.image_ref` on pull and sets
532+
it to `Mount.image.image` together with the other fields for `Mount.image`. The
533+
runtime will then mount the OCI object directly on container creation assuming
534+
it's already present on disk. The runtime also manages the lifecycle of the
535+
mount, for example to remove the OCI bind mount on container removal as well as
536+
the object mount on the `RemoveImage` RPC.
538537

539538
The kubelet tracks the information about which OCI object is used by which
540-
sandbox and therefore manages the lifecycle of them.
541-
542-
The proposal also considers smaller CRI changes, for example to add a list of
543-
mounted volume paths to the `ImageStatusResponse.Image` message returned by the
544-
`ImageStatus` RPC. This allows providing the right amount of information between
545-
the kubelet and the runtime to ensure that no context gets lost in restart
546-
scenarios.
539+
sandbox and therefore manages the lifecycle of them for garbage collection
540+
purposes.
547541

548542
The overall flow for container creation will look like this:
549543

@@ -554,32 +548,30 @@ sequenceDiagram
554548
Note left of K: During pod sync
555549
Note over K,C: CRI
556550
K->>+C: RPC: PullImage
557-
Note right of C: Pull and mount<br/>OCI object
558-
C-->>-K: PullImageResponse.Mountpoint
551+
Note right of C: Pull OCI object
552+
C-->>-K: PullImageResponse.image_ref
559553
Note left of K: Add mount points<br/> to container<br/>creation request
560554
K->>+C: RPC: CreateContainer
561-
Note right of C: Add bind mounts<br/>from object mount<br/>point to container
555+
Note right of C: Mount OCI object
556+
Note right of C: Add OCI bind mounts<br/>from OCI object<br/>to container
562557
C-->>-K: CreateContainerResponse
563558
```
564559

565560
1. **Kubelet Initiates Image Pull**:
566561
- During pod setup, the kubelet initiates the pull for the OCI object based on the volume source.
567-
- The kubelet passes the necessary indicator to mount the object to the container runtime.
568562

569563
2. **Runtime Handles Mounting**:
570-
- The container runtime mounts the OCI object as a filesystem using the metadata provided by the kubelet.
571-
- The runtime returns the mount point information to the kubelet.
564+
- The runtime returns the image reference information to the kubelet.
572565

573566
3. **Redirecting of the Mountpoint**:
574-
- The kubelet uses the returned mount point to build the container creation request for each container using that mount.
575-
- The kubelet initiates the container creation and the runtime creates the required bind mounts to the target location.
567+
- The kubelet uses the returned image reference to build the container creation request for each container using that mount.
568+
- The kubelet initiates the container creation and the runtime creates the required OCI object mount as well as bind mounts to the target location.
576569
This is the current implemented behavior for all other mounts and should require no actual container runtime code change.
577570

578571
4. **Lifecycle Management**:
579572
- The container runtime manages the lifecycle of the mounts, ensuring they are created during pod setup and cleaned up upon sandbox removal.
580573

581574
5. **Tracking and Coordination**:
582-
- The kubelet and runtime coordinate to track pods requesting mounts to avoid removing containers with volumes in use.
583575
- During image garbage collection, the runtime provides the kubelet with the necessary mount information to ensure proper cleanup.
584576

585577
6. **SELinux Context Handling**:
@@ -597,19 +589,17 @@ sequenceDiagram
597589

598590
#### Container Runtimes
599591

600-
Container runtimes need to support the new `mount` field, otherwise the
601-
feature cannot be used. The kubelet will verify if the returned `mountpoint`
602-
actually exists on disk to check the feature availability, because Protobuf will
603-
strip the field in a backwards compatible way for older runtimes. Pods using the
604-
new `VolumeSource` combined with a not supported container runtime version will
605-
fail to run on the node.
592+
Container runtimes need to support the new `Mount.image` field, otherwise the
593+
feature cannot be used. Pods using the new `VolumeSource` combined with a not
594+
supported container runtime version will fail to run on the node, because the
595+
`Mount.host_path` field is not set for those mounts.
606596

607597
For security reasons, volume mounts should set the [`noexec`] and `ro`
608598
(read-only) options by default.
609599

610600
##### Filesystem representation
611601

612-
Container Runtimes are expected to return a `mountpoint`, which is a single
602+
Container Runtimes are expected to manage a `mountpoint`, which is a single
613603
directory containing the unpacked (in case of tarballs) and merged layer files
614604
from the image or artifact. If an OCI artifact has multiple layers (in the same
615605
way as for container images), then the runtime is expected to merge them
@@ -716,41 +706,6 @@ oras manifest fetch localhost:5000/image:v1 | jq .
716706
}
717707
```
718708

719-
The container runtime can now pull the artifact with the `mount = true` CRI
720-
field set, for example using an experimental [`crictl pull --mount` flag](https://github.com/kubernetes-sigs/cri-tools/compare/master...saschagrunert:oci-volumesource-poc):
721-
722-
```bash
723-
sudo crictl pull --mount localhost:5000/image:v1
724-
```
725-
726-
```console
727-
Image is up to date for localhost:5000/image@sha256:7728cb2fa5dc31ad8a1d05d4e4259d37c3fc72e1fbdc0e1555901687e34324e9
728-
Image mounted to: /var/lib/containers/storage/overlay/7ee9a1dcea9f152b10590871e55e485b249cd42ea912111ff9f99ab663c1001a/merged
729-
```
730-
731-
And the returned `mountpoint` contains the unpacked layers as directory tree:
732-
733-
```bash
734-
sudo tree /var/lib/containers/storage/overlay/7ee9a1dcea9f152b10590871e55e485b249cd42ea912111ff9f99ab663c1001a/merged
735-
```
736-
737-
```console
738-
/var/lib/containers/storage/overlay/7ee9a1dcea9f152b10590871e55e485b249cd42ea912111ff9f99ab663c1001a/merged
739-
├── dir
740-
│   └── file
741-
└── file
742-
743-
2 directories, 2 files
744-
```
745-
746-
```console
747-
$ sudo cat /var/lib/containers/storage/overlay/7ee9a1dcea9f152b10590871e55e485b249cd42ea912111ff9f99ab663c1001a/merged/dir/file
748-
layer0
749-
750-
$ sudo cat /var/lib/containers/storage/overlay/7ee9a1dcea9f152b10590871e55e485b249cd42ea912111ff9f99ab663c1001a/merged/file
751-
layer1
752-
```
753-
754709
ORAS (and other tools) are also able to push multiple files or directories
755710
within a single layer. This should be supported by container runtimes in the
756711
same way.
@@ -759,17 +714,7 @@ same way.
759714

760715
Traditionally, the container runtime is responsible of applying SELinux labels
761716
to volume mounts, which are inherited from the `securityContext` of the pod or
762-
container. Relabeling volume mounts can be time-consuming, especially when there
763-
are many files on the volume.
764-
765-
If the following criteria are met, then the kubelet will use the `mount_label`
766-
field in the CRI to apply the right SELinux label to the mount.
767-
768-
- The operating system must support SELinux
769-
- The Pod must have at least `seLinuxOptions.level` assigned in the
770-
`PodSecurityContext` or all volume using containers must have it set in their
771-
`SecurityContexts`. Kubernetes will read the default user, role and type from
772-
the operating system defaults (typically `system_u`, `system_r` and `container_t`).
717+
container on container creation. The same will apply to OCI volume mounts.
773718

774719
### Test Plan
775720

0 commit comments

Comments
 (0)