Add documentation for transformer collocation with runtime

sivanantha321 · sivanantha321 · commit 79654fc8044c · 2025-04-09T14:18:22.000+05:30
Enhance transformer documentation

Signed-off-by: Sivanantham Chinnaiyan &lt;sivanantham.chinnaiyan@ideas2it.com&gt;
diff --git a/docs/modelserving/v1beta1/transformer/collocation/README.md b/docs/modelserving/v1beta1/transformer/collocation/README.md
@@ -13,13 +13,16 @@ KServe by default deploys the Transformer and Predictor as separate services, al
 2. Your cluster's Istio Ingress gateway must be [network accessible](https://istio.io/latest/docs/tasks/traffic-management/ingress/ingress-control/).
 3. You can find the [code samples](https://github.com/kserve/kserve/tree/master/docs/samples/v1beta1/transformer/collocation) on kserve repository.
 
-## Deploy the InferenceService
+## Collocation with custom container
+### Deploy the InferenceService
 
 Since, the predictor and the transformer are in the same pod, they need to listen on different ports to avoid conflict. `Transformer` is configured to listen on port 8080 (REST) and 8081 (GRPC) 
 while, `Predictor` listens on port 8085 (REST). `Transformer` calls `Predictor` on port 8085 via local socket. 
 Deploy the `Inferenceservice` using the below command.
 
-```bash
+Note that, readiness probe is specified in the transformer container. This due to the limitation of Knative. You can provide `--enable_predictor_health_check` argument to allow the transformer container to check the predictor health as well. This will make sure that both the containers are healthy before the isvc is marked as ready.
+
+```yaml
 cat <<EOF | kubectl apply -f -
 apiVersion: serving.kserve.io/v1beta1
 kind: InferenceService
@@ -52,13 +55,20 @@ spec:
         image: kserve/image-transformer:latest
         args:
           - --model_name=mnist
-          - --protocol=v1    # protocol of the predictor; used for converting the input to specific protocol supported by the predictor
+          - --predictor_protocol=v1
           - --http_port=8080
           - --grpc_port=8081
           - --predictor_host=localhost:8085      # predictor listening port
+          - --enable_predictor_health_check
         ports:
           - containerPort: 8080
             protocol: TCP
+        readinessProbe:
+          httpGet:
+            path: /v1/models/mnist
+            port: 8080
+          initialDelaySeconds: 5
+          periodSeconds: 10
         resources:
           requests:
             cpu: 100m
@@ -82,15 +92,15 @@ EOF
     predictor. The storage uri should be only present in this container. If it is specified in the transformer 
     container the isvc creation will fail.
 
-!!! Note
-    Currently, The collocation support is limited to the custom container spec for kserve model container.
-
 !!! Note
     In Serverless mode, Specifying ports for predictor will result in isvc creation failure as specifying multiple ports 
     is not supported by knative. Due to this limitation predictor cannot be exposed to the outside cluster. 
     For more info see, [knative discussion on multiple ports](https://github.com/knative/serving/issues/8471).
 
-## Check InferenceService status
+!!! Tip
+    Check the [Transformer documentation](../torchserve_image_transformer/#transformer-specific-commandline-arguments) for list of arguments that can be passed to the transformer container.
+
+### Check InferenceService status
 ```bash
 kubectl get isvc custom-transformer-collocation
 ```
@@ -101,14 +111,13 @@ kubectl get isvc custom-transformer-collocation
     ```
 
 !!! Note
-    If your DNS contains `svc.cluster.local`, then `Inferenceservice` is not exposed through Ingress. you need to [configure DNS](https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/#configure-dns) 
+    If your DNS contains `svc.cluster.local`, then `Inferenceservice` is not exposed through Ingress. You need to [configure DNS](https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/#configure-dns) 
     or [use a custom domain](https://knative.dev/docs/serving/using-a-custom-domain/) in order to expose the `isvc`.
 
 ## Run a prediction
 Prepare the [inputs](https://github.com/kserve/kserve/blob/master/docs/samples/v1beta1/transformer/collocation/input.json) for the inference request. Copy the following Json into a file named `input.json`.
 
-Now, [determine the ingress IP and ports](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports
-) and set `INGRESS_HOST` and `INGRESS_PORT`
+Now, [determine the ingress IP and ports](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`
 
 ```bash
 SERVICE_NAME=custom-transformer-collocation
@@ -143,3 +152,240 @@ curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" -d $I
     * Connection #0 to host localhost left intact
     {"predictions":[2]}
     ```
+
+## Collocation with Runtime
+### Deploy the InferenceService
+
+Since, the predictor and the transformer are in the same pod, they need to listen on different ports to avoid conflict. `Transformer` is configured to listen on port 8080 (REST) and 8081 (GRPC) 
+while, `Predictor` listens on port 8085 (REST). `Transformer` calls `Predictor` on port 8085 via local socket. 
+Deploy the `Inferenceservice` using the below command.
+
+Note that, readiness probe is specified in the transformer container. This due to the limitation of Knative. You can provide `--enable_predictor_health_check` argument to allow the transformer container to check the predictor health as well. This will make sure that both the containers are healthy before the isvc is marked as ready.
+
+```yaml
+cat <<EOF | kubectl apply -f -
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  name: transformer-collocation
+spec:
+  predictor:
+    model:
+      modelFormat:
+        name: pytorch
+      storageUri: gs://kfserving-examples/models/torchserve/image_classifier/v1
+      resources:
+        requests:
+          cpu: 100m
+          memory: 256Mi
+        limits:
+          cpu: 1
+          memory: 1Gi
+    containers:
+      - name: transformer-container    # Do not change the container name
+        image: kserve/image-transformer:latest
+        args:
+          - --model_name=mnist
+          - --predictor_protocol=v1
+          - --http_port=8080
+          - --grpc_port=8081
+          - --predictor_host=localhost:8085      # predictor listening port
+          - --enable_predictor_health_check      # transformer checks for predictor health before marking itself as ready
+        ports:
+          - containerPort: 8080
+            protocol: TCP
+        readinessProbe:
+          httpGet:
+            path: /v1/models/mnist
+            port: 8080
+          initialDelaySeconds: 5
+          periodSeconds: 10
+        resources:
+          requests:
+            cpu: 100m
+            memory: 256Mi
+          limits:
+            cpu: 1
+            memory: 1Gi
+EOF
+```
+
+!!! success "Expected output"
+    ```{ .bash .no-copy }
+    $ inferenceservice.serving.kserve.io/transformer-collocation created
+    ```
+
+### Check InferenceService status
+```bash
+kubectl get isvc custom-transformer-collocation
+```
+!!! success "Expected output"
+    ```{ .bash .no-copy }
+    NAME                             URL                                                         READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                              AGE
+    transformer-collocation   http://transformer-collocation.default.example.com   True           100                              transformer-collocation-predictor-00001   133m
+    ```
+
+!!! Note
+    If your DNS contains `svc.cluster.local`, then `Inferenceservice` is not exposed through Ingress. You need to [configure DNS](https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/#configure-dns) 
+    or [use a custom domain](https://knative.dev/docs/serving/using-a-custom-domain/) in order to expose the `isvc`.
+
+### Run a prediction
+Prepare the [inputs](https://github.com/kserve/kserve/blob/master/docs/samples/v1beta1/transformer/collocation/input.json) for the inference request. Copy the following Json into a file named `input.json`.
+
+Now, [determine the ingress IP and ports](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`
+
+```bash
+SERVICE_NAME=transformer-collocation
+MODEL_NAME=mnist
+INPUT_PATH=@./input.json
+SERVICE_HOSTNAME=$(kubectl get inferenceservice $SERVICE_NAME -o jsonpath='{.status.url}' | cut -d "/" -f 3)
+```
+You can use `curl` to send the inference request as:
+```bash
+curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" -d $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
+```
+
+!!! success "Expected output"
+    ```{ .bash .no-copy }
+    *   Trying 127.0.0.1:8080...
+    * Connected to localhost (127.0.0.1) port 8080 (#0)
+    > POST /v1/models/mnist:predict HTTP/1.1
+    > Host: transformer-collocation.default.example.com
+    > User-Agent: curl/7.85.0
+    > Accept: */*
+    > Content-Type: application/json
+    > Content-Length: 427
+    > 
+    * Mark bundle as not supporting multiuse
+    < HTTP/1.1 200 OK
+    < content-length: 19
+    < content-type: application/json
+    < date: Sat, 02 Dec 2023 09:13:16 GMT
+    < server: istio-envoy
+    < x-envoy-upstream-service-time: 315
+    < 
+    * Connection #0 to host localhost left intact
+    {"predictions":[2]}
+    ```
+
+
+## Defining Collocation In ServingRuntime
+
+You can also define the collocation in the `ServingRuntime` and use it in the `InferenceService`. This is useful when you want to use the same transformer for multiple models.
+
+### Create ServingRuntime
+
+```yaml
+cat <<EOF | kubectl apply -f -
+apiVersion: serving.kserve.io/v1alpha1
+kind: ServingRuntime
+metadata:
+  name: pytorch-collocation
+spec:
+  annotations:
+    prometheus.kserve.io/port: "8080"
+    prometheus.kserve.io/path: "/metrics"
+  supportedModelFormats:
+    - name: pytorch
+      version: "1"
+      autoSelect: true
+      priority: 1
+  protocolVersions:
+    - v1
+  containers:
+    - name: kserve-container
+      image: pytorch/torchserve:0.9.0-cpu
+      args:
+        - torchserve
+        - --start
+        - --model-store=/mnt/models/model-store
+        - --ts-config=/mnt/models/config/config.properties
+      env:
+        - name: "TS_SERVICE_ENVELOPE"
+          value: "{% raw %}{{.Labels.serviceEnvelope}}{% endraw %}"
+      securityContext:
+        runAsUser: 1000    # User ID is not defined in the Dockerfile, so we need to set it here to run as non-root
+        allowPrivilegeEscalation: false
+        privileged: false
+        runAsNonRoot: true
+        capabilities:
+          drop:
+            - ALL
+      resources:
+        requests:
+          cpu: "1"
+          memory: 2Gi
+        limits:
+          cpu: "1"
+          memory: 2Gi
+    
+    - name: transformer-container    # Do not change the container name
+      image: kserve/image-transformer:latest
+      args:
+        - --model_name={% raw %}{{.Labels.modelName}}{% endraw %}
+        - --predictor_protocol=v1
+        - --http_port=8080
+        - --grpc_port=8081
+        - --predictor_host=localhost:8085      # predictor listening port
+        - --enable_predictor_health_check      # transformer checks for predictor health before marking itself as ready
+      ports:
+        - containerPort: 8080
+          protocol: TCP
+      readinessProbe:
+          httpGet:
+            path: /v1/models/{% raw %}{{.Labels.modelName}}{% endraw %}
+            port: 8080
+          initialDelaySeconds: 5
+          periodSeconds: 10
+      resources:
+        requests:
+          cpu: 100m
+          memory: 256Mi
+        limits:
+          cpu: 1
+          memory: 1Gi
+EOF
+```
+
+!!! note
+    Do not specify ports for predictor in the serving runtime for Serverless deployment. This is not supported by knative. 
+    For more info see, [knative discussion on multiple ports](https://github.com/knative/serving/issues/8471).
+
+!!! success "Expected output"
+    ```{ .bash .no-copy }
+    $ servingruntime.serving.kserve.io/pytorch-collocation created
+    ```
+### Create InferenceService
+
+```yaml
+cat <<EOF | kubectl apply -f -
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  name: transformer-collocation-runtime
+  labels:
+    modelName: mnist
+spec:
+  predictor:
+    model:
+      modelFormat:
+        name: pytorch
+      storageUri: gs://kfserving-examples/models/torchserve/image_classifier/v1
+      runtime: pytorch-collocation
+    containers:
+      - name: transformer-container    # Do not change the container name
+        image: kserve/image-transformer:latest
+        resources:                         # You can override the serving runtime values
+          requests:
+            cpu: 200m
+            memory: 512Mi
+          limits:
+            cpu: 1
+            memory: 1Gi
+EOF
+```
+
+!!! success "Expected output"
+    ```{ .bash .no-copy }
+    $ inferenceservice.serving.kserve.io/transformer-collocation-runtime created
+    ```
diff --git a/docs/modelserving/v1beta1/transformer/torchserve_image_transformer/README.md b/docs/modelserving/v1beta1/transformer/torchserve_image_transformer/README.md
@@ -363,3 +363,10 @@ time serializing and deserializing `3*32*32` shape tensor and with gRPC it is tr
 # from gPPC v2 predictor log
 2023-01-09 07:27:52.171 79711 root INFO [__call__():128] requestId: , preprocess_ms: 0.067949295, explain_ms: 0, predict_ms: 51.237106323, postprocess_ms: 0.049114227
 ```
+
+## Transformer Specific Commandline Arguments
+- `--predictor_protocol`: The protocol used to communicate with the predictor. The available values are "v1", "v2" and "grpc-v2". The default value is "v1".
+- `--predictor_use_ssl`: Whether to use secure SSL when communicating with the predictor. The default value is "false".
+- `--predictor_request_timeout_seconds`: The timeout seconds for the request sent to the predictor. The default value is 600 seconds.
+- `--predictor_request_retries`: The number of retries for the request sent to the predictor. The default value is 0.
+- `--enable_predictor_health_check`: The Transformer will perform readiness check for the predictor in addition to its health check. By default, it is disabled.