Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Kserve to v0.11.1 for 1.8 release #193

Closed
misohu opened this issue Nov 20, 2023 · 9 comments
Closed

Upgrade Kserve to v0.11.1 for 1.8 release #193

misohu opened this issue Nov 20, 2023 · 9 comments
Labels
enhancement New feature or request Kubeflow 1.8 This issue affects the Charmed Kubeflow 1.8 release

Comments

@misohu
Copy link
Member

misohu commented Nov 20, 2023

What needs to get done

To correctly update the kserve to v0.11.1 for 1.8 release we need to

  • Update the images based on the information from this PR.
  • Generate the upstream kustomize manifests (instructions) and compere them with ours.

At the end make sure the tests are passing.

Why it needs to get done

Required for 1.8

@misohu misohu added the enhancement New feature or request label Nov 20, 2023
@misohu
Copy link
Member Author

misohu commented Nov 20, 2023

While checking for the changes in the manifests I noticed the upstream change in version for kserve/pmmlserver to v0.11.1 for now we will stick with upstream's v0.11.1 tag until we release corresponding rock version.

@misohu misohu added the Kubeflow 1.8 This issue affects the Charmed Kubeflow 1.8 release label Nov 20, 2023
@misohu
Copy link
Member Author

misohu commented Nov 20, 2023

In order to change the pmmlserver rock to upstream Docker I had to change the SrvingRuntime for pmmlserver in serving_runtimes_manifests.yaml.j2

from:

  - args:
    - --args 
    - pmml-server
    - --model_name={{ '{{.Name}}' }}
    - --model_dir=/mnt/models

to

- args:
   - --model_name={{ '{{.Name}}' }}
   - --model_dir=/mnt/models
   - --http_port=8080

@misohu
Copy link
Member Author

misohu commented Nov 20, 2023

After upgrading the crds I tried to deploy following inference service

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "pmml-demo"
spec:
  predictor:
    model:
      modelFormat:
        name: pmml
      storageUri: "gs://kfserving-examples/models/pmml"
      resources:
        limits:
          cpu: 1
          memory: 500Mi
        requests:
          cpu: 100m
          memory: 250Mi

Getting

Error from server (InternalError): error when retrieving current configuration of:
Resource: "serving.kserve.io/v1beta1, Resource=inferenceservices", GroupVersionKind: "serving.kserve.io/v1beta1, Kind=InferenceService"
Name: "pmml-demo", Namespace: "test"
from server for: "pmml-server.yaml": Internal error occurred: error resolving resource

@misohu
Copy link
Member Author

misohu commented Nov 20, 2023

I checked the controller logs

kubectl logs -f kserve-controller-0 -c kserve-controller

and found problem:

2023-11-20T14:05:42.611Z [kserve-controller] E1120 14:05:42.611790    1403 reflector.go:140] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta1.InferenceService: failed to list *v1beta1.InferenceService: Internal error occurred: error resolving resource
2023-11-20T14:05:44.913Z [kserve-controller] W1120 14:05:44.913191    1403 reflector.go:424] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1beta1.InferenceService: Internal error occurred: error resolving resource
2023-11-20T14:05:44.913Z [kserve-controller] E1120 14:05:44.913214    1403 reflector.go:140] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta1.InferenceService: failed to list *v1beta1.InferenceService: Internal error occurred: error resolving resource
2023-11-20T14:05:49.211Z [kserve-controller] W1120 14:05:49.211779    1403 reflector.go:424] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1beta1.InferenceService: Internal error occurred: error resolving resource
2023-11-20T14:05:49.211Z [kserve-controller] E1120 14:05:49.211802    1403 reflector.go:140] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta1.InferenceService: failed to list *v1beta1.InferenceService: Internal error occurred: error resolving resource
2023-11-20T14:05:58.845Z [kserve-controller] W1120 14:05:58.845113    1403 reflector.go:424] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1beta1.InferenceService: Internal error occurred: error resolving resource
2023-11-20T14:05:58.845Z [kserve-controller] E1120 14:05:58.845129    1403 reflector.go:140] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta1.InferenceService: failed to list *v1beta1.InferenceService: Internal error occurred: error resolving resource
2023-11-20T14:06:22.649Z [kserve-controller] W1120 14:06:22.649276    1403 reflector.go:424] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1beta1.InferenceService: Internal error occurred: error resolving resource
2023-11-20T14:06:22.649Z [kserve-controller] E1120 14:06:22.649292    1403 reflector.go:140] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta1.InferenceService: failed to list *v1beta1.InferenceService: Internal error occurred: error resolving resource
2023-11-20T14:07:00.573Z [kserve-controller] W1120 14:07:00.573314    1403 reflector.go:424] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1beta1.InferenceService: Internal error occurred: error resolving resource
2023-11-20T14:07:00.573Z [kserve-controller] E1120 14:07:00.573329    1403 reflector.go:140] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta1.InferenceService: failed to list *v1beta1.InferenceService: Internal error occurred: error resolving resource
2023-11-20T14:07:30.714Z [kserve-controller] W1120 14:07:30.714029    1403 reflector.go:424] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1beta1.InferenceService: Internal error occurred: error resolving resource
2023-11-20T14:07:30.714Z [kserve-controller] E1120 14:07:30.714051    1403 reflector.go:140] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta1.InferenceService: failed to list *v1beta1.InferenceService: Internal error occurred: error resolving resource
2023-11-20T14:07:41.032Z [kserve-controller] {"level":"error","ts":"2023-11-20T14:07:41Z","msg":"Could not wait for Cache to sync","controller":"inferenceservice","controllerGroup":"serving.kserve.io","controllerKind":"InferenceService","error":"failed to wait for inferenceservice caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:211\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:242\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:219"}

My suspicion was that there is missing some Cluster role/binding for the service account to list the InferenceService. I checked both, but looks like they are there.

@ca-scribner
Copy link
Contributor

I think there's something wrong with the CRD or something to do with administering it. "Internal error occurred: error resolving resource" is coming from the k8s api. But maybe it has to do with a webhook that manipulates the CR before applying?

I see similar-ish conversations like this but I don't get a great idea from them on what to look at. in their case, there was a conversion webhook specified even though there's nothing to convert - maybe we have something similar?

@ca-scribner
Copy link
Contributor

there are some 409 (conflict) errors in the CI runs too - I wonder if that is a cause or a symptom?

@ca-scribner
Copy link
Contributor

oh nice, I might have figured it out. When updating the CRDs, a caBundle = {{ cert }} line in the inferenceservice CRD got changed to caBundle: Cg==. I've fixed that here and it seemed to be working locally - hopefully the CI sorts itself out now!

@kimwnasptd
Copy link
Contributor

@misohu can we close this issue?

@misohu
Copy link
Member Author

misohu commented Nov 22, 2023

Closed with #191

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Kubeflow 1.8 This issue affects the Charmed Kubeflow 1.8 release
Projects
None yet
Development

No branches or pull requests

3 participants