Skip to content

KF 1.7 testing on dev #1752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
9 tasks done
Jose-Matsuda opened this issue Jun 28, 2023 · 7 comments
Closed
9 tasks done

KF 1.7 testing on dev #1752

Jose-Matsuda opened this issue Jun 28, 2023 · 7 comments
Assignees

Comments

@Jose-Matsuda
Copy link
Contributor

Jose-Matsuda commented Jun 28, 2023

Task list from #1729

  • Pipelines: Check the Cluster Roles are sufficient for Argo Workflow (Archive, Delete, Run, Experiments) (is this still necessary?) Using this resource was able to submit a workflow, but it errored out I'm not sure if I have the correct SA.
    • Just pass along --serviceaccount argo-workflows-default-sa with itand I was able to submit a job and not have it error out.
  • Profiles: Check the new Access Management KFAM works
  • Check notebook submission (with and without volumes)
  • Check notebook start and pause
  • Verify persistence
  • Add a Contributor
  • Verify roles / contributors can still access as normal
  • Delete a contributor
  • Verify Minio still works
    • I can verify that there is no discernable difference between using minio in prod vs in dev
@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented Jul 4, 2023

List of Outstanding (no not extraordinary just issues that are still alive) issues

kubeflow-aaw-notebook-controller Fixed in pr

It seems like the way kubeflow wants the environment variables changed from just configuring it in our deployment to a configmap, and then we need to change values as shown in params.env. And you cannot have both value and valueFrom
image

kubeflow-aaw-profiles

Somebody might have messed with some kubectl config'ing the profile crd as we got an error while on dev. What we needed to do was to remove the kubectl.kubernetes.io/last-applied-configuration annotation as that was preventing argocd from doing its work.
This will need to be done for prod as well (as in removing the last-applied-config for it to sync properly, best done through editing live manifest on argocd)
image

RBAC access denied.

Appears when navigating to the Notebooks tab and seems to get caught on the en as well as the notebooks call
image
Attempting this with the upstream 1.7 kfam image, or even reverting both container images to 1.6 changes nothing.
I also updated the jupyter-web-app deployment image (in argocd, hasnt been persisted to repo) to be 0c452f74047420567c806628cdce07949089253d as its currently on an older, pre-mathis changes version to no avail.

Resolution in this comment

kserve stuck in 'Progressing'

Getting imagepullbackoff, probably because gatekeeper is blocking image: 'kserve/kserve-controller:v0.10.0'... Having said that the error does say, "Back-off restarting failed container"
Getting: {"level":"error","ts":1688561752.6363223,"logger":"entrypoint","msg":"unable to get ingress config.","error":"unable to parse ingress config json: invalid character '}' looking for beginning of object key string","stacktrace":"runtime.main\n\t/usr/local/go/src/runtime/proc.go:250"} from the manager container

kubeflow-aaw-spark

Seems like the CRDs are in a funk. On dev, I also deleted the scheduledsparkapplications.sparkoperator.k8s.io CRD to see if argocd would recreate it, that is a no.

Testing Minio (courtesy of Salwa from a little more than a year ago)

Once notebook is running:
Try connecting to a minio instace using mc alias (gateway-standard is done automatically by mc-wrapper.sh)
List whatever is in a bucket mc ls
Try uploading to a bucket and downloading from a bucket mc cp
Try moving files to a bucket and back to the filesystem mc mv
Try some of the commands in this: https://docs.min.io/docs/minio-client-complete-guide.html. (dont worry about mb,rm, admin, and update)

@wg102
Copy link
Contributor

wg102 commented Jul 5, 2023

Kserve stuck in Progressing"
We are currently using kserve/kserve-controller:v0.8.0 in prod, hence I don't think the gatekeeper is the reason it doesn't work

Looking at the pods log we see

{"level":"error","ts":1688561752.6363223,"logger":"entrypoint","msg":"unable to get ingress config.","error":"unable to parse ingress config json: invalid character '}' looking for beginning of object key string","stacktrace":"runtime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

According to an article on stackoverflow, this would be most likely malformed json file. Now we need to figure out how to retrieve said file

@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented Jul 6, 2023

RBAC Update

The AuthorizationPolicy of jupyter-web-app in the kubeflow namespace was causing it. Upon deleting it we are able to see the notebook page again.

We now need to determine why its here/ what its trying to do.
Was done in this pr. In it they create a DR and Authpolicy for the various apps (but are duplicated so if you get one pair you get them all)

So they block off everything unless it comes from the istio ingress-gateway, and they also enabled the istio sidecar (on prod the istio containers are not there, but are present in dev).
Some peer identities for the "Source", and with that naming convention I can at least say that we don't have a SA named istio-ingressgateway-service-account in dev currently.

We will point to the same service account that was defined for the centraldashboard AuthorizationPolicy kubeflow-service-account. This is the pr for the change.

@wg102
Copy link
Contributor

wg102 commented Jul 6, 2023

  • Weird UI for image selection
    This is not technically an issue, but it feels wrong to have 2 accordion features embedded one into the other
    image

@wg102
Copy link
Contributor

wg102 commented Jul 10, 2023

Other UI that seems weird.
When attaching an existing volume, there's a mandatory open section
image
Since this shows. You need to click on it, and then select which volume you want. Which seems not so friendly
image

@wg102
Copy link
Contributor

wg102 commented Jul 10, 2023

Issue with the times for the notebooks
Within a few minutes, for one notebook, here is what was showing
image
image
image

@Jose-Matsuda
Copy link
Contributor Author

Just going to close this as we have completed the testing points laid out and Wendy has created a follow-up issue for any remaining issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants