Skip to content

Add VolcanoWorker to support Volcano Job submission via Kubernetes work pool and test #17909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

zyn71
Copy link

@zyn71 zyn71 commented Apr 25, 2025

Apologies for the noise on the earlier PR #17880.
I am sry While iterating on the Volcano worker I accidentally pushed several unrelated commits that polluted the diff.
I’ve started fresh from the latest upstream/main, cherry-picked only the relevant changes, and added a focused unit-test suite.

The previous PR has been closed in favor of this clean one.
Changes included here (3 files):

volcanoworker.py – new VolcanoWorker implementation

init.py – registers the new worker type

tests/test_volcano_worker.py – unit & smoke tests covering job creation, pod discovery, and state polling

Please let me know if you’d like additional work

@zyn71 zyn71 force-pushed the feat/volcano_clean branch from fb34ccc to 5c10a28 Compare April 28, 2025 08:19
@github-actions github-actions bot added the ui-replatform Related to the React UI rewrite label Apr 28, 2025
Copy link

codspeed-hq bot commented Apr 28, 2025

CodSpeed Performance Report

Merging #17909 will not alter performance

Comparing zyn71:feat/volcano_clean (5c10a28) with main (df10951)

Summary

✅ 2 untouched benchmarks

@zyn71 zyn71 force-pushed the feat/volcano_clean branch from 5c10a28 to 0066f99 Compare April 28, 2025 08:29
@zyn71
Copy link
Author

zyn71 commented Apr 28, 2025

hi,@reneleonhardt @zzstoatzz (we have argued before)
Sorry to nudge, but I wanted to gently follow-up on this PR. It’s been open for three days, the checks (CI, pre-commit, dev env) have all passed, and the diff is limited to three files.And it has already been applied into out porduction.
If there’s anything I can clarify or adjust, just let me know—I’m happy to turn it around quickly.Thanks a lot for your time and for keeping the review queue moving!

@reneleonhardt
Copy link

I'm no maintainer sorry. I suggest using a free AI to save contributors and reviewers time:
https://www.coderabbit.ai/blog/how-linux-foundation-used-ai-code-reviews-to-reduce-manual-bottlenecks-in-oss

@desertaxle
Copy link
Member

Hey @zyn71, neither @zzstoatzz nor I are very familiar with Volcano, and we'll want to test this functionality before merging. Could you provide instructions on how we can test these changes locally?

@zyn71
Copy link
Author

zyn71 commented Apr 30, 2025

Hi @desertaxle — thank you for the quick feedback! Below is a full, single-file guide my team follows to test the new VolcanoWorker locally. Everything has been validated on real clusters;


1 Get server and workpool

Create a conventional prefect server and prefect-kubernetes workpool.So far nothing is Volcano-specific.


2 Convert K8s work pool to “Volcano mode” manully

Open K8s Work Pool → Advanced → Job Manifest and paste the manifest below.
It’s the normal template with apiVersion: batch.volcano.sh/v1alpha1 and Volcano-only fields added:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  labels: "{{ labels }}"
  namespace: "{{ namespace }}"
  generateName: "{{ name }}-"
spec:
  schedulerName: volcano
  queue: your-volcano-queue
  maxRetry: 3
  minAvailable: 1
  tasks:
    - name: pod1
      replicas: 1
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        metadata:
          labels:
            job-name: "{{ name }}"
            volcano.sh/job-name: "{{ name }}"
        spec:
          restartPolicy: Never
          containers:
            - name: c1
              image: "{{ image }}"
              command: "{{ command }}"
              env: "{{ env }}"
              imagePullPolicy: "{{ image_pull_policy }}"

3 Start a VolcanoWorker

Before merging into main branch, we must build a new image including above there changes files.(BASE is the newest version prefect)

docker build -t your_image .

We can uses helm chart to create a volcanoworker

helm pull prefect/prefect-worker          # fetch chart
tar -xzf prefect-worker-*.tgz && cd prefect-worker

# edit values.yaml (connect to prefect server by your owns)➜
config:
    type: volcano
    workPool:"your work pool"
image:
    repository: your image
    prefectTag: your tag

helm install prefect-worker . -f values.yaml -n your-namespace

No extra image build is needed after merging


4 Deploy & run a flow

Any deployment assigned to my-k8s-pool now spawns a real Volcano Job.
Logs, exit-codes, and state reporting behave exactly like the stock K8s worker.nothing is Volcano-specific


Extra notes

  • Zero intrusion: VolcanoWorker subclasses KubernetesWorker; normal paths are untouched unless apiVersion is Volcano.
  • Tests: added tests/test_volcanoworker.py covering job creation, pod discovery, and state-watching.
  • Happy to iterate further — just let me know what else you need!

Thanks again for your time!

@desertaxle desertaxle removed the ui-replatform Related to the React UI rewrite label May 1, 2025
@desertaxle
Copy link
Member

Thanks for that guide @zyn71! Do I need to install Volcano in my cluster before using the Volcano worker? Also, I think it'd be best if this worker didn't require manual changes to the base job template in the default case. Would you be able to add _get_default_job_manifest_template similar to the existing KubernetesWorker?

@zyn71
Copy link
Author

zyn71 commented May 2, 2025

hi again @desertaxle !Yes, Volcano must be installed in the cluster before the VolcanoWorker can be used.
Because this PR has not been merged yet, you also need to build a custom worker image (step 3 in my guide): simply copy volcanoworker.py and the updated __init__.py into the current Prefect package, build the image, and then start the worker with --type volcano.

Regarding your concern about avoiding manual edits to the base job template: Prefect validates the job manifest in several layers, not only inside the worker. Making the worker auto‑switch templates would therefore require extensive changes—perhaps even a brand‑new integration. Considering that our goal is just to extend prefect‑kubernetes with Volcano support, and since VolcanoWorker already inherits from KubernetesWorker, the current solution remains flexible and easy to use.

If there is anything else you need, please let me know. Thank you!

@desertaxle
Copy link
Member

@zyn71 I disagree that it would take extensive updates to support a different default manifest for the Volcano worker. The primary validation of the manifest happens in KubernetesWorkerJobConfiguration._validate_job_manifest, which you are already overriding in this PR. I think it makes more sense to have that validation be Volcano-specific and have a default manifest that allows the Volcano worker to be used without manual configuration.

I think one thing that would help is that the Volcano worker doesn't need to be compatible with vanilla Kubernetes jobs. I think the overall design of the new worker would be greatly improved if you dropped that compatibility.

@zyn71
Copy link
Author

zyn71 commented May 5, 2025

hi @desertaxle !I totally understand your point—the first thing I tried was exactly what you’re suggesting.
However, the default manifest is injected at the work‑pool level, not at the worker level.
Right now Prefect only ships a single work‑pool type (kubernetes). When you create a
Kubernetes work pool, the UI automatically seeds a Kubernetes Job template; later, the
worker validates whatever manifest lives in that pool.

If we want to eliminate all manual edits, we would first need to introduce a brand‑new
Volcano work‑pool type (or extend the existing pool so it can return a Volcano‑flavored
template). That’s only part of the story, though:

  • During local testing I noticed that validation happens in several places.
    In addition to the worker’s JobConfiguration._validate_job_manifest, the flow‑run
    machinery also validates the manifest that comes from the work pool, and other internal
    modules touch it as well.
  • So even with a Volcano work pool, we’d still have to propagate changes across multiple
    code paths, which feels like a fairly invasive refactor.

Because Volcano can sit on top of standard Kubernetes primitives, my goal was to reuse
the existing kubernetes work‑pool infrastructure and just add a lightweight
VolcanoWorker on top. That avoids a large amount of churn in the core codebase while
still giving users an extra scheduling option.

@desertaxle
Copy link
Member

@zyn71 we have processes in place to add new supported work pools to the UI, so it will be possible to create a Volcano worker from the UI once the Volcano worker is added to prefect-kubernetes. In the meantime, workers can create work pools when they start up, so running prefect worker start --type volcano WORK_POOL_NAME can create a Volcano work pool. That's why I'm insistent that the Volcano worker should have its own default manifest.

I appreciate your desire not to introduce new churn to the codebase, but I don't think you'll introduce much churn since you're adding something new. If the Volcano worker is going to share code with the Kubernetes worker, I'd like them to be as loosely coupled as possible. One way you can do that is by ensuring the Volcano worker is only responsible for running Volcano jobs.

@zyn71
Copy link
Author

zyn71 commented May 7, 2025

Hi @desertaxle! Thanks a lot for the suggestions and guidance. I’ve now added a _get_default_volcano_job_manifest_template() to VolcanoWorker, so we have anative Volcano base manifest.

Following your advice, once the UI can create a Volcano work‑pool (with itsown default job manifest), this worker will be much more self‑contained andloosely coupled with the existing Kubernetes worker.

Could you take a look and let me know if this approach works for you? I’m happy to make further adjustments if needed. Thanks!

@desertaxle
Copy link
Member

Hey @zyn71, we made some pretty significant changes to the Kubernetes worker in #18004 that will require changes to the worker you're working on here.

The big change is splitting event emission and crash detection responsibilities into a separate observer that starts alongside the worker. You might be able to reuse the observer, but I'm not sure if the jobs event handler will work with Volanco jobs. Let me know if you have any questions on the new structure!

@zyn71
Copy link
Author

zyn71 commented May 28, 2025

hi! @desertaxle,Sorry for the delayed response - I've been quite busy lately and this took longer than expected.
First of all, I want to say that the refactoring work you did in #18004 is excellent! It addresses many issues and significantly improves the reliability and architecture of the Kubernetes worker.
I've successfully adapted my VolcanoWorker to work with the new structure. The good news is that I was able to reuse your observer implementation without any modifications - it works perfectly with Volcano jobs! Here's what I did to make it compatible:

  1. Removed old monitoring logic: Eliminated the custom _watch_job(), _get_job(), and event replication code since the observer now handles all of this.
  2. Simplified the run() method: Now it only creates the job and returns immediately, letting the observer handle monitoring.
  3. Fixed Pod label propagation: This was the key part - I updated the _propagate_labels_to_pod() method to correctly handle Volcano job's pod template structure (spec.tasks[0].template.metadata.labels instead of spec.template.metadata.labels), ensuring that prefect.io/flow-run-id labels are properly propagated to pods so the observer can identify them.
  4. Updated job manifest template: Made sure the Volcano job template includes {{ labels }} placeholders in the pod template.
    I've tested the implementation and confirmed that:
  • The observer correctly identifies and monitors Volcano job pods
  • Event replication works as expected
  • Crash detection functions properly
  • Both standard Kubernetes jobs and Volcano jobs work seamlessly
    Thanks again for the great work on the refactoring - it makes the codebase much cleaner and more robust!

@zyn71
Copy link
Author

zyn71 commented Jun 4, 2025

Hi! Our MR hasn’t made any progress in a week. Do you have any suggestions, or is there anything else I can do to move it forward or improve it?

@zzstoatzz
Copy link
Collaborator

hi @zyn71 - sorry for the delay here. desertaxle is out on leave right now. one of us will review your updates as soon as possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants