Skip to content

Fix llm hp optimization error #2576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

helenxie-bit
Copy link
Contributor

What this PR does / why we need it:
This PR aims to fix errors when using Katib LLM hyperparameter optimization API—which depends on the Trainer SDK v1.9.0—for running the example in the user guide.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2575

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@helenxie-bit
Copy link
Contributor Author

Please review when you have time @andreyvelich @mahdikhashan . Thank you!

@helenxie-bit
Copy link
Contributor Author

helenxie-bit commented Mar 29, 2025

The E2E test for train API failed due to the following error TypeError: Object of type LoraRuntimeConfig is not JSON serializable. I'm working on fixing it.

Updated 2025-03-31:
I fixed the issue by updating the following line of code:

json.dumps(
trainer_parameters.lora_config.__dict__, cls=utils.SetEncoder
),

to:

json.dumps(trainer_parameters.lora_config.to_dict(), cls=utils.SetEncoder),

This change follows the official documentation, which recommends using LoraConfig.to_dict() for serialization.

@mahdikhashan Can you help test if this fix your issues? Since I remember you've met with the same issue.

@helenxie-bit helenxie-bit changed the title fix llm hp optimization error [WIP] fix llm hp optimization error Mar 29, 2025
Signed-off-by: helenxie-bit <[email protected]>
…46 of 🤗 Transformers. Use instead'

Signed-off-by: helenxie-bit <[email protected]>
@helenxie-bit helenxie-bit changed the title [WIP] fix llm hp optimization error Fix llm hp optimization error Mar 31, 2025
@mahdikhashan
Copy link
Member

/assign

@andreyvelich
Copy link
Member

/rerun-all

datasets==3.5.0
transformers==4.50.2
accelerate==1.5.2
tensorboard==2.19.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the use-case for tensorboard in this changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I've met an error which said the version of tensorboard is not correct, so I explicitly set its version here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you

@mahdikhashan
Copy link
Member

mahdikhashan commented Apr 23, 2025

@helenxie-bit thank you for this pr and sorry for the delay. code changes seems good to me, let me check if i can run an example from notebook with this changes.

@helenxie-bit
Copy link
Contributor Author

/rerun-all

@mahdikhashan
Copy link
Member

/lgtm

thank you for your patience

@google-oss-prow google-oss-prow bot added the lgtm label Apr 25, 2025
@andreyvelich
Copy link
Member

Thank you @helenxie-bit!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit f58e893 into kubeflow:release-1.9 Apr 29, 2025
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants