Skip to content

[Feature] Support dynamic loading and unloading of Lora adapters #2891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

Fridge003
Copy link
Collaborator

Motivation

This PR aims to implement the dynamic LoRA feature mentioned in #2686.
This PR is still under developing, please comment if the code can be improved.

Modifications

Current implementation of LoRA modules

Current LoRA features are implemented under folder python/sglang/srt/lora, where three files lora.py, lora_manager.py, lora_config.py are included. Initial support can be referred to #1307.

In the __init__ function of ModelRunner, a LoraManager will be created if a valid lora_path is passed in server_args. The initialization of LoraManager contains two parts: first calling init_loras to load huggingface LoRA weights to CPU and replace the targeted layers with BaseLayerWithLoRA instances, then calling init_lora_memory_pool to preallocate the memory pool for S-Lora. The definition of lora modules in lora.py are implemented on the basis of vllm implementation.

Before forwarding the batch, LoraManager will call prepare_lora_batch method to load active Lora adapters from memory pool. During loading, lora weight not used in current batch can be evicted from buffer if necessary.

Unit tests are put under test/srt/models/test_lora.py. The test for inference can be passed, but the test for serving is skipped, so the feature of LoRA serving might require further check. The benchmark codes can be found in benchmark/lora/lora_bench.py.

Implementation of dynamic serving LoRA

Dynamical serving LoRA means the lora adapters can be loaded and unloaded at users' commands during server runtime. This feature has been supported in vllm (vllm lora doc). As mentioned in #1433, current implementation supports multi-lora serving, but the loading and unloading of Lora modules can only be done at initialization of servers.

The design of loading and unloading LoRA at API side can be similar to update_weights_from_disk API, since both of their behavior is changing the weights that current server is running on. In this design, the two APIs are named as load_lora_adapter and unload_lora_adapter as in vllm.

After the user send the LoadLoraAdapterReq/UnloadLoraAdapterReq request to server, the serve will grab a write lock and wait for the request in progress to be finished. Then, the request will be transmitted to ModelRunner through several passes, and be handled by LoraManager owned by ModelRunner.

At LoraManager side, the loading of new lora adapter just follows the process of initialization: collecting new target modules, initialize new lora weights on CPU, and if open new space in memory buffer if needed.

The implementation of unloading and testing scripts to be done...

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@mitchklusty
Copy link

Hi. I'm looking forward to this feature getting added. Any updates on progress? Thanks.

@Fridge003
Copy link
Collaborator Author

Hi. I'm looking forward to this feature getting added. Any updates on progress? Thanks.

@mitchklusty thanks for noticing. We are recently supporting other Lora features such as unified paging and tensor parallelism, which will cause huge changes to Lora codes. I'm afraid the feature of dynamic loading/unloading have to wait for these features, so mass conflicts can be avoided. Really sorry for that.

@mitchklusty
Copy link

@Fridge003 That's ok, I completely understand. Any idea on a rough timeline for when it might get implemented or is it still too early to say?

@Fridge003
Copy link
Collaborator Author

Fridge003 commented Mar 5, 2025

@Fridge003 That's ok, I completely understand. Any idea on a rough timeline for when it might get implemented or is it still too early to say?

We have added this feature to half year plan, so it should be implemented before end of June.

If everything goes smooth, it can be done before end of April ideally

@mitchklusty
Copy link

We have added this feature to half year plan, so it should be implemented before end of June.

If everything goes smooth, it can be done before end of April ideally

Ok, great! Thanks for adding this feature,

@binarycrayon
Copy link
Contributor

excited to see this

Copy link

This pull request has been automatically closed due to inactivity. Please feel free to reopen it if needed.

@kdduha
Copy link

kdduha commented Jun 5, 2025

hi! so what's the progress? we're really waiting for this feature

@Fridge003
Copy link
Collaborator Author

hi! so what's the progress? we're really waiting for this feature

Sorry for keep you waiting... We are in short of developers, and I'm also really busy with other tasks.
But @lifuhuang will take over this feature, he will work on it during June.

@kdduha
Copy link

kdduha commented Jun 8, 2025

Sorry for keep you waiting... We are in short of developers, and I'm also really busy with other tasks.
But @lifuhuang will take over this feature, he will work on it during June.

thx! Just wanted to clarify - this merge request will be closed and we need to wait for an another one from @lifuhuang, right? Mb we can help somehow to speed up the process? It seems like the main changes in the code are already have been done

@lifuhuang
Copy link
Collaborator

lifuhuang commented Jun 8, 2025

Hi @kdduha, I discussed with @Fridge003 offline, from what I learned, the change in this PR was branched off main in Jan so it has been somewhat outdated due to the changes introduced over the past months, so indeed we would need a separate PR.

I plan to start working on this feature roughly in a week after wrapping up something small task I have and should be able to finish in June. But if you are interested in collaborating or taking a stab yourself, let me know, you can find me in Slack (Lifu).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants