-
Notifications
You must be signed in to change notification settings - Fork 2k
[Feature] Support dynamic loading and unloading of Lora adapters #2891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi. I'm looking forward to this feature getting added. Any updates on progress? Thanks. |
@mitchklusty thanks for noticing. We are recently supporting other Lora features such as unified paging and tensor parallelism, which will cause huge changes to Lora codes. I'm afraid the feature of dynamic loading/unloading have to wait for these features, so mass conflicts can be avoided. Really sorry for that. |
@Fridge003 That's ok, I completely understand. Any idea on a rough timeline for when it might get implemented or is it still too early to say? |
We have added this feature to half year plan, so it should be implemented before end of June. If everything goes smooth, it can be done before end of April ideally |
Ok, great! Thanks for adding this feature, |
excited to see this |
This pull request has been automatically closed due to inactivity. Please feel free to reopen it if needed. |
hi! so what's the progress? we're really waiting for this feature |
Sorry for keep you waiting... We are in short of developers, and I'm also really busy with other tasks. |
thx! Just wanted to clarify - this merge request will be closed and we need to wait for an another one from @lifuhuang, right? Mb we can help somehow to speed up the process? It seems like the main changes in the code are already have been done |
Hi @kdduha, I discussed with @Fridge003 offline, from what I learned, the change in this PR was branched off main in Jan so it has been somewhat outdated due to the changes introduced over the past months, so indeed we would need a separate PR. I plan to start working on this feature roughly in a week after wrapping up something small task I have and should be able to finish in June. But if you are interested in collaborating or taking a stab yourself, let me know, you can find me in Slack (Lifu). |
Motivation
This PR aims to implement the dynamic LoRA feature mentioned in #2686.
This PR is still under developing, please comment if the code can be improved.
Modifications
Current implementation of LoRA modules
Current LoRA features are implemented under folder
python/sglang/srt/lora
, where three fileslora.py
,lora_manager.py
,lora_config.py
are included. Initial support can be referred to #1307.In the
__init__
function ofModelRunner
, aLoraManager
will be created if a validlora_path
is passed inserver_args
. The initialization ofLoraManager
contains two parts: first callinginit_loras
to load huggingface LoRA weights to CPU and replace the targeted layers withBaseLayerWithLoRA
instances, then callinginit_lora_memory_pool
to preallocate the memory pool for S-Lora. The definition of lora modules inlora.py
are implemented on the basis of vllm implementation.Before forwarding the batch,
LoraManager
will callprepare_lora_batch
method to load active Lora adapters from memory pool. During loading, lora weight not used in current batch can be evicted from buffer if necessary.Unit tests are put under
test/srt/models/test_lora.py
. The test for inference can be passed, but the test for serving is skipped, so the feature of LoRA serving might require further check. The benchmark codes can be found inbenchmark/lora/lora_bench.py
.Implementation of dynamic serving LoRA
Dynamical serving LoRA means the lora adapters can be loaded and unloaded at users' commands during server runtime. This feature has been supported in vllm (vllm lora doc). As mentioned in #1433, current implementation supports multi-lora serving, but the loading and unloading of Lora modules can only be done at initialization of servers.
The design of loading and unloading LoRA at API side can be similar to
update_weights_from_disk
API, since both of their behavior is changing the weights that current server is running on. In this design, the two APIs are named asload_lora_adapter
andunload_lora_adapter
as in vllm.After the user send the
LoadLoraAdapterReq
/UnloadLoraAdapterReq
request to server, the serve will grab a write lock and wait for the request in progress to be finished. Then, the request will be transmitted toModelRunner
through several passes, and be handled byLoraManager
owned byModelRunner
.At
LoraManager
side, the loading of new lora adapter just follows the process of initialization: collecting new target modules, initialize new lora weights on CPU, and if open new space in memory buffer if needed.The implementation of unloading and testing scripts to be done...
Checklist