[Perf] Speed up LoRA Batch Initialization #6961

lifuhuang · 2025-06-08T01:04:12Z

Motivation

prepare_lora_batch is triggered once per forward pass and is one of the main sources of perf overheads from LoRA. Based on suggestion from @Fridge003 , there are some low-hanging fruits for perf optimization such as eliminating unnecessary cuda device syncs.

Eliminate unnecessary H2D transfer in set_lora_info (Speed up set_lora_info by eliminating unnecessary H2D transfers #6960)
Eliminate cuda stream syncs and redundant compute in LoRAManager ([Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations #6994)
Experiment torch.compile / cuda grpah for prepare_lora_batch to reduce gaps between kernels (Idea from @hebiao064 , to be verified)

Related resources

No response

The text was updated successfully, but these errors were encountered:

lifuhuang self-assigned this Jun 8, 2025

lifuhuang mentioned this issue Jun 8, 2025

Speed up set_lora_info by eliminating unnecessary H2D transfers #6960

Merged

6 tasks

Fridge003 mentioned this issue Jun 8, 2025

[Feature] Lora Development Roadmap #2929

Open

16 tasks

lifuhuang mentioned this issue Jun 11, 2025

[Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations #6994

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Speed up LoRA Batch Initialization #6961

[Perf] Speed up LoRA Batch Initialization #6961

lifuhuang commented Jun 8, 2025 •

edited

Loading

[Perf] Speed up LoRA Batch Initialization #6961

[Perf] Speed up LoRA Batch Initialization #6961

Comments

lifuhuang commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Related resources

lifuhuang commented Jun 8, 2025 •

edited

Loading