Speed up set_lora_info by eliminating unnecessary H2D transfers #6960

lifuhuang · 2025-06-08T01:03:16Z

Motivation

This is the first PR for #6961. The 2nd one (cache LoRA parames) is WIP.

Header	Before	After	Speedup
ITL@P95 (request rate = 8)	78.42 ms	68.24 ms	10.18 ms (13.0%)
ITL@P95 (request rate = 1)	14.16ms	13.29 ms	0.87 ms (6.1%)

Before	After

Profile set_lora_info

Before

After

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Hello @lifuhuang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team,

Gemini here, providing a summary of this pull request to help everyone quickly understand the changes and context.

This PR focuses on optimizing the performance of the set_lora_info function within the LoRA (Low-Rank Adaptation) layers, specifically targeting the QKV (Query, Key, Value) processing path. The core idea is to eliminate unnecessary CUDA synchronizations that were implicitly occurring, which can introduce latency.

The author has provided benchmark results showing noticeable speedups, particularly under higher request rates (13.0% speedup at 8 requests/sec P95 latency). The changes involve refining how certain tensors and offsets are initialized and checked within the set_lora_info method.

Highlights

Performance Optimization: The primary goal is to speed up the set_lora_info function by removing implicit CUDA synchronizations, leading to improved latency as demonstrated by the provided benchmarks.
Tensor Initialization Refinement: The initialization of the self.output_offset tensor was changed from a two-step process (empty allocation followed by assignment) to a single-step direct initialization using torch.tensor. This change is likely the main contributor to avoiding the implicit synchronization.
Code Conciseness: Checks for the existence and non-None value of attributes like self.B_buffer_gate_up and self.B_buffer_qkv were updated from not hasattr(...) or ... is None to the more concise getattr(..., None) is None.

Changelog

python/sglang/srt/lora/layers.py
- Replaced not hasattr(self, "B_buffer_gate_up") or self.B_buffer_gate_up is None with getattr(self, "B_buffer_gate_up", None) is None for checking the B_buffer_gate_up attribute (L140).
- Replaced not hasattr(self, "B_buffer_qkv") or self.B_buffer_qkv is None with getattr(self, "B_buffer_qkv", None) is None for checking the B_buffer_qkv attribute (L205).
- Changed the initialization of self.output_offset from allocating an empty tensor and then assigning values to directly creating the tensor with the correct values using torch.tensor (L224-L234). This change likely eliminates an implicit CUDA synchronization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to speed up QKVParallelLinearWithLoRA.set_lora_info by eliminating unnecessary CUDA stream synchronizations, and the provided benchmarks show a significant performance improvement. The code changes in the diff focus on refining the initialization logic for some internal buffers.

Overall, the changes are positive, leading to cleaner and potentially more efficient code. The primary change involving self.output_offset initialization is a good simplification. The other changes using getattr are stylistic improvements.

Summary of Findings

Improved output_offset Initialization: The initialization of self.output_offset in QKVParallelLinearWithLoRA.set_lora_info was refactored to be more direct and to occur only once. This is cleaner and more efficient, particularly by avoiding redundant updates if the underlying dimensions (output_dim_q, output_dim_kv) are constant for the layer instance. (Commented)
Pythonic Attribute Checking: The conditions for checking attribute existence and None status (e.g., for B_buffer_gate_up and B_buffer_qkv) were updated from not hasattr(self, ...) or self... is None to getattr(self, ..., None) is None. This is a more concise and Pythonic way to achieve the same check. (Not commented due to review settings - low severity)

Merge Readiness

The changes in this pull request appear to be beneficial, improving code clarity and efficiency in set_lora_info. The benchmark results are compelling. Assuming the constancy of output_dim_q and output_dim_kv for a layer instance (which seems to be the case), the main refactoring of output_offset is correct and a good improvement.

While I cannot approve the PR myself, the changes look good to merge after confirmation on the point raised in the review comment. Please ensure other reviewers also take a look before merging.

Fridge003

LGTM

Eliminate cudaStreamSynchronize in QKVParallelLinearWithLoRA.

51d9447

lifuhuang requested review from Ying1123 and Fridge003 as code owners June 8, 2025 01:03

lifuhuang had a problem deploying to prod June 8, 2025 01:03 — with GitHub Actions Failure

gemini-code-assist bot reviewed Jun 8, 2025

View reviewed changes

lifuhuang mentioned this pull request Jun 8, 2025

[Perf] Speed up LoRA Batch Initialization #6961

Open

3 tasks

gemini-code-assist bot reviewed Jun 8, 2025

View reviewed changes

Fridge003 approved these changes Jun 8, 2025

View reviewed changes

lifuhuang changed the title ~~Speed QKV set_lora_info by eliminating unnecessary cudaStreamSyncrhonize~~ Eliminate stream sync to speed up LoRA batch init Jun 8, 2025

hebiao064 approved these changes Jun 8, 2025

View reviewed changes

Merge branch 'main' into lifuhuang/lora-sync

21d2aaa

lifuhuang temporarily deployed to prod June 9, 2025 04:31 — with GitHub Actions Inactive

lifuhuang added the ready-to-merge The PR is ready to merge after the CI is green. label Jun 9, 2025

zhyncs merged commit b1e5a33 into main Jun 9, 2025
114 of 131 checks passed

zhyncs deleted the lifuhuang/lora-sync branch June 9, 2025 07:22

lifuhuang mentioned this pull request Jun 11, 2025

[Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations #6994

Merged

6 tasks

lifuhuang changed the title ~~Eliminate stream sync to speed up LoRA batch init~~ Speed up set_lora_info by eliminating unnecessary H2D transfers Jun 11, 2025

jianan-gu pushed a commit to jianan-gu/sglang that referenced this pull request Jun 12, 2025

Eliminate stream sync to speed up LoRA batch init (sgl-project#6960)

deda937

Edwardf0t1 pushed a commit to Edwardf0t1/sglang that referenced this pull request Jun 13, 2025

Eliminate stream sync to speed up LoRA batch init (sgl-project#6960)

acd9949

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up set_lora_info by eliminating unnecessary H2D transfers #6960

Speed up set_lora_info by eliminating unnecessary H2D transfers #6960

Uh oh!

lifuhuang commented Jun 8, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Fridge003 left a comment

Uh oh!

Uh oh!

Uh oh!

Speed up set_lora_info by eliminating unnecessary H2D transfers #6960

Speed up set_lora_info by eliminating unnecessary H2D transfers #6960

Uh oh!

Conversation

lifuhuang commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Profile set_lora_info

Before

After

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Fridge003 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lifuhuang commented Jun 8, 2025 •

edited

Loading