Skip to content

Speed up set_lora_info by eliminating unnecessary H2D transfers #6960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 9, 2025

Conversation

lifuhuang
Copy link
Collaborator

@lifuhuang lifuhuang commented Jun 8, 2025

Motivation

This is the first PR for #6961. The 2nd one (cache LoRA parames) is WIP.

Header Before After Speedup
ITL@P95 (request rate = 8) 78.42 ms 68.24 ms 10.18 ms (13.0%)
ITL@P95 (request rate = 1) 14.16ms 13.29 ms 0.87 ms (6.1%)
Before After
image image

Profile set_lora_info

Before

image

After

image

Checklist

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @lifuhuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team,

Gemini here, providing a summary of this pull request to help everyone quickly understand the changes and context.

This PR focuses on optimizing the performance of the set_lora_info function within the LoRA (Low-Rank Adaptation) layers, specifically targeting the QKV (Query, Key, Value) processing path. The core idea is to eliminate unnecessary CUDA synchronizations that were implicitly occurring, which can introduce latency.

The author has provided benchmark results showing noticeable speedups, particularly under higher request rates (13.0% speedup at 8 requests/sec P95 latency). The changes involve refining how certain tensors and offsets are initialized and checked within the set_lora_info method.

Highlights

  • Performance Optimization: The primary goal is to speed up the set_lora_info function by removing implicit CUDA synchronizations, leading to improved latency as demonstrated by the provided benchmarks.
  • Tensor Initialization Refinement: The initialization of the self.output_offset tensor was changed from a two-step process (empty allocation followed by assignment) to a single-step direct initialization using torch.tensor. This change is likely the main contributor to avoiding the implicit synchronization.
  • Code Conciseness: Checks for the existence and non-None value of attributes like self.B_buffer_gate_up and self.B_buffer_qkv were updated from not hasattr(...) or ... is None to the more concise getattr(..., None) is None.

Changelog

  • python/sglang/srt/lora/layers.py
    • Replaced not hasattr(self, "B_buffer_gate_up") or self.B_buffer_gate_up is None with getattr(self, "B_buffer_gate_up", None) is None for checking the B_buffer_gate_up attribute (L140).
    • Replaced not hasattr(self, "B_buffer_qkv") or self.B_buffer_qkv is None with getattr(self, "B_buffer_qkv", None) is None for checking the B_buffer_qkv attribute (L205).
    • Changed the initialization of self.output_offset from allocating an empty tensor and then assigning values to directly creating the tensor with the correct values using torch.tensor (L224-L234). This change likely eliminates an implicit CUDA synchronization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to speed up QKVParallelLinearWithLoRA.set_lora_info by eliminating unnecessary CUDA stream synchronizations, and the provided benchmarks show a significant performance improvement. The code changes in the diff focus on refining the initialization logic for some internal buffers.

Overall, the changes are positive, leading to cleaner and potentially more efficient code. The primary change involving self.output_offset initialization is a good simplification. The other changes using getattr are stylistic improvements.

Summary of Findings

  • Improved output_offset Initialization: The initialization of self.output_offset in QKVParallelLinearWithLoRA.set_lora_info was refactored to be more direct and to occur only once. This is cleaner and more efficient, particularly by avoiding redundant updates if the underlying dimensions (output_dim_q, output_dim_kv) are constant for the layer instance. (Commented)
  • Pythonic Attribute Checking: The conditions for checking attribute existence and None status (e.g., for B_buffer_gate_up and B_buffer_qkv) were updated from not hasattr(self, ...) or self... is None to getattr(self, ..., None) is None. This is a more concise and Pythonic way to achieve the same check. (Not commented due to review settings - low severity)

Merge Readiness

The changes in this pull request appear to be beneficial, improving code clarity and efficiency in set_lora_info. The benchmark results are compelling. Assuming the constancy of output_dim_q and output_dim_kv for a layer instance (which seems to be the case), the main refactoring of output_offset is correct and a good improvement.

While I cannot approve the PR myself, the changes look good to merge after confirmation on the point raised in the review comment. Please ensure other reviewers also take a look before merging.

Copy link
Collaborator

@Fridge003 Fridge003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lifuhuang lifuhuang changed the title Speed QKV set_lora_info by eliminating unnecessary cudaStreamSyncrhonize Eliminate stream sync to speed up LoRA batch init Jun 8, 2025
@lifuhuang lifuhuang added the ready-to-merge The PR is ready to merge after the CI is green. label Jun 9, 2025
@zhyncs zhyncs merged commit b1e5a33 into main Jun 9, 2025
114 of 131 checks passed
@zhyncs zhyncs deleted the lifuhuang/lora-sync branch June 9, 2025 07:22
@lifuhuang lifuhuang changed the title Eliminate stream sync to speed up LoRA batch init Speed up set_lora_info by eliminating unnecessary H2D transfers Jun 11, 2025
jianan-gu pushed a commit to jianan-gu/sglang that referenced this pull request Jun 12, 2025
Edwardf0t1 pushed a commit to Edwardf0t1/sglang that referenced this pull request Jun 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge The PR is ready to merge after the CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants