[Feature] Support Tensor Parallelism and Weight Slicing for Lora #4274

aoshen524 · 2025-03-10T20:45:11Z

Motivation

#3414 reports issues regarding limited model support compared to test_generation_models.py. This PR introduces tensor parallelism and weight slicing for LoRA, alongside additional improvements to testing and functionality.

Modifications

Implemented tensor parallelism support in LoRA, allowing efficient distribution of computations across multiple devices.
Introduced LoRA weight slicing and refactor memory pool to facilitate distributed inference, optimizing memory usage and performance.
Remove useless code for CPU-GPU weight transmission.
Add error handling when num available gpu < num needed

checklist:

Remove tensor.contiguous() used in GPU

Throughput for LLaMA 7B with Triton Backend in 4 x 3090-24GB

LoRA Config: wissdw/4r2a_llama_hf

	TP Size = 1	TP Size = 2	TP Size = 4
Use LoRA	800 tok/s	725 tok/s	717 tok/s
No LoRA	793 tok/s	783 tok/s	765 tok/s

Local CI test result for test/srt/models/lora/test_lora_tp.py

- Remove load_to_gpu and offload_from_gpu methods from LoRALayer and LoRAAdapter classes - Simplify weight initialization and management for LoRA layers - This change reduces code complexity and removes unnecessary functionality

- Implement tensor parallelism for LoRA weights in column-major format - Add logic to slice LoRA weights for row-parallel modules - Update memory pool initialization to handle row-parallel modules - Modify weight loading to accommodate row-parallelism

- Implement slice_lora_a_weights and slice_lora_b_weights methods for various layers - Add support for splitting LoRA weights across multiple GPUs - Improve weight handling for VocabParallelEmbeddingWithLoRA - Enhance ColumnParallelLinearWithLoRA and related classes for LoRA integration - Update QKVParallelLinearWithLoRA for better weight management - Modify RowParallelLinearWithLoRA for efficient weight slicing

…hardware without peer to peer communication.

python/sglang/srt/lora/layers.py

python/sglang/srt/lora/utils.py

python/sglang/srt/model_executor/model_runner.py

test/srt/models/lora/test_lora_backend.py

python/sglang/srt/lora/mem_pool.py

Fridge003 · 2025-03-11T01:06:59Z

Great work! Also, to ensure the performance of lora after tp, please paste some benchmark results before/after enabling tp. You can refer to the benchmark in #3161 for example.

aoshen524 · 2025-03-11T01:41:25Z

Great work! Also, to ensure the performance of lora after tp, please paste some benchmark results before/after enabling tp. You can refer to the benchmark in #3161 for example.

Good advice. But no extra communication kernel launch or used when supporting lora tp. Still recommend to do it?

Fridge003 · 2025-03-11T02:43:08Z

Great work! Also, to ensure the performance of lora after tp, please paste some benchmark results before/after enabling tp. You can refer to the benchmark in #3161 for example.

Good advice. But no extra communication kernel launch or used when supporting lora tp. Still recommend to do it?

Yes, just make sure lora with tp is not too slow.

aoshen524 · 2025-03-11T14:20:35Z

Great work! Also, to ensure the performance of lora after tp, please paste some benchmark results before/after enabling tp. You can refer to the benchmark in #3161 for example.

Good advice. But no extra communication kernel launch or used when supporting lora tp. Still recommend to do it?

Yes, just make sure lora with tp is not too slow.

sure

…s is inconsistent across the layers.

test/srt/models/lora/test_lora_backend_tensor_parallel.py

- Add checks for available GPUs before setting device - Raise informative errors for invalid GPU IDs or lack of CUDA support - Refactor CUDA device count retrieval into a separate function - Update GPU memory retrieval to use the new device count function

- Rename test file from test_lora_backend_tensor_parallel.py to test_lora_tp.py - Remove 'backend' parameter from test functions, focusing on Triton backend - Introduce 'tp_size' parameter to test different tensor parallel configurations - Update test suite to reflect the new file name

Fridge003

LGTM

We skip the LoRA TP test for now as CI is not able to stably run the test when tp_size > 1.

…benchmark.

…test.

ShenAo1111 added 5 commits March 11, 2025 04:39

feat(sglang): add disable_custom_all_reduce option to SRT runner for …

cdb29b4

…hardware without peer to peer communication.

test(lora): update test cases for LoRA backend.

fa2a70d

aoshen524 requested review from merrymercy, Ying1123 and zhyncs as code owners March 10, 2025 20:45

ShenAo1111 added 2 commits March 11, 2025 04:48

fix(srt/lora): lint check

0c4e8f1

feat(srt): add q_proj_shard_size and kv_proj_shard_size attributes

1d2b0f7

aoshen524 requested review from ispobock and HaiShaw as code owners March 10, 2025 20:54

fix(srt): apply tensor parallelism only when tp_size >1

fb58418

aoshen524 requested a review from hnyls2002 as a code owner March 10, 2025 21:16

Fridge003 assigned Fridge003 and Ying1123 Mar 10, 2025

This was referenced Mar 9, 2025

[Feature] Lora Development Roadmap #2929

Open

Development Roadmap (2025 H1) #4042

Open

Fridge003 reviewed Mar 11, 2025

View reviewed changes

Merge branch 'main' into feature/lora_test

2bf7c3d

ShenAo1111 added 3 commits March 11, 2025 23:24

remove[lora]: remove notation written in Chinese

a204be7

refactor(lora): use an independent test file for lora tp.

a61430f

refactor(lora): add dynamic modules selection in case num lora module…

6550df7

…s is inconsistent across the layers.

Fridge003 reviewed Mar 11, 2025

View reviewed changes

ShenAo1111 added 2 commits March 15, 2025 22:36

aoshen524 requested a review from ByronHsu as a code owner March 15, 2025 14:37

ShenAo1111 added 2 commits March 15, 2025 22:39

fix(lora): lint.

5eb8f2a

fix(lora): typo.

876a611

Fridge003 approved these changes Mar 15, 2025

View reviewed changes

Fridge003 and others added 12 commits March 15, 2025 11:18

Merge branch 'main' into feature/lora_test

b325956

refactor(srt): simplify GPU device initialization

5c996a1

test(srt): skip LoRA TP test in CI

8f9b7c4

We skip the LoRA TP test for now as CI is not able to stably run the test when tp_size > 1.

Merge branch 'main' into feature/lora_test

674dbf6

add(benchmark): add two parameters in launch_server.py to support tp …

685173a

…benchmark.

Merge branch 'main' into feature/lora_test

2bbe9a0

Merge branch 'main' into feature/lora_test

1e45fee

Merge branch 'main' into feature/lora_test

977247f

fix(lora): add lora tp test to multi gpu category when conducting pr …

6c68020

…test.

Merge branch 'main' into feature/lora_test

4875246

fix(lora): change the tp size when conducting lora tp test.

a96dfc6

fix(lora): typo.

afadd01

Ying1123 approved these changes Mar 19, 2025

View reviewed changes

Fridge003 merged commit 588865f into sgl-project:main Mar 19, 2025
34 of 36 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Support Tensor Parallelism and Weight Slicing for Lora #4274

[Feature] Support Tensor Parallelism and Weight Slicing for Lora #4274

Uh oh!

aoshen524 commented Mar 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Mar 11, 2025 •

edited

Loading

Uh oh!

aoshen524 commented Mar 11, 2025

Uh oh!

Fridge003 commented Mar 11, 2025

Uh oh!

aoshen524 commented Mar 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 left a comment

Uh oh!

Uh oh!

Uh oh!

[Feature] Support Tensor Parallelism and Weight Slicing for Lora #4274

[Feature] Support Tensor Parallelism and Weight Slicing for Lora #4274

Uh oh!

Conversation

aoshen524 commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Throughput for LLaMA 7B with Triton Backend in 4 x 3090-24GB

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aoshen524 commented Mar 11, 2025

Uh oh!

Fridge003 commented Mar 11, 2025

Uh oh!

aoshen524 commented Mar 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aoshen524 commented Mar 10, 2025 •

edited

Loading

Fridge003 commented Mar 11, 2025 •

edited

Loading