sampling: dual pivot rejection sampling algorithm to improve top-p/top-k sampling efficiency #912

yzh119 · 2025-03-05T21:40:13Z

In our previous sampling algorithms, the rejection sampling algorithm for top-p/top-k sampling is not guaranteed to stop within given number of rounds, and the API would return an success array indicating whether the sampling is successful or not. If not, serving engines will fall back to naive sorting-base sampling algorithm.

This PR improves the rejection sampling algorithm, instead of relying on a single pivot, we propose to use dual pivot, and the bound [low, high] is guaranteed to shrink by half each round. After n rounds, the gap between low and high will be within 2^-n.

Design doc: https://docs.google.com/document/d/1rhdgOM5VawSMAK6jjapFS02-1neGYd8dNazhtmZg7fA/edit?usp=sharing

Breaking Changes

This PR removes the success return value of all sampling API, which is not compatible with earlier design.
Instead of passing uniform tensor, we changed to interface to accept torch.Generator (optional, https://pytorch.org/docs/stable/generated/torch.Generator.html), to align with the behavior of torch.
C++ API and TVM interface would break in this PR, let's fix the behavior later.

Co-authored-by: Shanli Xing [email protected]

yzh119 · 2025-03-08T04:37:41Z

@MasterJH5574 we need to figure out how to get a torch.Generator equivalent in tvm.

We obtain philox seed/offset from the generator:

  uint64_t philox_seed, philox_offset;
  auto gen = at::get_generator_or_default<at::CUDAGeneratorImpl>(
      gen_, at::cuda::detail::getDefaultCUDAGenerator());
  std::lock_guard<std::mutex> lock(gen->mutex_);
  at::PhiloxCudaState rng_engine_inputs = gen->philox_cuda_state(increment_size);
  philox_seed = rng_engine_inputs.seed_.val;
  philox_offset = rng_engine_inputs.offset_.val;

Definition of PhiloxCudaState:

https://github.com/pytorch/pytorch/blob/85467ed063d284fa21a2f1d2adfec8fda544923d/aten/src/ATen/cuda/detail/PhiloxCudaStateRaw.cuh

Looks like a typo in #912 ``` - uniform_samples: torch.Tensor - The uniform samples used as needle for sampling, shape ``(batch_size, num_speculate_tokens + 1)``. + target_probs: torch.Tensor Expected to be uniformly distributed in ``[0, 1)``. target_probs: torch.Tensor The probability over vocabulary generated by target model. ```

Looks like another typo in #912 - sorry for taking 3 PRs to fix one docstring! 🙄 ``` >>> # uniform samples for rejection sampling - >>> uniform_samples = torch.rand(batch_size, num_speculate_tokens + 1).to(0) - tensor([[0.8823, 0.9150, 0.3829], device='cuda:0') >>> target_probs = torch.tensor([[[0.0, 0.1, 0.6, 0.3], [1.0, 0.0, 0.0, 0.0], [0.7, 0.1, 0.1, 0.1]]]).to(0) ```

upd

37ebce2

yzh119 mentioned this pull request Mar 4, 2025

[Roadmap] FlashInfer v0.2 to v0.3 #675

Open

15 tasks

yzh119 added 8 commits March 5, 2025 21:54

bugfix

69b42c3

upd

6a9c2e0

upd

ec0afcf

fix

03544d5

fix parallel sampling

88adb70

upd

73ed810

upd

4985331

upd

36792d1

yzh119 merged commit d4dc3f9 into main Mar 8, 2025

This was referenced Mar 11, 2025

typo: fix target_probs docs after uniform_samples removal #935

Merged

typo: remove another uniform samples leftover #937

Merged

MasterJH5574 deleted the dual-pivot-sampling branch March 13, 2025 14:09

yubofredwang mentioned this pull request Apr 19, 2025

Fix sampler nan check when calling top_k_top_p_sampling_from_probs sgl-project/sglang#5546

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sampling: dual pivot rejection sampling algorithm to improve top-p/top-k sampling efficiency #912

sampling: dual pivot rejection sampling algorithm to improve top-p/top-k sampling efficiency #912

yzh119 commented Mar 5, 2025 •

edited

Loading

yzh119 commented Mar 8, 2025

sampling: dual pivot rejection sampling algorithm to improve top-p/top-k sampling efficiency #912

sampling: dual pivot rejection sampling algorithm to improve top-p/top-k sampling efficiency #912

Conversation

yzh119 commented Mar 5, 2025 • edited Loading

Breaking Changes

yzh119 commented Mar 8, 2025

yzh119 commented Mar 5, 2025 •

edited

Loading