Skip to content

sampling: dual pivot rejection sampling algorithm to improve top-p/top-k sampling efficiency #912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 8, 2025

Conversation

yzh119
Copy link
Collaborator

@yzh119 yzh119 commented Mar 5, 2025

In our previous sampling algorithms, the rejection sampling algorithm for top-p/top-k sampling is not guaranteed to stop within given number of rounds, and the API would return an success array indicating whether the sampling is successful or not. If not, serving engines will fall back to naive sorting-base sampling algorithm.

This PR improves the rejection sampling algorithm, instead of relying on a single pivot, we propose to use dual pivot, and the bound [low, high] is guaranteed to shrink by half each round. After n rounds, the gap between low and high will be within 2^-n.

Design doc: https://docs.google.com/document/d/1rhdgOM5VawSMAK6jjapFS02-1neGYd8dNazhtmZg7fA/edit?usp=sharing

Breaking Changes

  • This PR removes the success return value of all sampling API, which is not compatible with earlier design.
  • Instead of passing uniform tensor, we changed to interface to accept torch.Generator (optional, https://pytorch.org/docs/stable/generated/torch.Generator.html), to align with the behavior of torch.
  • C++ API and TVM interface would break in this PR, let's fix the behavior later.

Co-authored-by: Shanli Xing [email protected]

@yzh119 yzh119 mentioned this pull request Mar 4, 2025
15 tasks
@yzh119 yzh119 merged commit d4dc3f9 into main Mar 8, 2025
@yzh119
Copy link
Collaborator Author

yzh119 commented Mar 8, 2025

@MasterJH5574 we need to figure out how to get a torch.Generator equivalent in tvm.

We obtain philox seed/offset from the generator:

  uint64_t philox_seed, philox_offset;
  auto gen = at::get_generator_or_default<at::CUDAGeneratorImpl>(
      gen_, at::cuda::detail::getDefaultCUDAGenerator());
  std::lock_guard<std::mutex> lock(gen->mutex_);
  at::PhiloxCudaState rng_engine_inputs = gen->philox_cuda_state(increment_size);
  philox_seed = rng_engine_inputs.seed_.val;
  philox_offset = rng_engine_inputs.offset_.val;

Definition of PhiloxCudaState:

https://github.com/pytorch/pytorch/blob/85467ed063d284fa21a2f1d2adfec8fda544923d/aten/src/ATen/cuda/detail/PhiloxCudaStateRaw.cuh

yzh119 pushed a commit that referenced this pull request Mar 12, 2025
Looks like a typo in #912

```
-    uniform_samples: torch.Tensor
-        The uniform samples used as needle for sampling, shape ``(batch_size, num_speculate_tokens + 1)``.
+    target_probs: torch.Tensor
         Expected to be uniformly distributed in ``[0, 1)``.
      target_probs: torch.Tensor
         The probability over vocabulary generated by target model.
```
yzh119 pushed a commit that referenced this pull request Mar 12, 2025
Looks like another typo in #912 - sorry for taking 3 PRs to fix one
docstring! 🙄

```
     >>> # uniform samples for rejection sampling
-    >>> uniform_samples = torch.rand(batch_size, num_speculate_tokens + 1).to(0)
-    tensor([[0.8823, 0.9150, 0.3829], device='cuda:0')
     >>> target_probs = torch.tensor([[[0.0, 0.1, 0.6, 0.3], [1.0, 0.0, 0.0, 0.0], [0.7, 0.1, 0.1, 0.1]]]).to(0)
```
@MasterJH5574 MasterJH5574 deleted the dual-pivot-sampling branch March 13, 2025 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant