sampling: dual pivot rejection sampling algorithm to improve top-p/top-k sampling efficiency #912
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In our previous sampling algorithms, the rejection sampling algorithm for top-p/top-k sampling is not guaranteed to stop within given number of rounds, and the API would return an
success
array indicating whether the sampling is successful or not. If not, serving engines will fall back to naive sorting-base sampling algorithm.This PR improves the rejection sampling algorithm, instead of relying on a single pivot, we propose to use dual pivot, and the bound
[low, high]
is guaranteed to shrink by half each round. After n rounds, the gap between low and high will be within 2^-n.Design doc: https://docs.google.com/document/d/1rhdgOM5VawSMAK6jjapFS02-1neGYd8dNazhtmZg7fA/edit?usp=sharing
Breaking Changes
success
return value of all sampling API, which is not compatible with earlier design.uniform
tensor, we changed to interface to accepttorch.Generator
(optional, https://pytorch.org/docs/stable/generated/torch.Generator.html), to align with the behavior of torch.Co-authored-by: Shanli Xing [email protected]