rejection sampling for `top_p` etc

Currently, `sample_sequence()` first does rejection sampling (ie. checking if token is allowed after sampling it) and only if this fails, computes the full mask. This is the same as llama.cpp.

This is equivalent to computing the full mask and then sampling from masked logits for arg-max and temperature sampling, but **not** for top_p and top_k (and possibly other sampling methods).

Initially, chatgpt [told me](https://chatgpt.com/share/67401ca4-635c-8010-8aad-4c1444335564) but after thinking about it a bit, I'm convinced it's right.

Possible courses of action:
- ignore it (it's probably close enough)
- only do it for temp and arg-max
- always compute mask

Note that llguidance now has interface for a cheaper check if an element is allowed than what I did in #899 - I can try to get that in at some point unless we go with the last option above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rejection sampling for `top_p` etc #963

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

rejection sampling for top_p etc #963

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

rejection sampling for `top_p` etc #963