Skip to content

rejection sampling for top_p etc #963

Open
@mmoskal

Description

@mmoskal

Currently, sample_sequence() first does rejection sampling (ie. checking if token is allowed after sampling it) and only if this fails, computes the full mask. This is the same as llama.cpp.

This is equivalent to computing the full mask and then sampling from masked logits for arg-max and temperature sampling, but not for top_p and top_k (and possibly other sampling methods).

Initially, chatgpt told me but after thinking about it a bit, I'm convinced it's right.

Possible courses of action:

  • ignore it (it's probably close enough)
  • only do it for temp and arg-max
  • always compute mask

Note that llguidance now has interface for a cheaper check if an element is allowed than what I did in #899 - I can try to get that in at some point unless we go with the last option above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions