Closed
Description
When using beam search, we currently run the decoders sequentially:
This is multiple times slower compared to a batched evaluation. This inefficiency is the major factor preventing efficient usage of beam search in whisper.cpp
and thus often resulting in bad transcription quality.
Batched inference has been demonstrated in llama.cpp
:
This can be a starting point for doing the same in whisper.cpp
and achieving efficient beam search implementation