Just a practical question, np.choices
is very slow to return a sample when one tries to sample from a large distribution - say, for example, a 52K token vocabulary.
How do HuggingFace’s implementations of sampling methods actually sample? Nucleus sampling makes sense to me because we only sample from the low entropy parts of the distribution, but for beam search which tries to find the max likelihood sequence over several samples, mightn’t we need to sample from close to the full distribution at any given step? How do we do this in practise??