Sampling: what's the secret sauce?

Just a practical question, np.choices is very slow to return a sample when one tries to sample from a large distribution - say, for example, a 52K token vocabulary.

How do HuggingFace’s implementations of sampling methods actually sample? Nucleus sampling makes sense to me because we only sample from the low entropy parts of the distribution, but for beam search which tries to find the max likelihood sequence over several samples, mightn’t we need to sample from close to the full distribution at any given step? How do we do this in practise??

Hey @chrisdoyle :wave:

We use PyTorch/TensorFlow/JAX sampling operations, which are optimized for GPU usage. See here, for example: transformers/generation_utils.py at e54a1b49aa6268c484625c6374f952f318914743 · huggingface/transformers · GitHub

Beam search does no sampling – it takes the beams/tokens with the highest score at each iteration, which is deterministic after running the model forward pass :slight_smile:

Thanks @joaogante, I think I’ll find the answer I’m looking for inside the torch.multinomial implementation, many thanks!