Decoding strategy in training phase

Almost every topics about decoding strategy describe how to add randomness in inference phase. Everyone knows decoding strategy intends to enrich the output of LM to be more diverse.

However, would it be possible to use any random sampling method (e.g top-p, top-k or others…) in training phase? The reason why I ask about this question is: after LM learning from a certain dataset, the weights of LM model should be “fixed”, and then inference by those fixed weights. Those fixed weights is obtained by using maximum likelihood I’m training phase. Interestingly, we then instead of choosing the predicted token from original model which might implicitly learned distribution from original dataset, we “modify” the decoding of the output which should be probably not the same distribution as learned before.

If we want to adopt the decoding algorithm in inference phase, should we put it in training as well? Thanks.

Try to use greedy decoding (“choosing the predicted token from original model which might implicitly learned distribution from original dataset”) and see the result.

You’ll notice the generated output does not feels natural, contains a lot of repetitions, etc… Alternative algorithms (beam search, random sampling methods, etc…) are here to make the model more human like.

I recommend you to read this very interesting article : How to generate text: using different decoding methods for language generation with Transformers


And to answer your question, no we can’t integrate the decoding algorithm during training, because sampling is non-differentiable.

1 Like

Thanks bro! I think non-differentiable is the main problem. :hugs: