Decoding strategy in training phase

lianghsun · November 25, 2022, 1:30am

Almost every topics about decoding strategy describe how to add randomness in inference phase. Everyone knows decoding strategy intends to enrich the output of LM to be more diverse.

However, would it be possible to use any random sampling method (e.g top-p, top-k or others…) in training phase? The reason why I ask about this question is: after LM learning from a certain dataset, the weights of LM model should be “fixed”, and then inference by those fixed weights. Those fixed weights is obtained by using maximum likelihood I’m training phase. Interestingly, we then instead of choosing the predicted token from original model which might implicitly learned distribution from original dataset, we “modify” the decoding of the output which should be probably not the same distribution as learned before.

If we want to adopt the decoding algorithm in inference phase, should we put it in training as well? Thanks.

astariul · November 25, 2022, 2:58am

Try to use greedy decoding (“choosing the predicted token from original model which might implicitly learned distribution from original dataset”) and see the result.

You’ll notice the generated output does not feels natural, contains a lot of repetitions, etc… Alternative algorithms (beam search, random sampling methods, etc…) are here to make the model more human like.

I recommend you to read this very interesting article : How to generate text: using different decoding methods for language generation with Transformers

And to answer your question, no we can’t integrate the decoding algorithm during training, because sampling is non-differentiable.

lianghsun · November 25, 2022, 5:14am

Thanks bro! I think non-differentiable is the main problem.

Topic		Replies	Views
Implementing the REINFORCE algorithm for encoder-decoder model Intermediate	1	680	March 14, 2022
Using penalized sampling from CTRL 🤗Transformers	1	352	February 4, 2021
Nucleus Sampling copying input Beginners	0	360	April 28, 2021
BART_LM: Odd Beam Search Output Intermediate	18	1871	August 17, 2020
Sampling with FSMTForConditionalGeneration 🤗Transformers	1	336	December 24, 2020

Decoding strategy in training phase

Related topics