Almost every topics about decoding strategy describe how to add randomness in inference phase. Everyone knows decoding strategy intends to enrich the output of LM to be more diverse.

However, would it be possible to use any random sampling method (e.g top-p, top-k or others…) in training phase? The reason why I ask about this question is: after LM learning from a certain dataset, the weights of LM model should be “fixed”, and then inference by those fixed weights. Those fixed weights is obtained by using maximum likelihood I’m training phase. Interestingly, we then instead of choosing the predicted token from original model which might implicitly learned distribution from original dataset, we “modify” the decoding of the output which should be probably not the same distribution as learned before.

If we want to adopt the decoding algorithm in inference phase, should we put it in training as well? Thanks.