What to use for the target input in the decoder for autoregressive usage

I want to use a transformer (encoder and decoder) and use it for the seq 2 seq modeling.
I like to use it as autoregressive model during the inference time.
I know that I should use the targets to the decoder during the training, but then the issue is that when I do it my model overfits and performed well only if I have good targets as the input of the decoder.
I was wondering if there is a trick or rule on how should I select the target for the decoder?

my second question is if the target inputs to the decoder should be exactly similar to the target outputs of the decoder or it is kinda an art to choose the right target input for the decoder?

Encoder-decoder (seq2seq2) models like T5, BART and PEGASUS are trained using what is called “teacher forcing”, this just means supervised learning, i.e. the model needs to produce the target sentence given the source text.

Normally, if your dataset is diverse enough, it will also perform well at inference time, when using model.generate(). Using decoding techniques such as beam search and top-k sampling, you can get good results, even on unseen inputs.

1 Like

I feel the need to clarify this. Teacher forcing is something very specific in such models, not simply the task of “producing the target sentence given source text”. I think that what you mean is correct, but it is written a bit ambiguously.

In teacher forcing in an autoregressive decoder you give the decoder the correct previous output tokens. So for instance, in machine translation during decoding, you would “guide” it by giving it the correct previous token for the prediction of each next token.

I am not sure about the models that you list, but in MT implementations it is common to randomly apply teacher forcing for different batches to ensure that the model generalizes better and is less prone to exposure bias.

1 Like

@BramVanroy oh thanks, I actually did not know this. So usually, one applies teacher forcing on all training examples when fine-tuning models like T5 and BART, right?

Sorry, I am not sure about those specific models and I do not have the time to go into it in detail. I can find that the fairseq datasets for MT by default include a shifted version of the golden targets for teacher forcing.

I am not sure about the implementation/training/finetuning of all these models.

But, considering that one of the advantages of teacher forcing is faster convergence, fine-tuning (on limited data) with teacher forcing seems a good idea.

thank you for the information. I will look into the teaching forcing methods then to get more information on that :slight_smile: