What to use for the target input in the decoder for autoregressive usage

seyeeet · September 15, 2021, 9:14pm

I want to use a transformer (encoder and decoder) and use it for the seq 2 seq modeling.
I like to use it as autoregressive model during the inference time.
I know that I should use the targets to the decoder during the training, but then the issue is that when I do it my model overfits and performed well only if I have good targets as the input of the decoder.
I was wondering if there is a trick or rule on how should I select the target for the decoder?

my second question is if the target inputs to the decoder should be exactly similar to the target outputs of the decoder or it is kinda an art to choose the right target input for the decoder?

nielsr · September 16, 2021, 8:39am

Encoder-decoder (seq2seq2) models like T5, BART and PEGASUS are trained using what is called “teacher forcing”, this just means supervised learning, i.e. the model needs to produce the target sentence given the source text.

Normally, if your dataset is diverse enough, it will also perform well at inference time, when using model.generate(). Using decoding techniques such as beam search and top-k sampling, you can get good results, even on unseen inputs.

BramVanroy · September 16, 2021, 11:09am

I feel the need to clarify this. Teacher forcing is something very specific in such models, not simply the task of “producing the target sentence given source text”. I think that what you mean is correct, but it is written a bit ambiguously.

In teacher forcing in an autoregressive decoder you give the decoder the correct previous output tokens. So for instance, in machine translation during decoding, you would “guide” it by giving it the correct previous token for the prediction of each next token.

I am not sure about the models that you list, but in MT implementations it is common to randomly apply teacher forcing for different batches to ensure that the model generalizes better and is less prone to exposure bias.

nielsr · September 16, 2021, 11:22am

@BramVanroy oh thanks, I actually did not know this. So usually, one applies teacher forcing on all training examples when fine-tuning models like T5 and BART, right?

BramVanroy · September 16, 2021, 11:54am

Sorry, I am not sure about those specific models and I do not have the time to go into it in detail. I can find that the fairseq datasets for MT by default include a shifted version of the golden targets for teacher forcing.

github.com

pytorch/fairseq/blob/1bba712622b8ae4efb3eb793a8a40da386fe11d0/fairseq/data/language_pair_dataset.py#L182-L183

    
      
          input_feeding (bool, optional): create a shifted version of the targets
              to be passed into the model for teacher forcing (default: True).

I am not sure about the implementation/training/finetuning of all these models.

But, considering that one of the advantages of teacher forcing is faster convergence, fine-tuning (on limited data) with teacher forcing seems a good idea.

seyeeet · September 16, 2021, 3:30pm

thank you for the information. I will look into the teaching forcing methods then to get more information on that

Topic		Replies	Views
T5 decoder predicting tokens even after hitting end of sequence token, i.e </s> 🤗Transformers	4	332	February 26, 2024
What decoder inputs is the trainer creating when I use it with AutoModelForSeq2SeqLM and a model that needs Decoder Inputs? Beginners	0	183	May 13, 2023
Encoder Decoder Model gives same generation results after finetuning 🤗Transformers	2	662	August 4, 2022
Sequence to sequence model Intermediate	0	68	November 22, 2024
Teacher Forcing with T5 Models	0	651	February 12, 2021

What to use for the target input in the decoder for autoregressive usage

Related topics