T5forConditionalGeneration + classification

I would like to do sequence classification over the encoder in parallel with conditional generation using an auxiliary loss. However, I am confused about which hidden state I should take for the classification.
Supposing that the hidden state of the last layer has the following dimensions: [batch size, seq length, hidden size] should I take the last one [:, -1, :] ?

1 Like

It depends on the model. BERT uses the first one (where the CLS token is), some models use a pooling of all hidden states, other the one for the last logits (which is not necessarily -1 since you could have padding). I’d look at what is done in T5ForSequenceClassification and copy the code.

1 Like

You are absolutely right. That’s what I tried to do at first. However, the T5 model has no T5ForSequenceClassification class or something similar. I think that the most suitable is to use the last logits taking into account padding. Do you have in mind any function that can be helpful?

I would also like to ask if there is any way to tie the weights of the encoder and the decoder.