I would like to do sequence classification over the encoder in parallel with conditional generation using an auxiliary loss. However, I am confused about which hidden state I should take for the classification.
Supposing that the hidden state of the last layer has the following dimensions: [batch size, seq length, hidden size] should I take the last one [:, -1, :] ?
1 Like
It depends on the model. BERT uses the first one (where the CLS token is), some models use a pooling of all hidden states, other the one for the last logits (which is not necessarily -1
since you could have padding). I’d look at what is done in T5ForSequenceClassification
and copy the code.
1 Like
You are absolutely right. That’s what I tried to do at first. However, the T5 model has no T5ForSequenceClassification
class or something similar. I think that the most suitable is to use the last logits taking into account padding. Do you have in mind any function that can be helpful?
I would also like to ask if there is any way to tie the weights of the encoder and the decoder.