Encoder-decoder transformers,

i want to use Roberta as encoder and GPT ass decoder for a generation task, except the decoder loss, i also want to create a classification task in encoder, and sum these two loss to train the model. But from the original code, I found the encoder includes the pooling layer, while it doesn’t work and computing doesn’t pass the pooling layer, is it correct?

if i first build a encoder-decoder architecture, and then use the first output ([CLS]) of encoder to go through a pooling and softmax layer, is it correct?

anyone has suggestions? Thanks a lot!