reply myself:
I think this is a good try since the loss and hidden states are totally the same as the standard training process, and I will test the training process later.
the separate process:
Have you found something in this???
Even I want to use an encoder and decoder separately.
My task involves passing the tokenized input ids to the encoder and get the last_hidden_layer and then passing those embeddings to the decoder to get the tokens further decoding those tokens.