I will use BERT’s embedding weights(as discussed here) for embedding in embedding layers of the transformer model. But my question is: Doesn’t embeddings of bert already go through the whole encoding layer and got that matrix? Why I shouldn’t just remove-freeze the encoding layer and use bert embedding vectors as input for the decoding layer? And also I will use bert embeddings in the input of the decoding layer. Why should I not freeze attention layers in decoder layer too? Because embeddings of output text already have attention information? Thanks in advance.