I have a few questions in the BART paper…
In what ways can this paper be said to have “generalized” BERT and GPT?
Exactly what role does the encoder play in the BART pretraining structure?
In chapter 2.2 Pre-training, BART calculates the cross-entropy between the decoder output and the original document.
Since the input of the Decoder is given as teacher forcing, isn’t this method only possible to learn Decoder?
Does the encoder calculate the loss between the decoder output and the masked input?
I wonder how the model can learn masking from the encoder.
Chapter 3.3 contains the statement, “In both of these tasks, information is copied from the input but manipulated, …”
What does “manipulated” mean here?