Hello everyone,
I’m currently working with XGLM models and I was wondering why the forward function returns CausalLMOutputWithCrossAttentions instead of CausalLMOutputWithPast (used by other decoder-model causalLMheads) or other classes. I was confused by the name because decoder-only models do not have cross attentions like encoder-decoder models.
Could someone help me to understand the differences and the design choice behind? Thank you all!