I have been recently studying about GPT2. Can someone tell me whether Decoder only Models uses teacher forcing like Encoder Decoder models?
I have seen the GPT2 implementation of huggingface and they use the labels to only calculate loss. How does the model