we all know, that GPT2 is widely used in text generation task as autoregressive model. May I ask, could I further train GPT2 in a bidirectional attention, like BERT as auto-encoding model? If yes, how could I flexible control the uni- or bidirectional training in GPT2? Thanks a lot!!
Thanks for your info. Really appreciate!
While, I know there are already some work to control the bidi- or uni-directional attention by using a matrix. like this: unilm/unilm-v1 at master · microsoft/unilm · GitHub
so I am wondering if we could also realise flexible attention by using this in GPT2.
The reason autoregressive models like GPT2 are trained using a causal attention mask is because otherwise you “leak” information from the future. These models are trained to predict the next token, but if you let the model pay attention to the next token (while it needs to predict this), then that defeats the purpose of training the model.
yes, you make sense.
While, in my case, I want to bidirectional process the context part, while generation part is still in unidirectional (thus, the future information won’t be leaked). does that make sense?
Thanks for the sharing. But there is no detailed infos for the model “prefix-lm”.
And actually I wanna to do this with pre-trained GPT or DialoGPT. Could you tell me more infos about how to do that? Thanks a lot!!