Fine-tuning GPT2 for bidirectional training

Hallo,

we all know, that GPT2 is widely used in text generation task as autoregressive model. May I ask, could I further train GPT2 in a bidirectional attention, like BERT as auto-encoding model? If yes, how could I flexible control the uni- or bidirectional training in GPT2? Thanks a lot!!

I don’t think this is possible at all due to structural differences in the model’s design, but I may be wrong.
Reference this if you want more clarity on what I am saying - Comparison between BERT, GPT-2 and ELMo | by Gaurav Ghati | Medium

" Drawbacks: GPT is its uni-directional nature — the model is only trained to predict the future left-to-right context ." - from the article

Hope this answers your question

Thanks for your info. Really appreciate!
While, I know there are already some work to control the bidi- or uni-directional attention by using a matrix. like this: unilm/unilm-v1 at master · microsoft/unilm · GitHub
so I am wondering if we could also realise flexible attention by using this in GPT2.

Hi,

The reason autoregressive models like GPT2 are trained using a causal attention mask is because otherwise you “leak” information from the future. These models are trained to predict the next token, but if you let the model pay attention to the next token (while it needs to predict this), then that defeats the purpose of training the model.

yes, you make sense.
While, in my case, I want to bidirectional process the context part, while generation part is still in unidirectional (thus, the future information won’t be leaked). does that make sense?

Yes that makes sense, and can be done indeed (this is actually a nice research topic).

Note that BigScience is investigating this, they call it “prefix-lm”: Models - Hugging Face

I assume that bidirectional attention is applied to the “prefix” (i.e. context), and unidirectional on the generation part.

Thanks for the sharing. But there is no detailed infos for the model “prefix-lm”.
And actually I wanna to do this with pre-trained GPT or DialoGPT. Could you tell me more infos about how to do that? Thanks a lot!!

Hi,

I just found BigScience’s implementation here: bigscience/inference at f5f883866af9d58871d9dc498646dcc133b01b3c · bigscience-workshop/bigscience · GitHub

They adapted GPT-2.

thanks a lot! I will take a look at that. really appreciate!