Fine-tuning GPT2 for bidirectional training

Yeah · November 22, 2021, 1:15pm

Hallo,

we all know, that GPT2 is widely used in text generation task as autoregressive model. May I ask, could I further train GPT2 in a bidirectional attention, like BERT as auto-encoding model? If yes, how could I flexible control the uni- or bidirectional training in GPT2? Thanks a lot!!

NDugar · November 22, 2021, 9:23pm

I don’t think this is possible at all due to structural differences in the model’s design, but I may be wrong.
Reference this if you want more clarity on what I am saying - Comparison between BERT, GPT-2 and ELMo | by Gaurav Ghati | Medium

" Drawbacks: GPT is its uni-directional nature — the model is only trained to predict the future left-to-right context ." - from the article

Hope this answers your question

Yeah · November 24, 2021, 10:04am

Thanks for your info. Really appreciate!
While, I know there are already some work to control the bidi- or uni-directional attention by using a matrix. like this: unilm/unilm-v1 at master · microsoft/unilm · GitHub
so I am wondering if we could also realise flexible attention by using this in GPT2.

nielsr · November 24, 2021, 10:21am

Hi,

The reason autoregressive models like GPT2 are trained using a causal attention mask is because otherwise you “leak” information from the future. These models are trained to predict the next token, but if you let the model pay attention to the next token (while it needs to predict this), then that defeats the purpose of training the model.

Yeah · November 24, 2021, 11:09am

yes, you make sense.
While, in my case, I want to bidirectional process the context part, while generation part is still in unidirectional (thus, the future information won’t be leaked). does that make sense?

nielsr · November 24, 2021, 11:10am

Yes that makes sense, and can be done indeed (this is actually a nice research topic).

Note that BigScience is investigating this, they call it “prefix-lm”: Models - Hugging Face

I assume that bidirectional attention is applied to the “prefix” (i.e. context), and unidirectional on the generation part.

Yeah · November 24, 2021, 11:39am

Thanks for the sharing. But there is no detailed infos for the model “prefix-lm”.
And actually I wanna to do this with pre-trained GPT or DialoGPT. Could you tell me more infos about how to do that? Thanks a lot!!

nielsr · November 24, 2021, 12:05pm

Hi,

I just found BigScience’s implementation here: bigscience/inference at f5f883866af9d58871d9dc498646dcc133b01b3c · bigscience-workshop/bigscience · GitHub

They adapted GPT-2.

Yeah · November 24, 2021, 12:25pm

thanks a lot! I will take a look at that. really appreciate!

Topic		Replies	Views
Causal masks in BERT vs. GPT2 Intermediate	4	2714	December 30, 2022
Fine Tuning GPT2 for machine translation 🤗Transformers	1	4771	May 2, 2021
[Suggestions and Guidance]Finetuning Bert models for Next word Prediction Research	4	4893	January 26, 2022
Forward-looking or left-context attention mask (left-to-right) generation with BertGeneration and RobertaForCausalLM 🤗Transformers	3	1351	October 27, 2020
Fine tuning and retokenizing Beginners	0	589	May 29, 2022

Fine-tuning GPT2 for bidirectional training

Related topics