In the BERT paper, I learnt that BERT is encoder-only model, that is it involves only transformer encoder blocks.
In the GPT paper, I learnt that GPT is decoder-only model, that is it involves only transformer decoder blocks.
I was guessing whats the difference. I know following difference between encoder and decoder blocks: GPT Decoder looks only at previously generated tokens and learns from them and not in right side tokens. BERT Encoder gives attention to tokens on both sides.
But I have following doubts:
Q1. GPT2,3 focuses on new/one/zero short learning. Cant we build new/one/zero short learning model with encoder-only architecture like BERT?
forward() method. I guess, feeding single data instance to this method is like doing one shot learning?
Q3. I have implemented neural network model which utilizes output from
BertModel from hugging face. Can I simply swap
BertModel class with
GPT2Model with some class and will it. The return value of
Gpt2Model.forward does indeed contain
last_hidden_state similar to
BertModel.forward. So, I guess swapping out
Gpt2Model will indeed work, right?
Q4. Apart from being decoder-only and encoder-only, auto-regressive vs non-auto-regressive and whether or not accepting tokens generated so far as input, what high level architectural / conceptual differences GPT and BERT have?