BERT vs GPT architectural, conceptual and implemetational differences

In the BERT paper, I learnt that BERT is encoder-only model, that is it involves only transformer encoder blocks.

In the GPT paper, I learnt that GPT is decoder-only model, that is it involves only transformer decoder blocks.

I was guessing whats the difference. I know following difference between encoder and decoder blocks: GPT Decoder looks only at previously generated tokens and learns from them and not in right side tokens. BERT Encoder gives attention to tokens on both sides.

But I have following doubts:

Q1. GPT2,3 focuses on new/one/zero short learning. Cant we build new/one/zero short learning model with encoder-only architecture like BERT?

Q2. Huggingface Gpt2Model contains forward() method. I guess, feeding single data instance to this method is like doing one shot learning?

Q3. I have implemented neural network model which utilizes output from BertModel from hugging face. Can I simply swap BertModel class with GPT2Model with some class and will it. The return value of Gpt2Model.forward does indeed contain last_hidden_state similar to BertModel.forward. So, I guess swapping out BertModel with Gpt2Model will indeed work, right?

Q4. Apart from being decoder-only and encoder-only, auto-regressive vs non-auto-regressive and whether or not accepting tokens generated so far as input, what high level architectural / conceptual differences GPT and BERT have?