GPT2 - Training data vs size comparison for GPT2-Small/Medium and XL

I’m trying to finetune gpt2 to create a very basic chatbot and I’ve been trying to decide on which gpt2 model to use.
After trying out pretrained small/medium/large/xl variants, GPT-XL is already very good at create a believable dialogue while gpt-2 small is not (the other are somewhere in between).

I wanted to know how much of this is because of the training data (XL was trained on several GBs more data) and how much is because of the model size. Obviously a bigger model means better inference but for a simple conversational AI, will training gpt2 on a sufficiently large data lead to better conversation generation, or is it limited by the model size?

from GPT2-XL (1.5B)
Lisa: Hi adam, how are you?
Adam: Hi Lisa, I’m good. How is your day going?
Lisa: It’s going great. I’m just about to go to work.
Adam: Oh, I’m sorry. I didn’t mean to interrupt you.
Lisa: No, it’s fine.
Adam: I’m just trying to figure out what I’m going to do with my life.

from GPT2-Small (117M)
Lisa: Hi adam, how are you?
Adam: Hi Lisa, I’m good. How is your day going?
Lisa: I’m fine. I’m just going to go to bed.
Adam: I’m going to sleep.
Lisa: I’m going to sleep.
Adam: I’m going

(I can run gpt-2 locally but not gpt-xl, so I’m okay if I have to train gpt2-small more)