The result of bart-large is more stranger compare to the bart-base

Hi all, I used Bart Model to generate the story ends. However, when I used the same code but initialized the model with bart-base and bart-large, the result of the bart-large is more stranger. It will generate more bos_token <s> at the beginning of the result.

bart-base:

<s> She was so happy that she finally had a real horse.

bart-large:

<s> <s> <s> She was so happy that she finally had a real horse.

This problem appeared in every checkpoints of the bart-large, some checkpoints have more but some checkpoints have less. The result of the bart-base doesn’t have this problem, all sample only generate one <s>. I wonder what makes this difference.

What’s more, when I use bart-lager, the learning rate of the model has to tune to 1e-5 while bart-base uses 5e-5. If I use same learning rate, the loss of the model will crease in sharp at some time. Is this normal?

Thanks all!