Hi all, I used Bart Model to generate the story ends. However, when I used the same code but initialized the model with bart-base and bart-large, the result of the bart-large is more stranger. It will generate more bos_token <s>
at the beginning of the result.
bart-base:
<s> She was so happy that she finally had a real horse.
bart-large:
<s> <s> <s> She was so happy that she finally had a real horse.
This problem appeared in every checkpoints of the bart-large, some checkpoints have more but some checkpoints have less. The result of the bart-base doesn’t have this problem, all sample only generate one <s>
. I wonder what makes this difference.
What’s more, when I use bart-lager, the learning rate of the model has to tune to 1e-5 while bart-base uses 5e-5. If I use same learning rate, the loss of the model will crease in sharp at some time. Is this normal?
Thanks all!