The result of bart-large is more stranger compare to the bart-base

SYRain · July 5, 2022, 1:43pm

Hi all, I used Bart Model to generate the story ends. However, when I used the same code but initialized the model with bart-base and bart-large, the result of the bart-large is more stranger. It will generate more bos_token <s> at the beginning of the result.

bart-base:

<s> She was so happy that she finally had a real horse.

bart-large:

<s> <s> <s> She was so happy that she finally had a real horse.

This problem appeared in every checkpoints of the bart-large, some checkpoints have more but some checkpoints have less. The result of the bart-base doesn’t have this problem, all sample only generate one <s>. I wonder what makes this difference.

What’s more, when I use bart-lager, the learning rate of the model has to tune to 1e-5 while bart-base uses 5e-5. If I use same learning rate, the loss of the model will crease in sharp at some time. Is this normal?

Thanks all!

Topic		Replies	Views
Bart-base rouge scores Research	11	1730	October 27, 2020
What's the difference between bart-base tokenizer and bart-large tokenizer Beginners	6	2041	December 6, 2020
BART learns well, loss decreases, but prediction output is weird 🤗Transformers	2	193	March 3, 2024
BART generation with shorter input sequences on pre-training task Models	0	309	January 25, 2023
Bart Large Saved vs Pretrained Size Models	0	467	February 9, 2022

The result of bart-large is more stranger compare to the bart-base

Related topics