BART question, it seems that pretraining is not work for a small model?

My task is to generate keywords from sentences.

I pretrain a text-generation model. I mask the sentences’ tokens and predict the whole sentences’ tokens.

Pretraining batch_size = 8 and step = 1000000

I haven’t observed improvement from pretraining. BLEU score is 10.5 for not pretraining, BLEU score is 9.5 for pretraining.


I take the python code from

hidden_size = 512
num_encoder_layers = 3
num_decoder_layers = 3


Am I right? Is it the reason that pretraining do not improve the BLEU score?

With all due respect, you are asking a question on a forum dedicated to a specific library transformers by HuggingFace, but the question does not involve that library. In fact, you are using a completely different library.

I have changed the tag.

On the research part of the forum, we welcome any general questions, though of course we would prefer you to use our models :wink:
@sshleifer might have some answer as he is the Bart person on the team.

Definitely possible, there could also be a bug in your code. I don’t have enough familiarity with your task to know what results to expect.

Thank you. I am also using your models.

1, I pad some zeros in the input tokens for multi sentences. The output positions of output tokens should be exactly same to the input tokens, which means I should keep the padding zeros in the output tokens.

2, The pretraining time should be longer.