Data collator for training bart from scratch

I would like to train bart from scratch.
It seems the official example script is not available yet (if any, please tell me!).
So I try to have one by modifying the example scripts and
And not sure how to set the data collator part for bart.

In , DataCollatorForLanguageModeling is used:

In , default_data_collator is used:

Can someone give some advice about how to set the data collator for bart?

1 Like

@zuujhyt Maybe try creating different Dataset classes using different DataCollators. You could then use PyTorch Lightning to create a Dataloader w/ multiple Dataset classes like this, and train Bart on that. I’ll try doing this.

Let me know if you have any updates on this.