Data collator for training bart from scratch

Hello,
I would like to train bart from scratch.
It seems the official example script is not available yet (if any, please tell me!).
So I try to have one by modifying the example scripts run_mlm.py and run_clm.py.
And not sure how to set the data collator part for bart.

In run_mlm.py , DataCollatorForLanguageModeling is used:

In run_clm.py , default_data_collator is used:

Can someone give some advice about how to set the data collator for bart?
Thanks.

1 Like

@zuujhyt Maybe try creating different Dataset classes using different DataCollators. You could then use PyTorch Lightning to create a Dataloader w/ multiple Dataset classes like this, and train Bart on that. I’ll try doing this.

Let me know if you have any updates on this.