PreTrain BART on The Pile

Description: We will pretrain a large BART model on The Pile, and measure a performance increase downstream. Potentially we could also add rotary embeddings?

Model: BART (1b+)

Dataset: The Pile

Training scripts: Training scripts will be written as part of the project. Data processing scripts can be taken from GPT-J 6b

Expected result: An adaptable JAX pipeline for training seq2seq models like BART on the pile.


Im Interested in this project, Any reasons why you’d aim for BART and not any model with some architectural improvements?

Great project! I’d also be interested in why one should use BART. Maybe T5 is also an option? We’ll have an official pretraining script merged for T5 very soon → see: [Flax] Add T5 pretraining script by patrickvonplaten · Pull Request #12355 · huggingface/transformers · GitHub

To be honest it’s just because I am familiar with BART :man_shrugging: I haven’t really used T5 in practice yet. A lot of a stack I built last year is still using BART.

Cool idea!

IMO one important consideration would be the oupt sequence length. AFAIU for BART’s denoiseing objective the output sequence length is same as the input length,where as for T5 it’s quite small, which would lead to faster training.

adding rotary embeddings also seems like a good idea!

Also what do you think about deep encoder-shallow decoder, looking at the ByT5 paper it seems it’s worth exploring.

The BART model’s objective seems to help better than T5 though, based on the results for text generation


Let’s officially define this project :slight_smile:

Putting everybody in the official sheet here . More people can still join! Leave a comment here or on the sheet if you want to change something.

@THEODOROS was interested in doing cross project collaboration, he mentioned wanting to use the BART+rotary implementation we end up with.

This is a cool idea, would love to help out if I can! Adding myself to the google sheet (if thats cool)

1 Like

Join on the discord @morgan

1 Like

@valhalla @patrickvonplaten We wanted to make some architectural improvements in the BART model, some of it related to using deberta’s tokenizer and also adding rotary embeddings, how would we use this model in HF after that?

Hi @paws

Feel free to make any improvements you want. You could see how FlaxBart is implemented and try to keep the same API (i.e __call__, save_pretrained and from_pretrained), that way the model will be compatible with HF API. And for now, you could create new repo, we could always modify the code later to make it compatible with HF API.

Thanks @LouisCastricato , where is the discord link? Or do you mean the HF slack? Haven’t received an invite for that yet…

1 Like

You should talk to someone working at HF to get the invite. I do not think I am allowed to send or post the link. @valhalla do you mind helping morgan?

1 Like

@LouisCastricato Hi, I have experience with using BART and T5 for summarization before so I’m interested in this project, hope I can join.

1 Like

Of course! Join the discord as well!

In, thanks!

Added you as well @mattbui

I have some experience with using BART for summarization on financial text. Very interested in learning more about pre-training seq2seq models + using TPU vms. I would love to join the project if there is still room on the team.

1 Like

added you :slight_smile:

1 Like