Description: We will pretrain a large BART model on The Pile, and measure a performance increase downstream. Potentially we could also add rotary embeddings?
Model: BART (1b+)
Dataset: The Pile
Training scripts: Training scripts will be written as part of the project. Data processing scripts can be taken from GPT-J 6b
Expected result: An adaptable JAX pipeline for training seq2seq models like BART on the pile.
To be honest it’s just because I am familiar with BART I haven’t really used T5 in practice yet. A lot of a stack I built last year is still using BART.
IMO one important consideration would be the oupt sequence length. AFAIU for BART’s denoiseing objective the output sequence length is same as the input length,where as for T5 it’s quite small, which would lead to faster training.
adding rotary embeddings also seems like a good idea!
Also what do you think about deep encoder-shallow decoder, looking at the ByT5 paper it seems it’s worth exploring.
@valhalla@patrickvonplaten We wanted to make some architectural improvements in the BART model, some of it related to using deberta’s tokenizer and also adding rotary embeddings, how would we use this model in HF after that?
Feel free to make any improvements you want. You could see how FlaxBart is implemented and try to keep the same API (i.e __call__, save_pretrained and from_pretrained), that way the model will be compatible with HF API. And for now, you could create new repo, we could always modify the code later to make it compatible with HF API.
You should talk to someone working at HF to get the invite. I do not think I am allowed to send or post the link. @valhalla do you mind helping morgan?
I have some experience with using BART for summarization on financial text. Very interested in learning more about pre-training seq2seq models + using TPU vms. I would love to join the project if there is still room on the team.