That’s one of the issues. The Spanish portion of mC4 seems to be 1TB of uncompressed text. For sure not all of it is needed, but it’d be great to be able to train on at least half or a third of it.
That should work! We are also working on datasets streaming for very large datasets, see PR here: https://github.com/huggingface/datasets/pull/2375 and RoBERTa lange can fit up to a batch_size of 512 or 1024 on a TPUv3-8 for a sequence length of 128 (most of the time one actually starts with just 128 sequence length).
So this is definitely a doable project!
That’s really cool! mC4 can be painful depending on what languages you are dealing with. Streaming will be great for this.
Would it be viable to pre-train ALBERT or DeBERTa-v2 models in this event? Hard to make a decision on which one!
We haven’t yet added ALBERT and DeBERTa-v2 in Flax yet. ALBERT should be easy to add though.
ELECTRA (which is already available in Flax) is also a good option since it’s much more sample efficient. But we don’t have a pre-training script for electra yet.
This is the opportunity we were waiting for! Count me in @mrm8488
Great! Created some of my project ideas here.
- PreTrain GPT2 from scratch in Bengali
- PreTrain T5 from scratch in Bengali
- PreTrain RoBERTa (MLM model) from scratch for Programming Languages
Not sure what is the condition of the T5 pre-training script. Would love to contribute and adapt the given MLM and CLM script to T5 if it’s not done yet.
Awesome can’t wait for Jax/flax huggingface + tpus:partying_face: already working on japanese based text classification using Bert pre trained transformers
T5 pretraining script should be merged by next week
Super excited for this!!!
training GPT2 in Bengali would be pretty huge for Bengali NLP research community.
Here’s the topic link: PreTrain GPT2 from scratch in Bengali
Unfortunately I will not be able to attend the talks from 30/06 - 02/07. Will they be recorded and made available?
I think so! (@Suzana might know better)
Would it be possible to train an mBART model from scratch in JAX/Flax?
Maybe only for a couple of languages, to fit the time frame.
Sure, why not!
mBART will be merged soon in JAX/Flax, but if you want to train from scratch you could also use BART or T5.
And yeah, starting with a few languages makes sense to fit the time frame.
Can i train Wav2Vec2 from JAX?
Is there code for BART pre-training using huggingface?
No, we haven’t yet added BART pre-training script. T5 pre-training script should be available in week.
But if someone wants, feel free to take a shot at yet. The most important part is the bart denosing function. Then one could just leverage the run_summarization_script` with the denoising dataset to pre-train BART
Patrick is working on FlaxWav2vec2, but it will take some time since it’s a complex model and pre-training is also a bit complex.
Super excited to get some formal training on jax. Been trying to get started with jax for many weeks now but lack of motivation and busy work schedule prevented it. Looks like at least one problem is solved! Btw if I only want to learn jax and not have to work on the project would that be okay? Curious because it might not fit into the schedule.
Sure, try to take in as much as possible during the event!
Hi @Suzana, please share here if you have any updates on this. Thank you!