Is there any codebase in huggingface that could be used to pretrain T5 model? Looking into the examples dir in the repo there is nothing mentioned about T5. Thanks!
Still need help on this…
Hi @mralexis, there’s a GitHub issue that might help you: How do I pre-train the T5 model in HuggingFace library using my own text corpus? · Issue #5079 · huggingface/transformers · GitHub
T5ForConditionalGeneration is probably what you are looking for doing pretraining: T5 — transformers 4.3.0 documentation
@lewtun Thanks for the quick reply! I did check it out but there is only a code block on how to calculate the loss for pretraining but no other implementation details which are also critical. Do you know whether there is code on that?
Unfortunately I do not know where one can find a detailed example of T5 pretraining, so pinging @valhalla in case he does
Hey guys, sorry about the super late response.
T5 pre-training is not implemented with
Transformers, AFAIK it’s only available in the original T5 repo.
What we need to implement this with
Transformers is the T5 style denoising dataset. It’s in my todo-list to implement this hopefully early next month.
Hey! just checking in on that to see if anyone has any updates.
I am also interested in this and actually have a semi working version (needs more testing) based on the original T5 repo. I’d be happy to work together on this to bring to the transformers library if it is still on the roadmap. Here is the colab with the current implementation: Google Colaboratory (scroll down/CTRL-F for DataCollatorForSeq2SeqMaskLanguageModeling
I can also open a PR to start this process if interested.
Any more developments here? My understanding is that we’d have to pre-train using the standard
Trainer class with a custom Data Collator as described by @ncoop57. @valhalla would you be able to help/comment?
Hi @valhalla , any update on this ?
T5 pre-training is now supported in JAX/FLAX. You can check out the example script here: transformers/examples/flax/language-modeling at master · huggingface/transformers · GitHub. It actually includes 2 scripts:
- t5_tokenizer_model.py, to train a T5 tokenizer (i.e. SentencePiece) from scratch.
- run_t5_mlm_flax.py, to pre-train T5. It’s suited to run on TPUs (for which you can obtain access for free by applying to Google’s TFRC program).
This script was developed for the JAX/FLAX community event. It would be really cool if someone contributes the PyTorch version of it. It would mean translating the script from FLAX to PyTorch, which is probably straightforward.