Prepare data to fine-tune T5 model on unsupervised objective

Hi, I couldn’t find a way to fine-tune the T5 model on a dataset in a specific domain (let’s say medical domain) using the unsupervised objective. Does the current version of Huggingface support this? Basically, all I need is to prepare the dataset to train the T5 model on the unsupervised objective, which could itself be very tricky. Any pointer on this is highly appreciated. P.S: I am looking for something in PyTorch and not Tensorflow. @valhalla @clem


Hi! Have you found a solution? I also can’t figure out how to fine-tune the pretrained model (mT5) on unlabeled domain specific data using Transformer library. Thank you.

Hello @Adi-0-0-Gupta and @naumov-al.

There is an overview of the training task of Language Modeling for T5 in the T5 page on the Hugging Face site at Unsupervised denoising training.

And you will get scripts for training as said in this text from Hugging Face:

If you’re interested in pre-training T5 on a new corpus, check out the script in the Examples directory.

To train the T5 tokenizer vocab on your specific domain, this script should help:

Source: Example scripts in the T5 page

pre-training: the script allows you to further pre-train T5 or pre-train T5 from scratch on your own data. The script allows you to further train a T5 tokenizer or train a T5 Tokenizer from scratch on your own data. Note that Flax (a neural network library on top of JAX) is particularly useful to train on TPU hardware.