Preprocessing for T5 Denoising

Hi all!

I’m trying to perform continued pre-training on T5. Basically, I’m doing domain adaptation to new data before fine-tuning, and I want to make sure that I’m preprocessing data as similarly as possible to how T5 does it during pre-training (i.e., randomly corrupt 15% of tokens, pack sequences together for training examples of length 512, mask contiguous corrupted spans, reconstruct corrupted spans in target sequence).

I know that there’s a TensorFlow implementation of this from Google (text-to-text-transfer-transformer/preprocessors.py at d72bd861de901d3269f45ec33c6ca6acd18b10b8 · google-research/text-to-text-transfer-transformer · GitHub), but has anyone implemented this in PyTorch for use with huggingface models? I’m hoping to make something like the linked span_corruption function work with torch tensors composed of tokenized text, rather than a tf.Dataset object.

Thanks!

hey @amueller i haven’t tried this myself but it seems that pretraining T5 can be done by using sentinel tokens in the tokenizer as described here: T5 — transformers 4.5.0.dev0 documentation