I’m trying to perform continued pre-training on T5. Basically, I’m doing domain adaptation to new data before fine-tuning, and I want to make sure that I’m preprocessing data as similarly as possible to how T5 does it during pre-training (i.e., randomly corrupt 15% of tokens, pack sequences together for training examples of length 512, mask contiguous corrupted spans, reconstruct corrupted spans in target sequence).
I know that there’s a TensorFlow implementation of this from Google (text-to-text-transfer-transformer/preprocessors.py at d72bd861de901d3269f45ec33c6ca6acd18b10b8 · google-research/text-to-text-transfer-transformer · GitHub), but has anyone implemented this in PyTorch for use with huggingface models? I’m hoping to make something like the linked
span_corruption function work with torch tensors composed of tokenized text, rather than a tf.Dataset object.