Preprocessing for T5 Denoising

amueller · May 20, 2021, 2:19am

Hi all!

I’m trying to perform continued pre-training on T5. Basically, I’m doing domain adaptation to new data before fine-tuning, and I want to make sure that I’m preprocessing data as similarly as possible to how T5 does it during pre-training (i.e., randomly corrupt 15% of tokens, pack sequences together for training examples of length 512, mask contiguous corrupted spans, reconstruct corrupted spans in target sequence).

I know that there’s a TensorFlow implementation of this from Google (text-to-text-transfer-transformer/preprocessors.py at d72bd861de901d3269f45ec33c6ca6acd18b10b8 · google-research/text-to-text-transfer-transformer · GitHub), but has anyone implemented this in PyTorch for use with huggingface models? I’m hoping to make something like the linked span_corruption function work with torch tensors composed of tokenized text, rather than a tf.Dataset object.

Thanks!

lewtun · May 20, 2021, 12:34pm

hey @amueller i haven’t tried this myself but it seems that pretraining T5 can be done by using sentinel tokens in the tokenizer as described here: T5 — transformers 4.5.0.dev0 documentation

Topic		Replies	Views
Prepare data for pretraining T5 model 🤗Datasets	1	1067	May 4, 2023
How to denoise text using T5? 🤗Transformers	2	681	May 8, 2023
Errors when fine-tuning T5 Beginners	7	6468	January 3, 2022
No Improvement in Results after Implementing Unsupervised Denoising Training Technique for T5 Model using Hugging Face Models	0	120	April 25, 2024
Transformer for Translation from Scratch with Hugging Face/PyTorch Intermediate	5	3786	December 1, 2022

Preprocessing for T5 Denoising

Related topics