T5 masking - spans of text tokens or encoded tokens?

Hi - I am trying to fine-tune T5 on a domain-specific corpus of text in a specific using unsupervised denoising training as shown in the Training section of the T5 model documentation. I am assuming that is not yet a standard training/fine-tuning method within the HuggingFace library (please tell me if so! :-))

Probably a stupid question (but wanted to check!), am I correct in assuming that when talking about the “token spans” to be replaced with <extra_id_0>, <extra_id_1>, … etc, the tokens referred to in the example are the text tokens and not the encoded tokens? So for example in a medical example if I have the text ‘For hayfever, I take laratadine’ and I mask loratadine, am I masking one token (‘loratadine’) or am I masking five tokens (t5-small tokenizer encodes ‘loratadine’ into 5 sub-tokens not 1 since it doesn’t have the whole word in its vocab).

So token spans refer to text token spans, and not to encoded token spans?

1 Like