Whole-word masking for T5

I want to write a custom Data Collator using the T5 tokenizer that masks whole words.
The thing is, it is difficult to know which tokens comprise a word. Some words are a single token while other words are comprised of more then 1 token.

My initial approach is for every word I should find that start and end index of the tokens in the list that comprise that word, But I don’t know if that would be any good.

How should I go about this task?


When you’re dealing with tokenization and need to mask whole words using the T5 tokenizer, one approach is to use the tokenized output and identify the boundaries of each word in the tokenized sequence. The T5 tokenizer in the transformers library provides a method called tokenize_plus that can be helpful for this task.

Here’s a step-by-step guide on how you can create a custom data collator to mask whole words using the T5 tokenizer:
from transformers import T5Tokenizer
from transformers import DataCollatorForLanguageModeling
import torch

class CustomDataCollator(DataCollatorForLanguageModeling):
def init(self, tokenizer, mlm=True, mlm_probability=0.15):
super().init(tokenizer=tokenizer, mlm=mlm, mlm_probability=mlm_probability)

def mask_words(self, input_ids, labels):
    masked_input_ids = input_ids.clone()

    for i in range(len(labels)):
        # Identify the start and end index of each word in the tokenized sequence
        start_idx = (input_ids[i] == self.tokenizer.pad_token_id).nonzero().item() + 1
        end_idx = len(input_ids[i]) - (input_ids[i][::-1] == self.tokenizer.pad_token_id).nonzero().item() - 1

        # Mask the entire word
        masked_input_ids[i, start_idx:end_idx] = self.tokenizer.mask_token_id
        labels[i, start_idx:end_idx] = input_ids[i, start_idx:end_idx].clone()

    return masked_input_ids, labels

def __call__(self, examples):
    batch = self._tensorize_batch(examples)
    input_ids, labels = self.mask_tokens(batch["input_ids"], batch["labels"])

    # Additional step to mask whole words
    masked_input_ids, masked_labels = self.mask_words(input_ids, labels)

    return {"input_ids": masked_input_ids, "labels": masked_labels}

Example usage

tokenizer = T5Tokenizer.from_pretrained(“t5-small”)
data_collator = CustomDataCollator(tokenizer)

Dummy data for demonstration

dummy_data = [{“text”: “This is an example sentence.”}, {“text”: “Another example for testing.”}]
encoded_data = tokenizer(dummy_data, return_tensors=“pt”, padding=True)

Apply the custom data collator

masked_batch = data_collator(encoded_data[“input_ids”])

print(“Input IDs:”, masked_batch[“input_ids”])
print(“Labels:”, masked_batch[“labels”])

Hi, thanks for the reply?
I might be missing something, but where is the use of tokenize_plus?
Also, since we are dealing with a causal model, should mlm be set to False?
In the case where I am want to pre-train a t5 and I simply want to mask words, what are the labels supposed to be in this case?