How can I combine encode and padding into call during training if I want to pad to batch longest

wenmin-wu · October 3, 2022, 2:51pm

I keep receiving the following warning when training the deberta-v3 model.
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the callmethod is faster than using a method to encode the text followed by a call to thepad method to get a padded encoding.

I want to pad inputs dynamically to the batch’s longest seq. So, I encode the text first, then use a Collator for padding. How can I combine these 2 steps into the __call__ with dynamic padding?
My code is as follows:

def prepare_input(cfg, text):
    inputs = cfg.tokenizer(
        text,
        return_tensors=None,
        return_token_type_ids=False,
        add_special_tokens=True,
    )
    for k, v in inputs.items():
        inputs[k] = torch.tensor(v, dtype=torch.long)
    return inputs

class TrainDataset(Dataset):
    def __init__(self, cfg, df):
        self.cfg = cfg
        self.texts = df["full_text"].values
        self.labels = df[cfg.target_cols].values

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, i):
        item = prepare_input(self.cfg, self.texts[i])
        item["labels"] = torch.tensor(self.labels[i], dtype=torch.float)
        return item

collate_fn = DataCollatorWithPadding(tokenizer=CFG.tokenizer, padding="longest")
trn_ds = TrainDataset(CFG, trn_folds)
trn_loader = DataLoader(
    trn_ds,
    batch_size=CFG.batch_size,
    shuffle=True,
    num_workers=CFG.num_workers,
    pin_memory=True,
    drop_last=True,
    collate_fn=collate_fn,
)

Topic		Replies	Views
DeBERTa - ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Tokenizers	2	1483	October 3, 2023
Key error: 0 in DataCollatorForSeq2Seq for BERT Beginners	10	3991	March 13, 2024
EncoderDecoderModel output all pad token 🤗Transformers	1	531	June 2, 2022
Get "using the `__call__` method is faster" warning with DataCollatorWithPadding 🤗Tokenizers	8	16916	June 3, 2024
Padding in datasets 🤗Datasets	6	5039	October 21, 2021

How can I combine encode and padding into __call__ during training if I want to pad to batch longest

Related topics

How can I combine encode and padding into call during training if I want to pad to batch longest