DataCollatorWithPadding - Minimum Pad Argument

slyle · August 14, 2023, 8:44pm

DataCollatorWithPadding does not take into account block_sparse thresholds of a model. If I use it to pad my batches to the longest sequence in a batch but all sequences in the batch are sufficiently small, block_sparse may be deactivated causing Out Of Memory issues. Why does something like the below not exist allowing a user to set a minimum? Is it because of the interplay between pad_to_multiple_of and a minimum length? Or did I miss it in the documentation somewhere?

@dataclass
class DataCollatorWithMinimumPadding():
    tokenizer: PreTrainedTokenizerBase
    min_length: [int]
    pad_to_multiple_of: [int] = 8
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:

        longest_sequence = max(len(feature["input_ids"]) for feature in features)

        if longest_sequence <= self.min_length:
            self.max_length = self.min_length + self.pad_to_multiple_of
            self.padding = 'max_length'
        else:
            self.max_length = None
            self.padding = 'longest'
            
        del longest_sequence

        batch = self.tokenizer.pad(
            features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        if "label" in batch:
            batch["labels"] = batch["label"]
            del batch["label"]
        if "label_ids" in batch:
            batch["labels"] = batch["label_ids"]
            del batch["label_ids"]
        return batch

slyle · August 17, 2023, 5:04pm

Edited this second post as I had confused myself lol. Are all models subject to different methodologies for determining minimum threshold for block_sparse?

Topic		Replies	Views
Padding in datasets 🤗Datasets	6	5039	October 21, 2021
Getting error in DataCollatorCTCWithPadding class while running the notebook in my local computer Beginners	0	899	April 24, 2022
DataCollator not padding as expected Intermediate	0	662	August 17, 2022
Sequences shorter than model's input window size 🤗Transformers	2	1172	January 4, 2022
Training with varying lengths of sequences Beginners	0	1614	May 31, 2023

DataCollatorWithPadding - Minimum Pad Argument

Related topics