DataCollatorWithPadding does not take into account block_sparse thresholds of a model. If I use it to pad my batches to the longest sequence in a batch but all sequences in the batch are sufficiently small, block_sparse may be deactivated causing Out Of Memory issues. Why does something like the below not exist allowing a user to set a minimum? Is it because of the interplay between pad_to_multiple_of and a minimum length? Or did I miss it in the documentation somewhere?
@dataclass
class DataCollatorWithMinimumPadding():
tokenizer: PreTrainedTokenizerBase
min_length: [int]
pad_to_multiple_of: [int] = 8
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
longest_sequence = max(len(feature["input_ids"]) for feature in features)
if longest_sequence <= self.min_length:
self.max_length = self.min_length + self.pad_to_multiple_of
self.padding = 'max_length'
else:
self.max_length = None
self.padding = 'longest'
del longest_sequence
batch = self.tokenizer.pad(
features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)
if "label" in batch:
batch["labels"] = batch["label"]
del batch["label"]
if "label_ids" in batch:
batch["labels"] = batch["label_ids"]
del batch["label_ids"]
return batch