On-the-fly splitting for datasets with long texts

I’ve recently been looking at document analysis use cases (e.g. LayoutLM, LayoutLMv2, LayoutXLM and others) involving some more unusual input representations than 1D text sequences, and struggling a bit to wrap my head around the intended points in the API flow where different data prep steps can be performed.

In particular, if I understand correctly, a DataCollator is expected to receive examples from a sampled batch, and return prepared tensors for the batch (typically including the tokenization) right? Seems like it’s not designed or intended for a collator to return a different batch dimension size and this to be automatically re-called/queued to standardize the batch size after.

But for some use cases (for example token classification or MLM) on long documents, it may be much more useful to split annotated texts for a model’s max_seq_len than to truncate.

I seem to find that splitting accurately/optimally already requires tokenization, and ensuring generated samples have the labels needed for their included content requires already doing label-token alignment… So end up doing all this in the dataset and having a dummy/pass-through collator instead of using the pre-built ones.

This seems like an anti-pattern though (especially since some tokenizers don’t like running in multiprocess dataset loader workers by default). Is there a more standard/correct way to approach this?

1 Like