On-the-fly splitting for datasets with long texts

Thewz · November 8, 2021, 3:37am

I’ve recently been looking at document analysis use cases (e.g. LayoutLM, LayoutLMv2, LayoutXLM and others) involving some more unusual input representations than 1D text sequences, and struggling a bit to wrap my head around the intended points in the API flow where different data prep steps can be performed.

In particular, if I understand correctly, a DataCollator is expected to receive examples from a sampled batch, and return prepared tensors for the batch (typically including the tokenization) right? Seems like it’s not designed or intended for a collator to return a different batch dimension size and this to be automatically re-called/queued to standardize the batch size after.

But for some use cases (for example token classification or MLM) on long documents, it may be much more useful to split annotated texts for a model’s max_seq_len than to truncate.

I seem to find that splitting accurately/optimally already requires tokenization, and ensuring generated samples have the labels needed for their included content requires already doing label-token alignment… So end up doing all this in the dataset and having a dummy/pass-through collator instead of using the pre-built ones.

This seems like an anti-pattern though (especially since some tokenizers don’t like running in multiprocess dataset loader workers by default). Is there a more standard/correct way to approach this?

Topic		Replies	Views
Can Processors/FeatureExtractors be used within custom DataCollators or DataLoaders? 🤗Transformers	0	378	April 21, 2022
DataCollator for list of inputs? Intermediate	0	458	November 1, 2022
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12707	October 6, 2021
Why use batched=True in map function? 🤗Datasets	2	7264	May 17, 2022
DataCollator vs. Tokenizers 🤗Transformers	1	3788	May 1, 2021

On-the-fly splitting for datasets with long texts

Related topics