Can Processors/FeatureExtractors be used within custom DataCollators or DataLoaders?

Thewz · April 21, 2022, 8:18am

I’ve recently been working on a training script for multi-modal models (LayoutLMv2/LayoutXLM) similar to this AWS Sample for LayoutLMv1 (by me a while ago), and trying to avoid keeping the entire dataset’s image pixel values in memory.

Previously, it seemed to be fine to do tokenization within the custom DataCollator.__call__ method (in the data loader workers, as per here) so long as the TOKENIZERS_PARALLELISM=false environment variable was set.

The general pattern of this training script today is that:

The initialization of the dataset scans through the entire corpus to count how many samples are present (since some documents/pages might have to be split into multiple sequences, depending on model params)
The dataset stores some relatively compact in-memory representation of where to fetch from for each “sample”
By doing this, the objective was to stay aligned to random-access Dataset (not having to worry about manual shuffling and duplicated loading in IterableDataset) - while keeping memory footprint small to balance performance vs scalability.
In the custom DataCollator, the tokenizer is called (and whatever other manipulations needed) to actually construct each batch from the in-memory representation.

But when trying to extend this approach to doing image loading in the DataCollator, I’m seeing deadlocks: The first few batches get loaded, but then CPU & GPU utilization drop near zero and nothing really happens until the job is terminated.

So I’m wondering: Is it possible to use e.g. LayoutLMv2FeatureExtractor inside a data loader worker? Is there some trick that I’m missing to avoid deadlock? Or is this whole pattern weird and there’s another better method that would work here for customizable data loading without loading the entire, fully-pre-processed dataset in memory at once?

Topic		Replies	Views
On-the-fly splitting for datasets with long texts 🤗Transformers	0	750	November 8, 2021
Multilabel token classification (dataloader issues) 🤗Datasets	0	178	April 20, 2024
DataCollator uses Tokenizer while having BatchEncodings? 🤗Transformers	0	138	February 29, 2024
Using datacollator for multi-task training Intermediate	2	1185	January 24, 2022
Subclassing DataCollator to pad additional inputs 🤗Transformers	0	214	January 24, 2023

Can Processors/FeatureExtractors be used within custom DataCollators or DataLoaders?

Related topics