I’ve recently been working on a training script for multi-modal models (LayoutLMv2/LayoutXLM) similar to this AWS Sample for LayoutLMv1 (by me a while ago), and trying to avoid keeping the entire dataset’s image pixel values in memory.
Previously, it seemed to be fine to do tokenization within the custom
DataCollator.__call__ method (in the data loader workers, as per here) so long as the
TOKENIZERS_PARALLELISM=false environment variable was set.
The general pattern of this training script today is that:
- The initialization of the dataset scans through the entire corpus to count how many samples are present (since some documents/pages might have to be split into multiple sequences, depending on model params)
- The dataset stores some relatively compact in-memory representation of where to fetch from for each “sample”
- By doing this, the objective was to stay aligned to random-access
Dataset(not having to worry about manual shuffling and duplicated loading in
IterableDataset) - while keeping memory footprint small to balance performance vs scalability.
- In the custom
DataCollator, the tokenizer is called (and whatever other manipulations needed) to actually construct each batch from the in-memory representation.
But when trying to extend this approach to doing image loading in the DataCollator, I’m seeing deadlocks: The first few batches get loaded, but then CPU & GPU utilization drop near zero and nothing really happens until the job is terminated.
So I’m wondering: Is it possible to use e.g.
LayoutLMv2FeatureExtractor inside a data loader worker? Is there some trick that I’m missing to avoid deadlock? Or is this whole pattern weird and there’s another better method that would work here for customizable data loading without loading the entire, fully-pre-processed dataset in memory at once?