[Bug?] Datasets map and concatenation after sharding OOM

alvations · September 4, 2024, 12:58am

When using

dataset.map(myfunc, num_proc=16,  
    keep_in_memory=False, 
    cache_file_name='parts.arrow', 
    batch_size=16, writer_batch_size=16
)

Due to the size of my dataset, it results in:

/site-packages/datasets/table.py:1421:
  table = cls._concat_blocks(blocks, axis=0)
Killed

Looking carefully at the code in .map, I see that the shards are being created and the .arrow files exists after the progress bar goes to 100%, then at it goes OOM at datasets/src/datasets/arrow_dataset.py at 2.21.0 · huggingface/datasets · GitHub

            logger.info(f"Concatenating {num_proc} shards")
            result = _concatenate_map_style_datasets(transformed_shards)

Not sure if it’s a bug that writing to individual arrow files work but the combination of the shard fails.

Any way to resolve this and avoid the OOM in the _concat_blocks function?

alvations · September 4, 2024, 1:01am

Topic		Replies	Views
`load_dataset` results in OOM 🤗Datasets	0	175	June 25, 2024
Working with large datasets 🤗Datasets	5	4136	November 10, 2020
Working with large datasets - cache issues 🤗Datasets	1	1025	June 1, 2022
OOM issue with large dataset streaming 🤗Datasets	6	109	March 15, 2025
Load shards as one dataset 🤗Datasets	0	183	February 16, 2024