[Bug?] Datasets map and concatenation after sharding OOM

When using

dataset.map(myfunc, num_proc=16,  
    batch_size=16, writer_batch_size=16

Due to the size of my dataset, it results in:

  table = cls._concat_blocks(blocks, axis=0)

Looking carefully at the code in .map, I see that the shards are being created and the .arrow files exists after the progress bar goes to 100%, then at it goes OOM at datasets/src/datasets/arrow_dataset.py at 2.21.0 · huggingface/datasets · GitHub

            logger.info(f"Concatenating {num_proc} shards")
            result = _concatenate_map_style_datasets(transformed_shards)

Not sure if it’s a bug that writing to individual arrow files work but the combination of the shard fails.

Any way to resolve this and avoid the OOM in the _concat_blocks function?

Also asked on concatenation - How to resolve OOM when .map concatenate the sharded parts? - Stack Overflow