When using
dataset.map(myfunc, num_proc=16,
keep_in_memory=False,
cache_file_name='parts.arrow',
batch_size=16, writer_batch_size=16
)
Due to the size of my dataset, it results in:
/site-packages/datasets/table.py:1421:
table = cls._concat_blocks(blocks, axis=0)
Killed
Looking carefully at the code in .map
, I see that the shards are being created and the .arrow
files exists after the progress bar goes to 100%, then at it goes OOM at datasets/src/datasets/arrow_dataset.py at 2.21.0 · huggingface/datasets · GitHub
logger.info(f"Concatenating {num_proc} shards")
result = _concatenate_map_style_datasets(transformed_shards)
Not sure if it’s a bug that writing to individual arrow files work but the combination of the shard fails.
Any way to resolve this and avoid the OOM in the _concat_blocks
function?