I am trying to encode an image dataset of size 20000 images. But when the encoding reaches at 69% irrespective of the batch size it generates an error displaying realloc of size 32 gb failed. I even tried to increase the batch size but the error still persists. Pyarrow version that is installed is 10.0.1, datasets version is 2.7.1.
Hi ! How do you create your dataset ?
Right now every dataset is loaded from disk using memory mapping to not fill your RAM. However datasets created from in-memory data currently stay in memory.
So if you used Dataset.from_dict
for example you may want to write your dataset to disk to avoid filling up your RAM. You can use ds.save_to_disk()
and reload it with load_form_disk()
before calling your map function
Thank you @lhoestq for the Solution. Will definitely try it if it works.