I’m creating a HF dataset in an Azure VM by first reading a file from generator, doing some pre processing and then using save to disk. Im saving to disk on a blobfuse mount.
In the middle of the save_to_disk operation I was running out of disk, and i think it is related to HF cache files that are being stored in default “local” location, instead of leveraging the mount.
Now im settinf HF_HOME environment variable to a directory in the mount, right before importing datasets. However, now I’m getting an OS Error: no space left on device, even there is plenty of space and i am pointing the cache to the mount.
The top of my code looks like this:
Import os
os.environ[“HF_HOME”] = /mnt/outputs/hf_cache/
From datasets import …
I am not creating the hf_cache directory , though it seems HF creates it automatically.
Any idea what might be happening?
The code fails right at the beginning, just when i call the from_generator an before i do anything.
I’ve dug into HF code based on the error message I receive below, and I think I have an idea of what’s going on:
File “/opt/conda/envs/ptca/lib/python3.10/site-packages/datasets/io/generator.py”, line 49, in read
self.builder.download_and_prepare(
File “/opt/conda/envs/ptca/lib/python3.10/site-packages/datasets/builder.py”, line 875, in download_and_prepare
raise OSError(
OSError: Not enough disk space. Needed: Unknown size (download: Unknown size, generated: Unknown size, post-processed: Unknown size)
When I use disk_usage from shutil to print the disk space of the cache dir located in the mount, it prints 0 GB. The download_and_prepare function inside the builder then reads this and outputs there is no storage on the device.
Is it possible that this is a characteristic of the mount with blobfuse?
BTW, in the Hugging Face-related libraries, loading and saving will behave differently depending on the version of the library. The following is an example of fixing it to a very old version (from October last year).
datasets uses shutil.disk_usage() to know if there is enough space before writing a (potentially huge) dataset
(maybe if it says zero it can let the writing begin - it should fail anyway if it’s really zero under the hood ? that part of datasets is open to contributions btw if you want to improve it)