Hi all, I’ve been using the huggingface python interface, and I’ve been using the upload_large_folder() function. I was able to upload many TB of data just fine (thank you huggingface!) but one of my jobs got interrupted and now I keep on getting this error: “Failed to preupload LFS: Data processing error: MerkleDB Shard error: File I/O error” after the upload for each file completes, and then nothing ever gets committed.
I think its an issue isolated to my computer, I tried:
deleting my huggingface cache
deleting the .cache folder within the large folder
uploading to a new repo
I get the same error everytime.
Any ideas? I looked through the source code and I’m not sure where the error is coming from.
I think I figured out the issue. My HF_XET_CACHE was stored in /dev/shm, and that had ballooned up to 200GB and had filled up the entire tmpfs. I store it in memory because we use lustre as the parallel filesystem, although I don’t really see any major slowdowns using the lustre backed filesystem or RAM based filesystem.
Ok I spoke too soon. I’m still getting this error. But now that the tmpfs is cleared, new commits are being generated. I validated the hashes locally against the one on the hub and they are correct too. Maybe I need to restart my compute node, a stray process is causing an error?
@johntzwei we’ve seen that error crop up when information in the xet cache (held underneath the huggingface cache) is removed/not properly downloaded in the first place and then the download resumed.
However, this should be addressed in recent releases of hf-xet - when you do a pip freeze | grep hf-xet on the machine where this is running, what do you see? If it’s below 1.1.4 I would suggest upgrading with pip install -U hf-xet and seeing if that helps.
Of note is that more recent versions of hf-xet also provide stronger protection against the xet cache expanding like what you saw.
If upgrading doesn’t help and you’re still experiencing issues, please open an issue at this link so we can gather more information and dig into the issue. Thank you!