Weird MerkleDB error when using upload_large_folder

johntzwei · June 18, 2025, 2:18pm

Hi all, I’ve been using the huggingface python interface, and I’ve been using the upload_large_folder() function. I was able to upload many TB of data just fine (thank you huggingface!) but one of my jobs got interrupted and now I keep on getting this error: “Failed to preupload LFS: Data processing error: MerkleDB Shard error: File I/O error” after the upload for each file completes, and then nothing ever gets committed.

I think its an issue isolated to my computer, I tried:

deleting my huggingface cache
deleting the .cache folder within the large folder
uploading to a new repo

I get the same error everytime.

Any ideas? I looked through the source code and I’m not sure where the error is coming from.

johntzwei · June 18, 2025, 2:25pm

I think I figured out the issue. My HF_XET_CACHE was stored in /dev/shm, and that had ballooned up to 200GB and had filled up the entire tmpfs. I store it in memory because we use lustre as the parallel filesystem, although I don’t really see any major slowdowns using the lustre backed filesystem or RAM based filesystem.

Huggingface, keep on being awesome!

johntzwei · June 18, 2025, 2:40pm

Ok I spoke too soon. I’m still getting this error. But now that the tmpfs is cleared, new commits are being generated. I validated the hashes locally against the one on the hub and they are correct too. Maybe I need to restart my compute node, a stray process is causing an error?

John6666 · June 18, 2025, 9:42pm

It seems to be almost resolved, but just to be sure, I will notify the HF XET team. @jsulz

jsulz · June 20, 2025, 4:46pm

Thanks for the ping, @John6666

@johntzwei we’ve seen that error crop up when information in the xet cache (held underneath the huggingface cache) is removed/not properly downloaded in the first place and then the download resumed.

However, this should be addressed in recent releases of hf-xet - when you do a pip freeze | grep hf-xet on the machine where this is running, what do you see? If it’s below 1.1.4 I would suggest upgrading with pip install -U hf-xet and seeing if that helps.

Of note is that more recent versions of hf-xet also provide stronger protection against the xet cache expanding like what you saw.

If upgrading doesn’t help and you’re still experiencing issues, please open an issue at this link so we can gather more information and dig into the issue. Thank you!

Topic		Replies	Views
Error while uploading files larger than 10Mb 🤗Datasets	1	917	October 10, 2022
Uploading large files (>5GB) to HF Spaces Spaces	6	3994	August 16, 2022
LFS no longer works for my Huggingface models 🤗Hub	8	1847	January 28, 2023
Push to Hub - HTTPS Connection Pool( host = ‘huggingface.co’, port = 443 ) 🤗Datasets	5	16913	June 30, 2022
Uploading files larger than 5GB to model hub 🤗Transformers	16	9372	September 21, 2021

Weird MerkleDB error when using upload_large_folder

Related topics