The first workaround seems straightforward…
Root cause: upload_large_folder expects a local filesystem path (str/PathLike). You passed an s3fs.S3File. Hence the TypeError. The helper does not traverse remote file objects. It only walks a local directory tree. (Hugging Face)
Workable paths, from least change to most change:
1) Mount S3 so it looks local, then call upload_large_folder
Effect: No code changes beyond the path. The Hub sees a normal folder.
-
Mount options
-
Example
# s3fs-fuse (https://github.com/s3fs-fuse/s3fs-fuse)
s3fs ${BUCKET} /mnt/s3 -o iam_role=auto,use_path_request_style
# rclone (https://rclone.org/commands/rclone_mount/)
rclone mount s3:${BUCKET} /mnt/s3 --vfs-cache-mode full
# docs: https://huggingface.co/docs/huggingface_hub/guides/upload
from huggingface_hub import upload_large_folder
upload_large_folder(
folder_path="/mnt/s3/bia-integrator-data/S-BIAD845/.../....zarr",
repo_id="stefanches/idr0012-fuchs-cellmorph-S-BIAD845",
repo_type="dataset",
multi_commits=True, # chunked, resumable
multi_commits_verbose=True,
)
Notes: multi_commits=True splits large trees into several commits and can resume. Mounts with VFS write caching avoid odd POSIX edge cases. (Hugging Face)
2) Stream from S3 in batches using the commit API (no local copy)
Effect: Push 50–100 files per commit. Avoids the 128 files/min symptom and reduces 429s.
-
Why it works:
CommitOperationAddaccepts a path or a file-like object. You can hand its3fsfile handles directly and commit in batches. (Hugging Face) -
Minimal script
# refs:
# - HF upload guide: https://huggingface.co/docs/huggingface_hub/guides/upload
# - HfApi.create_commit: https://huggingface.co/docs/huggingface_hub/package_reference/hf_api
import posixpath, s3fs
from huggingface_hub import HfApi, CommitOperationAdd
api = HfApi()
fs = s3fs.S3FileSystem()
prefix = "s3://YOUR_BUCKET/bia-integrator-data/S-BIAD845/009bd3.../009bd3....zarr"
repo_id = "stefanches/idr0012-fuchs-cellmorph-S-BIAD845"
root_in_repo = "009bd3....zarr"
ops, open_fhs = [], []
for key in fs.find(prefix):
if key.endswith("/"): # skip pseudo-dirs
continue
rel = key[len(prefix):].lstrip("/")
fh = fs.open(key, "rb") # S3 file-like
open_fhs.append(fh)
ops.append(CommitOperationAdd(
path_in_repo=posixpath.join(root_in_repo, rel),
path_or_fileobj=fh
))
if len(ops) >= 80: # 50–100 per commit helps avoid 429s
api.create_commit(repo_id=repo_id, repo_type="dataset",
operations=ops, commit_message="batch")
for h in open_fhs: h.close()
ops, open_fhs = [], []
if ops:
api.create_commit(repo_id=repo_id, repo_type="dataset",
operations=ops, commit_message="final")
for h in open_fhs: h.close()
Tip: If you still hit HTTP 429 with many small files, reduce the batch size or sleep between commits. This pattern is used because large, flat trees can trigger rate limiting. Issues and forum reports in 2024–2025 confirm this. (GitHub)
3) Collapse the Zarr into a single .zarr.zip, then upload one file
Effect: Replace thousands of tiny files with one LFS object. Best for read-heavy, write-rare assets.
-
Why it works: Zarr’s
ZipStorestores an entire hierarchy in a single ZIP. Zarr v2 and v3 document ZipStore. Clients can open viaZipStoredirectly. (zarr.readthedocs.io) -
Make the zip and upload
# refs:
# - ZipStore v3 guide: https://zarr.readthedocs.io/en/v3.1.0/user-guide/storage.html
# - ZipStore v2 API: https://zarr.readthedocs.io/en/v2.15.0/api/storage.html
# - HF upload_file: https://huggingface.co/docs/huggingface_hub/guides/upload
import s3fs, zarr
from huggingface_hub import HfApi
fs = s3fs.S3FileSystem()
api = HfApi()
# Read from S3 directory-like Zarr
src = zarr.storage.FsspecStore(
url="s3://YOUR_BUCKET/bia-integrator-data/S-BIAD845/.../.zarr",
read_only=True
)
# Write a single zip in S3
with fs.open("s3://YOUR_BUCKET/S-BIAD845/....zarr.zip", "wb") as zout:
zdst = zarr.storage.ZipStore(zout, mode="w")
zarr.copy_store(src, zdst)
zdst.close()
# Upload one file to the Hub
with fs.open("s3://YOUR_BUCKET/S-BIAD845/....zarr.zip", "rb") as f:
api.upload_file(
repo_id="stefanches/idr0012-fuchs-cellmorph-S-BIAD845",
repo_type="dataset",
path_in_repo="....zarr.zip",
path_or_fileobj=f,
)
Trade-offs: ZIP is immutable. Random writes require re-zipping. For read-only public datasets this is fine and common. (zarr.readthedocs.io)
Background and constraints, stated explicitly:
upload_large_folderis designed for huge trees and is resumable, but it walks a local path. Not a file-like. Use a mount or switch to commit-level APIs. Docs updated Jul 22 2024. (Hugging Face)- Rate limiting appears with many small files. Community issues mention 429/503 during big uploads. Batching and retries are the practical mitigations. Dates: 2024-10-01 and 2025-05-12. (GitHub)
Pitfalls and checks:
- Preserve Zarr layout. When mounting, confirm that directory entries like
.zgroup,.zarray,zarr.jsonare visible under the mount. - rclone VFS cache can use disk. Cap it with
--vfs-cache-max-sizeif needed. (Reddit) - If you need pure read-only publishing and minimal file count, the
.zarr.ziproute is simplest for the Hub. Zarr docs show direct ZipStore reads. (zarr.readthedocs.io)
Suggested choice matrix:
- Need minimal code change and you trust a FUSE mount → Option 1.
- Need full control, want to stay pure-Python without mounts → Option 2.
- Want to avoid small-file problems entirely and the dataset is static → Option 3.
Curated references (updated dates shown):
- Hugging Face upload guide, commit API and file-like support. Updated Jul 22 2024. (Hugging Face)
multi_commitsfor large trees. Docs pages across versions. 2023–2024. (Hugging Face)- 429 symptoms with many files. GitHub issue 2024-10-01 and forum 2025-04-23. (GitHub)
- Zarr ZipStore docs (v3.1.0 and v2.15.0). 2024–2025. (zarr.readthedocs.io)
- s3fs-fuse README and rclone mount docs. Current. (GitHub)