The first workaround seems straightforward…
Root cause: upload_large_folder
expects a local filesystem path (str
/PathLike
). You passed an s3fs.S3File
. Hence the TypeError
. The helper does not traverse remote file objects. It only walks a local directory tree. (Hugging Face)
Workable paths, from least change to most change:
1) Mount S3 so it looks local, then call upload_large_folder
Effect: No code changes beyond the path. The Hub sees a normal folder.
-
Mount options
- s3fs-fuse (Linux/macOS/BSD): FUSE mount of an S3 bucket. (GitHub)
- rclone mount: stable, good VFS cache controls. Use
--vfs-cache-mode=full
for POSIX-like behavior. (Rclone)
-
Example
# s3fs-fuse (https://github.com/s3fs-fuse/s3fs-fuse)
s3fs ${BUCKET} /mnt/s3 -o iam_role=auto,use_path_request_style
# rclone (https://rclone.org/commands/rclone_mount/)
rclone mount s3:${BUCKET} /mnt/s3 --vfs-cache-mode full
# docs: https://huggingface.co/docs/huggingface_hub/guides/upload
from huggingface_hub import upload_large_folder
upload_large_folder(
folder_path="/mnt/s3/bia-integrator-data/S-BIAD845/.../....zarr",
repo_id="stefanches/idr0012-fuchs-cellmorph-S-BIAD845",
repo_type="dataset",
multi_commits=True, # chunked, resumable
multi_commits_verbose=True,
)
Notes: multi_commits=True
splits large trees into several commits and can resume. Mounts with VFS write caching avoid odd POSIX edge cases. (Hugging Face)
2) Stream from S3 in batches using the commit API (no local copy)
Effect: Push 50–100 files per commit. Avoids the 128 files/min symptom and reduces 429s.
# refs:
# - HF upload guide: https://huggingface.co/docs/huggingface_hub/guides/upload
# - HfApi.create_commit: https://huggingface.co/docs/huggingface_hub/package_reference/hf_api
import posixpath, s3fs
from huggingface_hub import HfApi, CommitOperationAdd
api = HfApi()
fs = s3fs.S3FileSystem()
prefix = "s3://YOUR_BUCKET/bia-integrator-data/S-BIAD845/009bd3.../009bd3....zarr"
repo_id = "stefanches/idr0012-fuchs-cellmorph-S-BIAD845"
root_in_repo = "009bd3....zarr"
ops, open_fhs = [], []
for key in fs.find(prefix):
if key.endswith("/"): # skip pseudo-dirs
continue
rel = key[len(prefix):].lstrip("/")
fh = fs.open(key, "rb") # S3 file-like
open_fhs.append(fh)
ops.append(CommitOperationAdd(
path_in_repo=posixpath.join(root_in_repo, rel),
path_or_fileobj=fh
))
if len(ops) >= 80: # 50–100 per commit helps avoid 429s
api.create_commit(repo_id=repo_id, repo_type="dataset",
operations=ops, commit_message="batch")
for h in open_fhs: h.close()
ops, open_fhs = [], []
if ops:
api.create_commit(repo_id=repo_id, repo_type="dataset",
operations=ops, commit_message="final")
for h in open_fhs: h.close()
Tip: If you still hit HTTP 429 with many small files, reduce the batch size or sleep between commits. This pattern is used because large, flat trees can trigger rate limiting. Issues and forum reports in 2024–2025 confirm this. (GitHub)
3) Collapse the Zarr into a single .zarr.zip
, then upload one file
Effect: Replace thousands of tiny files with one LFS object. Best for read-heavy, write-rare assets.
# refs:
# - ZipStore v3 guide: https://zarr.readthedocs.io/en/v3.1.0/user-guide/storage.html
# - ZipStore v2 API: https://zarr.readthedocs.io/en/v2.15.0/api/storage.html
# - HF upload_file: https://huggingface.co/docs/huggingface_hub/guides/upload
import s3fs, zarr
from huggingface_hub import HfApi
fs = s3fs.S3FileSystem()
api = HfApi()
# Read from S3 directory-like Zarr
src = zarr.storage.FsspecStore(
url="s3://YOUR_BUCKET/bia-integrator-data/S-BIAD845/.../.zarr",
read_only=True
)
# Write a single zip in S3
with fs.open("s3://YOUR_BUCKET/S-BIAD845/....zarr.zip", "wb") as zout:
zdst = zarr.storage.ZipStore(zout, mode="w")
zarr.copy_store(src, zdst)
zdst.close()
# Upload one file to the Hub
with fs.open("s3://YOUR_BUCKET/S-BIAD845/....zarr.zip", "rb") as f:
api.upload_file(
repo_id="stefanches/idr0012-fuchs-cellmorph-S-BIAD845",
repo_type="dataset",
path_in_repo="....zarr.zip",
path_or_fileobj=f,
)
Trade-offs: ZIP is immutable. Random writes require re-zipping. For read-only public datasets this is fine and common. (zarr.readthedocs.io)
Background and constraints, stated explicitly:
upload_large_folder
is designed for huge trees and is resumable, but it walks a local path. Not a file-like. Use a mount or switch to commit-level APIs. Docs updated Jul 22 2024. (Hugging Face)
- Rate limiting appears with many small files. Community issues mention 429/503 during big uploads. Batching and retries are the practical mitigations. Dates: 2024-10-01 and 2025-05-12. (GitHub)
Pitfalls and checks:
- Preserve Zarr layout. When mounting, confirm that directory entries like
.zgroup
, .zarray
, zarr.json
are visible under the mount.
- rclone VFS cache can use disk. Cap it with
--vfs-cache-max-size
if needed. (Reddit)
- If you need pure read-only publishing and minimal file count, the
.zarr.zip
route is simplest for the Hub. Zarr docs show direct ZipStore reads. (zarr.readthedocs.io)
Suggested choice matrix:
- Need minimal code change and you trust a FUSE mount → Option 1.
- Need full control, want to stay pure-Python without mounts → Option 2.
- Want to avoid small-file problems entirely and the dataset is static → Option 3.
Curated references (updated dates shown):
- Hugging Face upload guide, commit API and file-like support. Updated Jul 22 2024. (Hugging Face)
multi_commits
for large trees. Docs pages across versions. 2023–2024. (Hugging Face)
- 429 symptoms with many files. GitHub issue 2024-10-01 and forum 2025-04-23. (GitHub)
- Zarr ZipStore docs (v3.1.0 and v2.15.0). 2024–2025. (zarr.readthedocs.io)
- s3fs-fuse README and rclone mount docs. Current. (GitHub)