Upload a large folder from S3 to a dataset

Hi everyone, trying to upload a Zarr image from S3 to HuggingFace. I have read Cloud storage, but I can not upload files one-by-one because the Zarr structure has many files for a single Zarr screen, and the Hugging Face upload terminates on 128 files / minute.

I’d like to upload the whole folder from s3 to HF:

from huggingface_hub import upload_large_folder

destination_dataset = "stefanches/idr0012-fuchs-cellmorph-S-BIAD845"

data_file = bia-integrator-data/S-BIAD845/009bd3ab-eb79-4cf4-8a11-ad028b827c03/009bd3ab-eb79-4cf4-8a11-ad028b827c03.zarr # really a directory in Zarr format

with s3.open(data_file) as zarr_path:

    path_in_repo = data_file[len(data_dir)-5:]

    upload_large_folder(

        folder_path=zarr_path,

        repo_id=destination_dataset,

        repo_type="dataset",

    )

    print(f"Uploaded {data_file} to {path_in_repo}")

However, I get the following error: TypeError: expected str, bytes or os.PathLike object, not S3File

What could be a possible workaround?

1 Like

The first workaround seems straightforward…


Root cause: upload_large_folder expects a local filesystem path (str/PathLike). You passed an s3fs.S3File. Hence the TypeError. The helper does not traverse remote file objects. It only walks a local directory tree. (Hugging Face)

Workable paths, from least change to most change:

1) Mount S3 so it looks local, then call upload_large_folder

Effect: No code changes beyond the path. The Hub sees a normal folder.

  • Mount options

    • s3fs-fuse (Linux/macOS/BSD): FUSE mount of an S3 bucket. (GitHub)
    • rclone mount: stable, good VFS cache controls. Use --vfs-cache-mode=full for POSIX-like behavior. (Rclone)
  • Example

# s3fs-fuse (https://github.com/s3fs-fuse/s3fs-fuse)
s3fs ${BUCKET} /mnt/s3 -o iam_role=auto,use_path_request_style

# rclone (https://rclone.org/commands/rclone_mount/)
rclone mount s3:${BUCKET} /mnt/s3 --vfs-cache-mode full
# docs: https://huggingface.co/docs/huggingface_hub/guides/upload
from huggingface_hub import upload_large_folder

upload_large_folder(
    folder_path="/mnt/s3/bia-integrator-data/S-BIAD845/.../....zarr",
    repo_id="stefanches/idr0012-fuchs-cellmorph-S-BIAD845",
    repo_type="dataset",
    multi_commits=True,  # chunked, resumable
    multi_commits_verbose=True,
)

Notes: multi_commits=True splits large trees into several commits and can resume. Mounts with VFS write caching avoid odd POSIX edge cases. (Hugging Face)

2) Stream from S3 in batches using the commit API (no local copy)

Effect: Push 50–100 files per commit. Avoids the 128 files/min symptom and reduces 429s.

  • Why it works: CommitOperationAdd accepts a path or a file-like object. You can hand it s3fs file handles directly and commit in batches. (Hugging Face)

  • Minimal script

# refs:
# - HF upload guide: https://huggingface.co/docs/huggingface_hub/guides/upload
# - HfApi.create_commit: https://huggingface.co/docs/huggingface_hub/package_reference/hf_api
import posixpath, s3fs
from huggingface_hub import HfApi, CommitOperationAdd

api = HfApi()
fs = s3fs.S3FileSystem()

prefix = "s3://YOUR_BUCKET/bia-integrator-data/S-BIAD845/009bd3.../009bd3....zarr"
repo_id = "stefanches/idr0012-fuchs-cellmorph-S-BIAD845"
root_in_repo = "009bd3....zarr"

ops, open_fhs = [], []
for key in fs.find(prefix):
    if key.endswith("/"):  # skip pseudo-dirs
        continue
    rel = key[len(prefix):].lstrip("/")
    fh = fs.open(key, "rb")          # S3 file-like
    open_fhs.append(fh)
    ops.append(CommitOperationAdd(
        path_in_repo=posixpath.join(root_in_repo, rel),
        path_or_fileobj=fh
    ))
    if len(ops) >= 80:               # 50–100 per commit helps avoid 429s
        api.create_commit(repo_id=repo_id, repo_type="dataset",
                          operations=ops, commit_message="batch")
        for h in open_fhs: h.close()
        ops, open_fhs = [], []

if ops:
    api.create_commit(repo_id=repo_id, repo_type="dataset",
                      operations=ops, commit_message="final")
    for h in open_fhs: h.close()

Tip: If you still hit HTTP 429 with many small files, reduce the batch size or sleep between commits. This pattern is used because large, flat trees can trigger rate limiting. Issues and forum reports in 2024–2025 confirm this. (GitHub)

3) Collapse the Zarr into a single .zarr.zip, then upload one file

Effect: Replace thousands of tiny files with one LFS object. Best for read-heavy, write-rare assets.

  • Why it works: Zarr’s ZipStore stores an entire hierarchy in a single ZIP. Zarr v2 and v3 document ZipStore. Clients can open via ZipStore directly. (zarr.readthedocs.io)

  • Make the zip and upload

# refs:
# - ZipStore v3 guide: https://zarr.readthedocs.io/en/v3.1.0/user-guide/storage.html
# - ZipStore v2 API:   https://zarr.readthedocs.io/en/v2.15.0/api/storage.html
# - HF upload_file:    https://huggingface.co/docs/huggingface_hub/guides/upload
import s3fs, zarr
from huggingface_hub import HfApi

fs = s3fs.S3FileSystem()
api = HfApi()

# Read from S3 directory-like Zarr
src = zarr.storage.FsspecStore(
    url="s3://YOUR_BUCKET/bia-integrator-data/S-BIAD845/.../.zarr",
    read_only=True
)

# Write a single zip in S3
with fs.open("s3://YOUR_BUCKET/S-BIAD845/....zarr.zip", "wb") as zout:
    zdst = zarr.storage.ZipStore(zout, mode="w")
    zarr.copy_store(src, zdst)
    zdst.close()

# Upload one file to the Hub
with fs.open("s3://YOUR_BUCKET/S-BIAD845/....zarr.zip", "rb") as f:
    api.upload_file(
        repo_id="stefanches/idr0012-fuchs-cellmorph-S-BIAD845",
        repo_type="dataset",
        path_in_repo="....zarr.zip",
        path_or_fileobj=f,
    )

Trade-offs: ZIP is immutable. Random writes require re-zipping. For read-only public datasets this is fine and common. (zarr.readthedocs.io)

Background and constraints, stated explicitly:

  • upload_large_folder is designed for huge trees and is resumable, but it walks a local path. Not a file-like. Use a mount or switch to commit-level APIs. Docs updated Jul 22 2024. (Hugging Face)
  • Rate limiting appears with many small files. Community issues mention 429/503 during big uploads. Batching and retries are the practical mitigations. Dates: 2024-10-01 and 2025-05-12. (GitHub)

Pitfalls and checks:

  • Preserve Zarr layout. When mounting, confirm that directory entries like .zgroup, .zarray, zarr.json are visible under the mount.
  • rclone VFS cache can use disk. Cap it with --vfs-cache-max-size if needed. (Reddit)
  • If you need pure read-only publishing and minimal file count, the .zarr.zip route is simplest for the Hub. Zarr docs show direct ZipStore reads. (zarr.readthedocs.io)

Suggested choice matrix:

  • Need minimal code change and you trust a FUSE mount → Option 1.
  • Need full control, want to stay pure-Python without mounts → Option 2.
  • Want to avoid small-file problems entirely and the dataset is static → Option 3.

Curated references (updated dates shown):

  • Hugging Face upload guide, commit API and file-like support. Updated Jul 22 2024. (Hugging Face)
  • multi_commits for large trees. Docs pages across versions. 2023–2024. (Hugging Face)
  • 429 symptoms with many files. GitHub issue 2024-10-01 and forum 2025-04-23. (GitHub)
  • Zarr ZipStore docs (v3.1.0 and v2.15.0). 2024–2025. (zarr.readthedocs.io)
  • s3fs-fuse README and rclone mount docs. Current. (GitHub)
1 Like

Thanks @John6666 , super informative. Do you actually use some LM for these answers, or collect the points somewhere (I just adore the answer speed!)

1 Like

Do you actually use some LM for these answers,

Just GPT-5 Thinking on Web browser.:sweat_smile:

1 Like