Mirroring Huggingface S3 to download models/tokenizers

There are some use cases for companies to keep computes on premise without internet connection. Is there a way to mirror Huggingface S3 buckets to download a subset of models and datasets?

Huggingface datasets support storage_options from load_datasets, it’ll be good if AutoModel* and AutoTokenizer supports that too.

The use-case would ideally be something like:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import aiobotocore.session
import s3fs

s3_session = aiobotocore.session.AioSession(profile="my_profile_name")
storage_options = {"session": s3_session}

fs = s3fs.S3FileSystem(**storage_options)  # This storage contains a subset mirror of HF's S3.


tokenizer = AutoTokenizer.from_pretrained(
    "facebook/nllb-200-distilled-600M", 
    file_system=fs
)

model = AutoModelForSeq2SeqLM.from_pretrained(
    "facebook/nllb-200-distilled-600M",     
    file_system=fs
)

What i recommend is just git-cloning the repos you need, and then loading from local filesystem.

1 Like

Thanks for the suggestion to clone the repos we need, I guess we will do something like:

import os
import tempfile
from pathlib import Path

import s3fs
import smart_open
from huggingface_hub import snapshot_download


def huggingface_to_s3mirror(repo_id, s3_path):
    s3 = s3fs.S3FileSystem(anon=False)

    with tempfile.TemporaryDirectory() as tmpdirname:
        temp_dir = Path(tmpdirname)
        snapshot_download(repo_id=repo_id, cache_dir=temp_dir)
        s3.put(str(temp_dir)+'/*', s3_path, recursive=True)
       
def huggingmirror_to_local(repo_id, s3_path, local_path=os.path.abspath(os.path.dirname('.'))):
    s3 = s3fs.S3FileSystem(anon=False)
    # Downloads from mirror to a local directory.
    s3_url = s3_path + "models--" + repo_id.replace("/", "--")
    s3.get(s3_url, lpath=local_path, recursive=True)

huggingface_to_s3mirror("facebook/nllb-200-distilled-600M", "s3://mybucket")
huggingmirror_to_local("facebook/nllb-200-distilled-600M", "s3://mybucket")

Then this should kinda work:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", local_files_only=True)
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", local_files_only=True)
2 Likes