There are some use cases for companies to keep computes on premise without internet connection. Is there a way to mirror Huggingface S3 buckets to download a subset of models and datasets?
Huggingface datasets support storage_options
from load_datasets
, it’ll be good if AutoModel*
and AutoTokenizer
supports that too.
The use-case would ideally be something like:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import aiobotocore.session
import s3fs
s3_session = aiobotocore.session.AioSession(profile="my_profile_name")
storage_options = {"session": s3_session}
fs = s3fs.S3FileSystem(**storage_options) # This storage contains a subset mirror of HF's S3.
tokenizer = AutoTokenizer.from_pretrained(
"facebook/nllb-200-distilled-600M",
file_system=fs
)
model = AutoModelForSeq2SeqLM.from_pretrained(
"facebook/nllb-200-distilled-600M",
file_system=fs
)
What i recommend is just git-cloning the repos you need, and then loading from local filesystem.
1 Like
Thanks for the suggestion to clone the repos we need, I guess we will do something like:
import os
import tempfile
from pathlib import Path
import s3fs
import smart_open
from huggingface_hub import snapshot_download
def huggingface_to_s3mirror(repo_id, s3_path):
s3 = s3fs.S3FileSystem(anon=False)
with tempfile.TemporaryDirectory() as tmpdirname:
temp_dir = Path(tmpdirname)
snapshot_download(repo_id=repo_id, cache_dir=temp_dir)
s3.put(str(temp_dir)+'/*', s3_path, recursive=True)
def huggingmirror_to_local(repo_id, s3_path, local_path=os.path.abspath(os.path.dirname('.'))):
s3 = s3fs.S3FileSystem(anon=False)
# Downloads from mirror to a local directory.
s3_url = s3_path + "models--" + repo_id.replace("/", "--")
s3.get(s3_url, lpath=local_path, recursive=True)
huggingface_to_s3mirror("facebook/nllb-200-distilled-600M", "s3://mybucket")
huggingmirror_to_local("facebook/nllb-200-distilled-600M", "s3://mybucket")
Then this should kinda work:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", local_files_only=True)
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", local_files_only=True)