Mirroring Huggingface S3 to download models/tokenizers

alvations · May 3, 2023, 11:55am

There are some use cases for companies to keep computes on premise without internet connection. Is there a way to mirror Huggingface S3 buckets to download a subset of models and datasets?

Huggingface datasets support storage_options from load_datasets, it’ll be good if AutoModel* and AutoTokenizer supports that too.

The use-case would ideally be something like:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import aiobotocore.session
import s3fs

s3_session = aiobotocore.session.AioSession(profile="my_profile_name")
storage_options = {"session": s3_session}

fs = s3fs.S3FileSystem(**storage_options)  # This storage contains a subset mirror of HF's S3.


tokenizer = AutoTokenizer.from_pretrained(
    "facebook/nllb-200-distilled-600M", 
    file_system=fs
)

model = AutoModelForSeq2SeqLM.from_pretrained(
    "facebook/nllb-200-distilled-600M",     
    file_system=fs
)

julien-c · May 4, 2023, 9:35am

What i recommend is just git-cloning the repos you need, and then loading from local filesystem.

alvations · May 4, 2023, 12:49pm

Thanks for the suggestion to clone the repos we need, I guess we will do something like:

import os
import tempfile
from pathlib import Path

import s3fs
import smart_open
from huggingface_hub import snapshot_download


def huggingface_to_s3mirror(repo_id, s3_path):
    s3 = s3fs.S3FileSystem(anon=False)

    with tempfile.TemporaryDirectory() as tmpdirname:
        temp_dir = Path(tmpdirname)
        snapshot_download(repo_id=repo_id, cache_dir=temp_dir)
        s3.put(str(temp_dir)+'/*', s3_path, recursive=True)
       
def huggingmirror_to_local(repo_id, s3_path, local_path=os.path.abspath(os.path.dirname('.'))):
    s3 = s3fs.S3FileSystem(anon=False)
    # Downloads from mirror to a local directory.
    s3_url = s3_path + "models--" + repo_id.replace("/", "--")
    s3.get(s3_url, lpath=local_path, recursive=True)

huggingface_to_s3mirror("facebook/nllb-200-distilled-600M", "s3://mybucket")
huggingmirror_to_local("facebook/nllb-200-distilled-600M", "s3://mybucket")

Then this should kinda work:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", local_files_only=True)
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", local_files_only=True)

Topic		Replies	Views
Export AutoNLP models to custom S3 🤗AutoTrain	1	1115	October 19, 2021
After manually downloading the model from huggingface, how do I put the model file into the specified path? 🤗Hub	0	1928	January 7, 2024
Upload and commit your model into huggingface hub Beginners	0	89	January 20, 2025
Why does local-downloaded model files are different from those in huggingface? Beginners	5	689	June 10, 2024
Using S3 as model cache for Huggingface LLM inference DLC on Sagemaker Amazon SageMaker	1	3843	June 21, 2023

Mirroring Huggingface S3 to download models/tokenizers

Related topics