Manually Downloading Models in docker build with snapshot_download

Hi,

To avoid re-downloading the models every time my docker container is started, I want to manually download the models during building the docker image. I wrote a small script that runs the following to download the models during the build:

    AutoTokenizer.from_pretrained(configs.get("models_names.tockenizer"))
    SentenceTransformer(configs.get("models_names.sentence_embedding"))
    AutoModelForSeq2SeqLM.from_pretrained(configs.get("models_names.paraphraser"))

While this works in theory, it breaks in practice because the models are not just downloaded, they are also loaded into memory, and that raises an out of memory error on the build machine. I then read about snapshot_download (ref) and replaced my script with

    snapshot_download(configs.get("models_names.tockenizer"))
    snapshot_download(configs.get("models_names.sentence_embedding"))

While these two lines do download the same files, transformers is not able to load the models and attempts to download them whenever a from_pretrained call is made. It remains the case even if I explicitly TRANSFORMERS_CACHE to point to the cache directory of HuggingFace hub.

I noticed that the structure of the caches generated by snapshot_download is different from the cache structure of transformers using from_pretrained. Is that why it’s not working?

In general, what is the best way to pre-download the models during the build phase?

Thanks.

@mostafa-samir I worked on caching models in docker images, and what I think you are missing in your approach is committing the cached models to a new image. See How to Commit Changes to a Docker Image (With Example) for an example. HTH, Vladimir