Manually Downloading Models in docker build with snapshot_download

Hi,

To avoid re-downloading the models every time my docker container is started, I want to manually download the models during building the docker image. I wrote a small script that runs the following to download the models during the build:

    AutoTokenizer.from_pretrained(configs.get("models_names.tockenizer"))
    SentenceTransformer(configs.get("models_names.sentence_embedding"))
    AutoModelForSeq2SeqLM.from_pretrained(configs.get("models_names.paraphraser"))

While this works in theory, it breaks in practice because the models are not just downloaded, they are also loaded into memory, and that raises an out of memory error on the build machine. I then read about snapshot_download (ref) and replaced my script with

    snapshot_download(configs.get("models_names.tockenizer"))
    snapshot_download(configs.get("models_names.sentence_embedding"))

While these two lines do download the same files, transformers is not able to load the models and attempts to download them whenever a from_pretrained call is made. It remains the case even if I explicitly TRANSFORMERS_CACHE to point to the cache directory of HuggingFace hub.

I noticed that the structure of the caches generated by snapshot_download is different from the cache structure of transformers using from_pretrained. Is that why it’s not working?

In general, what is the best way to pre-download the models during the build phase?

Thanks.

1 Like

@mostafa-samir I worked on caching models in docker images, and what I think you are missing in your approach is committing the cached models to a new image. See How to Commit Changes to a Docker Image (With Example) for an example. HTH, Vladimir

@mostafa-samir git-lfs clone the model into the dockerfile as a buildstage. The copy --from in the next stage discards the intermediate steps form the prior stage. Here’s mine for an AWS Lambda function:

FROM public.ecr.aws/lambda/python:3.9 AS model
RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | bash
RUN yum install git-lfs -y
RUN git lfs install
RUN git clone https://huggingface.co/ccdv/lsg-bart-base-4096-wcep /tmp/model
RUN rm -rf /tmp/model/.git

FROM public.ecr.aws/lambda/python:3.9
ARG FUNCTION_DIR="/var/task"
RUN mkdir -p ${FUNCTION_DIR}
COPY summarize.py ${FUNCTION_DIR}
COPY --from=model /tmp/model ${FUNCTION_DIR}/model
RUN pip install --no-cache-dir transformers[torch]==4.21.2
CMD [ "summarize.main" ]
2 Likes