Transformer model works locally but not in Docker container

jlewallen18 · November 8, 2023, 10:27pm

Hi there! Fairly new to this space so bear with me…

I’m trying to containerize a model called CLAP so that I can use an API (FastAPI) to be able to return embeddings from a text query. CLAP uses the HuggingFace Transformers from_pretrained("roberta-base") RoBERTa model “under the hood” for its text embeddings.

I was able to get CLAP running in my FastAPI application locally without many problems. However when I containerize the application, it seems to fail when passing the input_ids/attention_mask (tokenization stuff) to the RoBERTa model which should return my embeddings.

There are no internal errors, but my API returns a 500 empty response instead of my embeddings (like it does locally).

I’m guessing this has to be somewhat related to memory usage in the container? But I am not presented with any OOM notifications. When plotting plotting over time, I see my memory spike to 8gb and then drop immediately upon failure which is questionable.

Are there any example implementations of running RoBERTa in a Docker container?

Here is my Dockerfile

FROM python:3.11.6

WORKDIR /workspace

COPY ./services/clap/requirements.txt ./requirements.txt

RUN pip install --no-cache-dir --upgrade -r ./requirements.txt

COPY ./services/clap/app ./app

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--reload"]

I’ve also tried mounting my CLAP checkpoint and Transformer cache to a volume instead of storing in memory:

clap_api:
    build:
        context: .
        dockerfile: ./services/clap/Dockerfile
    ports:
        - 8002:8000
    restart: always
    volumes:
        - ./services/clap/app:/workspace/app
        - ./services/clap/clap-data:/clap-data # where my CLAP checkpoint is stored
        - ./services/clap/cache:/root/.cache/huggingface/hub # default cache location

My Fast API route is simple, it just does the following (using the CLAP library)

@app.post("/text-embedding")
def post_text_embedding(body: TextEmbeddingModel):

    model = laion_clap.CLAP_Module(enable_fusion=False, amodel='HTSAT-base')
    model.load_ckpt(get_laion_clap_ckpt())

    embeddings = model.get_text_embedding(body.queries)

    return {"embeddings": embeddings.tolist()}

The line within the CLAP repo that utilizes RoBERTa is the following, where self.text_branch === RobertaModel.from_pretrained('roberta-base'). You can see the definition of that variable a few lines higher.

github.com

LAION-AI/CLAP/blob/6b1b4b5b4b87f4e19d3836d2ae7d7272e1c69410/src/laion_clap/clap_module/model.py#L629


      
              x = self.text_branch(
                  input_ids=text["input_ids"].to(device=device, non_blocking=True),
                  attention_mask=text["attention_mask"].to(
                      device=device, non_blocking=True
                  ),
                  token_type_ids=text["token_type_ids"].to(
                      device=device, non_blocking=True
                  ),
              )["pooler_output"]
              x = self.text_projection(x)
          elif self.text_branch_type == "roberta":
              x = self.text_branch(
                  input_ids=text["input_ids"].to(device=device, non_blocking=True),
                  attention_mask=text["attention_mask"].to(
                      device=device, non_blocking=True
                  ),
              )["pooler_output"]
              x = self.text_projection(x)
          elif self.text_branch_type == "bart":
              x = torch.mean(self.text_branch(
                  input_ids=text["input_ids"].to(device=device, non_blocking=True),

if I comment out the content within the elif statement, the API returns a response, of course without the text embeddings but I’ve narrowed it down to that line, where RoBERTa actually needs to run! So I figured I could ask here since it is Transformers/RoBERTa related.

I am so puzzled as to why this stops working in Docker and works locally…usually it’s the other way around! haha.

What other information can I help provide to be able to debug this, more than happy to promptly send over.

jlewallen18 · November 9, 2023, 8:59am

To make matters more confusing, on application load, I have an iterator that prints out all the parameters that have been loaded. When I attempt to load in the docker environment, it only prints out about 2/3 of the parameters. I then make a change in my code, which triggers a reload in my uvicorn application. AND THEN it prints out the remaining parameters as it is restarting the server.

…it’s like I’m hitting some sort of memory limit or something within Docker preventing the iterator from continuing? I cannot replicate this locally, only when it’s in a container.

Any ideas? Here’s a video replicating this problem

jlewallen18 · November 16, 2023, 4:16am

Some updates, still would really appreciate some help here if anyone has time?

I fixed one issue with the output not being fully printed by adding ENV PYTHONUNBUFFERED=1 to my Dockerfile.

However the original issue still persists. I’ve discovered that I can’t run this in Docker on my M2 MacBook, but my Intel can. Here’s the following matrix

On my M2 MacBook:

Local: works
Docker: FAILS

On my Intel MacBook:

Local: works
Docker: works

Anyone have any idea why? Here’s a GitHub repository that you can replicate this problem with

GitHub: GitHub - uncvrd/clap-mre

Again, thanks to anyone who can provide guidance here!

sankalpmukim · May 26, 2025, 3:24pm

Same issue,
similar environment (M4 Max Macbook)
Same unpredictableish behaviour, a smaller model gets initialized but a larger model does not get initialized.

John6666 · May 27, 2025, 12:11am

On my M2 MacBook:

Local: works
Docker: FAILS

On my Intel MacBook:

Local: works
Docker: works

I can only think of Apple Silicon or MPS issues…

John6666 · May 27, 2025, 12:14am

Oh. I see…

github.com/mudler/LocalAI

Docker support for Apple silicon

opened 05:58PM - 29 Jan 24 UTC

szegedim

enhancement

**Is your feature request related to a problem? Please describe.** Thank you …for putting this together. It helped me a lot to learn the big picture of LLMs. I tried to build and run it on an Apple silicon and I ran into some issues. **Describe the solution you'd like** I managed to fix it using the Ubuntu 22.04 docker image instead of the earlier and newer ones. Here is the Dockerfile that worked for me for an Ubuntu container on aamd64 architecture. ``` FROM ubuntu:22.04 #sha256:e6173d4dc55e76b87c4af8db8821b1feae4146dd47341e4d431118c7dd060a74 RUN apt update RUN apt install -y cmake golang-1.21-go protobuf-compiler-grpc protobuf-compiler libgrpc++-dev wget curl patch git RUN git clone https://github.com/go-skynet/LocalAI.git /opt/localai WORKDIR /opt/localai ENV BUILD_GRPC_FOR_BACKEND_LLAMA=true ENV PATH=$PATH:/usr/lib/go-1.21/bin RUN make build RUN wget https://gpt4all.io/models/ggml-gpt4all-j.bin -O models/ggml-gpt4all-j RUN cp -rf prompt-templates/ggml-gpt4all-j.tmpl models/ CMD echo ./local-ai --models-path=./models/ --debug=true ``` **Describe alternatives you've considered** I was considering using brew, but I was concerned running gigabytes of new code from the internet that requires privileged access. I tried to run it in Docker to assess the performance - 10s per subsequent answers on Mac Studio. I need to do more work to assess whether it can leverage all the AI and GPU support of the silicon. **Additional context** I can open a pull request if you are interested. Regards, Miklos

Docker engine doesn’t run on macOS natively, but runs inside a linux/arm64 vm with virtualization in Apple Silicon (similar with linux/amd64 on intel)

If you want native metal acceleration, you’ll need to run the native binary.

Pimpcat-AU · June 10, 2025, 8:07pm

Set environment variables in Docker:
Hugging Face models need HF_HOME, TRANSFORMERS_CACHE, or HF_DATASETS_CACHE set and writable in Docker. Add these to your Dockerfile or docker-compose:

ENV TRANSFORMERS_CACHE=/workspace/hf_cache

In docker-compose.yml:

environment:

TRANSFORMERS_CACHE=/workspace/hf_cache
volumes:
./clap/cache:/workspace/hf_cache

Ensure all model/tokenizer files are present:
In the Docker shell, try running:

python -c "from transformers import RobertaModel; RobertaModel.from_pretrained('roberta-base')"

If this fails, there’s a permissions or cache/mounting issue.

Run with --no-cache-dir and check permissions:
Make sure all directories used for caching are owned by the container’s user and are writable.

If you get empty outputs:
This usually means the model weights were not loaded. Double check the path and cache logic, or pre-download the model on the host and mount the cache.

Solution provided by Triskel Data Deterministic AI.

Topic		Replies	Views
Unable to Run Sentence Transformer Text embedding in Docker 🤗Transformers	1	391	January 7, 2025
Scaling up BERT-like model Inference on modern CPU - Part 1 Intermediate	3	1118	April 22, 2021
Docker container, run model only 🤗Transformers	0	1138	October 21, 2020
AWS Lambda + Transformers + Docker = use High RAM for summarization model 🤗Transformers	1	595	June 26, 2023
Containerizing transformers with Docker and FastAPI 🤗Transformers	1	2049	August 28, 2020

Transformer model works locally but not in Docker container

Related topics