Forcing Offline mode for hugging face embeddings in docker container is slow

saiacx · July 2, 2025, 3:45pm

Docker LlamaIndex Flask App: Embedding Model Performance Issues in Offline Mode

Hi,

I am trying to run my LlamaIndex Flask application in Docker using the BAAI-bge-small embedding model. On container restart, it tends to pull and download the model each time from Hugging Face. I am trying to run this in a no-internet environment, so I can’t afford to restart the container or start it back up if it stops due to the lack of internet.

One solution I found was to download the embedding models during the Docker image build and bake them into the image, then run it in the no-internet environment. The solution worked, but it came with terrible latency.

For a similar query with a file uploaded:

The internet version takes 2-5s
The offline mode version takes 18-20s

Why does this happen?

Below is a code snippet for offline mode:

python
os.environ['HF_HUB_OFFLINE'] = '1'
os.environ['TRANSFORMERS_OFFLINE'] = '1'
os.environ['HF_HUB_DISABLE_TELEMETRY'] = '1'
os.environ['HF_HUB_DISABLE_PROGRESS_BARS'] = '1'
os.environ['HF_HUB_DISABLE_SYMLINKS_WARNING'] = '1'
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'
# Set cache directories
os.environ['HF_HOME'] = '/app/.cache/huggingface'
os.environ['TRANSFORMERS_CACHE'] = '/app/.cache/huggingface'
os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/app/.cache/sentence_transformers'
os.environ['XDG_CACHE_HOME'] = '/app/.cache'
....
        embedding_model = HuggingFaceEmbedding(
            model_name="BAAI/bge-small-en-v1.5",
            cache_folder="/app/.cache/sentence_transformers",
            device="cpu",
            trust_remote_code=False,
            embed_batch_size=1
        )
        # Configure LlamaIndex global settings
        Settings.embed_model = embedding_model
        Settings.chunk_size = 1024
        Settings.chunk_overlap = 64

This is how I used to run this in the internet environment when models are downloaded during runtime:

# Global LLM instance
llm = Ollama(model="llama3.2-vision", request_timeout=3600, temperature=0.0)
Settings.llm = llm
Settings.embed_model = "local:BAAI/bge-small-en-v1.5"

How can I correct this performance issue?

John6666 · July 2, 2025, 4:18pm

In the former, a 32-bit model for Transformers is used on the CPU, while in the latter, a 4-bit GGUF quantization model (presumably) for Ollama is used on either the CPU or GPU.

If VRAM is not an issue, would this approach be worth trying?

        embedding_model = HuggingFaceEmbedding(
            model_name="BAAI/bge-small-en-v1.5",
            cache_folder="/app/.cache/sentence_transformers",
            #device="cpu",
            device="cuda",
            torch_dtype=torch.bfloat16 # Faster with recent GeForce

saiacx · July 10, 2025, 12:46pm

I tried this
But still there is no difference in the response time yet.
FYI I am running on 16Gb VRAM
Please let me know if you need any more details

John6666 · July 10, 2025, 1:04pm

Ollama is fast to begin with, so maybe that’s the difference. How about trying another fast method, Text Embedding Inference?

Topic		Replies	Views
Load a cached custom model in offline mode 🤗Transformers	1	10282	September 19, 2022
Chatbot in offline mode using when using langchain.HuggingFaceImbeddings 🤗Transformers	0	4831	November 3, 2023
Huggingface Embeddings Cache folder error when trying to deploy fastAPI app in Huggingface Spaces Spaces	1	2183	December 13, 2023
Make Text Embedding Server compatible 🤗Optimum	2	256	August 8, 2024
How to download a model and run it with Ollama locally? Beginners	17	117815	May 15, 2025

Forcing Offline mode for hugging face embeddings in docker container is slow

Docker LlamaIndex Flask App: Embedding Model Performance Issues in Offline Mode

Related topics