Forcing Offline mode for hugging face embeddings in docker container is slow

Docker LlamaIndex Flask App: Embedding Model Performance Issues in Offline Mode

Hi,

I am trying to run my LlamaIndex Flask application in Docker using the BAAI-bge-small embedding model. On container restart, it tends to pull and download the model each time from Hugging Face. I am trying to run this in a no-internet environment, so I can’t afford to restart the container or start it back up if it stops due to the lack of internet.

One solution I found was to download the embedding models during the Docker image build and bake them into the image, then run it in the no-internet environment. The solution worked, but it came with terrible latency.

For a similar query with a file uploaded:

  • The internet version takes 2-5s
  • The offline mode version takes 18-20s

Why does this happen?

Below is a code snippet for offline mode:

python
os.environ['HF_HUB_OFFLINE'] = '1'
os.environ['TRANSFORMERS_OFFLINE'] = '1'
os.environ['HF_HUB_DISABLE_TELEMETRY'] = '1'
os.environ['HF_HUB_DISABLE_PROGRESS_BARS'] = '1'
os.environ['HF_HUB_DISABLE_SYMLINKS_WARNING'] = '1'
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'
# Set cache directories
os.environ['HF_HOME'] = '/app/.cache/huggingface'
os.environ['TRANSFORMERS_CACHE'] = '/app/.cache/huggingface'
os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/app/.cache/sentence_transformers'
os.environ['XDG_CACHE_HOME'] = '/app/.cache'
....
        embedding_model = HuggingFaceEmbedding(
            model_name="BAAI/bge-small-en-v1.5",
            cache_folder="/app/.cache/sentence_transformers",
            device="cpu",
            trust_remote_code=False,
            embed_batch_size=1
        )
        # Configure LlamaIndex global settings
        Settings.embed_model = embedding_model
        Settings.chunk_size = 1024
        Settings.chunk_overlap = 64

This is how I used to run this in the internet environment when models are downloaded during runtime:

# Global LLM instance
llm = Ollama(model="llama3.2-vision", request_timeout=3600, temperature=0.0)
Settings.llm = llm
Settings.embed_model = "local:BAAI/bge-small-en-v1.5"

How can I correct this performance issue?

1 Like

In the former, a 32-bit model for Transformers is used on the CPU, while in the latter, a 4-bit GGUF quantization model (presumably) for Ollama is used on either the CPU or GPU.

If VRAM is not an issue, would this approach be worth trying?

        embedding_model = HuggingFaceEmbedding(
            model_name="BAAI/bge-small-en-v1.5",
            cache_folder="/app/.cache/sentence_transformers",
            #device="cpu",
            device="cuda",
            torch_dtype=torch.bfloat16 # Faster with recent GeForce