Docker LlamaIndex Flask App: Embedding Model Performance Issues in Offline Mode
Hi,
I am trying to run my LlamaIndex Flask application in Docker using the BAAI-bge-small embedding model. On container restart, it tends to pull and download the model each time from Hugging Face. I am trying to run this in a no-internet environment, so I can’t afford to restart the container or start it back up if it stops due to the lack of internet.
One solution I found was to download the embedding models during the Docker image build and bake them into the image, then run it in the no-internet environment. The solution worked, but it came with terrible latency.
For a similar query with a file uploaded:
- The internet version takes 2-5s
- The offline mode version takes 18-20s
Why does this happen?
Below is a code snippet for offline mode:
python
os.environ['HF_HUB_OFFLINE'] = '1'
os.environ['TRANSFORMERS_OFFLINE'] = '1'
os.environ['HF_HUB_DISABLE_TELEMETRY'] = '1'
os.environ['HF_HUB_DISABLE_PROGRESS_BARS'] = '1'
os.environ['HF_HUB_DISABLE_SYMLINKS_WARNING'] = '1'
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'
# Set cache directories
os.environ['HF_HOME'] = '/app/.cache/huggingface'
os.environ['TRANSFORMERS_CACHE'] = '/app/.cache/huggingface'
os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/app/.cache/sentence_transformers'
os.environ['XDG_CACHE_HOME'] = '/app/.cache'
....
embedding_model = HuggingFaceEmbedding(
model_name="BAAI/bge-small-en-v1.5",
cache_folder="/app/.cache/sentence_transformers",
device="cpu",
trust_remote_code=False,
embed_batch_size=1
)
# Configure LlamaIndex global settings
Settings.embed_model = embedding_model
Settings.chunk_size = 1024
Settings.chunk_overlap = 64
This is how I used to run this in the internet environment when models are downloaded during runtime:
# Global LLM instance
llm = Ollama(model="llama3.2-vision", request_timeout=3600, temperature=0.0)
Settings.llm = llm
Settings.embed_model = "local:BAAI/bge-small-en-v1.5"
How can I correct this performance issue?