Unable to push embeddings onto GPU

User515 · June 5, 2025, 3:36am

Hmm well I tried a few things.

First, I tried using a few different embedding models: two different nomic-text-1.5’s, MiniLM-L6-v2, and one more that was too big to load in my VRAM. No change.

Next, I tried loading the embedding model as CPU (i.e.
--nn-preload default:GGML:GPU:Llama-3.2-3B-Instruct-Q5_K_M.gguf \ --nn-preload embedding:GGML:CPU:nomic-embed-text-v1.5.Q5_K_M.gguf \
) to see if I could speed it up just by not repeatedly transferring between GPU and CPU; still extremely slow while chunking the text file and nvidia-smi reports a very similar VRAM consumption as before.

I tried to see if the rag-api-server.wasm could be loaded with embeddings models only (for testing) and then asked to chunk a file but it requires the chat model is also loaded.

Lastly, I looked to see if I could adjust bfloat16 vs float32 usage anywhere but my understanding is that these vector types are compiled into the GGUF (or generated during training?) and llama-core libraries, so I don’t think I can change them. I don’t see any reference to them in the script that WASMEdge supplies to build WASMEdge and the NN-GGML plugin, either.

What would be an obvious next step for debugging? Or a workaround to get a reasonably quick RAG going on middle-range consumer hardware?

Topic		Replies	Views
Make Text Embedding Server compatible 🤗Optimum	2	281	August 8, 2024
Host a Model with vllm for RAG Models	6	3704	September 12, 2024
Cannot create new endpoints: WebserverFailed Inference Endpoints on the Hub	1	771	November 30, 2023
Processong speed for text embedding models Models	0	174	April 5, 2024
Text-generation-inference: "You are using a model of type llama to instantiate a model of type ." Models	5	7585	November 3, 2023

Unable to push embeddings onto GPU

Related topics