Unable to push embeddings onto GPU

Hmm well I tried a few things.

First, I tried using a few different embedding models: two different nomic-text-1.5’s, MiniLM-L6-v2, and one more that was too big to load in my VRAM. No change.

Next, I tried loading the embedding model as CPU (i.e.
--nn-preload default:GGML:GPU:Llama-3.2-3B-Instruct-Q5_K_M.gguf \ --nn-preload embedding:GGML:CPU:nomic-embed-text-v1.5.Q5_K_M.gguf \
) to see if I could speed it up just by not repeatedly transferring between GPU and CPU; still extremely slow while chunking the text file and nvidia-smi reports a very similar VRAM consumption as before.

I tried to see if the rag-api-server.wasm could be loaded with embeddings models only (for testing) and then asked to chunk a file but it requires the chat model is also loaded.

Lastly, I looked to see if I could adjust bfloat16 vs float32 usage anywhere but my understanding is that these vector types are compiled into the GGUF (or generated during training?) and llama-core libraries, so I don’t think I can change them. I don’t see any reference to them in the script that WASMEdge supplies to build WASMEdge and the NN-GGML plugin, either.

What would be an obvious next step for debugging? Or a workaround to get a reasonably quick RAG going on middle-range consumer hardware?

1 Like