Unable to push embeddings onto GPU

John6666 · June 5, 2025, 5:00am

The behavior of the embedding model on the GPU in WasmEdge seems strange…
I’m not sure if the issue is with WasmEdge, Llama.cpp, or something else…
Considering that the calculation results are being output, it seems like the model is being loaded to VRAM, but the GPU isn’t being utilized properly during the computation, resulting in unusual behavior.

Or a workaround to get a reasonably quick RAG going on middle-range consumer hardware?

If you’re not tied to WasmEdge, Ollama could be a lighter option for simple tasks, and vLLM could be used for longer token lengths (processing text longer than 8,000 tokens). In either case, there should be no hardware specifications issues.

Topic		Replies	Views
Make Text Embedding Server compatible 🤗Optimum	2	280	August 8, 2024
Host a Model with vllm for RAG Models	6	3704	September 12, 2024
Cannot create new endpoints: WebserverFailed Inference Endpoints on the Hub	1	770	November 30, 2023
Processong speed for text embedding models Models	0	174	April 5, 2024
Text-generation-inference: "You are using a model of type llama to instantiate a model of type ." Models	5	7584	November 3, 2023

Unable to push embeddings onto GPU

Related topics