Unable to push embeddings onto GPU

The behavior of the embedding model on the GPU in WasmEdge seems strange…
I’m not sure if the issue is with WasmEdge, Llama.cpp, or something else…
Considering that the calculation results are being output, it seems like the model is being loaded to VRAM, but the GPU isn’t being utilized properly during the computation, resulting in unusual behavior.

Or a workaround to get a reasonably quick RAG going on middle-range consumer hardware?

If you’re not tied to WasmEdge, Ollama could be a lighter option for simple tasks, and vLLM could be used for longer token lengths (processing text longer than 8,000 tokens). In either case, there should be no hardware specifications issues.