Unable to push embeddings onto GPU

Ubuntu 24.04, RTX 4060, i5-13400F, 16 GB DDR5 RAM

I’m new to WASM and Llama but I’m trying to set up a local RAG API server. What I’d like to eventually do is repeatedly check a text file (that is growing in size) and let me ask it questions about it.

My issue right now is that the Instruct model will use GPU when I set it up as a chat server alone, but the embedding model doesn’t use GPU when I instantiate a server with Instruct and Embedding. I set up the server using the following command…

wasmedge --dir .:. --nn-preload default:GGML:GPU:Llama-3.2-3B-Instruct-Q5_K_M.gguf \
    --nn-preload embedding:GGML:GPU:nomic-embed-text-v1.5.Q5_K_M.gguf \
    rag-api-server.wasm \
    --model-name Llama-3.2-3B-Instruct-Q5_K_M,nomic-embed-text-v1.5.Q5_K_M \
    --ctx-size 4096,348 \
	--batch-size 512,512 \
    --prompt-template llama-3-chat,embedding \
    --rag-policy system-message \
    --qdrant-collection-name default \
    --qdrant-limit 3 \
    --qdrant-score-threshold 0.5 \
    --rag-prompt "Use the following pieces of context to answer the user's question.\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n" \
    --port 8080

I then issue…

curl -X POST http://127.0.0.1:8080/v1/create/rag -F "file=@filename.txt"

And the return takes like 30 minutes. When I watch htop and nvidia-smi, I start loading individual CPU cores at 100% and cycling through them but the GPU is barely utilized, if at all.

Below are server logs, in case they’re helpful.


[2025-06-01 12:01:59.701] [info] rag_api_server in src/main.rs:189: log_level: info
[2025-06-01 12:01:59.701] [info] rag_api_server in src/main.rs:192: server_version: 0.13.15
[2025-06-01 12:01:59.701] [info] rag_api_server in src/main.rs:200: model_name: Llama-3.2-3B-Instruct-Q5_K_M,nomic-embed-text-v1.5.Q5_K_M
[2025-06-01 12:01:59.701] [info] rag_api_server in src/main.rs:208: model_alias: default,embedding
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:222: ctx_size: 4096,348
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:236: batch_size: 512,512
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:250: ubatch_size: 512,512
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:264: prompt_template: llama-3-chat,embedding
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:272: n_predict: -1
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:275: n_gpu_layers: 100
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:278: split_mode: layer
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:291: threads: 2
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:305: rag_prompt: Use the following pieces of context to answer the user's question.\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:326: qdrant_url: http://127.0.0.1:6333
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:353: qdrant_collection_name: default
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:362: qdrant_limit: 3
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:371: qdrant_score_threshold: 0.5
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:399: chunk_capacity: 100
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:402: context_window: 1
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:408: rag_policy: system-message
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:427: include_usage: false
[2025-06-01 12:01:59.702] [info] llama_core in /home/laurentius/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/llama-core-0.30.0/src/lib.rs:189: Initializing the core context for RAG scenarios
[2025-06-01 12:01:59.702] [info] [WASI-NN] GGML backend: LLAMA_COMMIT cf0a43bb
[2025-06-01 12:01:59.702] [info] [WASI-NN] GGML backend: LLAMA_BUILD_NUMBER 5361
[2025-06-01 12:01:59.706] [info] [WASI-NN] llama.cpp: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
[2025-06-01 12:01:59.706] [info] [WASI-NN] llama.cpp: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[2025-06-01 12:01:59.706] [info] [WASI-NN] llama.cpp: ggml_cuda_init: found 1 CUDA devices:
[2025-06-01 12:01:59.706] [info] [WASI-NN] llama.cpp:   Device 0: NVIDIA GeForce RTX 4060, compute capability 8.9, VMM: yes
[2025-06-01 12:01:59.749] [info] [WASI-NN] llama.cpp: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060) - 5561 MiB free
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: loaded meta data with 31 key-value pairs and 255 tensors from Llama-3.2-3B-Instruct-Q5_K_M.gguf (version GGUF V3 (latest))
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   0:                       general.architecture str              = llama
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   1:                               general.type str              = model
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   5:                         general.size_label str              = 3B
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   6:                            general.license str              = llama3.2
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   9:                          llama.block_count u32              = 28
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  11:                     llama.embedding_length u32              = 3072
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 8192
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 24
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
[2025-06-01 12:01:59.788] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
[2025-06-01 12:01:59.793] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  29:               general.quantization_version u32              = 2
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  30:                          general.file_type u32              = 17
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - type  f32:   58 tensors
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - type q5_K:  168 tensors
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - type q6_K:   29 tensors
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: print_info: file format = GGUF V3 (latest)
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: print_info: file type   = Q5_K - Medium
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: print_info: file size   = 2.16 GiB (5.76 BPW) 
[2025-06-01 12:01:59.913] [info] [WASI-NN] llama.cpp: load: special tokens cache size = 256
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: load: token to piece cache size = 0.7999 MB
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: arch             = llama
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: vocab_only       = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_ctx_train      = 131072
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_embd           = 3072
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_layer          = 28
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_head           = 24
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_head_kv        = 8
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_rot            = 128
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_swa            = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_swa_pattern    = 1
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_embd_head_k    = 128
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_embd_head_v    = 128
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_gqa            = 3
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_embd_k_gqa     = 1024
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_embd_v_gqa     = 1024
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: f_norm_eps       = 0.0e+00
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: f_norm_rms_eps   = 1.0e-05
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: f_clamp_kqv      = 0.0e+00
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: f_max_alibi_bias = 0.0e+00
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: f_logit_scale    = 0.0e+00
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: f_attn_scale     = 0.0e+00
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_ff             = 8192
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_expert         = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_expert_used    = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: causal attn      = 1
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: pooling type     = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: rope type        = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: rope scaling     = linear
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: freq_base_train  = 500000.0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: freq_scale_train = 1
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_ctx_orig_yarn  = 131072
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: rope_finetuned   = unknown
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: ssm_d_conv       = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: ssm_d_inner      = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: ssm_d_state      = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: ssm_dt_rank      = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: ssm_dt_b_c_rms   = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: model type       = 3B
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: model params     = 3.21 B
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: general.name     = Llama 3.2 3B Instruct
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: vocab type       = BPE
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_vocab          = 128256
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_merges         = 280147
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: BOS token        = 128000 '<|begin_of_text|>'
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: EOS token        = 128009 '<|eot_id|>'
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: EOT token        = 128009 '<|eot_id|>'
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: EOM token        = 128008 '<|eom_id|>'
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: LF token         = 198 'Ċ'
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: EOG token        = 128008 '<|eom_id|>'
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: EOG token        = 128009 '<|eot_id|>'
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: max token length = 256
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-06-01 12:01:59.987] [info] [WASI-NN] llama.cpp: load_tensors: offloading 28 repeating layers to GPU
[2025-06-01 12:01:59.987] [info] [WASI-NN] llama.cpp: load_tensors: offloading output layer to GPU
[2025-06-01 12:01:59.987] [info] [WASI-NN] llama.cpp: load_tensors: offloaded 29/29 layers to GPU
[2025-06-01 12:01:59.987] [info] [WASI-NN] llama.cpp: load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
[2025-06-01 12:01:59.987] [info] [WASI-NN] llama.cpp: load_tensors:        CUDA0 model buffer size =  2207.10 MiB
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: constructing llama_context
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: n_seq_max     = 1
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: n_ctx         = 4096
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: n_ctx_per_seq = 4096
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: n_batch       = 512
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: n_ubatch      = 512
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: causal_attn   = 1
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: flash_attn    = 0
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: freq_base     = 500000.0
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: freq_scale    = 1
[2025-06-01 12:02:00.263] [warning] [WASI-NN] llama.cpp: llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[2025-06-01 12:02:00.264] [info] [WASI-NN] llama.cpp: llama_context:  CUDA_Host  output buffer size =     0.49 MiB
[2025-06-01 12:02:00.264] [info] [WASI-NN] llama.cpp: llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1, padding = 32
[2025-06-01 12:02:00.266] [info] [WASI-NN] llama.cpp: llama_kv_cache_unified:      CUDA0 KV buffer size =   448.00 MiB
[2025-06-01 12:02:00.266] [info] [WASI-NN] llama.cpp: llama_kv_cache_unified: KV self size  =  448.00 MiB, K (f16):  224.00 MiB, V (f16):  224.00 MiB
[2025-06-01 12:02:00.278] [info] [WASI-NN] llama.cpp: llama_context:      CUDA0 compute buffer size =   256.50 MiB
[2025-06-01 12:02:00.278] [info] [WASI-NN] llama.cpp: llama_context:  CUDA_Host compute buffer size =    14.01 MiB
[2025-06-01 12:02:00.278] [info] [WASI-NN] llama.cpp: llama_context: graph nodes  = 958
[2025-06-01 12:02:00.278] [info] [WASI-NN] llama.cpp: llama_context: graph splits = 2
[2025-06-01 12:02:00.278] [info] [WASI-NN] GGML backend: llama_system_info: CUDA : ARCHS = 600,610,700,750,800,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | BMI2 = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
[2025-06-01 12:02:00.278] [info] [WASI-NN] GGML backend: LLAMA_COMMIT cf0a43bb
[2025-06-01 12:02:00.278] [info] [WASI-NN] GGML backend: LLAMA_BUILD_NUMBER 5361
[2025-06-01 12:02:00.279] [info] [WASI-NN] llama.cpp: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060) - 2647 MiB free
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from nomic-embed-text-v1.5.Q5_K_M.gguf (version GGUF V3 (latest))
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   8:                          general.file_type u32              = 17
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
[2025-06-01 12:02:00.284] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
[2025-06-01 12:02:00.289] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  22:               general.quantization_version u32              = 2
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - type  f32:   51 tensors
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - type q5_K:   43 tensors
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - type q6_K:   18 tensors
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: print_info: file format = GGUF V3 (latest)
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: print_info: file type   = Q5_K - Medium
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: print_info: file size   = 94.25 MiB (5.78 BPW) 
[2025-06-01 12:02:00.301] [warning] [WASI-NN] llama.cpp: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
[2025-06-01 12:02:00.301] [info] [WASI-NN] llama.cpp: load: special tokens cache size = 5
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: load: token to piece cache size = 0.2032 MB
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: arch             = nomic-bert
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: vocab_only       = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_ctx_train      = 2048
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_embd           = 768
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_layer          = 12
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_head           = 12
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_head_kv        = 12
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_rot            = 64
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_swa            = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_swa_pattern    = 1
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_embd_head_k    = 64
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_embd_head_v    = 64
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_gqa            = 1
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_embd_k_gqa     = 768
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_embd_v_gqa     = 768
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: f_norm_eps       = 1.0e-12
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: f_norm_rms_eps   = 0.0e+00
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: f_clamp_kqv      = 0.0e+00
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: f_max_alibi_bias = 0.0e+00
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: f_logit_scale    = 0.0e+00
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: f_attn_scale     = 0.0e+00
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_ff             = 3072
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_expert         = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_expert_used    = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: causal attn      = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: pooling type     = 1
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: rope type        = 2
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: rope scaling     = linear
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: freq_base_train  = 1000.0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: freq_scale_train = 1
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_ctx_orig_yarn  = 2048
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: rope_finetuned   = unknown
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: ssm_d_conv       = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: ssm_d_inner      = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: ssm_d_state      = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: ssm_dt_rank      = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: ssm_dt_b_c_rms   = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: model type       = 137M
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: model params     = 136.73 M
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: general.name     = nomic-embed-text-v1.5
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: vocab type       = WPM
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: n_vocab          = 30522
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: n_merges         = 0
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: BOS token        = 101 '[CLS]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: EOS token        = 102 '[SEP]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: UNK token        = 100 '[UNK]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: SEP token        = 102 '[SEP]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: PAD token        = 0 '[PAD]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: MASK token       = 103 '[MASK]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: LF token         = 0 '[PAD]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: EOG token        = 102 '[SEP]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: max token length = 21
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors: offloading 12 repeating layers to GPU
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors: offloading output layer to GPU
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors: offloaded 13/13 layers to GPU
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors:   CPU_Mapped model buffer size =    15.38 MiB
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors:        CUDA0 model buffer size =    78.88 MiB
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: constructing llama_context
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: n_seq_max     = 1
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: n_ctx         = 348
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: n_ctx_per_seq = 348
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: n_batch       = 512
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: n_ubatch      = 512
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: causal_attn   = 0
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: flash_attn    = 0
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: freq_base     = 1000.0
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: freq_scale    = 1
[2025-06-01 12:02:00.321] [warning] [WASI-NN] llama.cpp: llama_context: n_ctx_per_seq (348) < n_ctx_train (2048) -- the full capacity of the model will not be utilized
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context:  CUDA_Host  output buffer size =     0.12 MiB
[2025-06-01 12:02:00.329] [info] [WASI-NN] GGML backend: llama_system_info: CUDA : ARCHS = 600,610,700,750,800,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | BMI2 = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 348
[2025-06-01 12:02:00.329] [info] rag_api_server in src/main.rs:542: plugin_ggml_version: b5361 (commit cf0a43bb)
[2025-06-01 12:02:00.329] [info] rag_api_server in src/main.rs:591: Listening on 0.0.0.0:8080
1 Like

It seems that you cannot use GPU in Embedding if the version of Llama.cpp is old, but I am not sure if this is the problem.

Silly question, but I am pretty new to this: if I’m using a GGUF that’s previously been built and I’m running on WASMEdge / WASI-NN / LlamaEdge, at which point would I have compiled llama.cpp directly?

Since I’m just trying to quickly demo a tool, I’ve been running the install shell scripts provided on the respective repos (e.g. WasmEdge/utils/install_v2.sh at master · WasmEdge/WasmEdge · GitHub), and haven’t looked too deeply into the core libraries.

1 Like

If GGUF is converted correctly, even old files should work fine. (Very rarely, the supported format range may change, but we don’t usually need to worry about that.)
As long as Llama.cpp is up-to-date and relatively bug-free, it shouldn’t cause any issues at runtime…

Overall, since the GPU is recognized and doesn’t error out, this issue is likely due to a configuration mistake or a bug, so it’s probably not structurally impossible. VRAM should also be sufficient…

Ok I think I understand now: llama.cpp is used to quantize the GGUF’s before I download them, so I’m not directly interacting with that project in the way I’ve set up my tool.

According to Nomic’s page (second-state/nomic-embed-text-v1.5-GGUF · Hugging Face) it was quantized with llama.cpp b4120, a release from Nov. 2024. Seems relatively new.

Where could I check for the configuration mistake? Is there something I can test with the WASMEdge call?

1 Like

I assume that the problem lies in the settings of the program running on wasmedge rather than wasmedge itself, but I don’t know which program you are using…:sweat_smile:

Assuming that this is the case, increasing the number in this option may be effective.

-g, --n-gpu-layers <N_GPU_LAYERS>
Number of layers to run on the GPU

   [default: 100]

Edit:
I thought so, but looking at the log above, 100 seems to be enough…
The cause must be something else…

The core is fixed to an old version…

I’m not sure if this is the cause, but the version of Llama.cpp that is actually used may be old.

llama-core     = { version = "=0.30.0", features = ["logging", "rag", "index"] }
1 Like

Hmm I tried adjusting the llama-core crate version to newer versions (even the version immediately after, 0.31.0) and running cargo build --target wasm32-wasip1 --release again but the build fails because there are too many mismatched types (i.e. the method calls have been changed).

A few examples…

2438 | ... chat_prompt.build_with_tools(&mut chat_request.messages, Some(&[])) {
     |                 ---------------- ^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `ChatCompletionRequestMessage`, found a different `ChatCompletionRequestMessage`
     |                 |
     |                 arguments to this method are incorrect
     |

and

error[E0308]: mismatched types
    --> /home/laurentius/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/llama-core-0.31.0/src/chat.rs:2452:76
     |
2452 | ...chat_request.messages, Some(tools.as_slice()))
     |                           ---- ^^^^^^^^^^^^^^^^ expected `endpoints::chat::Tool`, found a different `endpoints::chat::Tool`
     |                           |
     |                           arguments to this enum variant are incorrect
     |

Changing the llama-core crate may be a heavy lift because I’d need to go in and understand many of the method calls and possibly rewrite some of them.

1 Like

According to this (All releases of llama-core // Lib.rs), 0.30.0 is from April of this year, so its CUDA support should be up to date. I’ll try a few different embeddings models.

1 Like

When I checked the log, most of it seems to be loaded on the GPU…

Could it be that some of it is on the CPU, or is there some other reason why the calculations are slow, and there is no problem with the loading itself?

[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
...
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors: offloading 12 repeating layers to GPU
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors: offloading output layer to GPU
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors: offloaded 13/13 layers to GPU
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors:   CPU_Mapped model buffer size =    15.38 MiB
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors:        CUDA0 model buffer size =    78.88 MiB

Edit:
Your GPU supports bfloat16, but a similar situation may occur with float32 or so in Embedding models, for example. Unnecessary data movement between the CPU and GPU will result in a speed penalty.

Bug? aroud cache.

Hmm well I tried a few things.

First, I tried using a few different embedding models: two different nomic-text-1.5’s, MiniLM-L6-v2, and one more that was too big to load in my VRAM. No change.

Next, I tried loading the embedding model as CPU (i.e.
--nn-preload default:GGML:GPU:Llama-3.2-3B-Instruct-Q5_K_M.gguf \ --nn-preload embedding:GGML:CPU:nomic-embed-text-v1.5.Q5_K_M.gguf \
) to see if I could speed it up just by not repeatedly transferring between GPU and CPU; still extremely slow while chunking the text file and nvidia-smi reports a very similar VRAM consumption as before.

I tried to see if the rag-api-server.wasm could be loaded with embeddings models only (for testing) and then asked to chunk a file but it requires the chat model is also loaded.

Lastly, I looked to see if I could adjust bfloat16 vs float32 usage anywhere but my understanding is that these vector types are compiled into the GGUF (or generated during training?) and llama-core libraries, so I don’t think I can change them. I don’t see any reference to them in the script that WASMEdge supplies to build WASMEdge and the NN-GGML plugin, either.

What would be an obvious next step for debugging? Or a workaround to get a reasonably quick RAG going on middle-range consumer hardware?

1 Like

The behavior of the embedding model on the GPU in WasmEdge seems strange…
I’m not sure if the issue is with WasmEdge, Llama.cpp, or something else…
Considering that the calculation results are being output, it seems like the model is being loaded to VRAM, but the GPU isn’t being utilized properly during the computation, resulting in unusual behavior.

Or a workaround to get a reasonably quick RAG going on middle-range consumer hardware?

If you’re not tied to WasmEdge, Ollama could be a lighter option for simple tasks, and vLLM could be used for longer token lengths (processing text longer than 8,000 tokens). In either case, there should be no hardware specifications issues.

Probably I found similar unresolved issue.