Unable to push embeddings onto GPU

User515 · June 1, 2025, 7:06pm

Ubuntu 24.04, RTX 4060, i5-13400F, 16 GB DDR5 RAM

I’m new to WASM and Llama but I’m trying to set up a local RAG API server. What I’d like to eventually do is repeatedly check a text file (that is growing in size) and let me ask it questions about it.

My issue right now is that the Instruct model will use GPU when I set it up as a chat server alone, but the embedding model doesn’t use GPU when I instantiate a server with Instruct and Embedding. I set up the server using the following command…

wasmedge --dir .:. --nn-preload default:GGML:GPU:Llama-3.2-3B-Instruct-Q5_K_M.gguf \
    --nn-preload embedding:GGML:GPU:nomic-embed-text-v1.5.Q5_K_M.gguf \
    rag-api-server.wasm \
    --model-name Llama-3.2-3B-Instruct-Q5_K_M,nomic-embed-text-v1.5.Q5_K_M \
    --ctx-size 4096,348 \
	--batch-size 512,512 \
    --prompt-template llama-3-chat,embedding \
    --rag-policy system-message \
    --qdrant-collection-name default \
    --qdrant-limit 3 \
    --qdrant-score-threshold 0.5 \
    --rag-prompt "Use the following pieces of context to answer the user's question.\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n" \
    --port 8080

I then issue…

curl -X POST http://127.0.0.1:8080/v1/create/rag -F "file=@filename.txt"

And the return takes like 30 minutes. When I watch htop and nvidia-smi, I start loading individual CPU cores at 100% and cycling through them but the GPU is barely utilized, if at all.

Below are server logs, in case they’re helpful.

[2025-06-01 12:01:59.701] [info] rag_api_server in src/main.rs:189: log_level: info
[2025-06-01 12:01:59.701] [info] rag_api_server in src/main.rs:192: server_version: 0.13.15
[2025-06-01 12:01:59.701] [info] rag_api_server in src/main.rs:200: model_name: Llama-3.2-3B-Instruct-Q5_K_M,nomic-embed-text-v1.5.Q5_K_M
[2025-06-01 12:01:59.701] [info] rag_api_server in src/main.rs:208: model_alias: default,embedding
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:222: ctx_size: 4096,348
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:236: batch_size: 512,512
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:250: ubatch_size: 512,512
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:264: prompt_template: llama-3-chat,embedding
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:272: n_predict: -1
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:275: n_gpu_layers: 100
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:278: split_mode: layer
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:291: threads: 2
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:305: rag_prompt: Use the following pieces of context to answer the user's question.\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:326: qdrant_url: http://127.0.0.1:6333
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:353: qdrant_collection_name: default
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:362: qdrant_limit: 3
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:371: qdrant_score_threshold: 0.5
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:399: chunk_capacity: 100
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:402: context_window: 1
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:408: rag_policy: system-message
[2025-06-01 12:01:59.702] [info] rag_api_server in src/main.rs:427: include_usage: false
[2025-06-01 12:01:59.702] [info] llama_core in /home/laurentius/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/llama-core-0.30.0/src/lib.rs:189: Initializing the core context for RAG scenarios
[2025-06-01 12:01:59.702] [info] [WASI-NN] GGML backend: LLAMA_COMMIT cf0a43bb
[2025-06-01 12:01:59.702] [info] [WASI-NN] GGML backend: LLAMA_BUILD_NUMBER 5361
[2025-06-01 12:01:59.706] [info] [WASI-NN] llama.cpp: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
[2025-06-01 12:01:59.706] [info] [WASI-NN] llama.cpp: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[2025-06-01 12:01:59.706] [info] [WASI-NN] llama.cpp: ggml_cuda_init: found 1 CUDA devices:
[2025-06-01 12:01:59.706] [info] [WASI-NN] llama.cpp:   Device 0: NVIDIA GeForce RTX 4060, compute capability 8.9, VMM: yes
[2025-06-01 12:01:59.749] [info] [WASI-NN] llama.cpp: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060) - 5561 MiB free
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: loaded meta data with 31 key-value pairs and 255 tensors from Llama-3.2-3B-Instruct-Q5_K_M.gguf (version GGUF V3 (latest))
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   0:                       general.architecture str              = llama
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   1:                               general.type str              = model
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   5:                         general.size_label str              = 3B
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   6:                            general.license str              = llama3.2
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   9:                          llama.block_count u32              = 28
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  11:                     llama.embedding_length u32              = 3072
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 8192
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 24
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
[2025-06-01 12:01:59.776] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
[2025-06-01 12:01:59.788] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
[2025-06-01 12:01:59.793] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  29:               general.quantization_version u32              = 2
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  30:                          general.file_type u32              = 17
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - type  f32:   58 tensors
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - type q5_K:  168 tensors
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: llama_model_loader: - type q6_K:   29 tensors
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: print_info: file format = GGUF V3 (latest)
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: print_info: file type   = Q5_K - Medium
[2025-06-01 12:01:59.817] [info] [WASI-NN] llama.cpp: print_info: file size   = 2.16 GiB (5.76 BPW) 
[2025-06-01 12:01:59.913] [info] [WASI-NN] llama.cpp: load: special tokens cache size = 256
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: load: token to piece cache size = 0.7999 MB
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: arch             = llama
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: vocab_only       = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_ctx_train      = 131072
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_embd           = 3072
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_layer          = 28
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_head           = 24
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_head_kv        = 8
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_rot            = 128
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_swa            = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_swa_pattern    = 1
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_embd_head_k    = 128
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_embd_head_v    = 128
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_gqa            = 3
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_embd_k_gqa     = 1024
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_embd_v_gqa     = 1024
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: f_norm_eps       = 0.0e+00
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: f_norm_rms_eps   = 1.0e-05
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: f_clamp_kqv      = 0.0e+00
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: f_max_alibi_bias = 0.0e+00
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: f_logit_scale    = 0.0e+00
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: f_attn_scale     = 0.0e+00
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_ff             = 8192
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_expert         = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_expert_used    = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: causal attn      = 1
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: pooling type     = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: rope type        = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: rope scaling     = linear
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: freq_base_train  = 500000.0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: freq_scale_train = 1
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_ctx_orig_yarn  = 131072
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: rope_finetuned   = unknown
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: ssm_d_conv       = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: ssm_d_inner      = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: ssm_d_state      = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: ssm_dt_rank      = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: ssm_dt_b_c_rms   = 0
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: model type       = 3B
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: model params     = 3.21 B
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: general.name     = Llama 3.2 3B Instruct
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: vocab type       = BPE
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_vocab          = 128256
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: n_merges         = 280147
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: BOS token        = 128000 '<|begin_of_text|>'
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: EOS token        = 128009 '<|eot_id|>'
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: EOT token        = 128009 '<|eot_id|>'
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: EOM token        = 128008 '<|eom_id|>'
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: LF token         = 198 'Ċ'
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: EOG token        = 128008 '<|eom_id|>'
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: EOG token        = 128009 '<|eot_id|>'
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: print_info: max token length = 256
[2025-06-01 12:01:59.937] [info] [WASI-NN] llama.cpp: load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-06-01 12:01:59.987] [info] [WASI-NN] llama.cpp: load_tensors: offloading 28 repeating layers to GPU
[2025-06-01 12:01:59.987] [info] [WASI-NN] llama.cpp: load_tensors: offloading output layer to GPU
[2025-06-01 12:01:59.987] [info] [WASI-NN] llama.cpp: load_tensors: offloaded 29/29 layers to GPU
[2025-06-01 12:01:59.987] [info] [WASI-NN] llama.cpp: load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
[2025-06-01 12:01:59.987] [info] [WASI-NN] llama.cpp: load_tensors:        CUDA0 model buffer size =  2207.10 MiB
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: constructing llama_context
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: n_seq_max     = 1
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: n_ctx         = 4096
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: n_ctx_per_seq = 4096
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: n_batch       = 512
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: n_ubatch      = 512
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: causal_attn   = 1
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: flash_attn    = 0
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: freq_base     = 500000.0
[2025-06-01 12:02:00.263] [info] [WASI-NN] llama.cpp: llama_context: freq_scale    = 1
[2025-06-01 12:02:00.263] [warning] [WASI-NN] llama.cpp: llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[2025-06-01 12:02:00.264] [info] [WASI-NN] llama.cpp: llama_context:  CUDA_Host  output buffer size =     0.49 MiB
[2025-06-01 12:02:00.264] [info] [WASI-NN] llama.cpp: llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1, padding = 32
[2025-06-01 12:02:00.266] [info] [WASI-NN] llama.cpp: llama_kv_cache_unified:      CUDA0 KV buffer size =   448.00 MiB
[2025-06-01 12:02:00.266] [info] [WASI-NN] llama.cpp: llama_kv_cache_unified: KV self size  =  448.00 MiB, K (f16):  224.00 MiB, V (f16):  224.00 MiB
[2025-06-01 12:02:00.278] [info] [WASI-NN] llama.cpp: llama_context:      CUDA0 compute buffer size =   256.50 MiB
[2025-06-01 12:02:00.278] [info] [WASI-NN] llama.cpp: llama_context:  CUDA_Host compute buffer size =    14.01 MiB
[2025-06-01 12:02:00.278] [info] [WASI-NN] llama.cpp: llama_context: graph nodes  = 958
[2025-06-01 12:02:00.278] [info] [WASI-NN] llama.cpp: llama_context: graph splits = 2
[2025-06-01 12:02:00.278] [info] [WASI-NN] GGML backend: llama_system_info: CUDA : ARCHS = 600,610,700,750,800,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | BMI2 = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
[2025-06-01 12:02:00.278] [info] [WASI-NN] GGML backend: LLAMA_COMMIT cf0a43bb
[2025-06-01 12:02:00.278] [info] [WASI-NN] GGML backend: LLAMA_BUILD_NUMBER 5361
[2025-06-01 12:02:00.279] [info] [WASI-NN] llama.cpp: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060) - 2647 MiB free
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from nomic-embed-text-v1.5.Q5_K_M.gguf (version GGUF V3 (latest))
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   8:                          general.file_type u32              = 17
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
[2025-06-01 12:02:00.282] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
[2025-06-01 12:02:00.284] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
[2025-06-01 12:02:00.289] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv  22:               general.quantization_version u32              = 2
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - type  f32:   51 tensors
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - type q5_K:   43 tensors
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: llama_model_loader: - type q6_K:   18 tensors
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: print_info: file format = GGUF V3 (latest)
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: print_info: file type   = Q5_K - Medium
[2025-06-01 12:02:00.291] [info] [WASI-NN] llama.cpp: print_info: file size   = 94.25 MiB (5.78 BPW) 
[2025-06-01 12:02:00.301] [warning] [WASI-NN] llama.cpp: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
[2025-06-01 12:02:00.301] [info] [WASI-NN] llama.cpp: load: special tokens cache size = 5
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: load: token to piece cache size = 0.2032 MB
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: arch             = nomic-bert
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: vocab_only       = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_ctx_train      = 2048
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_embd           = 768
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_layer          = 12
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_head           = 12
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_head_kv        = 12
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_rot            = 64
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_swa            = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_swa_pattern    = 1
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_embd_head_k    = 64
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_embd_head_v    = 64
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_gqa            = 1
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_embd_k_gqa     = 768
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_embd_v_gqa     = 768
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: f_norm_eps       = 1.0e-12
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: f_norm_rms_eps   = 0.0e+00
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: f_clamp_kqv      = 0.0e+00
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: f_max_alibi_bias = 0.0e+00
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: f_logit_scale    = 0.0e+00
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: f_attn_scale     = 0.0e+00
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_ff             = 3072
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_expert         = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_expert_used    = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: causal attn      = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: pooling type     = 1
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: rope type        = 2
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: rope scaling     = linear
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: freq_base_train  = 1000.0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: freq_scale_train = 1
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: n_ctx_orig_yarn  = 2048
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: rope_finetuned   = unknown
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: ssm_d_conv       = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: ssm_d_inner      = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: ssm_d_state      = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: ssm_dt_rank      = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: ssm_dt_b_c_rms   = 0
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: model type       = 137M
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: model params     = 136.73 M
[2025-06-01 12:02:00.303] [info] [WASI-NN] llama.cpp: print_info: general.name     = nomic-embed-text-v1.5
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: vocab type       = WPM
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: n_vocab          = 30522
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: n_merges         = 0
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: BOS token        = 101 '[CLS]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: EOS token        = 102 '[SEP]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: UNK token        = 100 '[UNK]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: SEP token        = 102 '[SEP]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: PAD token        = 0 '[PAD]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: MASK token       = 103 '[MASK]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: LF token         = 0 '[PAD]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: EOG token        = 102 '[SEP]'
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: print_info: max token length = 21
[2025-06-01 12:02:00.304] [info] [WASI-NN] llama.cpp: load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors: offloading 12 repeating layers to GPU
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors: offloading output layer to GPU
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors: offloaded 13/13 layers to GPU
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors:   CPU_Mapped model buffer size =    15.38 MiB
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors:        CUDA0 model buffer size =    78.88 MiB
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: constructing llama_context
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: n_seq_max     = 1
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: n_ctx         = 348
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: n_ctx_per_seq = 348
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: n_batch       = 512
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: n_ubatch      = 512
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: causal_attn   = 0
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: flash_attn    = 0
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: freq_base     = 1000.0
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context: freq_scale    = 1
[2025-06-01 12:02:00.321] [warning] [WASI-NN] llama.cpp: llama_context: n_ctx_per_seq (348) < n_ctx_train (2048) -- the full capacity of the model will not be utilized
[2025-06-01 12:02:00.321] [info] [WASI-NN] llama.cpp: llama_context:  CUDA_Host  output buffer size =     0.12 MiB
[2025-06-01 12:02:00.329] [info] [WASI-NN] GGML backend: llama_system_info: CUDA : ARCHS = 600,610,700,750,800,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | BMI2 = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 348
[2025-06-01 12:02:00.329] [info] rag_api_server in src/main.rs:542: plugin_ggml_version: b5361 (commit cf0a43bb)
[2025-06-01 12:02:00.329] [info] rag_api_server in src/main.rs:591: Listening on 0.0.0.0:8080

John6666 · June 2, 2025, 3:14am

It seems that you cannot use GPU in Embedding if the version of Llama.cpp is old, but I am not sure if this is the problem.

User515 · June 3, 2025, 4:23am

Silly question, but I am pretty new to this: if I’m using a GGUF that’s previously been built and I’m running on WASMEdge / WASI-NN / LlamaEdge, at which point would I have compiled llama.cpp directly?

Since I’m just trying to quickly demo a tool, I’ve been running the install shell scripts provided on the respective repos (e.g. WasmEdge/utils/install_v2.sh at master · WasmEdge/WasmEdge · GitHub), and haven’t looked too deeply into the core libraries.

John6666 · June 3, 2025, 5:45am

If GGUF is converted correctly, even old files should work fine. (Very rarely, the supported format range may change, but we don’t usually need to worry about that.)
As long as Llama.cpp is up-to-date and relatively bug-free, it shouldn’t cause any issues at runtime…

Overall, since the GPU is recognized and doesn’t error out, this issue is likely due to a configuration mistake or a bug, so it’s probably not structurally impossible. VRAM should also be sufficient…

User515 · June 3, 2025, 2:01pm

Ok I think I understand now: llama.cpp is used to quantize the GGUF’s before I download them, so I’m not directly interacting with that project in the way I’ve set up my tool.

According to Nomic’s page (second-state/nomic-embed-text-v1.5-GGUF · Hugging Face) it was quantized with llama.cpp b4120, a release from Nov. 2024. Seems relatively new.

Where could I check for the configuration mistake? Is there something I can test with the WASMEdge call?

John6666 · June 4, 2025, 12:51am

I assume that the problem lies in the settings of the program running on wasmedge rather than wasmedge itself, but I don’t know which program you are using…

Assuming that this is the case, increasing the number in this option may be effective.

-g, --n-gpu-layers <N_GPU_LAYERS>
Number of layers to run on the GPU

   [default: 100]

Edit:
I thought so, but looking at the log above, 100 seems to be enough…
The cause must be something else…

John6666 · June 4, 2025, 12:58am

The core is fixed to an old version…

I’m not sure if this is the cause, but the version of Llama.cpp that is actually used may be old.

github.com/LlamaEdge/rag-api-server

Cargo.toml

main


      
          [dependencies]
          anyhow         = "1"
          chat-prompts   = { version = "=0.26.1" }
          chrono         = "0.4.38"
          clap           = { version = "4.4.6", features = ["cargo"] }
          either         = "1.12.0"
          endpoints      = { version = "=0.25.1", features = ["rag", "index"] }
          futures        = { version = "0.3.6", default-features = false, features = ["async-await", "std"] }
          futures-util   = "0.3"
          hyper          = { version = "0.14", features = ["full"] }
          llama-core     = { version = "=0.30.0", features = ["logging", "rag", "index"] }
          log            = { version = "0.4.21", features = ["std", "kv", "kv_serde"] }
          mime_guess     = "2.0.4"
          multipart-2021 = "0.19.0"
          once_cell      = "1.18"
          reqwest        = { version = "0.11", default-features = false, features = ["json", "stream", "rustls-tls"] }
          serde          = { version = "1.0", features = ["derive"] }
          serde_json     = "1.0"
          thiserror      = "1"
          tokio          = { version = "^1.36", features = ["io-util", "fs", "net", "time", "rt", "macros"] }
          url            = "^2.5"

llama-core     = { version = "=0.30.0", features = ["logging", "rag", "index"] }

User515 · June 4, 2025, 3:32am

Hmm I tried adjusting the llama-core crate version to newer versions (even the version immediately after, 0.31.0) and running cargo build --target wasm32-wasip1 --release again but the build fails because there are too many mismatched types (i.e. the method calls have been changed).

A few examples…

2438 | ... chat_prompt.build_with_tools(&mut chat_request.messages, Some(&[])) {
     |                 ---------------- ^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `ChatCompletionRequestMessage`, found a different `ChatCompletionRequestMessage`
     |                 |
     |                 arguments to this method are incorrect
     |

and

error[E0308]: mismatched types
    --> /home/laurentius/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/llama-core-0.31.0/src/chat.rs:2452:76
     |
2452 | ...chat_request.messages, Some(tools.as_slice()))
     |                           ---- ^^^^^^^^^^^^^^^^ expected `endpoints::chat::Tool`, found a different `endpoints::chat::Tool`
     |                           |
     |                           arguments to this enum variant are incorrect
     |

Changing the llama-core crate may be a heavy lift because I’d need to go in and understand many of the method calls and possibly rewrite some of them.

User515 · June 4, 2025, 3:58am

According to this (All releases of llama-core // Lib.rs), 0.30.0 is from April of this year, so its CUDA support should be up to date. I’ll try a few different embeddings models.

John6666 · June 4, 2025, 4:35am

When I checked the log, most of it seems to be loaded on the GPU…

Could it be that some of it is on the CPU, or is there some other reason why the calculations are slow, and there is no problem with the loading itself?

[2025-06-01 12:02:00.281] [info] [WASI-NN] llama.cpp: llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
...
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors: offloading 12 repeating layers to GPU
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors: offloading output layer to GPU
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors: offloaded 13/13 layers to GPU
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors:   CPU_Mapped model buffer size =    15.38 MiB
[2025-06-01 12:02:00.307] [info] [WASI-NN] llama.cpp: load_tensors:        CUDA0 model buffer size =    78.88 MiB

Edit:
Your GPU supports bfloat16, but a similar situation may occur with float32 or so in Embedding models, for example. Unnecessary data movement between the CPU and GPU will result in a speed penalty.

github.com/ggml-org/llama.cpp

llama.cpp is slow on GPU

opened 11:19AM - 14 Oct 24 UTC

closed 01:08AM - 01 Dec 24 UTC

vineel96

Nvidia GPU bug-unconfirmed stale low severity

### What happened? llama.cpp is running slow on NVIDIA A100 80GB GPU Steps… to reproduce: 1. git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp 2. mkdir build && cd build 3. cmake .. -DGGML_CUDA=ON 4. make GGML_CUDA=1 3. command: ./build/bin/llama-cli -m ../gguf_files/llama-3-8B.gguf -t 6912 --ctx-size 50 --n_predict 50 --prompt "There are two persons named ram and krishna" Here threads are set to 6912 since GPU has 6912 CUDA cores. It is slow on gpu compared to cpu. On gpu eval time is around 0.07 tokens per second. Is this expected behaviour or any tweak should be done while building llama.cpp? ### Name and Version version: 3902 (c81f3bbb) built with cc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) for aarch64-redhat-linux ### What operating system are you seeing the problem on? Linux ### Relevant log output ```shell ```

Bug? aroud cache.

github.com/ggml-org/llama.cpp

Bug: llama-server api first query very slow

opened 06:37AM - 15 Sep 24 UTC

bosmart

bug medium severity

### What happened? I'm using the `openai` library to interact with `llama-serve…r` docker image on an A6000: `docker run -p 8080:8080 --name llama-server -v ~/gguf_models:/models --gpus all ghcr.io/ggerganov/llama.cpp:server-cuda -m models/Meta-Llama-3.1-70B-Instruct-Q4_K_L.gguf -c 65536 -fa --host 0.0.0.0 --port 8080 --n-gpu-layers 99 -ctk q4_0 -ctv q4_0 -t 4` The first request I send takes about 80 seconds, during which at first a single CPU core gets 100% load for maybe ~55s (with GPU usage at 0%) and only then the GPU kicks in. The second time I execute the exact same call, it takes ~26s to respond and starts with both CPU (one core 100%) and GPU (~87%) working a the same time. The API call itself is: ``` import openai # openai-1.45.0 api_url = 'http://x.x.x.x:8080' client = openai.OpenAI( base_url=f"{api_url}/v1", api_key = "unused" ) messages = [ {"role": "system", "content": "You are a helpful assistant. Your top priority is answering user questions truthfully, based solely on the information provided."}, {"role": "user", "content": prompt} # prompt is around 450 words ] completion = client.chat.completions.create( model=None, temperature=0.0, messages=messages ).choices[0].message ``` ### Name and Version $./llama-server --version version: 0 (unknown) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu ^^^ not very helpful but I have just pulled a fresh docker image today i.e. 15/09/2024: `docker pull ghcr.io/ggerganov/llama.cpp:server-cuda` ### What operating system are you seeing the problem on? Linux ### Relevant log output ```shell warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host INFO [ main] build info | tid="126087453233152" timestamp=1726381552 build=0 commit="unknown" INFO [ main] system info | tid="126087453233152" timestamp=1726381552 n_threads=4 n_threads_batch=4 total_threads=36 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " INFO [ main] HTTP server is listening | tid="126087453233152" timestamp=1726381552 n_threads_http="35" port="8080" hostname="0.0.0.0" INFO [ main] loading model | tid="126087453233152" timestamp=1726381552 n_threads_http="35" port="8080" hostname="0.0.0.0" llama_model_loader: loaded meta data with 33 key-value pairs and 724 tensors from models/Meta-Llama-3.1-70B-Instruct-Q4_K_L.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 70B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 80 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 8192 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 13: llama.attention.head_count u32 = 64 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 15 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - kv 29: quantize.imatrix.file str = /models_out/Meta-Llama-3.1-70B-Instru... llama_model_loader: - kv 30: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 560 llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 125 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q8_0: 2 tensors llama_model_loader: - type q4_K: 440 tensors llama_model_loader: - type q5_K: 40 tensors llama_model_loader: - type q6_K: 80 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 40.32 GiB (4.91 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.68 MiB llm_load_tensors: offloading 80 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 81/81 layers to GPU llm_load_tensors: CPU buffer size = 1064.62 MiB llm_load_tensors: CUDA0 buffer size = 40222.18 MiB ................................................................................................. llama_new_context_with_model: n_ctx = 65536 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 5760.00 MiB llama_new_context_with_model: KV self size = 5760.00 MiB, K (q4_0): 2880.00 MiB, V (q4_0): 2880.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.98 MiB llama_new_context_with_model: CUDA0 compute buffer size = 266.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 144.01 MiB llama_new_context_with_model: graph nodes = 2247 llama_new_context_with_model: graph splits = 2 INFO [ init] initializing slots | tid="126087453233152" timestamp=1726381574 n_slots=1 INFO [ init] new slot | tid="126087453233152" timestamp=1726381574 id_slot=0 n_ctx_slot=65536 INFO [ main] model loaded | tid="126087453233152" timestamp=1726381574 INFO [ main] chat template | tid="126087453233152" timestamp=1726381574 chat_example="<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi there<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" built_in=true INFO [ update_slots] all slots are idle | tid="126087453233152" timestamp=1726381574 INFO [ log_server_request] request | tid="126086914506752" timestamp=1726381582 remote_addr="127.0.0.1" remote_port=35744 status=200 method="GET" path="/health" params={} INFO [ launch_slot_with_task] slot is processing task | tid="126087453233152" timestamp=1726381587 id_slot=0 id_task=0 INFO [ update_slots] kv cache rm [p0, end) | tid="126087453233152" timestamp=1726381587 id_slot=0 id_task=0 p0=0 INFO [ log_server_request] request | tid="126086815940608" timestamp=1726381613 remote_addr="127.0.0.1" remote_port=59038 status=200 method="GET" path="/health" params={} INFO [ log_server_request] request | tid="126086805454848" timestamp=1726381643 remote_addr="127.0.0.1" remote_port=35914 status=200 method="GET" path="/health" params={} INFO [ release] slot released | tid="126087453233152" timestamp=1726381666 id_slot=0 id_task=0 n_past=1011 truncated=false INFO [ print_timings] prompt eval time = 39264.22 ms / 685 tokens ( 57.32 ms per token, 17.45 tokens per second) | tid="126087453233152" timestamp=1726381666 id_slot=0 id_task=0 t_prompt_processing=39264.217 n_prompt_tokens_processed=685 t_token=57.320024817518245 n_tokens_second=17.44591010181102 INFO [ print_timings] generation eval time = 40157.40 ms / 327 runs ( 122.81 ms per token, 8.14 tokens per second) | tid="126087453233152" timestamp=1726381666 id_slot=0 id_task=0 t_token_generation=40157.401 n_decoded=327 t_token=122.80550764525994 n_tokens_second=8.142957259609506 INFO [ print_timings] total time = 79421.62 ms | tid="126087453233152" timestamp=1726381666 id_slot=0 id_task=0 t_prompt_processing=39264.217 t_token_generation=40157.401 t_total=79421.61799999999 INFO [ update_slots] all slots are idle | tid="126087453233152" timestamp=1726381666 INFO [ log_server_request] request | tid="126086904020992" timestamp=1726381666 remote_addr="10.147.20.194" remote_port=47472 status=200 method="POST" path="/v1/chat/completions" params={} INFO [ launch_slot_with_task] slot is processing task | tid="126087453233152" timestamp=1726381670 id_slot=0 id_task=328 INFO [ update_slots] kv cache rm [p0, end) | tid="126087453233152" timestamp=1726381670 id_slot=0 id_task=328 p0=0 INFO [ log_server_request] request | tid="126086794969088" timestamp=1726381673 remote_addr="127.0.0.1" remote_port=44812 status=200 method="GET" path="/health" params={} INFO [ release] slot released | tid="126087453233152" timestamp=1726381696 id_slot=0 id_task=328 n_past=1011 truncated=false INFO [ print_timings] prompt eval time = 1319.62 ms / 685 tokens ( 1.93 ms per token, 519.09 tokens per second) | tid="126087453233152" timestamp=1726381696 id_slot=0 id_task=328 t_prompt_processing=1319.625 n_prompt_tokens_processed=685 t_token=1.9264598540145985 n_tokens_second=519.0868617978592 INFO [ print_timings] generation eval time = 24725.42 ms / 327 runs ( 75.61 ms per token, 13.23 tokens per second) | tid="126087453233152" timestamp=1726381696 id_slot=0 id_task=328 t_token_generation=24725.416 n_decoded=327 t_token=75.61289296636086 n_tokens_second=13.225257767149397 INFO [ print_timings] total time = 26045.04 ms | tid="126087453233152" timestamp=1726381696 id_slot=0 id_task=328 t_prompt_processing=1319.625 t_token_generation=24725.416 t_total=26045.041 INFO [ update_slots] all slots are idle | tid="126087453233152" timestamp=1726381696 INFO [ log_server_request] request | tid="126086904020992" timestamp=1726381696 remote_addr="10.147.20.194" remote_port=47472 status=200 method="POST" path="/v1/chat/completions" params={} INFO [ log_server_request] request | tid="126086763511808" timestamp=1726381703 remote_addr="127.0.0.1" remote_port=57226 status=200 method="GET" path="/health" params={} INFO [ log_server_request] request | tid="126086773997568" timestamp=1726381733 remote_addr="127.0.0.1" remote_port=58518 status=200 method="GET" path="/health" params={} INFO [ log_server_request] request | tid="126086784483328" timestamp=1726381763 remote_addr="127.0.0.1" remote_port=36938 status=200 method="GET" path="/health" params={} INFO [ log_server_request] request | tid="126086753026048" timestamp=1726381793 remote_addr="127.0.0.1" remote_port=34366 status=200 method="GET" path="/health" params={} INFO [ log_server_request] request | tid="126086742540288" timestamp=1726381823 remote_addr="127.0.0.1" remote_port=36110 status=200 method="GET" path="/health" params={} INFO [ log_server_request] request | tid="126086721568768" timestamp=1726381853 remote_addr="127.0.0.1" remote_port=41308 status=200 method="GET" path="/health" params={} INFO [ log_server_request] request | tid="126086732054528" timestamp=1726381883 remote_addr="127.0.0.1" remote_port=44500 status=200 method="GET" path="/health" params={} INFO [ log_server_request] request | tid="126086711083008" timestamp=1726381913 remote_addr="127.0.0.1" remote_port=51098 status=200 method="GET" path="/health" params={} INFO [ log_server_request] request | tid="126086700597248" timestamp=1726381943 remote_addr="127.0.0.1" remote_port=39654 status=200 method="GET" path="/health" params={} ```

User515 · June 5, 2025, 3:36am

Hmm well I tried a few things.

First, I tried using a few different embedding models: two different nomic-text-1.5’s, MiniLM-L6-v2, and one more that was too big to load in my VRAM. No change.

Next, I tried loading the embedding model as CPU (i.e.
--nn-preload default:GGML:GPU:Llama-3.2-3B-Instruct-Q5_K_M.gguf \ --nn-preload embedding:GGML:CPU:nomic-embed-text-v1.5.Q5_K_M.gguf \
) to see if I could speed it up just by not repeatedly transferring between GPU and CPU; still extremely slow while chunking the text file and nvidia-smi reports a very similar VRAM consumption as before.

I tried to see if the rag-api-server.wasm could be loaded with embeddings models only (for testing) and then asked to chunk a file but it requires the chat model is also loaded.

Lastly, I looked to see if I could adjust bfloat16 vs float32 usage anywhere but my understanding is that these vector types are compiled into the GGUF (or generated during training?) and llama-core libraries, so I don’t think I can change them. I don’t see any reference to them in the script that WASMEdge supplies to build WASMEdge and the NN-GGML plugin, either.

What would be an obvious next step for debugging? Or a workaround to get a reasonably quick RAG going on middle-range consumer hardware?

John6666 · June 5, 2025, 5:00am

The behavior of the embedding model on the GPU in WasmEdge seems strange…
I’m not sure if the issue is with WasmEdge, Llama.cpp, or something else…
Considering that the calculation results are being output, it seems like the model is being loaded to VRAM, but the GPU isn’t being utilized properly during the computation, resulting in unusual behavior.

Or a workaround to get a reasonably quick RAG going on middle-range consumer hardware?

If you’re not tied to WasmEdge, Ollama could be a lighter option for simple tasks, and vLLM could be used for longer token lengths (processing text longer than 8,000 tokens). In either case, there should be no hardware specifications issues.

John6666 · June 5, 2025, 5:37am

Probably I found similar unresolved issue.

Topic		Replies	Views
Make Text Embedding Server compatible 🤗Optimum	2	255	August 8, 2024
Host a Model with vllm for RAG Models	6	3610	September 12, 2024
Cannot create new endpoints: WebserverFailed Inference Endpoints on the Hub	1	766	November 30, 2023
Processong speed for text embedding models Models	0	172	April 5, 2024
Text-generation-inference: "You are using a model of type llama to instantiate a model of type ." Models	5	7490	November 3, 2023

Related topics