Text-generation-inference: "You are using a model of type llama to instantiate a model of type ."

I’m trying to run the Docker container but encountering the error You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors. during the “Warming up model” part and after allocating 11GB of GPU memory.

I tried adding -e GPTQ_BITS=4 -e GPTQ_GROUPSIZE=128 but that made no difference. I also did my research and read almost every issue on the repo but I don’t think this problem has been encountered before.

Also tried adding DISABLE_EXLLAMA=True and --disable-custom-kernels

Docker Command

sudo docker run --gpus all --shm-size 16g -p 8183:80 \
        --runtime=nvidia \
        -v /storage/text-generation-inference/docker/data:/data ghcr.io/huggingface/text-generation-inference:latest \
        --model-id TheBloke/MythoMax-L2-13B-GPTQ --revision gptq-4bit-128g-actorder_True \
        --quantize gptq    

Version: ghcr.io/huggingface/text-generation-inference:latest
Model: TheBloke/MythoMax-L2-13B-GPTQ:gptq-4bit-128g-actorder_True
GPU: Nvidia A4000
Ubuntu 22.04

Logs

Command Ouput
2023-08-26T23:55:35.854791Z  INFO text_generation_launcher: Args { model_id: "TheBloke/MythoMax-L2-13B-GPTQ", revision: Some("gptq-4bit-128g-actorder_True"), validation_workers: 2, sharded: None, num_shard: Some(1), quantize: Some(Gptq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "9f8054981ef0", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-08-26T23:55:35.855120Z  INFO download: text_generation_launcher: Starting download process.
2023-08-26T23:55:42.115911Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-08-26T23:55:42.966478Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-08-26T23:55:42.967019Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-08-26T23:55:50.341447Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:50.506312Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:50.558788Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:50.631771Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:50.684478Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:50.755956Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:50.808272Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:50.878371Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:50.930356Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:50.999522Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:51.058761Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:51.140228Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:51.197339Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:51.277385Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:51.333274Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:51.412191Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:51.468466Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:51.544203Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:51.598740Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:51.673087Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:51.738228Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:51.829787Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:51.899165Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:51.986018Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:52.053955Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:52.140754Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:52.201983Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:52.284596Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:52.347910Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:52.429569Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:52.495493Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:52.580749Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:52.649547Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:52.735419Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:52.798945Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:52.881939Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:52.948323Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:52.991872Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-08-26T23:55:53.039310Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:53.111628Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:53.199440Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:53.265831Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:53.352812Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:53.421639Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:53.507330Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:53.576358Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:53.663294Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:53.729889Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:53.817077Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:53.887240Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:53.971477Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:54.031414Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:54.110964Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:54.169305Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:54.243262Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:54.300053Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:54.390297Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:54.449249Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:54.522054Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:54.578900Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:54.653837Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:54.708822Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:54.782192Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:54.836983Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:54.909894Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:54.966976Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:55.040961Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:55.098204Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:55.171750Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:55.230459Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:55.308081Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:55.366400Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:55.442389Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:55.500467Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:55.575549Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:55.633376Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:55.712977Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:55.769988Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:55.847992Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:55.906371Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:55.980776Z  INFO text_generation_launcher: Using exllama kernels

2023-08-26T23:55:57.879732Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2023-08-26T23:55:57.900250Z  INFO shard-manager: text_generation_launcher: Shard ready in 14.931675025s rank=0
2023-08-26T23:55:57.987600Z  INFO text_generation_launcher: Starting Webserver
2023-08-26T23:55:59.443055Z  INFO text_generation_router: router/src/main.rs:367: Serving revision 48caaecfc0ca83f44d056a3f46d4c1661bbea1e7 of model TheBloke/MythoMax-L2-13B-GPTQ
2023-08-26T23:55:59.452758Z  INFO text_generation_router: router/src/main.rs:210: Warming up model
2023-08-26T23:56:11.299165Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
Error: Warmup(Generation("transport error"))
2023-08-26T23:56:11.411063Z ERROR text_generation_launcher: Webserver Crashed
2023-08-26T23:56:11.411132Z  INFO text_generation_launcher: Shutting down shards
2023-08-26T23:56:11.424793Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors. rank=0
2023-08-26T23:56:11.424860Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 4 rank=0

Sometimes I don’t get the Shard complete standard error output message.

3 Likes

Did you find any solution to this? I try deploying models on sagemaker, with the code provided on huggingface.
It always fails. For example: Llama-2-7b-german-assistant-v3-4bit-autogptq
I noticed it happens with the gptq models

i have same issue,

File “/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py”, line 49, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight model.layers.0.self_attn.rotary_emb.inv_freq does not exist

You are using a model of type llama to instantiate a model of type .

I am trying to deploy Phind/Phind-CodeLlama-34B-v2

I’m getting this same error about the inv_freq weight not existing.

For reference I am attempting to deploy codellama/CodeLlama-7b-hf on Sagemaker using the DLC inference container via HuggingFaceModel().

I found this thread in the TGI repo Cannot load LLaMA models saved with latest transformers · Issue #790 · huggingface/text-generation-inference · GitHub. Not sure how we can apply that change to our situation (deploying from Sagemaker).

Any ideas are appreciated!

Up! Facing the same issue!

same issue here, trying to deploy finetuned meta-llama/Llama-2-7b-chat-hf with image_uri= get_huggingface_llm_image_uri(“huggingface”,version=“0.9.3”) on sagemaker