TGI doesn't like quantization option on aws inf2

arisin · August 8, 2024, 9:15pm

Hi All,

Facing this error when running TGI on inf2
Error:
2024-08-08T21:14:29.441172Z INFO text_generation_launcher: Defaultmax_batch_prefill_tokens` to 32768
2024-08-08T21:14:29.441176Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-08-08T21:14:29.441256Z INFO download: text_generation_launcher: Starting download process.
2024-08-08T21:14:29.514298Z WARN text_generation_launcher: ‘extension’ argument is not supported and will be ignored.

2024-08-08T21:14:32.644926Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-08-08T21:14:32.645248Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-08-08T21:14:32.746522Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Usage: text-generation-server serve [OPTIONS] MODEL_ID
Try ‘text-generation-server serve --help’ for help.

Error: No such option: --quantize rank=0
Error: ShardCannotStart
2024-08-08T21:14:32.845269Z ERROR text_generation_launcher: Shard 0 failed to start
2024-08-08T21:14:32.845293Z INFO text_generation_launcher: Shutting down shards
`

Command to start:
docker run -p 8000:80 -v $(pwd)/data:/data --privileged ghcr.io/huggingface/neuronx-tgi:latest --model-id /data/meta/ --quantize awq

Appreciate any guidance to get this to work.

Topic		Replies	Views
TGI version 0.9.3 llama2 13B deployment sagemaker RuntimeError Amazon SageMaker	2	670	September 12, 2023
Can't change max_input_length of Text Generation Inference Intermediate	0	136	May 15, 2024
GPTQ and AWQ quantized model doesn't work Beginners	0	143	February 19, 2024
QLoRA trained LLaMA2 13B deployment error on Sagemaker using text generation inference image Amazon SageMaker	14	2976	August 18, 2023
huggingface_hub.utils._errors.EntryNotFoundError: No .bin weights found for model Beginners	0	776	August 6, 2023

TGI doesn't like quantization option on aws inf2

Related topics