Hi All,
Facing this error when running TGI on inf2
Error:
2024-08-08T21:14:29.441172Z INFO text_generation_launcher: Default
max_batch_prefill_tokens` to 32768
2024-08-08T21:14:29.441176Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-08-08T21:14:29.441256Z INFO download: text_generation_launcher: Starting download process.
2024-08-08T21:14:29.514298Z WARN text_generation_launcher: ‘extension’ argument is not supported and will be ignored.
2024-08-08T21:14:32.644926Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-08-08T21:14:32.645248Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-08-08T21:14:32.746522Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
Usage: text-generation-server serve [OPTIONS] MODEL_ID
Try ‘text-generation-server serve --help’ for help.
Error: No such option: --quantize rank=0
Error: ShardCannotStart
2024-08-08T21:14:32.845269Z ERROR text_generation_launcher: Shard 0 failed to start
2024-08-08T21:14:32.845293Z INFO text_generation_launcher: Shutting down shards
`
Command to start:
docker run -p 8000:80 -v $(pwd)/data:/data --privileged ghcr.io/huggingface/neuronx-tgi:latest --model-id /data/meta/ --quantize awq
Appreciate any guidance to get this to work.