Is there a memory-efficient way to run HF codellama-70b model in HF Docker TGI?
$docker run --user 111:12 --env CUDA_VISIBLE_DEVICES=7 --env HUGGING_FACE_HUB_TOKEN=hf_... --gpus all --shm-size 1g -p 8082:8082 -v /SCRATCH/:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id codellama/CodeLlama-70b-Instruct-hf --num-shard 1 --port 8082
70B also fails with the below flag (how is it not working for 4bit data types? total memory is ~35GB)
--quantize bitsandbytes-fp4
codellama-34b also fails with the same error.
- Is there a way to run the server on multiple GPUs.
You are using a model of type llama to instantiate a model of type. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 269, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 69, in __init__
model = FlashLlamaForCausalLM(config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 412, in __init__
self.model = FlashLlamaModel(config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 350, in __init__
[
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 351, in <listcomp>
FlashLlamaLayer(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 290, in __init__
self.mlp = LlamaMLP(prefix=f"{prefix}.mlp", config=config, weights=weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 260, in __init__
self.gate_up_proj = TensorParallelColumnLinear.load_multi(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 460, in load_multi
weight = weights.get_multi_weights_col(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 224, in get_multi_weights_col
weight = torch.cat(w, dim=dim)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacty of 79.11 GiB of which 525.56 MiB is free. Process 124405 has 610.00 MiB memory in use. Process 87654 has 53.28 GiB memory in use. Process 101667 has 24.68 GiB memory in use. Of the allocated memory 23.96 GiB is allocated by PyTorch, and 217.28 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
rank=0
2024-03-18T20:27:29.935470Z ERROR text_generation_launcher: Shard 0 failed to start
2024-03-18T20:27:29.935499Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart