codellama/CodeLlama-70b-Instruct-hf TGI server out-of-memory error in H100

Is there a memory-efficient way to run HF codellama-70b model in HF Docker TGI?

$docker run --user 111:12 --env CUDA_VISIBLE_DEVICES=7 --env HUGGING_FACE_HUB_TOKEN=hf_... --gpus all --shm-size 1g -p 8082:8082 -v /SCRATCH/:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id codellama/CodeLlama-70b-Instruct-hf --num-shard 1 --port 8082

70B also fails with the below flag (how is it not working for 4bit data types? total memory is ~35GB)

--quantize bitsandbytes-fp4 

codellama-34b also fails with the same error.

  1. Is there a way to run the server on multiple GPUs.
You are using a model of type llama to instantiate a model of type. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 269, in get_model
    return FlashLlama(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 69, in __init__
    model = FlashLlamaForCausalLM(config, weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 412, in __init__
    self.model = FlashLlamaModel(config, weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 350, in __init__
    [

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 351, in <listcomp>
    FlashLlamaLayer(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 290, in __init__
    self.mlp = LlamaMLP(prefix=f"{prefix}.mlp", config=config, weights=weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 260, in __init__
    self.gate_up_proj = TensorParallelColumnLinear.load_multi(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 460, in load_multi
    weight = weights.get_multi_weights_col(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 224, in get_multi_weights_col
    weight = torch.cat(w, dim=dim)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacty of 79.11 GiB of which 525.56 MiB is free. Process 124405 has 610.00 MiB memory in use. Process 87654 has 53.28 GiB memory in use. Process 101667 has 24.68 GiB memory in use. Of the allocated memory 23.96 GiB is allocated by PyTorch, and 217.28 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
 rank=0
2024-03-18T20:27:29.935470Z ERROR text_generation_launcher: Shard 0 failed to start
2024-03-18T20:27:29.935499Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

I can’t answer why it isn’t working with quantization


If you have server with n GPUs, you need to run the docker run command one time per GPU and use a load-balancer. Unfortunately, there is not a convenient way to do this in TGI out-of-the-box.

Regarding the load-balancer, using nginx with round-robin balancing is probably the simplest way to implement this.

I was unable to reproduce the TGI error on an A100. Are you sure there aren’t other processes consuming GPU memory? Can you use nvidia-smi or gpustat to check before and after launching TGI?

I would also recommend using bitsandbytes-nf4 instead of bitsandbytes-fp4 as the quality should be higher.