Multi-node Multi-gpu inference for Long inputs on Llama-3

prb977 · June 19, 2024, 9:30pm

I have a very long input with 62k tokens, so I am using gradientai/Llama-3-70B-Instruct-Gradient-262k. I have access to multiple nodes of GPU, each node has 4 of 80 GB A100. I used accelerate with device_map=auto to distribute the model to different GPUs and it works with inputs of small length but when I use my required input of longer length, I run into CUDA out-of-memory error. Accelerate doesn’t have multi-node support. I am also not sure if the input is also divided into GPUs like the model is (I believe if it was then I would not have run into such an issue.). What do you suggest, I should try? I have read about FSDP and deepspeed but I could not find anything about inference similar to my scenario.

Topic		Replies	Views
Multi-GPU inference with accelerate Beginners	0	1713	October 19, 2023
Multi-GPU LLM inference data parallelism (llama) Beginners	1	14135	October 25, 2023
LLAMA-2 Multi-Node Models	0	655	August 8, 2023
Any good code/tutorial that is shows how to do inference with Llama 2 70b on multiple GPUs with accelerate? 🤗Accelerate	1	2770	November 27, 2023
How to run inference on multigpus 🤗Accelerate	0	129	November 29, 2024

Multi-node Multi-gpu inference for Long inputs on Llama-3

Related topics