Multi-node Multi-gpu inference for Long inputs on Llama-3

I have a very long input with 62k tokens, so I am using gradientai/Llama-3-70B-Instruct-Gradient-262k. I have access to multiple nodes of GPU, each node has 4 of 80 GB A100. I used accelerate with device_map=auto to distribute the model to different GPUs and it works with inputs of small length but when I use my required input of longer length, I run into CUDA out-of-memory error. Accelerate doesn’t have multi-node support. I am also not sure if the input is also divided into GPUs like the model is (I believe if it was then I would not have run into such an issue.). What do you suggest, I should try? I have read about FSDP and deepspeed but I could not find anything about inference similar to my scenario.