Trainer) training one batch with multiple GPUs

Hi, there.
I’m using huggingFace Trainer code to train gpt-based large language model.
The size is more than 8b.
when I use input sequence length = 2048 tokens, and the per_device_train_batch_size=1, it seems it doesn’t fit on A100 (40GB) GPU. How can I load one batch to multiple gpus? It seems like that I ‘must’ load more than one batch on one gpu.