I’m fine-tuning T5 (11B) with very long sequence lengths (2048 input, 256 output) and am running out of memory on an 8x A100-80GB cluster even with ZeRO-3 enabled, bf16 enabled, and per-device batch size=1. The issue seems to be not with optimizer or model memory, but rather activation memory. I’m trying to get activation checkpointing to work with my existing setup (which uses the automatic HF Trainer/Deepspeed integration).
Indeed, enabling activation checkpointing should make a very noticeable difference.
If that is not enough you can look into Memory-centric tiling which should shave some more memory, and tuning up buffer sizes in the deepspeed config may help a bit more.
Specifically to your situation Sequence Parallelism should be very helpful, but if I’m not mistaken this is yet to be supported by Deepspeed.- you may want to submit a feature request for this to happen.
Ah I see, so activation and gradient checkpointing are the same thing? The Deepspeed activation checkpoint reference seems to suggest that their implementation partitions the activations between the GPUs (similar to gradients + model weights in ZeRO 3).
Does this gradient_checkpointing=True flag HF Trainer enable partitioning as well? That is an optimization I’m interested in – as most of my GPU memory is in fact being eaten up by activations.