Fine-tuning T5 with long sequence length, using activation checkpointing with Deepspeed

I’m fine-tuning T5 (11B) with very long sequence lengths (2048 input, 256 output) and am running out of memory on an 8x A100-80GB cluster even with ZeRO-3 enabled, bf16 enabled, and per-device batch size=1. The issue seems to be not with optimizer or model memory, but rather activation memory. I’m trying to get activation checkpointing to work with my existing setup (which uses the automatic HF Trainer/Deepspeed integration).

Would really appreciate your advice @stas

1 Like

Indeed, enabling activation checkpointing should make a very noticeable difference.

If that is not enough you can look into Memory-centric tiling which should shave some more memory, and tuning up buffer sizes in the deepspeed config may help a bit more.

Specifically to your situation Sequence Parallelism should be very helpful, but if I’m not mistaken this is yet to be supported by Deepspeed.- you may want to submit a feature request for this to happen.

The frameworks that support SP that I know of are Megatron-LM, CollosalAI, Transformer Engine . There might be others.

Thank you, @stas.

My issue is that enabling activation checkpointing in the ds_config makes no difference in terms of used memory (as measured by nvidia-smi).

Specifically, I’ve added

"activation_checkpointing": {
    "partition_activations": True,

But it seems to be useless. Would appreciate your input.

Ah, yes, that feature is part of the modeling code. Please see:

the example is for the HF Trainer, but you can, of course, do the same w/o the Trainer. see

it’s confusing that there are 2 very different names for the same feature in the ml world.

1 Like

Ah I see, so activation and gradient checkpointing are the same thing? The Deepspeed activation checkpoint reference seems to suggest that their implementation partitions the activations between the GPUs (similar to gradients + model weights in ZeRO 3).

Does this gradient_checkpointing=True flag HF Trainer enable partitioning as well? That is an optimization I’m interested in – as most of my GPU memory is in fact being eaten up by activations.

Additionally, when setting gradient_checkpointing=True with distributed multi-node Deepspeed (4 x 8xA100), I get constant warnings @stas :

use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False

Thanks for your help!

It tells you that it sets it to False, if you don’t want the warming, set the use_cache=False explicitly

Incidentally, one can turn the caching off during generate to save memory, but at a cost of regenerating the past values.