Fine-tuning T5 with long sequence length, using activation checkpointing with Deepspeed

michaelroyzen · December 5, 2022, 1:17am

I’m fine-tuning T5 (11B) with very long sequence lengths (2048 input, 256 output) and am running out of memory on an 8x A100-80GB cluster even with ZeRO-3 enabled, bf16 enabled, and per-device batch size=1. The issue seems to be not with optimizer or model memory, but rather activation memory. I’m trying to get activation checkpointing to work with my existing setup (which uses the automatic HF Trainer/Deepspeed integration).

Would really appreciate your advice @stas

stas · December 5, 2022, 2:51am

Indeed, enabling activation checkpointing should make a very noticeable difference.

If that is not enough you can look into Memory-centric tiling which should shave some more memory, and tuning up buffer sizes in the deepspeed config may help a bit more.

Specifically to your situation Sequence Parallelism should be very helpful, but if I’m not mistaken this is yet to be supported by Deepspeed.- you may want to submit a feature request for this to happen.

The frameworks that support SP that I know of are Megatron-LM, CollosalAI, Transformer Engine . There might be others.

michaelroyzen · December 5, 2022, 3:24am

Thank you, @stas.

My issue is that enabling activation checkpointing in the ds_config makes no difference in terms of used memory (as measured by nvidia-smi).

Specifically, I’ve added

"activation_checkpointing": {
    "partition_activations": True,
    }

But it seems to be useless. Would appreciate your input.

stas · December 5, 2022, 3:41am

Ah, yes, that feature is part of the modeling code. Please see:

the example is for the HF Trainer, but you can, of course, do the same w/o the Trainer. see

github.com

huggingface/transformers/blob/699e90437f984d69ad3c9b891dd2e9d0fc2cffe4/src/transformers/modeling_utils.py#L1465


      
                      to prune in said layer (list of `int`). For instance {1: [0, 2], 2: [2, 3]} will prune heads 0 and 2 on
                      layer 1 and heads 2 and 3 on layer 2.
              """
              # save new sets of pruned heads as union of previously stored pruned heads and newly pruned heads
              for layer, heads in heads_to_prune.items():
                  union_heads = set(self.config.pruned_heads.get(layer, [])) | set(heads)
                  self.config.pruned_heads[layer] = list(union_heads)  # Unfortunately we have to store it as list for JSON
          
          
    self.base_model._prune_heads(heads_to_prune)
          
          
def gradient_checkpointing_enable(self):
              """
              Activates gradient checkpointing for the current model.
          
          
    Note that in other frameworks this feature can be referred to as "activation checkpointing" or "checkpoint
              activations".
              """
              if not self.supports_gradient_checkpointing:
                  raise ValueError(f"{self.__class__.__name__} does not support gradient checkpointing.")
              self.apply(partial(self._set_gradient_checkpointing, value=True))

it’s confusing that there are 2 very different names for the same feature in the ml world.

michaelroyzen · December 5, 2022, 4:43am

Ah I see, so activation and gradient checkpointing are the same thing? The Deepspeed activation checkpoint reference seems to suggest that their implementation partitions the activations between the GPUs (similar to gradients + model weights in ZeRO 3).

Does this gradient_checkpointing=True flag HF Trainer enable partitioning as well? That is an optimization I’m interested in – as most of my GPU memory is in fact being eaten up by activations.

michaelroyzen · December 5, 2022, 6:23am

Additionally, when setting gradient_checkpointing=True with distributed multi-node Deepspeed (4 x 8xA100), I get constant warnings @stas :

use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False…

Thanks for your help!

stas · December 5, 2022, 5:10pm

It tells you that it sets it to False, if you don’t want the warming, set the use_cache=False explicitly

github.com

huggingface/transformers/blob/28d74872cc049e0cbee3fafd15cbbabfe348ebd4/src/transformers/generation_utils.py#L689-L691


      
          use_cache: (:obj:`bool`, `optional`, defaults to :obj:`True`):
              Whether or not the model should use the past last key/values attentions (if applicable to the model) to
              speed up decoding.

Incidentally, one can turn the caching off during generate to save memory, but at a cost of regenerating the past values.

Topic		Replies	Views
Checkpoint breaks with deepspeed 🤗Transformers	6	3433	March 20, 2021
Tips for training LongT5 Models	0	670	June 29, 2022
Fine-tuning a 16B CodeGen model with 256GB RAM+2xA6000s? DeepSpeed	2	1647	July 3, 2023
Avoid saving deepspeed optimizer and model states at checkpoints Beginners	2	436	February 19, 2025
Fine-tune OPT 13B: CUDA out of memory error (720gb vram, batch size 1, fp16)! Beginners	6	4564	July 25, 2022

Fine-tuning T5 with long sequence length, using activation checkpointing with Deepspeed

Related topics