Not seeing memory benefit to accelerate/FSDP2

hpcpony · June 4, 2025, 9:34pm

TL;DR Why doesn’t Acclerate/FSDP seem to be doing much of anything to reduce memory in the following?

I’m trying to get some hands-on and learn how to run large models across multiple nodes and/or GPUs. I’m starting with Trainer/accelerate/FSDP2 and planning to work up from there but I think I’m missing something.

python 3.12.9
torch 2.7.0
transformers 4.52.4
accelerate 1.7.0

My “toy” program to train an “empty” model:

from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

from transformers import DefaultDataCollator, DataCollatorForLanguageModeling
from transformers import TrainingArguments, Trainer
import os

model_dir = 'NousResearch/Llama-3.2-1B'
TRACE = False
N = 2048
context_length = 64
batch_size = 64

def load_datasets() :
    train_data_list = [
        {"text" : "The quick brown fox jumped over the lazy dog's back t{:06d}".format(i)} for i in range(4*N)
        ]
    eval_data_list = [
        {"text" : "The quick brown fox jumped over the lazy dog's back e{:06d}".format(i)} for i in range(N)
        ]
    datasets = DatasetDict (                       # create datasets dict train and eval
            { 'train': Dataset.from_list(train_data_list),
              'eval' : Dataset.from_list(eval_data_list)}
        )
    return datasets

def load_tokenizer(model_dir) :
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    return tokenizer

def load_model(model_dir) :
    # get just the config from the pretrained directory
    config = AutoConfig.from_pretrained(model_dir)
    model = AutoModelForCausalLM.from_config(config)
    return model

def mytrain(model_dir) :

    def tokenize(dataset) :
        return tokenizer(dataset['text'], padding='max_length', max_length=context_length, return_length=True)

    ##
    raw_datasets = load_datasets()
    if TRACE : print("dataset\n", raw_datasets)
    ##
    tokenizer = load_tokenizer(model_dir)
    if TRACE : print("tokenizer\n", tokenizer)
    ##
    tokenizer.pad_token = tokenizer.eos_token
    tokenized_datasets = raw_datasets.map(
        tokenize, batched=True, remove_columns=raw_datasets["train"].column_names)
    if TRACE : print("tokenized_datasets\n", tokenized_datasets)
    ##
    data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
    if TRACE :
        example_collated = data_collator([tokenized_datasets["train"][i] for i in range(3)])
        print("example_collated\n", example_collated)
    ##
    training_args = TrainingArguments(     # do this before model load for FSDP?
        output_dir="outputs/",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=10,
        logging_strategy="epoch",
        eval_strategy="epoch",
        save_strategy="no",
        push_to_hub=False,
        disable_tqdm=True,
        deepspeed=None,
    )
    ##
    model = load_model(model_dir)          # do the after TrainingArguments which sets up some stuff?
    if TRACE : print("model\n", model)
    ##
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["eval"],
        processing_class=tokenizer,
        data_collator=data_collator,
    )
    trainer.train()

from datasets.utils.logging import disable_progress_bar
import torch
if __name__ == "__main__" :
  disable_progress_bar()
  mytrain(
     model_dir=model_dir
     )
  torch.distributed.destroy_process_group()

I first run my test progam as simple python/pytorch; single GPU without accelerate.

[gpu2:training] CUDA_VISIBLE_DEVICES=0 python 05_acctest.py 
{'loss': 0.8924, 'grad_norm': 0.8125, 'learning_rate': 4.50390625e-05, 'epoch': 1.0}
{'eval_loss': 2.5442957878112793, 'eval_runtime': 2.4496, 'eval_samples_per_second': 836.064, 'eval_steps_per_second': 13.063, 'epoch': 1.0}
{'loss': 0.6293, 'grad_norm': 0.65234375, 'learning_rate': 4.00390625e-05, 'epoch': 2.0}
{'eval_loss': 2.6600184440612793, 'eval_runtime': 2.4495, 'eval_samples_per_second': 836.094, 'eval_steps_per_second': 13.064, 'epoch': 2.0}
  .
  .
  .
{'loss': 0.6061, 'grad_norm': 0.4921875, 'learning_rate': 3.90625e-08, 'epoch': 10.0}
{'eval_loss': 2.8240463733673096, 'eval_runtime': 2.4496, 'eval_samples_per_second': 836.055, 'eval_steps_per_second': 13.063, 'epoch': 10.0}
{'train_runtime': 333.183, 'train_samples_per_second': 245.871, 'train_steps_per_second': 3.842, 'train_loss': 0.6405227959156037, 'epoch': 10.0}

While it’s running I use nvidia-smi to look at the memory used

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           21181      C   python                                21372MiB |
+-----------------------------------------------------------------------------------------+

That’s at least in the ball-park for what accelerate estimates:

[gpu2:training] accelerate estimate-memory NousResearch/Llama-3.2-1B
Loading pretrained config for `NousResearch/Llama-3.2-1B` from `transformers`...
┌────────────────────────────────────────────────────────┐
│  Memory Usage for loading `NousResearch/Llama-3.2-1B`  │
├───────┬─────────────┬──────────┬───────────────────────┤
│ dtype │Largest Layer│Total Size│  Training using Adam  │
├───────┼─────────────┼──────────┼───────────────────────┤
│float32│  1002.0 MB  │  4.6 GB  │        18.42 GB       │
│float16│   501.0 MB  │  2.3 GB  │        9.21 GB        │
│  int8 │   250.5 MB  │ 1.15 GB  │          N/A          │
│  int4 │  125.25 MB  │589.28 MB │          N/A          │
└───────┴─────────────┴──────────┴───────────────────────┘

Next I use “accelerate config” to generate a config file for 2 GPUs using FSDP2. (mostly with default values)

[gpu2:training] cat 1n2gfsdp_defaults.yaml 
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: false
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: false
  fsdp_reshard_after_forward: true
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_version: 2
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Using that file an running with accelerate…

[gpu2:training] CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file 1n2gfsdp_defaults.yaml 05_acctest.py 
{'loss': 1.0797, 'grad_norm': 0.6328125, 'learning_rate': 4.5078125000000006e-05, 'epoch': 1.0}
{'eval_loss': 2.5193161964416504, 'eval_runtime': 1.376, 'eval_samples_per_second': 1488.383, 'eval_steps_per_second': 11.628, 'epoch': 1.0}
{'loss': 0.6584, 'grad_norm': 0.4609375, 'learning_rate': 4.0078125e-05, 'epoch': 2.0}
{'eval_loss': 2.5891079902648926, 'eval_runtime': 1.3771, 'eval_samples_per_second': 1487.218, 'eval_steps_per_second': 11.619, 'epoch': 2.0}
  .
  .
  .
{'loss': 0.6096, 'grad_norm': 0.462890625, 'learning_rate': 7.8125e-08, 'epoch': 10.0}
{'eval_loss': 2.754133462905884, 'eval_runtime': 1.3776, 'eval_samples_per_second': 1486.605, 'eval_steps_per_second': 11.614, 'epoch': 10.0}
{'train_runtime': 178.9799, 'train_samples_per_second': 457.705, 'train_steps_per_second': 3.576, 'train_loss': 0.6661747217178344, 'epoch': 10.0}

… nvidia-smi memory during the computation…

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           24421      C   ...AI/training-4.52.4/bin/python      21384MiB |
|    1   N/A  N/A           24422      C   ...AI/training-4.52.4/bin/python      21388MiB |
+-----------------------------------------------------------------------------------------+

Next a config file with 4 GPUs…

[gpu2:training] cat 1n4gfsdp_defaults.yaml 
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: false
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: false
  fsdp_reshard_after_forward: true
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_version: 2
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

… execute using accelerate…

[gpu2:training] CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --config_file 1n4gfsdp_defaults.yaml 05_acctest.py 
{'loss': 1.373, 'grad_norm': 0.458984375, 'learning_rate': 4.515625e-05, 'epoch': 1.0}
{'eval_loss': 2.402463912963867, 'eval_runtime': 0.6972, 'eval_samples_per_second': 2937.372, 'eval_steps_per_second': 11.474, 'epoch': 1.0}
{'loss': 0.7474, 'grad_norm': 0.435546875, 'learning_rate': 4.0156250000000004e-05, 'epoch': 2.0}
{'eval_loss': 2.3128156661987305, 'eval_runtime': 0.6946, 'eval_samples_per_second': 2948.607, 'eval_steps_per_second': 11.518, 'epoch': 2.0}
   .
   .
   .
{'loss': 0.6214, 'grad_norm': 0.30078125, 'learning_rate': 1.5625e-07, 'epoch': 10.0}
{'eval_loss': 2.432434320449829, 'eval_runtime': 0.694, 'eval_samples_per_second': 2950.801, 'eval_steps_per_second': 11.527, 'epoch': 10.0}
{'train_runtime': 89.6101, 'train_samples_per_second': 914.182, 'train_steps_per_second': 3.571, 'train_loss': 0.718875628709793, 'epoch': 10.0}

… nvidia-smi while executing…

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           25570      C   ...AI/training-4.52.4/bin/python      20526MiB |
|    1   N/A  N/A           25571      C   ...AI/training-4.52.4/bin/python      20146MiB |
|    2   N/A  N/A           25572      C   ...AI/training-4.52.4/bin/python      20146MiB |
|    3   N/A  N/A           25573      C   ...AI/training-4.52.4/bin/python      20146MiB |
+-----------------------------------------------------------------------------------------+

Clearly something is happening; I’m getting a performance benefit from using more GPUs (almost linear!). But, I’m not seeing a substantial improvement in memory usage.

Is my config file missing something? Are there better parameters that facilitate memory savings?
Can I somehow get accelerate to dump what it thinks it’s doing (vs. what I specified in the config file)?
Can I somehow dump the wrapped model to see what FSDP has done?

===============================================================

I did a similar experiment with bloom-3b just to see if it made any difference, and things still seem strange.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           37058      C   python                                74748MiB |
+-----------------------------------------------------------------------------------------+

┌────────────────────────────────────────────────────┐
│   Memory Usage for loading `bigscience/bloom-3b`   │
├───────┬─────────────┬──────────┬───────────────────┤
│ dtype │Largest Layer│Total Size│Training using Adam│
├───────┼─────────────┼──────────┼───────────────────┤
│float32│   2.39 GB   │ 11.19 GB │      44.74 GB     │
│float16│    1.2 GB   │ 5.59 GB  │      22.37 GB     │
│  int8 │   612.5 MB  │  2.8 GB  │        N/A        │
│  int4 │  306.25 MB  │  1.4 GB  │        N/A        │
└───────┴─────────────┴──────────┴───────────────────┘

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          251138      C   ...AI/training-4.52.4/bin/python      53922MiB |
|    1   N/A  N/A          251139      C   ...AI/training-4.52.4/bin/python      53538MiB |
|    2   N/A  N/A          251140      C   ...AI/training-4.52.4/bin/python      53538MiB |
|    3   N/A  N/A          251141      C   ...AI/training-4.52.4/bin/python      53538MiB |
+-----------------------------------------------------------------------------------------+

John6666 · June 5, 2025, 6:24am

I don’t really understand how multi-GPU environments work…

github.com/pytorch/pytorch

[FSDP2] The evil `record_stream` in c10d causes FSDP2 to over-allocate GPU memory

opened 01:42AM - 14 Feb 25 UTC

closed 08:00PM - 08 Mar 25 UTC

leonardo0lyj

oncall: distributed module: c10d module: fsdp

Hey Andrew @awgu, As a big fan of FSDP2, I find an potential bug 😄 ## Demand: …- No inter-stream memory fragmentation (incurred by copy in streams) - Explicit Prefetch - CPU runs a head of GPU by a lot ## `_set_unshard_async_op(True)` To satisfy these demands, FSDP2 has to turn on [`_set_unshard_async_op(True)`](https://github.com/pytorch/pytorch/blob/20a369aa3abb6083600d5b22fcd8ba6e861c3959/torch/distributed/fsdp/_fully_shard/_fully_shard.py#L413) with explicit prefetch `set_modules_to_forward_prefetch` and `set_modules_to_backward_prefetch`. ## Memory Over-Allocation Then memory over-allocation happens like this: ![Image](https://github.com/user-attachments/assets/058c2f0a-a15c-42f8-ac70-3bfa9f138008) with memory traces: ![Image](https://github.com/user-attachments/assets/34a43f32-7355-43ce-943b-59a074294de7) ![Image](https://github.com/user-attachments/assets/c7122c74-a941-4e3a-b448-c15572b36a56) ## Root Cause As known to all, these memory over-allocations are caused by the evil `tensor.record_stream(ncclStream)`. Although FSDP2 tried to avoid this evil originated from FSDP1, such `record_stream` still is [embedded in all c10d collectives](https://github.com/pytorch/pytorch/blob/0acbf8039abccfc17f9c8529d217209db5a7cc85/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L5373) (when `async_op=True`). Therefore, FSDP2 still suffers over-allocation from this evil in c10d. ## Candidate Solution Not sure how can we avoid the `record_stream` even when `async_op=True`? IMO, candidate solutions are below: 1. Make `TORCH_NCCL_AVOID_RECORD_STREAMS=True` as an default value, getting rid of the `record_stream` in c10d. (Safety should be good without `record_stream`, as collective with `async_op=True` usually starts from allocation stream and ends at allocation stream, or users indeed know how to [manually sync streams](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html).) 2. Make `TORCH_NCCL_AVOID_RECORD_STREAMS=True` an advanced option to each collective, such as `dist.all_gather(..., _avoid_record_stream=True)`. This limits the scope of environmental `TORCH_NCCL_AVOID_RECORD_STREAMS` to each specific collective. 3. Use only `dist.all_gather(async_op=False)` in FSDP2, but [changes the `current_stream`](https://github.com/pytorch/pytorch/blob/20a369aa3abb6083600d5b22fcd8ba6e861c3959/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py#L92) to the `all_gather_stream` such that all gather still allocates/frees in `current_stream` while runs in `all_gather_stream` and overlaps with `current_stream`, just like `async_op=True`. ```python def get_all_gather_streams( self, async_op: bool, training_state: TrainingState ) -> tuple[torch.Stream, torch.Stream]: if not async_op and training_state in ( TrainingState.FORWARD, TrainingState.PRE_BACKWARD, ): # Use separate streams for implicit prefetching return self.all_gather_copy_in_stream, self.all_gather_stream # Use separate streams for explicit prefetching! current_stream = self.device_handle.current_stream() return current_stream, self.all_gather_stream # Change this! ``` How do you prefer? (Let us make FSDP great again 😄) ## Code P.S. the code to reproduce over-allocation: ```python class MLP(nn.Module): def __init__(self, hidden_dim: int, bias: bool = False): super().__init__() self.fc1 = nn.Linear(hidden_dim, hidden_dim, bias=bias) self.gelu = nn.GELU() self.fc2 = nn.Linear(hidden_dim, hidden_dim, bias=bias) def forward(self, x): x = self.fc1(x) x = self.gelu(x) x = self.fc2(x) return x class MultiMLP(nn.Module): def __init__(self, hidden_dim: int, bias: bool = False, layers: int = 4): super().__init__() self.pre_norm = nn.LayerNorm(hidden_dim, bias=bias) self.mlps = nn.ModuleList([MLP(hidden_dim, bias) for _ in range(layers)]) self.post_norm = nn.LayerNorm(hidden_dim, bias=bias) def forward(self, x): x = self.pre_norm(x) for mlp in self.mlps: x = x + mlp(x) x = self.post_norm(x) return x class TestMemory(DTensorTestBase): @with_comms def test_over_allocation(self): mesh = init_device_mesh("cuda", (self.world_size,)) device = torch.device("cuda") hidden_dim = 10240 total_bsz = 16 # ----- init model -------- torch.manual_seed(0) model = MultiMLP(hidden_dim=hidden_dim).to(device).to(torch.float32) # -------- fsdp2 wrap -------- fully_shard_fn = functools.partial( fully_shard, mesh=mesh, reshard_after_forward=True, ) last_fsdp_module = None for module in model.modules(): if isinstance(module, MLP): fully_shard_fn(module) if last_fsdp_module is not None: last_fsdp_module.set_modules_to_forward_prefetch([module]) module.set_modules_to_backward_prefetch([last_fsdp_module]) last_fsdp_module = module fsdp_model = fully_shard_fn(model) fsdp_model._set_unshard_async_op(True) optim = torch.optim.Adam(fsdp_model.parameters()) # ----- init data ----- torch.manual_seed(self.rank) bsz = total_bsz // self.world_size # -------- training loop -------- torch.distributed.barrier() torch.cuda.synchronize(self.rank) train_iter = 4 for iter in range(train_iter): # torch.distributed.barrier() # torch.cuda.synchronize(self.rank) if self.rank == 0 and iter == train_iter - 1: torch.cuda.memory._record_memory_history(max_entries=int(1E6)) with record_function("## zero grad ##"): optim.zero_grad() input = torch.randn((bsz, hidden_dim), device="cuda") with record_function(f"## forward ##"): output = fsdp_model(input) loss = output.mean() with record_function(f"## backward ##"): loss.backward() with record_function("## optimizer step ##"): optim.step() if self.rank == 0 and iter == train_iter - 1: timestamp = datetime.now().strftime("%b_%d_%H_%M_%S") file_name = f"mem_{timestamp}" torch.cuda.memory._dump_snapshot(f"{file_name}.pickle") torch.cuda.memory._record_memory_history(enabled=None) torch.distributed.barrier() torch.cuda.synchronize(self.rank) ``` cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @zhaojuanmao @mrshenli @rohan-varma @chauhang

github.com/pytorch/torchtune

Does FSDP v2 have the best performance?

opened 08:47AM - 17 Feb 25 UTC

dz1iang

discussion

Hi, when I set fsdp_reshard_after_forward: False, the training speed increased b…y approximately 5-7%（tokens_per_second_per_gpu）. Are there any other configurations that affect performance? Or where do you recommend referring to for configurations? In addition, the setting of gradient_accumulation_steps does not affect the speed. Generally speaking, setting a larger value will reduce the frequency of communication and speed up the training. The model used in the experiment is Qwen 2.5 3B.

github.com/pytorch/torchtitan

[question]FSDP2 have more peak active memory/reserved memory than FSDP1

opened 08:42AM - 13 Dec 24 UTC

closed 02:37PM - 17 Dec 24 UTC

FindDefinition

question

## Environment OS: Ubuntu GPU: 8x GPU torch: torch-2.6.0.dev20241212+cu124 D…DP: 4-way Tensor Parallel * 2-way FSDP ## Problem I'm using FSDP+TP in my model and follow torchtitan code. when I switch fsdp1 to fsdp2, the memory usage showed by `nvidia-smi` increases by 10GB, also the peak active memory is greatly larger than fsdp1. is this expected? Which metric should be cared in `memory_summary` to avoid OOM? here is the result from `torch.cuda.memory_summary()`. Following tables are generated when **first step is end**. * fsdp2 ``` |===========================================================================| | PyTorch CUDA memory summary, device ID 0 | |---------------------------------------------------------------------------| | CUDA OOMs: 0 | cudaMalloc retries: 0 | |===========================================================================| | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed | |---------------------------------------------------------------------------| | Allocated memory | 13975 MiB | 18803 MiB | 2142 GiB | 2128 GiB | | from large pool | 13959 MiB | 18790 MiB | 2140 GiB | 2127 GiB | | from small pool | 16 MiB | 17 MiB | 1 GiB | 1 GiB | |---------------------------------------------------------------------------| | Active memory | 13975 MiB | 39454 MiB | 2142 GiB | 2128 GiB | | from large pool | 13959 MiB | 39437 MiB | 2140 GiB | 2127 GiB | | from small pool | 16 MiB | 18 MiB | 1 GiB | 1 GiB | |---------------------------------------------------------------------------| | Requested memory | 13792 MiB | 39306 MiB | 2138 GiB | 2125 GiB | | from large pool | 13775 MiB | 39289 MiB | 2137 GiB | 2124 GiB | | from small pool | 16 MiB | 18 MiB | 1 GiB | 1 GiB | |---------------------------------------------------------------------------| | GPU reserved memory | 45590 MiB | 45590 MiB | 45590 MiB | 0 B | | from large pool | 45566 MiB | 45566 MiB | 45566 MiB | 0 B | | from small pool | 24 MiB | 24 MiB | 24 MiB | 0 B | |---------------------------------------------------------------------------| | Non-releasable memory | 377331 KiB | 7818 MiB | 1017 GiB | 1017 GiB | | from large pool | 375788 KiB | 7813 MiB | 1016 GiB | 1016 GiB | | from small pool | 1543 KiB | 10 MiB | 1 GiB | 1 GiB | |---------------------------------------------------------------------------| | Allocations | 4735 | 4738 | 34212 | 29477 | | from large pool | 1504 | 1507 | 15954 | 14450 | | from small pool | 3231 | 3348 | 18258 | 15027 | |---------------------------------------------------------------------------| | Active allocs | 4735 | 4738 | 34212 | 29477 | | from large pool | 1504 | 1507 | 15954 | 14450 | | from small pool | 3231 | 3348 | 18258 | 15027 | |---------------------------------------------------------------------------| | GPU reserved segments | 304 | 304 | 304 | 0 | | from large pool | 292 | 292 | 292 | 0 | | from small pool | 12 | 12 | 12 | 0 | |---------------------------------------------------------------------------| | Non-releasable allocs | 15 | 135 | 15054 | 15039 | | from large pool | 13 | 89 | 9160 | 9147 | | from small pool | 2 | 65 | 5894 | 5892 | |---------------------------------------------------------------------------| | Oversize allocations | 0 | 0 | 0 | 0 | |---------------------------------------------------------------------------| | Oversize GPU segments | 0 | 0 | 0 | 0 | |===========================================================================| ``` * fsdp1 ``` |===========================================================================| | PyTorch CUDA memory summary, device ID 0 | |---------------------------------------------------------------------------| | CUDA OOMs: 0 | cudaMalloc retries: 0 | |===========================================================================| | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed | |---------------------------------------------------------------------------| | Allocated memory | 13947 MiB | 18561 MiB | 2156 GiB | 2142 GiB | | from large pool | 13937 MiB | 18556 MiB | 2155 GiB | 2141 GiB | | from small pool | 10 MiB | 11 MiB | 1 GiB | 1 GiB | |---------------------------------------------------------------------------| | Active memory | 13947 MiB | 25765 MiB | 2156 GiB | 2142 GiB | | from large pool | 13937 MiB | 25758 MiB | 2155 GiB | 2141 GiB | | from small pool | 10 MiB | 11 MiB | 1 GiB | 1 GiB | |---------------------------------------------------------------------------| | Requested memory | 13792 MiB | 25709 MiB | 2154 GiB | 2140 GiB | | from large pool | 13782 MiB | 25702 MiB | 2153 GiB | 2139 GiB | | from small pool | 9 MiB | 11 MiB | 1 GiB | 1 GiB | |---------------------------------------------------------------------------| | GPU reserved memory | 36458 MiB | 36458 MiB | 36458 MiB | 0 B | | from large pool | 36446 MiB | 36446 MiB | 36446 MiB | 0 B | | from small pool | 12 MiB | 12 MiB | 12 MiB | 0 B | |---------------------------------------------------------------------------| | Non-releasable memory | 402232 KiB | 6360 MiB | 1345 GiB | 1345 GiB | | from large pool | 400277 KiB | 6359 MiB | 1344 GiB | 1343 GiB | | from small pool | 1955 KiB | 6 MiB | 1 GiB | 1 GiB | |---------------------------------------------------------------------------| | Allocations | 2460 | 2463 | 26870 | 24410 | | from large pool | 832 | 835 | 14354 | 13522 | | from small pool | 1628 | 1631 | 12516 | 10888 | |---------------------------------------------------------------------------| | Active allocs | 2460 | 2463 | 26870 | 24410 | | from large pool | 832 | 835 | 14354 | 13522 | | from small pool | 1628 | 1631 | 12516 | 10888 | |---------------------------------------------------------------------------| | GPU reserved segments | 305 | 305 | 305 | 0 | | from large pool | 299 | 299 | 299 | 0 | | from small pool | 6 | 6 | 6 | 0 | |---------------------------------------------------------------------------| | Non-releasable allocs | 56 | 86 | 13297 | 13241 | | from large pool | 53 | 76 | 8544 | 8491 | | from small pool | 3 | 31 | 4753 | 4750 | |---------------------------------------------------------------------------| | Oversize allocations | 0 | 0 | 0 | 0 | |---------------------------------------------------------------------------| | Oversize GPU segments | 0 | 0 | 0 | 0 | |===========================================================================| ``` fsdp related code: ```Python compute_dtype = torch.bfloat16 full_shard: bool = True if use_fsdp2: mixed_fsdp2 = MixedPrecisionPolicy(reduce_dtype=torch.float32, param_dtype=compute_dtype) for layer_str, block in tp_model.blocks.items(): # fsdp2 currently don't change buffer dtype in mixed precision policy # so we have to set buffer dtype by hand block.t_embed.to(torch.bfloat16) fully_shard(block, mesh=ddp_cp_mesh, mp_policy=mixed_fsdp2) fully_shard(tp_model, mesh=ddp_cp_mesh, mp_policy=mixed_fsdp2) tp_model_ddp = tp_model else: my_auto_wrap_policy = functools.partial( transformer_auto_wrap_policy, transformer_layer_cls={ type(tp_model.blocks["0"]), }, ) st = ShardingStrategy.FULL_SHARD if full_shard else ShardingStrategy.SHARD_GRAD_OP mixed = MixedPrecision(param_dtype=compute_dtype, reduce_dtype=torch.float32, buffer_dtype=compute_dtype) tp_model_ddp = FSDP(tp_model, auto_wrap_policy=my_auto_wrap_policy, device_mesh=ddp_cp_mesh, mixed_precision=mixed, sharding_strategy=st, device_id=torch.cuda.current_device(), use_orig_params=True) ``` * fsdp2 memory timeline ![image](https://github.com/user-attachments/assets/729026bf-f630-47be-b9b6-31f1cdf2dd2f) * fsdp1 memory timeline ![image](https://github.com/user-attachments/assets/0fe45869-d28a-4a3f-821f-e03d836b7acf)

hpcpony · June 18, 2025, 3:49pm

So after much futzing around and doing FSDP from pytorch I discovered that the answer to this question is that the memory usage reported by nvidia-smi is not an accurate reflection of memory required/used by pytorch. Apparently pytorch maintains a cache which is greater than that needed/used and that is primarily what the nvidia number reflects.

pytorch.cuda has a number of ways to get memory information that seems to be more relevant (though not always clear of the implications).

system · June 19, 2025, 3:50am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Transformers Trainer + Accelerate FSDP: How do I load my model from a checkpoint? 🤗Accelerate	3	14586	June 22, 2025
Issues with Dataset Loading and Checkpoint Saving using FSDP with HuggingFace Trainer on SLURM Multi-Node Setup 🤗Accelerate	1	112	April 7, 2025
Accelerate use of memory 🤗Transformers	1	103	February 7, 2025
FSDP Auto Wrap does not work using `accelerate` in Multi-GPU Setup 🤗Accelerate	0	313	September 6, 2024
Using Transformers with DistributedDataParallel — any examples? Intermediate	11	23236	May 8, 2023

Not seeing memory benefit to accelerate/FSDP2

Related topics