Accelerate FSDP config prompts

ablam · August 5, 2022, 3:15am

Hello. I’m using accelerate to leverage FSDP as per the tutorial here. I would like to finetune CodeBert using run_mlm_no_trainer.py on a custom dataset. I’ve run accelerate config but I’m not sure what the prompt → transformer_layer_cls_to_wrap entails - be it for my model or even in the examples provided.

sgugger · August 5, 2022, 1:10pm

cc @smangrul

smangrul · August 5, 2022, 3:27pm

Hello @ablam, the blog post is outdated as the FSDP features have been upgraded in PyTorch version 1.12.0. As such, all these new features have been integrated into HF Accelerate. For transformer-based models, PyTorch teams suggested using the transformer_auto_wrap policy. In this policy, the user has to specify the case-sensitive name of an encoder/decoder block comprising of the Multi-Head Attention layer followed by the Feedforward layer. For example, in the T5 model, T5Block is the name for the attention block used by the model for N such layers/blocks in the encoder and decoder. Similarly, for the BERT model, it is BertLayer and for GPT2 it is GPT2Block. Below is an example of the accelerate config for the bert-base-cased model :

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_offload_params: false
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: BertLayer
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false

You will have to print the model and check the name for the attention block and pass it as the value for transformer_layer_cls_to_wrap. I hope this helps.

ablam · August 9, 2022, 8:26pm

Yes, it does. Found the corresponding reference in my model. Thank you!

Colorful · July 16, 2023, 4:14pm

Hi,

I have a question about fsdp_transformer_layer_cls_to_wrap: How can I know the transformer layer for different models? If I want to accelerate LLaMa, can I directly set this parameter to LlamaLayer?

smangrul · September 15, 2023, 9:59am

Hello @Colorful, the docs have been update here recently Fully Sharded Data Parallel (huggingface.co) answering your question. Specifically, the highlighted part below.

Topic		Replies	Views
FSDP Auto Wrap does not work using `accelerate` in Multi-GPU Setup 🤗Accelerate	0	305	September 6, 2024
How to specify FSDP config without launching via Accelerate 🤗Accelerate	2	293	October 19, 2024
Save accelerate model 🤗Accelerate	4	754	February 5, 2025
Not seeing memory benefit to accelerate/FSDP2 🤗Accelerate	3	38	June 18, 2025
Gradient checkpointing + FSDP 🤗Accelerate	1	2559	August 22, 2023

Accelerate FSDP config prompts

Related topics