Accelerate FSDP shows "Removed shared tensor {'model.norm.weight'} while saving."

I used :hugs: Accelerate FSDP to finetune Llama-2-13b model with 2 A100 80G GPUs. The training ended well; however when uploading the trained model to the hub, it always prints out “Removed shared tensor {‘model.norm.weight’} while saving. This should be OK, but check by verifying that you don’t receive any warning while reloading” messages.

When I actually looked at the uploaded model, I could see that there was no model.norm.weight at the end of the model’s model.safetensors.index.json file. I want to upload the full model without any removal ! How do I resolve this problem? The following is the configuration of fsdp_config.yaml that I used.

compute_environment: LOCAL_MACHINE                                                                                             
debug: false                                                                                                                   
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

In addition, I used these hyperparameters for fine-tuning.

  • use_flash_attention_2 = True
  • num_train_epochs = 3
  • per_device_batch_size = 2
  • gradient_accumulation_steps = 2
  • gradient_checkpointing = True
  • bf16 = True
  • warmup_ratio = 0.03
  • learning_rate = 1e-5
  • lr_scheduler = “cosine”

Hello, i have the salme issue, when i save my model with accelerate i get the following message : Removed shared tensor {‘encoder.final_layer_norm.weight’} while saving. This should be OK, but check by verifying that you don’t receive any warning while reloading
And then i can’t load my model again. Anyone knows how to prevent this ?

Hi @tanglade , with safetensors, we don’t save shared tensors. You can go through this doc for better understanding. This is indeed strange that you can’t load your model. Can you share a reproducer ? One way to fix this is to save with pytorch saving method but it is less secured compared to safetensors.