Accelerate FSDP shows "Removed shared tensor {'model.norm.weight'} while saving."

Cartinoe5930 · November 10, 2023, 11:27am

I used Accelerate FSDP to finetune Llama-2-13b model with 2 A100 80G GPUs. The training ended well; however when uploading the trained model to the hub, it always prints out “Removed shared tensor {‘model.norm.weight’} while saving. This should be OK, but check by verifying that you don’t receive any warning while reloading” messages.

When I actually looked at the uploaded model, I could see that there was no model.norm.weight at the end of the model’s model.safetensors.index.json file. I want to upload the full model without any removal ! How do I resolve this problem? The following is the configuration of fsdp_config.yaml that I used.

compute_environment: LOCAL_MACHINE                                                                                             
debug: false                                                                                                                   
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

In addition, I used these hyperparameters for fine-tuning.

use_flash_attention_2 = True
num_train_epochs = 3
per_device_batch_size = 2
gradient_accumulation_steps = 2
gradient_checkpointing = True
bf16 = True
warmup_ratio = 0.03
learning_rate = 1e-5
lr_scheduler = “cosine”

tanglade · December 21, 2023, 8:34pm

Hello, i have the salme issue, when i save my model with accelerate i get the following message : Removed shared tensor {‘encoder.final_layer_norm.weight’} while saving. This should be OK, but check by verifying that you don’t receive any warning while reloading
And then i can’t load my model again. Anyone knows how to prevent this ?

marcsun13 · January 24, 2024, 3:26pm

Hi @tanglade , with safetensors, we don’t save shared tensors. You can go through this doc for better understanding. This is indeed strange that you can’t load your model. Can you share a reproducer ? One way to fix this is to save with pytorch saving method but it is less secured compared to safetensors.

Topic		Replies	Views
Save accelerate model 🤗Accelerate	4	721	February 5, 2025
Transformers Trainer + Accelerate FSDP: How do I load my model from a checkpoint? 🤗Accelerate	3	14314	June 22, 2025
Resetting base_model breaks shared tensors for safetensors Intermediate	3	40	March 14, 2025
Issue while loading file-tuned gemma2 Models	3	181	December 29, 2024
Loading a model which is saved on multiple nodes using sharded_state_dict? 🤗Accelerate	0	73	August 13, 2024

Accelerate FSDP shows "Removed shared tensor {'model.norm.weight'} while saving."

Related topics