I used Accelerate FSDP to finetune Llama-2-13b model with 2 A100 80G GPUs. The training ended well; however when uploading the trained model to the hub, it always prints out âRemoved shared tensor {âmodel.norm.weightâ} while saving. This should be OK, but check by verifying that you donât receive any warning while reloadingâ messages.
When I actually looked at the uploaded model, I could see that there was no model.norm.weight
at the end of the modelâs model.safetensors.index.json
file. I want to upload the full model without any removal ! How do I resolve this problem? The following is the configuration of fsdp_config.yaml
that I used.
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_forward_prefetch: true
fsdp_offload_params: false
fsdp_sharding_strategy: 1
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
In addition, I used these hyperparameters for fine-tuning.
- use_flash_attention_2 = True
- num_train_epochs = 3
- per_device_batch_size = 2
- gradient_accumulation_steps = 2
- gradient_checkpointing = True
- bf16 = True
- warmup_ratio = 0.03
- learning_rate = 1e-5
- lr_scheduler = âcosineâ