OOM Error on GPT-J finetuning using multi-gpu

I am trying to run a fine-tuning job using accelerator library and I am getting out-of-memory error in a multi-gpu setup.


Code run:
Here I tried to run on 6 A100 GPUs (40GB each)

accelerate launch run_clm_no_trainer.py \
    --dataset_name wikitext \
    --per_device_train_batch_size 1 \
    --per_device_valid_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --dataset_config_name wikitext-2-raw-v1 \
    --model_name_or_path EleutherAI/gpt-j-6b \
    --output_dir /tmp/test-clm

Error file:

error_output.txt


Updated accelerate env:

Copy-and-paste the text below in your GitHub issue

- Accelerate version: 0.18.0
- Platform: Linux-5.4.0-136-generic-x86_64-with-glibc2.10
- Python version: 3.8.12
- Numpy version: 1.22.2
- PyTorch version (GPU?): 1.13.1+cu116 (True)
- Accelerate default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: bf16
        - use_cpu: False
        - num_processes: 6
        - machine_rank: 0
        - num_machines: 0
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'FULL_STATE_DICT', 'fsdp_transformer_layer_cls_to_wrap': 'GPTJBlock'}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []