Having trouble accelerate on my 2 GPU machine

Hello, I am trying to fine-tune a model on my personal 2GPU machine accelerate framework.
My huggingface accelerate configuration is given below. I am using peft huggingface to finetune AutoModelForCausalLM. However, the model is being loaded on only one GPU and the other GPU isn’t used. The training gives the following error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 15.73 GiB total capacity; 11.44 GiB already allocated; 54.19 MiB free; 11.44 GiB reserved in total by PyTorch) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 15.73 GiB total capacity; 11.44 GiB already allocated; 54.19 MiB free; 11.44 GiB reserved in total by PyTorch)

  • Accelerate version: 0.19.0
  • Platform: Linux-5.4.0-146-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Numpy version: 1.24.3
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • System RAM: 251.41 GB
  • GPU type: NVIDIA RTX A4000
  • Accelerate default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: DEEPSPEED
    - mixed_precision: fp8
    - use_cpu: False
    - num_processes: 2
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - deepspeed_config: {‘gradient_accumulation_steps’: 1, ‘offload_optimizer_device’: ‘cpu’, ‘offload_param_device’: ‘nvme’, ‘zero3_init_flag’: False, ‘zero_stage’: 2}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: