Code RuntimeError:Multi-card operation

  • Accelerate version: 0.24.0
  • Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.27
  • Python version: 3.10.13
  • Numpy version: 1.26.1
  • PyTorch version (GPU?): 2.1.0+cu121 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • System RAM: 1007.80 GB
  • GPU type: NVIDIA A100 80GB PCIe
  • Accelerate default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_GPU
    - mixed_precision: no
    - use_cpu: False
    - debug: False
    - num_processes: 4
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: 1,2,3,4
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env:
    - dynamo_config: {‘dynamo_backend’: ‘INDUCTOR’, ‘dynamo_mode’: ‘default’, ‘dynamo_use_dynamic’: True, ‘dynamo_use_fullgraph’: False}

Run command as:CCL_P2P_DISABLE=1 CUDA_LAUNCH_BLOCKING=1 accelerate launch 01.py --max_memory_per_gpu 20GB
ERROR: cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

…/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [23,0,0], thread: [32,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
…/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [23,0,0], thread: [33,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
…/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [23,0,0], thread: [34,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
…/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [23,0,0], thread: [35,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
…/aten/src/ATen/native/cuda/IndexKernel.c