Accelerate
version: 0.24.0- Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.27
- Python version: 3.10.13
- Numpy version: 1.26.1
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 1007.80 GB
- GPU type: NVIDIA A100 80GB PCIe
Accelerate
default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: 1,2,3,4
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env:
- dynamo_config: {‘dynamo_backend’: ‘INDUCTOR’, ‘dynamo_mode’: ‘default’, ‘dynamo_use_dynamic’: True, ‘dynamo_use_fullgraph’: False}
Run command as:CCL_P2P_DISABLE=1 CUDA_LAUNCH_BLOCKING=1 accelerate launch 01.py --max_memory_per_gpu 20GB
ERROR: cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
…/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [23,0,0], thread: [32,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
…/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [23,0,0], thread: [33,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
…/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [23,0,0], thread: [34,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
…/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [23,0,0], thread: [35,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
…/aten/src/ATen/native/cuda/IndexKernel.c