Accelerate multi-gpu error: Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"

I tried to load any LLM (for example mistralai/Mistral-7B-Instruct-v0.2). When I run this code on one gpu, it works well. However when I tried to run it on multiple devices (rtx6000, CUDA Version: 12.1) I got this error:

Reproduction

import torch
from sklearn.metrics import classification_report
from tqdm import tqdm
from transformers import AutoTokenizer, BitsAndBytesConfig, pipeline, AutoModelForCausalLM
import os

max_length = 2048
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id,,
                                          model_max_length=max_length)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,
                                             device_map='auto').eval()
model.config.pad_token_id = tokenizer.pad_token_id

text = "how much is 1+1?"

inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)

error:

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [1,0,0], thread: [11,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
RuntimeError: CUDA error: device-side assert triggered

accelerate env result:

  • Accelerate version: 0.26.1
  • Platform: Linux-3.10.0-1160.90.1.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.10.9
  • Numpy version: 1.26.3
  • PyTorch version (GPU?): 2.2.1+cu121 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • System RAM: 503.65 GB
  • GPU type: NVIDIA RTX 6000 Ada Generation
  • Accelerate default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: DEEPSPEED
    - mixed_precision: bf16
    - use_cpu: False
    - debug: False
    - num_processes: 3
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - deepspeed_config: {‘gradient_accumulation_steps’: 1, ‘offload_optimizer_device’: ‘cpu’, ‘offload_param_device’: ‘cpu’, ‘zero3_init_flag’: True, ‘zero_stage’: 2}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env:

When I removed device_map=‘auto’, it works well. This looks like a running issue on multi-gpu.
How can i solve it?

1 Like