I tried to load any LLM (for example mistralai/Mistral-7B-Instruct-v0.2). When I run this code on one gpu, it works well. However when I tried to run it on multiple devices (rtx6000, CUDA Version: 12.1) I got this error:
Reproduction
import torch
from sklearn.metrics import classification_report
from tqdm import tqdm
from transformers import AutoTokenizer, BitsAndBytesConfig, pipeline, AutoModelForCausalLM
import os
max_length = 2048
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id,,
model_max_length=max_length)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,
device_map='auto').eval()
model.config.pad_token_id = tokenizer.pad_token_id
text = "how much is 1+1?"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
error:
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [1,0,0], thread: [11,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
RuntimeError: CUDA error: device-side assert triggered
accelerate env result:
Accelerate
version: 0.26.1- Platform: Linux-3.10.0-1160.90.1.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.10.9
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 503.65 GB
- GPU type: NVIDIA RTX 6000 Ada Generation
Accelerate
default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 3
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {‘gradient_accumulation_steps’: 1, ‘offload_optimizer_device’: ‘cpu’, ‘offload_param_device’: ‘cpu’, ‘zero3_init_flag’: True, ‘zero_stage’: 2}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env:
When I removed device_map=‘auto’, it works well. This looks like a running issue on multi-gpu.
How can i solve it?