Training fails on multiple GPUs with RuntimeError 'chuck expects at least a 1-dimensional array'

mmm4a · December 9, 2024, 4:11pm

Hi, I’m fine tuning a Llama-3.2-3B-Instruct model with a custom dataset. The training script works on one GPU (out of memory, which is possible), but fails with RuntimeError: chunk expects at least a 1-dimensional tensor on multiple GPUs.

I used the debugger to see what is going on. The problems seems to originate in the scatter_map call in file scatter_gather.py, where instead of the inputs (dict containing ‘input_ids’, ‘attention_mask’, ‘labels’, each shaped (8, 1024) - batch size 2 * 4 GPUs probably, and 1024 max sequence length); and ‘num_items_in_batch’) only the ‘num_items_in_batch’ actually stays in the data.

I believe I’m doing something wrong and this is not a bug. Any ideas where I should start looking for a problem?

Loading the dataset:

dataset_path = "../data/s2s/si/chatml_si_listed_services_and_measurements_t30.json"
dataset = load_dataset("json", data_files=[dataset_path], split="train")
evalset_path = "../data/s2s/si/chatml_si_listed_services_and_measurements_t100.json"
evalset = load_dataset("json", data_files=[evalset_path], split="train")
dataset = dataset.map(preprocess_instances, batched=True)
evalset = evalset.map(preprocess_instances, batched=True)
print("Data set length", len(dataset))  # 64 instances
print("Eval set length", len(evalset))  # 64 instances

Preprocess function:

def preprocess_instances(batch):
    lines_to_tokenize = [
        tokenizer.apply_chat_template(chat, tokenize=False) for chat in batch["text"]
    ]
    print(lines_to_tokenize)
    # this is a list of strings
    tokenized_data = tokenizer(
        lines_to_tokenize,
        truncation=True,
        max_length=1024,
        padding="max_length",
        return_tensors="pt",
    )
    # this is a dict of inputs_ids: tensor 2d , attention_mask: tensor 2d
    
# clone input_ids to labels
    tokenized_data["labels"] = tokenized_data["input_ids"].clone()
    return tokenized_data

And my model related code:

os.environ["CUDA_VISIBLE_DEVICES"] = "1,2,3,4"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

base_model_name = "meta-llama/Llama-3.2-3B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
)
base_model.config.use_cache = False

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=8,
    bias="none",
    task_type="CAUSAL_LM",
)

tokenizer = AutoTokenizer.from_pretrained(
    base_model_name,
    trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    learning_rate=2e-4,
    logging_steps=2,
    max_steps=100,
    eval_strategy="steps",
    ddp_find_unused_parameters=False,
    #dataset_text_field="text",
)


trainer = SFTTrainer(
    model=base_model,
    train_dataset=dataset,
    eval_dataset=evalset,
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=training_args,
    max_seq_length=1024,
)

trainer.train()

peik · December 19, 2024, 5:45am

meet the same error, upgrade transformers to 4.7, just warning no error appear

mmm4a · December 19, 2024, 9:48am

Thanks, will test this asap and report back

Topic		Replies	Views
Multi-GPU finetuning of NLLB produces RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0 🤗Transformers	1	1072	August 29, 2023
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select) DeepSpeed	5	3401	August 26, 2024
Trainer.evalute() with multi GPUs results Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! Beginners	2	60	February 11, 2025
CUDA out of memory on multi-GPU 🤗Transformers	1	2593	March 6, 2024
ValueError: Unable to create tensor, you should probably activate truncation... but only for training on multiple GPUs or with multi-batch 🤗Transformers	3	440	November 8, 2024

Training fails on multiple GPUs with RuntimeError 'chuck expects at least a 1-dimensional array'

Related topics