Fine-Tuning Qwen/Qwen2.5-Coder-0.5B: Mismatched Input and Target Batch Sizes

ShacharNar · February 26, 2025, 6:27pm

Hi everyone,

I’m trying to fine-tune Qwen/Qwen2.5-Coder-0.5B (or any other Qwen2.5 family model), but I keep running into the following error when using different max_length values for input and labels:

ValueError: Expected input batch_size (2047) to match target batch_size (1023).

However, this issue does not occur when training other models like google-t5/t5-small (which is based on AutoModelForSeq2SeqLM).

I suspect the root cause is that Qwen/Qwen2.5-Coder-0.5B is an AutoModelForCausalLM, but I’m struggling to fully understand why this behavior occurs and how to resolve it.

It’s worth noting that I followed this tutorial for fine-tuning Qwen:
Qwen Fine-Tuning Tutorial – DataCamp

Here’s a snippet of my fine-tuning code:

model_checkpoint = "Qwen/Qwen2.5-Coder-0.5B"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_checkpoint, trust_remote_code=True, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, trust_remote_code=True, return_tensors="pt", device=device)

def preprocess_function(examples):
    inputs = [f"{prompt}\nSQL Query:\n" for prompt in examples["prompt"]]
    targets = [f"{completion}\n" for completion in examples["completion"]]
    model_inputs = tokenizer(inputs, max_length=2048, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=1024, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

dataset = load_dataset("json", data_files="fake_dataset.json")
split_dataset = dataset["train"].train_test_split(test_size=0.1)
tokenized_dataset = split_dataset.map(preprocess_function, batched=True, remove_columns=split_dataset["train"].column_names)

training_args = TrainingArguments(
    output_dir="./qwen_ft_try",
    num_train_epochs=2,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-4,
    fp16=True,
    push_to_hub=False,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    processing_class=tokenizer,
)
trainer.train()

Library version:

- `transformers` version: 4.48.3
- Platform: Linux-6.1.85+-x86_64-with-glibc2.35
- Python version: 3.11.11
- Huggingface_hub version: 0.28.1
- Safetensors version: 0.5.2
- Accelerate version: 1.3.0
- Accelerate config: 	not found
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Tensorflow version (GPU?): 2.18.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.10.3 (gpu)
- Jax version: 0.4.33
- JaxLib version: 0.4.33
- Using distributed or parallel set-up in script?: False
- Using GPU in script?: True
- GPU type: Tesla T4

Has anyone encountered this issue before?
What would be the correct way to handle different input and output sequence lengths for causal language models like Qwen?

Any insights or workarounds would be greatly appreciated.

Thanks in advance

John6666 · February 27, 2025, 10:34am

I wonder if that’s the case with CausalLM…

github.com/huggingface/transformers

`ValueError: Expected input batch_size to match target batch_size` occurs when training GPT2 with `Seq2SeqTrainer`

opened 07:08PM - 18 Jun 21 UTC

closed 05:28PM - 21 Jun 21 UTC

ryangawei

## Environment info - `transformers` version: 4.6.1 - Platform: Linux-4.15….0-144-generic-x86_64-with-glibc2.10 - Python version: 3.8.5 - PyTorch version (GPU?): 1.8.1+cu102 (True) - Tensorflow version (GPU?): 2.5.0 (True) - Using GPU in script?: Yes - Using distributed or parallel set-up in script?: Yes ### Who can help Models: - gpt2: @patrickvonplaten, @LysandreJik Library: - trainer: @sgugger ## Information Model I am using: `distilgpt2` ## To reproduce I was following the tutorial [https://github.com/huggingface/notebooks/blob/master/examples/summarization.ipynb](https://github.com/huggingface/notebooks/blob/master/examples/summarization.ipynb) to fine-tune `distilgpt2` on a Seq2Seq task. Here's how I run my training process. My map function for preprocessing the datasets, ``` def tokenize(sample_batch, tokenizer): src_text = [] batch_size = len(sample_batch["src_abstract"]) for i in range(batch_size): src_text.append(" ".join( [sample_batch["src_abstract"][i], sample_batch["text_before_explicit_citation"][i], sample_batch["text_after_explicit_citation"][i]])) tgt_text = sample_batch["tgt_abstract"] inputs = tokenizer( src_text, tgt_text, add_special_tokens=True, truncation="longest_first", # padding="max_length", max_length=750 ) labels = tokenizer( sample_batch["explicit_citation"], truncation="longest_first", # padding="max_length", max_length=128, ) inputs["labels"] = labels["input_ids"] return inputs ``` My training code, ``` model_name = "distilgpt2" model = GPT2LMHeadModel.from_pretrained(model_name).to('cuda') data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding="max_length") training_args = Seq2SeqTrainingArguments( "./checkpoints", learning_rate=2e-5, weight_decay=0.01, per_device_train_batch_size=2, per_device_eval_batch_size=2, save_strategy='steps', evaluation_strategy='steps', logging_strategy='steps', save_total_limit=1, logging_steps=500, fp16=True, predict_with_generate=True ) trainer = Seq2SeqTrainer( model=model, args=training_args, data_collator=data_collator, train_dataset=train_dataset, eval_dataset=dev_dataset, compute_metrics=compute_metrics ) trainer.train() ``` And the error log occurs, ``` ValueError: Caught ValueError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/guoao/anaconda3/envs/wga/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/home/guoao/anaconda3/envs/wga/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/guoao/anaconda3/envs/wga/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 972, in forward loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)) File "/home/guoao/anaconda3/envs/wga/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/guoao/anaconda3/envs/wga/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1047, in forward return F.cross_entropy(input, target, weight=self.weight, File "/home/guoao/anaconda3/envs/wga/lib/python3.8/site-packages/torch/nn/functional.py", line 2693, in cross_entropy return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction) File "/home/guoao/anaconda3/envs/wga/lib/python3.8/site-packages/torch/nn/functional.py", line 2384, in nll_loss raise ValueError( ValueError: Expected input batch_size (2046) to match target batch_size (138). ``` It seems like the `input_ids` are padded to the model's `max_length`, but the `labels` are not (I also have a question on why the `batch_size` looks like `2046` instead of `batch_size * max_length = 2048`). I found similar errors in the forum [https://discuss.huggingface.co/t/how-to-use-seq2seqtrainer-seq2seqdatacollator-in-v4-2-1/3243](https://discuss.huggingface.co/t/how-to-use-seq2seqtrainer-seq2seqdatacollator-in-v4-2-1/3243), which says, > The PR has been merged, so you should be able to use a similar workflow. Note that the processing that used to be done in Seq2SeqDataCollator is now done on the dataset directly. But I'm not sure how it solves the problem. I'd really appreciate any kinds of help!

ShacharNar · February 27, 2025, 9:22pm

Hey John, thanks for your reply!

After doing some reading, I realized my error was due to a misunderstanding of the training mechanisms of Causal Language Models (CLMs) versus Seq2Seq models.

Causal Language Models (such as Qwen) are decoder-only models, meaning their objective is to predict the next token given all previous tokens in a sequence.
As a result, the input is a continuous sequence, and the target (labels) is the same sequence shifted one position to the right.
This means the preprocessing should be done as follows:

def preprocess_function(examples):
    inputs =  tokenizer(examples['text'], truncation=True, padding='max_length', max_length=1024)
    inputs['labels'] = inputs['input_ids'].copy()
    # Mask padding tokens so they don’t contribute to the loss
    inputs['labels'] = [[-100 if token == tokenizer.pad_token_id else token for token in label] for label in inputs['labels']]
    return inputs

On the other hand, Seq2Seq models (like T5) use an encoder-decoder architecture, where the objective is to transform one sequence into another.
Therefore, the input and output are two separate sequences.
The preprocessing can be done like this:

def preprocess_function(examples):
    inputs = [f"{prompt}\nSQL Query:\n" for prompt in examples["prompt"]]
    targets = [f"{completion}\n" for completion in examples["completion"]]
    model_inputs = tokenizer(inputs, max_length=2048, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=1024, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

For reference, I found this forum discussion helpful: valueerror-expected-input-batch-size-8-to-match-target-batch-size-280.
Which also points to this blog post: fine-tuning-a-pre-trained-gpt-2-model-and-performing-inference-a-hands-on-guide.

Hopefully, this can help others who run into the same error in the future

Topic		Replies	Views
Ask for help: Output inconsistency when using LLM batch inference compared to single input Beginners	4	175	March 20, 2025
ValueError: Expected input batch_size (8) to match target batch_size (280) Beginners	1	1932	November 18, 2024
ValueError: Expected input batch_size (16) to match target batch_size (64) Beginners	7	5004	November 7, 2023
Speculative Decoding with Qwen Models 🤗Transformers	1	334	March 5, 2025
ValueError: Expected input batch_size (4096) to match target batch_size (8) Beginners	3	8415	April 2, 2023

Fine-Tuning Qwen/Qwen2.5-Coder-0.5B: Mismatched Input and Target Batch Sizes

Related topics