Fine-Tuning Qwen/Qwen2.5-Coder-0.5B: Mismatched Input and Target Batch Sizes

Hi everyone,

I’m trying to fine-tune Qwen/Qwen2.5-Coder-0.5B (or any other Qwen2.5 family model), but I keep running into the following error when using different max_length values for input and labels:

ValueError: Expected input batch_size (2047) to match target batch_size (1023).

However, this issue does not occur when training other models like google-t5/t5-small (which is based on AutoModelForSeq2SeqLM).

I suspect the root cause is that Qwen/Qwen2.5-Coder-0.5B is an AutoModelForCausalLM, but I’m struggling to fully understand why this behavior occurs and how to resolve it.

It’s worth noting that I followed this tutorial for fine-tuning Qwen:
Qwen Fine-Tuning Tutorial – DataCamp

Here’s a snippet of my fine-tuning code:

model_checkpoint = "Qwen/Qwen2.5-Coder-0.5B"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_checkpoint, trust_remote_code=True, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, trust_remote_code=True, return_tensors="pt", device=device)

def preprocess_function(examples):
    inputs = [f"{prompt}\nSQL Query:\n" for prompt in examples["prompt"]]
    targets = [f"{completion}\n" for completion in examples["completion"]]
    model_inputs = tokenizer(inputs, max_length=2048, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=1024, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

dataset = load_dataset("json", data_files="fake_dataset.json")
split_dataset = dataset["train"].train_test_split(test_size=0.1)
tokenized_dataset = split_dataset.map(preprocess_function, batched=True, remove_columns=split_dataset["train"].column_names)

training_args = TrainingArguments(
    output_dir="./qwen_ft_try",
    num_train_epochs=2,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-4,
    fp16=True,
    push_to_hub=False,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    processing_class=tokenizer,
)
trainer.train()

Library version:

- `transformers` version: 4.48.3
- Platform: Linux-6.1.85+-x86_64-with-glibc2.35
- Python version: 3.11.11
- Huggingface_hub version: 0.28.1
- Safetensors version: 0.5.2
- Accelerate version: 1.3.0
- Accelerate config: 	not found
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Tensorflow version (GPU?): 2.18.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.10.3 (gpu)
- Jax version: 0.4.33
- JaxLib version: 0.4.33
- Using distributed or parallel set-up in script?: False
- Using GPU in script?: True
- GPU type: Tesla T4

Has anyone encountered this issue before?
What would be the correct way to handle different input and output sequence lengths for causal language models like Qwen?

Any insights or workarounds would be greatly appreciated.

Thanks in advance :slight_smile:

1 Like

I wonder if that’s the case with CausalLM…

Hey John, thanks for your reply!

After doing some reading, I realized my error was due to a misunderstanding of the training mechanisms of Causal Language Models (CLMs) versus Seq2Seq models.

Causal Language Models (such as Qwen) are decoder-only models, meaning their objective is to predict the next token given all previous tokens in a sequence.
As a result, the input is a continuous sequence, and the target (labels) is the same sequence shifted one position to the right.
This means the preprocessing should be done as follows:

def preprocess_function(examples):
    inputs =  tokenizer(examples['text'], truncation=True, padding='max_length', max_length=1024)
    inputs['labels'] = inputs['input_ids'].copy()
    # Mask padding tokens so they don’t contribute to the loss
    inputs['labels'] = [[-100 if token == tokenizer.pad_token_id else token for token in label] for label in inputs['labels']]
    return inputs

On the other hand, Seq2Seq models (like T5) use an encoder-decoder architecture, where the objective is to transform one sequence into another.
Therefore, the input and output are two separate sequences.
The preprocessing can be done like this:

def preprocess_function(examples):
    inputs = [f"{prompt}\nSQL Query:\n" for prompt in examples["prompt"]]
    targets = [f"{completion}\n" for completion in examples["completion"]]
    model_inputs = tokenizer(inputs, max_length=2048, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=1024, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

For reference, I found this forum discussion helpful: valueerror-expected-input-batch-size-8-to-match-target-batch-size-280.
Which also points to this blog post: fine-tuning-a-pre-trained-gpt-2-model-and-performing-inference-a-hands-on-guide.

Hopefully, this can help others who run into the same error in the future :slight_smile:

1 Like