Hi everyone,
I’m trying to fine-tune Qwen/Qwen2.5-Coder-0.5B (or any other Qwen2.5 family model), but I keep running into the following error when using different max_length values for input and labels:
ValueError: Expected input batch_size (2047) to match target batch_size (1023).
However, this issue does not occur when training other models like google-t5/t5-small (which is based on AutoModelForSeq2SeqLM).
I suspect the root cause is that Qwen/Qwen2.5-Coder-0.5B is an AutoModelForCausalLM, but I’m struggling to fully understand why this behavior occurs and how to resolve it.
It’s worth noting that I followed this tutorial for fine-tuning Qwen:
Qwen Fine-Tuning Tutorial – DataCamp
Here’s a snippet of my fine-tuning code:
model_checkpoint = "Qwen/Qwen2.5-Coder-0.5B"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_checkpoint, trust_remote_code=True, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, trust_remote_code=True, return_tensors="pt", device=device)
def preprocess_function(examples):
inputs = [f"{prompt}\nSQL Query:\n" for prompt in examples["prompt"]]
targets = [f"{completion}\n" for completion in examples["completion"]]
model_inputs = tokenizer(inputs, max_length=2048, truncation=True, padding="max_length")
labels = tokenizer(targets, max_length=1024, truncation=True, padding="max_length")
model_inputs["labels"] = labels["input_ids"]
return model_inputs
dataset = load_dataset("json", data_files="fake_dataset.json")
split_dataset = dataset["train"].train_test_split(test_size=0.1)
tokenized_dataset = split_dataset.map(preprocess_function, batched=True, remove_columns=split_dataset["train"].column_names)
training_args = TrainingArguments(
output_dir="./qwen_ft_try",
num_train_epochs=2,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=1e-4,
fp16=True,
push_to_hub=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
processing_class=tokenizer,
)
trainer.train()
Library version:
- `transformers` version: 4.48.3
- Platform: Linux-6.1.85+-x86_64-with-glibc2.35
- Python version: 3.11.11
- Huggingface_hub version: 0.28.1
- Safetensors version: 0.5.2
- Accelerate version: 1.3.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Tensorflow version (GPU?): 2.18.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.10.3 (gpu)
- Jax version: 0.4.33
- JaxLib version: 0.4.33
- Using distributed or parallel set-up in script?: False
- Using GPU in script?: True
- GPU type: Tesla T4
Has anyone encountered this issue before?
What would be the correct way to handle different input and output sequence lengths for causal language models like Qwen?
Any insights or workarounds would be greatly appreciated.
Thanks in advance ![]()