Hi everyone,
I’m trying to fine-tune Qwen/Qwen2.5-Coder-0.5B
(or any other Qwen2.5
family model), but I keep running into the following error when using different max_length
values for input and labels:
ValueError: Expected input batch_size (2047) to match target batch_size (1023).
However, this issue does not occur when training other models like google-t5/t5-small
(which is based on AutoModelForSeq2SeqLM
).
I suspect the root cause is that Qwen/Qwen2.5-Coder-0.5B
is an AutoModelForCausalLM
, but I’m struggling to fully understand why this behavior occurs and how to resolve it.
It’s worth noting that I followed this tutorial for fine-tuning Qwen:
Qwen Fine-Tuning Tutorial – DataCamp
Here’s a snippet of my fine-tuning code:
model_checkpoint = "Qwen/Qwen2.5-Coder-0.5B"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_checkpoint, trust_remote_code=True, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, trust_remote_code=True, return_tensors="pt", device=device)
def preprocess_function(examples):
inputs = [f"{prompt}\nSQL Query:\n" for prompt in examples["prompt"]]
targets = [f"{completion}\n" for completion in examples["completion"]]
model_inputs = tokenizer(inputs, max_length=2048, truncation=True, padding="max_length")
labels = tokenizer(targets, max_length=1024, truncation=True, padding="max_length")
model_inputs["labels"] = labels["input_ids"]
return model_inputs
dataset = load_dataset("json", data_files="fake_dataset.json")
split_dataset = dataset["train"].train_test_split(test_size=0.1)
tokenized_dataset = split_dataset.map(preprocess_function, batched=True, remove_columns=split_dataset["train"].column_names)
training_args = TrainingArguments(
output_dir="./qwen_ft_try",
num_train_epochs=2,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=1e-4,
fp16=True,
push_to_hub=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
processing_class=tokenizer,
)
trainer.train()
Library version:
- `transformers` version: 4.48.3
- Platform: Linux-6.1.85+-x86_64-with-glibc2.35
- Python version: 3.11.11
- Huggingface_hub version: 0.28.1
- Safetensors version: 0.5.2
- Accelerate version: 1.3.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Tensorflow version (GPU?): 2.18.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.10.3 (gpu)
- Jax version: 0.4.33
- JaxLib version: 0.4.33
- Using distributed or parallel set-up in script?: False
- Using GPU in script?: True
- GPU type: Tesla T4
Has anyone encountered this issue before?
What would be the correct way to handle different input and output sequence lengths for causal language models like Qwen
?
Any insights or workarounds would be greatly appreciated.
Thanks in advance