Fine-Tuning Qwen/Qwen2.5-Coder-0.5B: Mismatched Input and Target Batch Sizes

Hey John, thanks for your reply!

After doing some reading, I realized my error was due to a misunderstanding of the training mechanisms of Causal Language Models (CLMs) versus Seq2Seq models.

Causal Language Models (such as Qwen) are decoder-only models, meaning their objective is to predict the next token given all previous tokens in a sequence.
As a result, the input is a continuous sequence, and the target (labels) is the same sequence shifted one position to the right.
This means the preprocessing should be done as follows:

def preprocess_function(examples):
    inputs =  tokenizer(examples['text'], truncation=True, padding='max_length', max_length=1024)
    inputs['labels'] = inputs['input_ids'].copy()
    # Mask padding tokens so they don’t contribute to the loss
    inputs['labels'] = [[-100 if token == tokenizer.pad_token_id else token for token in label] for label in inputs['labels']]
    return inputs

On the other hand, Seq2Seq models (like T5) use an encoder-decoder architecture, where the objective is to transform one sequence into another.
Therefore, the input and output are two separate sequences.
The preprocessing can be done like this:

def preprocess_function(examples):
    inputs = [f"{prompt}\nSQL Query:\n" for prompt in examples["prompt"]]
    targets = [f"{completion}\n" for completion in examples["completion"]]
    model_inputs = tokenizer(inputs, max_length=2048, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=1024, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

For reference, I found this forum discussion helpful: valueerror-expected-input-batch-size-8-to-match-target-batch-size-280.
Which also points to this blog post: fine-tuning-a-pre-trained-gpt-2-model-and-performing-inference-a-hands-on-guide.

Hopefully, this can help others who run into the same error in the future :slight_smile:

1 Like