Hey John, thanks for your reply!
After doing some reading, I realized my error was due to a misunderstanding of the training mechanisms of Causal Language Models (CLMs) versus Seq2Seq models.
Causal Language Models (such as Qwen
) are decoder-only models, meaning their objective is to predict the next token given all previous tokens in a sequence.
As a result, the input is a continuous sequence, and the target (labels) is the same sequence shifted one position to the right.
This means the preprocessing should be done as follows:
def preprocess_function(examples):
inputs = tokenizer(examples['text'], truncation=True, padding='max_length', max_length=1024)
inputs['labels'] = inputs['input_ids'].copy()
# Mask padding tokens so they don’t contribute to the loss
inputs['labels'] = [[-100 if token == tokenizer.pad_token_id else token for token in label] for label in inputs['labels']]
return inputs
On the other hand, Seq2Seq models (like T5
) use an encoder-decoder architecture, where the objective is to transform one sequence into another.
Therefore, the input and output are two separate sequences.
The preprocessing can be done like this:
def preprocess_function(examples):
inputs = [f"{prompt}\nSQL Query:\n" for prompt in examples["prompt"]]
targets = [f"{completion}\n" for completion in examples["completion"]]
model_inputs = tokenizer(inputs, max_length=2048, truncation=True, padding="max_length")
labels = tokenizer(targets, max_length=1024, truncation=True, padding="max_length")
model_inputs["labels"] = labels["input_ids"]
return model_inputs
For reference, I found this forum discussion helpful: valueerror-expected-input-batch-size-8-to-match-target-batch-size-280.
Which also points to this blog post: fine-tuning-a-pre-trained-gpt-2-model-and-performing-inference-a-hands-on-guide.
Hopefully, this can help others who run into the same error in the future