Using Trainer with an empty training dataset

I have a single script to perform fine-tuning - if required - and test. In the test I use predict setting predict_with_generate=True.

I use a Trainer instance for doing so, more specifically the test is done with trainer.predict().

The problem is when I want to generate texts with 0 or k-shot prompting and no fine-tuning. In that case, the training set is empty and I get an error from the Trainer even if I specify do_train=False in its training arguments.

Using Trainer.predict() is very convenient because it takes care of all the start/end tokens the specific model expects. If I had to modify my code for generating text, I would need to take care of too many little details.

Is there a way to tell the Trainer to simply ignore the training set if the .train() method is never called?

1 Like

When I asked Hugging Chat, I got the following response. I wonder if it really works…?:sweat_smile:


To resolve the issue where the Huggingface Trainer requires a training or evaluation dataset when using trainer.predict(), follow these steps:

  1. Set Training Flags: Ensure do_train=False and do_eval=False in your TrainingArguments to bypass training and evaluation.

  2. Pass Empty Datasets: Provide None for both train_dataset and eval_dataset when initializing the Trainer. This prevents the Trainer from attempting to use non-existent datasets.

  3. Adjust Evaluation Strategy: Optionally, set eval_strategy='no' to disable any evaluation during prediction.

By implementing these steps, you can use trainer.predict() without encountering errors related to missing datasets.

Answer:
You can configure the Trainer to ignore the training and evaluation datasets by setting do_train=False, do_eval=False, and passing None for both datasets. This allows you to use trainer.predict() without any issues [1][2].

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    do_train=False,
    do_eval=False,
    do_predict=True,
    eval_strategy='no'  # Optional: Disable evaluation during prediction
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
)

trainer.predict(test_dataset)

This setup ensures the Trainer doesn’t require any datasets for training or evaluation, allowing your prediction task to proceed smoothly [1][2].