How to sample from the validation set when using Trainer?

When using the Trainer, e.g.

# set training arguments - these params are not really tuned, feel free to change
training_args = Seq2SeqTrainingArguments(
    output_dir="./",
    evaluation_strategy="steps",
    per_device_train_batch_size=50,
    per_device_eval_batch_size=10,
    predict_with_generate=True,
    logging_steps=2,  # set to 1000 for full training
    save_steps=16,    # set to 500 for full training
    eval_steps=4,     # set to 8000 for full training
    warmup_steps=1,   # set to 2000 for full training
    max_steps=16,     # delete for full training
    # overwrite_output_dir=True,
    save_total_limit=1,
    #fp16=True, 
)

# instantiate trainer
trainer = Seq2SeqTrainer(
    model=multibert,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_data.with_format("torch"),
    eval_dataset=eval_data.with_format("torch"),
)

Is there someway to randomly select/sample from the eval_data at every n eval_steps ?

E.g. I have tried

eval_data = eval_data.select(range(3000))
...
trainer = Seq2SeqTrainer(
    model=multibert,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_data.with_format("torch"),
    eval_dataset=eval_data.with_format("torch"),
)

But that would be statically defining the eval_data subset before the training. Is it possible to do the selecting during the training and make it kind of select a different subset at every evaluation point?

No that’s not supported. The evaluation dataset has to be fixed.

1 Like

Thanks @sgugger for the prompt reply!