When using the Trainer, e.g.
# set training arguments - these params are not really tuned, feel free to change
training_args = Seq2SeqTrainingArguments(
output_dir="./",
evaluation_strategy="steps",
per_device_train_batch_size=50,
per_device_eval_batch_size=10,
predict_with_generate=True,
logging_steps=2, # set to 1000 for full training
save_steps=16, # set to 500 for full training
eval_steps=4, # set to 8000 for full training
warmup_steps=1, # set to 2000 for full training
max_steps=16, # delete for full training
# overwrite_output_dir=True,
save_total_limit=1,
#fp16=True,
)
# instantiate trainer
trainer = Seq2SeqTrainer(
model=multibert,
tokenizer=tokenizer,
args=training_args,
train_dataset=train_data.with_format("torch"),
eval_dataset=eval_data.with_format("torch"),
)
Is there someway to randomly select/sample from the eval_data
at every n eval_steps
?
E.g. I have tried
eval_data = eval_data.select(range(3000))
...
trainer = Seq2SeqTrainer(
model=multibert,
tokenizer=tokenizer,
args=training_args,
train_dataset=train_data.with_format("torch"),
eval_dataset=eval_data.with_format("torch"),
)
But that would be statically defining the eval_data
subset before the training. Is it possible to do the selecting during the training and make it kind of select a different subset at every evaluation point?