I’m using a IterableDataset from datasets library and passing it to HF Trainer, something like this:
from datasets import load_dataset
from transformers import Trainer, TrainingArguments
ds = load_dataset("my-dataset", streaming=True)
training_args = TrainingArguments(
output_dir="my_model",
per_device_train_batch_size=8,
max_steps=1000, # Large enough for multiple epochs
)
trainer = Trainer(
model=my_model,
train_dataset=ds["train"].shuffle(seed=42),
eval_dataset=ds["test"],
)
trainer.train()
Then, will trainer automatically reshuffle my iterable dataset after each epoch?
I cannot find the behavior in docs, but I found a code which seems to force Trainer to reshuffle using callbacks (so I guess the trainer will not reshuffle the dataset?):