Streaming dataset into Trainer: does not implement __len__, max_steps has to be specified

I’m using IterableDataset for reading large datasets (larger than 100G). I do not know how many rows they have and counting this in itself could take quite a while.

In my opinion, hugging face should have just a notion of end of dataset in iterabledatasets. I believe this exists in torch/tf and this could be used to mark when an epoch is finished without requiring the number of rows.

2 Likes