Hi, I’m trying to train a tokenizer on a dataset that uses streaming. I followed the instructions provided here, with the addition of
streaming=True during the dataset loading step. However, it quickly failed as the
IterableDataset class does not have a length property (unlike the normal
How can I work around this issue without having to download the dataset file entirely, i.e. purely relying on dataset streaming? Here’s the Colab link to reproduce the error.
Any help would be greatly appreciated. Thanks!
Note: This issue seems to be related to mine.