Hi, I’m trying to train a tokenizer on a dataset that uses streaming. I followed the instructions provided here, with the addition of streaming=True during the dataset loading step. However, it quickly failed as the IterableDataset class does not have a length property (unlike the normal Dataset class).
How can I work around this issue without having to download the dataset file entirely, i.e. purely relying on dataset streaming? Here’s the Colab link to reproduce the error.
Hi !
You can follow the instructions here and use this batch iterator instead:
def batch_iterator(batch_size=1000):
batch = []
for example in dataset:
batch.append(example["text"])
if len(batch) == batch_size:
yield batch
batch = []
if batch: # yield last batch
yield batch
I’m having problems to train the tokenizers on a very large dataset. I think the problem is in the “yield” done in the interator. Any change I could train the tokenizer without the yield?