Training a Tokenizer on a Streamed Dataset

Hi, I’m trying to train a tokenizer on a dataset that uses streaming. I followed the instructions provided here, with the addition of streaming=True during the dataset loading step. However, it quickly failed as the IterableDataset class does not have a length property (unlike the normal Dataset class).

How can I work around this issue without having to download the dataset file entirely, i.e. purely relying on dataset streaming? Here’s the Colab link to reproduce the error.

Any help would be greatly appreciated. Thanks!

Note: This issue seems to be related to mine.

Hi !
You can follow the instructions here and use this batch iterator instead:

def batch_iterator(batch_size=1000):
    batch = []
    for example in dataset:
        batch.append(example["text"])
        if len(batch) == batch_size:
            yield batch
            batch = []
    if batch:  # yield last batch
        yield batch
1 Like

Thanks for the reply and solution @lhoestq.
Appreciate the effort you and your team are doing!
Cheers!

1 Like