Training a Tokenizer on a Streamed Dataset

Hi, I’m trying to train a tokenizer on a dataset that uses streaming. I followed the instructions provided here, with the addition of streaming=True during the dataset loading step. However, it quickly failed as the IterableDataset class does not have a length property (unlike the normal Dataset class).

How can I work around this issue without having to download the dataset file entirely, i.e. purely relying on dataset streaming? Here’s the Colab link to reproduce the error.

Any help would be greatly appreciated. Thanks!

Note: This issue seems to be related to mine.

1 Like

Hi !
You can follow the instructions here and use this batch iterator instead:

def batch_iterator(batch_size=1000):
    batch = []
    for example in dataset:
        batch.append(example["text"])
        if len(batch) == batch_size:
            yield batch
            batch = []
    if batch:  # yield last batch
        yield batch
2 Likes

Thanks for the reply and solution @lhoestq.
Appreciate the effort you and your team are doing!
Cheers!

1 Like

I’m having problems to train the tokenizers on a very large dataset. I think the problem is in the “yield” done in the interator. Any change I could train the tokenizer without the yield?

@w11wo You can do this to get the __len__ property:

dataset.with_format("torch")

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.