Training a Tokenizer on a Streamed Dataset

w11wo · July 5, 2021, 7:02am

Hi, I’m trying to train a tokenizer on a dataset that uses streaming. I followed the instructions provided here, with the addition of streaming=True during the dataset loading step. However, it quickly failed as the IterableDataset class does not have a length property (unlike the normal Dataset class).

How can I work around this issue without having to download the dataset file entirely, i.e. purely relying on dataset streaming? Here’s the Colab link to reproduce the error.

Any help would be greatly appreciated. Thanks!

Note: This issue seems to be related to mine.

lhoestq · July 6, 2021, 12:51pm

Hi !
You can follow the instructions here and use this batch iterator instead:

def batch_iterator(batch_size=1000):
    batch = []
    for example in dataset:
        batch.append(example["text"])
        if len(batch) == batch_size:
            yield batch
            batch = []
    if batch:  # yield last batch
        yield batch

w11wo · July 6, 2021, 1:04pm

Thanks for the reply and solution @lhoestq.
Appreciate the effort you and your team are doing!
Cheers!

dotan1111 · January 12, 2023, 2:00pm

I’m having problems to train the tokenizers on a very large dataset. I think the problem is in the “yield” done in the interator. Any change I could train the tokenizer without the yield?

rootacess · May 30, 2023, 3:12am

@w11wo You can do this to get the __len__ property:

dataset.with_format("torch")

system · January 31, 2024, 6:52am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Map with tokenize function stuck in the beginning 🤗Datasets	4	57	December 27, 2024
Using IterableDataset with Trainer - `IterableDataset' has no len() 🤗Transformers	7	14510	December 17, 2024
Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset 🤗Transformers	7	2239	November 11, 2024
Improve performance IterableDataset (with tokenization) 🤗Datasets	2	771	November 2, 2023
Building a Custom Tokenizer with HGF Dataset: Batch Iterator Best Practices 🤗Datasets	4	553	December 8, 2023

Training a Tokenizer on a Streamed Dataset

Related topics