Improve performance IterableDataset (with tokenization)

cwallenwein · October 27, 2023, 9:00am

Hi friends

my problem
I’m using an IterableDataset to load a big dataset and I’m tokenizing it on demand using the IterableDatase.map function. For most samples, the loading time is reasonable, but sometimes, retrieving the next sample takes a really long time (45 sec). This typically happens for very long samples. Any ideas of how I can reduce the waiting time?

possible solutions
I’m currently testing if it’s possible to preload samples from the dataset using the iterlib package. Any other suggestions?

my tokenization code
(taken from the Data processing for Causal Language Modeling video by Huggingface)

def tokenize(batch, tokenizer, context_length):
    outputs = tokenizer(
        batch["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}

lhoestq · October 29, 2023, 3:28pm

Maybe you can add another map operation before tokenization to split long sentences ?

cwallenwein · November 2, 2023, 9:48am

Thanks for the suggestion. I tried that but it still doesn’t really work. Maybe the problem is also loading the data from memory instead of tokenizing it

Topic		Replies	Views
Map with tokenize function stuck in the beginning 🤗Datasets	4	74	December 27, 2024
Fetching data takes too too much time 🤗Datasets	1	1310	June 13, 2022
Roadmap/timeline for dataset streaming 🤗Datasets	9	2286	July 5, 2021
Tokenize iterable dataset Models	0	266	June 7, 2023
Use load dataset to load a sample of the dataset 🤗Datasets	3	1294	May 24, 2021

Improve performance IterableDataset (with tokenization)

Related topics