Improve performance IterableDataset (with tokenization)

Hi friends :wave:

my problem
I’m using an IterableDataset to load a big dataset and I’m tokenizing it on demand using the IterableDatase.map function. For most samples, the loading time is reasonable, but sometimes, retrieving the next sample takes a really long time (45 sec). This typically happens for very long samples. Any ideas of how I can reduce the waiting time?

possible solutions
I’m currently testing if it’s possible to preload samples from the dataset using the iterlib package. Any other suggestions?

my tokenization code
(taken from the Data processing for Causal Language Modeling video by Huggingface)

def tokenize(batch, tokenizer, context_length):
    outputs = tokenizer(
        batch["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}

Maybe you can add another map operation before tokenization to split long sentences ?

1 Like

Thanks for the suggestion. I tried that but it still doesn’t really work. Maybe the problem is also loading the data from memory instead of tokenizing it :thinking: