Hi friends
my problem
I’m using an IterableDataset to load a big dataset and I’m tokenizing it on demand using the IterableDatase.map function. For most samples, the loading time is reasonable, but sometimes, retrieving the next sample takes a really long time (45 sec). This typically happens for very long samples. Any ideas of how I can reduce the waiting time?
possible solutions
I’m currently testing if it’s possible to preload samples from the dataset using the iterlib package. Any other suggestions?
my tokenization code
(taken from the Data processing for Causal Language Modeling video by Huggingface)
def tokenize(batch, tokenizer, context_length):
outputs = tokenizer(
batch["text"],
truncation=True,
max_length=context_length,
return_overflowing_tokens=True,
return_length=True
)
input_batch = []
for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
if length == context_length:
input_batch.append(input_ids)
return {"input_ids": input_batch}