Streaming dataset into Trainer: does not implement len, max_steps has to be specified

nesty · March 2, 2023, 2:25pm

Greetings,

I am using datasets API to load a csv file in streaming mode, using this code:

train_dataset = load_dataset("csv",data_files='train.csv', streaming=True)

Then I am converting it into a pytorch format:

train_dataset = train_dataset.with_format("torch")

I am using a map function to tokenize the input and batch them:

def encode(batch):
    inputs = tokenizer(list(map(join_commit_codes,batch['code'])), truncation=True, padding='max_length',max_length=max_commit_code_length)
    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask
    batch["token_type_ids"] =  inputs.token_type_ids
    return batch
train_dataset = train_dataset.map(encode, batch_size=batch_size,batched=True, remove_columns=["code"])

And then passing it into a Trainer instance:

trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_dataset.with_format("torch")['train'],
        eval_dataset =valid_dataset.with_format("torch")['train']
    )

But here I am getting this error:

ValueError: train_dataset does not implement __len__, max_steps has to be specified

I have tried other solutions like specifying a len function:

class FromIterableDataset:
    def __init__(self,iterable_dataset,length):
        self.dataset = iterable_dataset
        self.length = length
    def __len__(self):
        return self.length

Then wrapping the dataset, but then I get an error that indexing is not supported.

We have a large dataset and streaming it into the model for training is the best option.
I would greatly appreciate your help in fixing this error.
Thank you.

lhoestq · March 3, 2023, 3:23pm

Have you tried passing max_steps to the trainer ?

nesty · March 3, 2023, 3:49pm

Hello,
Thanks for the reply.
The problem is I do not know the needed max_steps in advance,
is it dependent on the batch size and epochs?

amitport · March 10, 2023, 9:23am

I’m using IterableDataset for reading large datasets (larger than 100G). I do not know how many rows they have and counting this in itself could take quite a while.

In my opinion, hugging face should have just a notion of end of dataset in iterabledatasets. I believe this exists in torch/tf and this could be used to mark when an epoch is finished without requiring the number of rows.

lhoestq · March 10, 2023, 1:33pm

I agree it would be nice if the Trainer could just check when the dataset ends and count that as an epoch. There is probably an issue on github about this though

amitport · March 21, 2023, 11:25am

There was, but AFAIKT it was resolved with a workaround (specifying max_steps), which is less than ideal.

sgugger · March 21, 2023, 2:09pm

The Trainer can’t just wait for the end of an epoch as the number of steps needs to be known in advance for the learning rate scheduler. This is why we have no choice but to ask for max_steps to be set in this case.

Topic		Replies	Views
Using IterableDataset with Trainer - `IterableDataset' has no len() 🤗Transformers	7	14428	December 17, 2024
Training a Tokenizer on a Streamed Dataset Beginners	5	1336	May 30, 2023
Using an IterableDataset for 1+ epochs in Trainer Beginners	3	129	January 2, 2025
Map method to tokenize raises index error 🤗Datasets	9	4273	June 9, 2021
TrainingArguments class - max_steps formula when using streaming dataset 🤗Transformers	1	3649	September 14, 2023

Streaming dataset into Trainer: does not implement __len__, max_steps has to be specified

Related topics

Streaming dataset into Trainer: does not implement len, max_steps has to be specified