Streaming dataset into Trainer: does not implement __len__, max_steps has to be specified

Greetings,

I am using datasets API to load a csv file in streaming mode, using this code:

train_dataset = load_dataset("csv",data_files='train.csv', streaming=True)

Then I am converting it into a pytorch format:

train_dataset = train_dataset.with_format("torch")

I am using a map function to tokenize the input and batch them:

def encode(batch):
    inputs = tokenizer(list(map(join_commit_codes,batch['code'])), truncation=True, padding='max_length',max_length=max_commit_code_length)
    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask
    batch["token_type_ids"] =  inputs.token_type_ids
    return batch
train_dataset = train_dataset.map(encode, batch_size=batch_size,batched=True, remove_columns=["code"])

And then passing it into a Trainer instance:

trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_dataset.with_format("torch")['train'],
        eval_dataset =valid_dataset.with_format("torch")['train']
    )

But here I am getting this error:

ValueError: train_dataset does not implement __len__, max_steps has to be specified

I have tried other solutions like specifying a len function:

class FromIterableDataset:
    def __init__(self,iterable_dataset,length):
        self.dataset = iterable_dataset
        self.length = length
    def __len__(self):
        return self.length

Then wrapping the dataset, but then I get an error that indexing is not supported.

We have a large dataset and streaming it into the model for training is the best option.
I would greatly appreciate your help in fixing this error.
Thank you.

Have you tried passing max_steps to the trainer ?

Hello,
Thanks for the reply.
The problem is I do not know the needed max_steps in advance,
is it dependent on the batch size and epochs?

1 Like

I’m using IterableDataset for reading large datasets (larger than 100G). I do not know how many rows they have and counting this in itself could take quite a while.

In my opinion, hugging face should have just a notion of end of dataset in iterabledatasets. I believe this exists in torch/tf and this could be used to mark when an epoch is finished without requiring the number of rows.

1 Like

I agree it would be nice if the Trainer could just check when the dataset ends and count that as an epoch. There is probably an issue on github about this though

There was, but AFAIKT it was resolved with a workaround (specifying max_steps), which is less than ideal.

1 Like

The Trainer can’t just wait for the end of an epoch as the number of steps needs to be known in advance for the learning rate scheduler. This is why we have no choice but to ask for max_steps to be set in this case.

1 Like