Greetings,
I am using datasets API to load a csv file in streaming mode, using this code:
train_dataset = load_dataset("csv",data_files='train.csv', streaming=True)
Then I am converting it into a pytorch format:
train_dataset = train_dataset.with_format("torch")
I am using a map function to tokenize the input and batch them:
def encode(batch):
inputs = tokenizer(list(map(join_commit_codes,batch['code'])), truncation=True, padding='max_length',max_length=max_commit_code_length)
batch["input_ids"] = inputs.input_ids
batch["attention_mask"] = inputs.attention_mask
batch["token_type_ids"] = inputs.token_type_ids
return batch
train_dataset = train_dataset.map(encode, batch_size=batch_size,batched=True, remove_columns=["code"])
And then passing it into a Trainer instance:
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset.with_format("torch")['train'],
eval_dataset =valid_dataset.with_format("torch")['train']
)
But here I am getting this error:
ValueError: train_dataset does not implement __len__, max_steps has to be specified
I have tried other solutions like specifying a len function:
class FromIterableDataset:
def __init__(self,iterable_dataset,length):
self.dataset = iterable_dataset
self.length = length
def __len__(self):
return self.length
Then wrapping the dataset, but then I get an error that indexing is not supported.
We have a large dataset and streaming it into the model for training is the best option.
I would greatly appreciate your help in fixing this error.
Thank you.