Im having some issues defining a custom dataset. I found numerous tutorials online and functionally they do work however my model seems to be expecting a different dimension than the data returned by my custom getitem function:
I define my dataset like this .
class ExampleDataset(Dataset):
def __init__(self, large_file_path, offset_dict, ):
self.large_file_path = large_file_path
self.offset_dict = offset_dict
def __len__(self):
return len(self.offset_dict)
def __getitem__(self, line):
offset = self.offset_dict[line]
#samples=[]
with open(self.large_file_path, 'r', encoding='utf-8') as f:
f.seek(offset)
line = f.readline() add_special_tokens=True,max_length=256).to(device)
inputs = tokenizer(line, return_tensors="pt").to(device)
return inputs
The issue im having is that during training i receive this error :
slight_smile: RuntimeError: output with shape [256, 1, 18] doesn’t match the broadcast shape [256, 256, 18]
Its not clear where that central dimension is coming from as I dont define for example a batch size in the training parameters. The input is a single large text file split by line and I previously used the linebyline dataset which worked however there isnt enough memory for pretraining to work with my new dataset.
my trainer is defined like so :
training_args = TrainingArguments(
output_dir='./mlmresult',
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=64,
save_steps=10000,
)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
Could someone please explain why the output of my getitem is not correct for training RobertaForMaskedLM/BertForMaskedLM?
Thank you in advance.