Custom dataset output dimensions

Im having some issues defining a custom dataset. I found numerous tutorials online and functionally they do work however my model seems to be expecting a different dimension than the data returned by my custom getitem function:

I define my dataset like this .

class ExampleDataset(Dataset):
    def __init__(self, large_file_path, offset_dict, ):
        self.large_file_path = large_file_path
        self.offset_dict = offset_dict

    def __len__(self):
        return len(self.offset_dict)

    def __getitem__(self, line):
        offset = self.offset_dict[line]
        with open(self.large_file_path, 'r', encoding='utf-8') as f:
            line = f.readline() add_special_tokens=True,max_length=256).to(device)
            inputs = tokenizer(line, return_tensors="pt").to(device)
            return inputs

The issue im having is that during training i receive this error :

slight_smile: RuntimeError: output with shape [256, 1, 18] doesn’t match the broadcast shape [256, 256, 18]

Its not clear where that central dimension is coming from as I dont define for example a batch size in the training parameters. The input is a single large text file split by line and I previously used the linebyline dataset which worked however there isnt enough memory for pretraining to work with my new dataset.

my trainer is defined like so :

training_args = TrainingArguments(

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15

trainer = Trainer(

Could someone please explain why the output of my getitem is not correct for training RobertaForMaskedLM/BertForMaskedLM?

Thank you in advance.