Custom dataset output dimensions

Dean1 · May 15, 2022, 5:27pm

Im having some issues defining a custom dataset. I found numerous tutorials online and functionally they do work however my model seems to be expecting a different dimension than the data returned by my custom getitem function:

I define my dataset like this .

class ExampleDataset(Dataset):
    def __init__(self, large_file_path, offset_dict, ):
        self.large_file_path = large_file_path
        self.offset_dict = offset_dict

    def __len__(self):
        return len(self.offset_dict)

    def __getitem__(self, line):
        offset = self.offset_dict[line]
        #samples=[]
        with open(self.large_file_path, 'r', encoding='utf-8') as f:
            f.seek(offset)
            line = f.readline() add_special_tokens=True,max_length=256).to(device)
            inputs = tokenizer(line, return_tensors="pt").to(device)
            return inputs

The issue im having is that during training i receive this error :

slight_smile: RuntimeError: output with shape [256, 1, 18] doesn’t match the broadcast shape [256, 256, 18]

Its not clear where that central dimension is coming from as I dont define for example a batch size in the training parameters. The input is a single large text file split by line and I previously used the linebyline dataset which worked however there isnt enough memory for pretraining to work with my new dataset.

my trainer is defined like so :

training_args = TrainingArguments(
    output_dir='./mlmresult',
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=64,
    save_steps=10000,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

Could someone please explain why the output of my getitem is not correct for training RobertaForMaskedLM/BertForMaskedLM?

Thank you in advance.

Topic		Replies	Views
Custom Dataset, avoid doubling data (reuse encodings) 🤗Datasets	5	454	June 27, 2023
Hierarchy classification network: Having trouble preparing the dataset 🤗Transformers	0	1255	August 29, 2021
Streaming data with batch size shape issue (returning list not tensor) Beginners	0	389	February 21, 2023
Defining a custom dataset for fine-tuning translation Beginners	4	5083	July 10, 2021
Creating a dataset with custom data Beginners	3	8654	September 5, 2022

Custom dataset output dimensions

Related topics