Extra Dimension with DataCollatorFor LanguageModeling into BertForMaskedLM?

Hi all,

EDIT: I forgot to state that I am on transformers 4.6.1 and python 3.7

On Colab, I am trying to pre-train a BertforMaskedLM using a random subset of half of Wikitext-103. I am using a simple custom dataset class and the DataCollatorForLanguageModeling as follows.

import torch
import torchtext
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
import torch.nn as nn
import torch.optim as optim
import re
import random

from transformers import BertForMaskedLM, BertModel, BertConfig, BertTokenizer
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from transformers import PreTrainedTokenizer

wiki_train, wiki_valid, wiki_test = torchtext.datasets.WikiText103(root='data', 
def scrub_titles_get_lines(dataset):
    pattern = " =+.+ =+"
    pattern = re.compile(pattern)
    title_scrubbed = []
    for example in dataset:
        if not example.isspace() and not bool(pattern.match(example)):
    return title_scrubbed

class LineByLineBertDataset(torch.utils.data.Dataset):
    def __init__(self, data, tokenizer: PreTrainedTokenizer,  max_len=512):
        self.examples = data
        self.tokenizer = tokenizer
        self.max_length = max_len

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i):
        result = self.tokenizer(self.examples[i], 
        return result

configuration = BertConfig()
model = BertForMaskedLM(configuration)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
wiki_train = random.sample(wiki_train, len(wiki_train)//2)  # list of strings
train_set = LineByLineBertDataset(wiki_train, tokenizer)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15

training_args = TrainingArguments(

trainer = Trainer(


However, I get an error in the forward() method of the model:

/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
     923         elif input_ids is not None:
      924             input_shape = input_ids.size()
--> 925             batch_size, seq_length = input_shape
      926         elif inputs_embeds is not None:
      927             input_shape = inputs_embeds.size()[:-1]

ValueError: too many values to unpack (expected 2)

Each of the tensors in the batch encoding are of shape (8,512)

In the DataCollatorForMaskedLM I know that at some point another dimension gets added. If I do:

res = tokenizer(wiki_train[:8], 

collated = data_collator([res])

Output: torch.Size([1, 8, 512])

So it seems that maybe this first dimension needs to be squeezed out. However, I am not sure what parameter I can tweak to ensure that the correct tensor is being seen by the model after collation.

Any thoughts?

No the dimension was added by you when you passed

collated = data_collator([res])

res was already a list with 8 elements here, by putting in a new list you add the 1.

1 Like

Thanks for your reply! I passed “res” in a list in my little bit of test code there because I thought that was expected in line 343 of data_collator.py based on the stack trace for when I just passed the BatchEncoding object “res”:

collated = data_collator(res)

/usr/local/lib/python3.7/dist-packages/transformers/data/data_collator.py in __call__(self, examples)
    341     ) -> Dict[str, torch.Tensor]:
    342         # Handle dict or lists with proper padding and conversion to tensor.
--> 343         if isinstance(examples[0], (dict, BatchEncoding)):
    344             batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
    345         else:

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __getitem__(self, item)
    234         else:
    235             raise KeyError(
--> 236                 "Indexing with integers (to access backend Encoding for a given batch index) "
    237                 "is not available when using Python based tokenizers"
    238             )

KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers'

This^^ surprised me a little bit because I thought the __call__ method of the datacollator accepted a BatchEncoding directly.

But maybe these issues with what to pass to the data collator are not the heart of my problem.

I think the main question is whether or not the tokenization that I am returning in __getitem__ for my dataset class is going to return batches in a format that the data collator and the rest of the trainer pipeline can use. I had thought that what was best for the DataCollatorForLanguageModeling was a BatchEncoding that has the [‘special_tokens_mask’] key based on the note in the docs: Data Collator — transformers 4.5.0.dev0 documentation

I have tried returning it with and without the special tokens mask and with and without padding and get the same “too many items to unpack” error in the forward() method as in my original post.

I’m trying to reproduce your problem but your code sample fails at

wiki_train = random.sample(wiki_train, len(wiki_train)//2)  # list of strings

for me with

TypeError                                 Traceback (most recent call last)
<ipython-input-3-6b563a558a41> in <module>
     51 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
---> 52 wiki_train = random.sample(wiki_train, len(wiki_train)//2)  # list of strings
     53 train_set = LineByLineBertDataset(wiki_train, tokenizer)

~/.pyenv/versions/3.7.9/lib/python3.7/random.py in sample(self, population, k)
    315             population = tuple(population)
    316         if not isinstance(population, _Sequence):
--> 317             raise TypeError("Population must be a sequence or set.  For dicts, use list(d).")
    318         randbelow = self._randbelow
    319         n = len(population)

TypeError: Population must be a sequence or set.  For dicts, use list(d).
1 Like

Ah got it! Remove the line return_tensors='pt' in your tokenizer call, as this adds a batch dimension. The data collator will do the conversion to tensors anyway.


That was it, thank you!

1 Like