Key error: 0 in DataCollatorForSeq2Seq for BERT

Hello everyone,

I am trying to fine-tune a german BERT2BERT model for text summarization unsing bert-base-german-cased and want to use dynamic padding. However, when calling Trainer.train() I receive an error, that tensors cannot be created and I should use padding. I was able to trace this error back to my DataCollator. The code I used is the following:

First, I define the function to tokenize my data and do so using the map function.

tokenizer = BertTokenizerFast.from_pretrained(“bert-base-german-cased”)
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_token

max_input_length = 512
max_target_length = 128

def prepro_bert2bert(samples):
model_inputs = tokenizer(samples[“text”], max_length = max_input_length, truncation = True)

with tokenizer.as_target_tokenizer():
    labels = tokenizer(samples["description"], max_length = max_target_length, truncation = True)
      
samples["input_ids"] = model_inputs.input_ids
samples["attention_mask"] = model_inputs.attention_mask
samples["decoder_input_ids"] = labels.input_ids
samples["decoder_attention_mask"] = labels.attention_mask
samples["labels"] = labels.input_ids.copy()

return samples

traindata = Dataset.from_pandas(traindata)

tokenized_traindata = traindata.map(prepro_bert2bert, batched = True, remove_columns = [“text”, “description”, “_index_level_0_”])
tokenized_traindata.set_format(columns = [“labels”, “input_ids”, “attention_mask”, “decoder_input_ids”, “decoder_attention_mask”])

My tokenized_traindata looks like the following:

Dataset({
features: [‘attention_mask’, ‘decoder_attention_mask’, ‘decoder_input_ids’, ‘input_ids’, ‘labels’],
num_rows: 7986
})

Then I instantiate my bert2bert model and my DataCollator:

bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained(“bert-base-german-cased”, “bert-base-german-cased”)
data_collator = DataCollatorForSeq2Seq(tokenizer, model = bert2bert, padding = “longest”)

Lastly, I form batches from my training data and want to use the data_collator

samples = tokenized_traindata[:8]
batch = data_collator(samples)

This returns the following error message


KeyError Traceback (most recent call last)
in
----> 1 batch = data_collator(samples)
2 {k: v.shape for k, v in batch.items()}

~\miniconda3\envs\BERTnew\lib\site-packages\transformers\data\data_collator.py in call(self, features)
271
272 def call(self, features):
→ 273 labels = [feature[“labels”] for feature in features] if “labels” in features[0].keys() else None
274 # We have to pad the labels before calling tokenizer.pad as this method won’t pad them and needs them of the
275 # same length to return tensors.

KeyError: 0

Unfortunately, I do not know where to look further for a solution. I hope someone may has a suggestion where to look or how to solve this. Thank you very much in advance!

1 Like

This is because the datasets library returns a slice of the dataset as a dictionary with lists for each key. The data collator however expects a list of dataset elements, so a list of dictionaries. Practically, I think you need to do:

samples = [tokenized_traindata[i] for i in range(8)]
batch = data_collator(samples)
1 Like

Thank you for your reply. I tried your suggestion and now get the first error message again:

samples = [tokenized_traindata[i] for i in range(8)]
batch = data_collator(samples)


ValueError Traceback (most recent call last)
~\miniconda3\envs\BERTnew\lib\site-packages\transformers\tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
698 if not is_tensor(value):
→ 699 tensor = as_tensor(value)
700

ValueError: expected sequence of length 40 at dim 1 (got 47)

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
in
----> 1 batch = data_collator(samples)

~\miniconda3\envs\BERTnew\lib\site-packages\transformers\data\data_collator.py in call(self, features)
283 )
284
→ 285 features = self.tokenizer.pad(
286 features,
287 padding=self.padding,

~\miniconda3\envs\BERTnew\lib\site-packages\transformers\tokenization_utils_base.py in pad(self, encoded_inputs, padding, max_length, pad_to_multiple_of, return_attention_mask, return_tensors, verbose)
2678 batch_outputs[key].append(value)
2679
→ 2680 return BatchEncoding(batch_outputs, tensor_type=return_tensors)
2681
2682 def create_token_type_ids_from_sequences(

~\miniconda3\envs\BERTnew\lib\site-packages\transformers\tokenization_utils_base.py in init(self, data, encoding, tensor_type, prepend_batch_axis, n_sequences)
202 self._n_sequences = n_sequences
203
→ 204 self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
205
206 @property

~\miniconda3\envs\BERTnew\lib\site-packages\transformers\tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
713 “Please see if a fast version of this tokenizer is available to have this feature available.”
714 )
→ 715 raise ValueError(
716 "Unable to create tensor, you should probably activate truncation and/or padding "
717 “with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length.”

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length.

Do you have any suggestions as to why this happens again? Thank you very much in advance for your response and time!

I think it’s compaining because the decoder_attention_mask and decoder_input_ids aren’t all of the same length. You should let the DataCollatorForSeq2Seq create them for you.

1 Like

Thank you again! So I should not tokenize the labels and original text at the same time in my prepro_bert2bert function? How would I approach that?

You definitely should tokenize your labels, but remove the lines

samples["decoder_input_ids"] = labels.input_ids
samples["decoder_attention_mask"] = labels.attention_mask

which the data collator will ser properly from the labels and the models (there is a shift right to apply in some models).

Thank you very much! This solved my problem. :slight_smile:

1 Like

Sorry to bother you again. :frowning: I now wanted to implement what I learned in my training loop, but now get a value error that I must specify either input_ids or inputs_embeds. Do I need to specify anything else, so that the decoder_input_ids and decoder_attention_mask from the collator get passed into the model? My training arguments look like this:

batch_size = 2
learning_rate = 2e-5
epochs = 1

training_args = Seq2SeqTrainingArguments(output_dir = “model_checkpoints/”,
evaluation_strategy = “epoch”,
learning_rate = learning_rate,
per_device_train_batch_size = batch_size,
per_device_eval_batch_size = batch_size,
save_total_limit = 3,
num_train_epochs = epochs,
predict_with_generate = True,
fp16 = True)

trainer = Seq2SeqTrainer(bert2bert,
training_args,
train_dataset = tokenized_traindata,
eval_dataset = tokenized_valdata,
data_collator = data_collator,
tokenizer = tokenizer,
compute_metrics = compute_metrics)

Thank you again for your time and patience!

Pinging @patrickvonplaten since I’m not super familiar with this BERT2BERT archittecture.

1 Like

Hi @Arcticweasel ,

I ran into the same problem with this, requiring to specify input_ids or inputs_embeds. I thought that input_ids is already in tokenized_traindata after preprocessing. Have you found a solution?

Cheers!

1 Like

I was having a similar issue and landed on this thread.

I’m trying to understand why the DataCollatorForSeq2Seq is not padding the batch inputs similar to DataCollatorWithPadding?

Do I need to specify padding=True in the tokenizer? (only then seems to be working)
When adding padding=True as arguement to the DataCollatorForSeq2Seq it doesn’t work?

MWE:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer
from datasets import load_dataset

model_checkpoint = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

english_dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_review_All_Beauty")

def preprocess_function(examples):
    model_inputs = tokenizer(examples["text"],)
    # Set up the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["title"])

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = english_dataset.map(preprocess_function, batched=True)

tokenized_datasets = tokenized_datasets.remove_columns(['rating', 'title', 'text', 'images', 'asin', 'parent_asin', 'user_id', 'timestamp', 'helpful_vote', 'verified_purchase'])

tokenized_datasets['full'][:2].keys()
# dict_keys(['input_ids', 'attention_mask', 'labels'])

features = [tokenized_datasets["full"][i] for i in range(2)]

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

featuers_padded = data_collator(features)

len(features_padded[0]['input_ids'])
# 37

len(features_padded[1]['input_ids'])
# 69

len(features_padded[0]['labels'])
# 53

len(features_padded[1]['labels'])
# 82

Any idea what’s going on here?