Hello everyone,
I am trying to fine-tune a german BERT2BERT model for text summarization unsing bert-base-german-cased
and want to use dynamic padding. However, when calling Trainer.train()
I receive an error, that tensors cannot be created and I should use padding. I was able to trace this error back to my DataCollator. The code I used is the following:
First, I define the function to tokenize my data and do so using the map function.
tokenizer = BertTokenizerFast.from_pretrained(“bert-base-german-cased”)
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_tokenmax_input_length = 512
max_target_length = 128def prepro_bert2bert(samples):
model_inputs = tokenizer(samples[“text”], max_length = max_input_length, truncation = True)with tokenizer.as_target_tokenizer(): labels = tokenizer(samples["description"], max_length = max_target_length, truncation = True) samples["input_ids"] = model_inputs.input_ids samples["attention_mask"] = model_inputs.attention_mask samples["decoder_input_ids"] = labels.input_ids samples["decoder_attention_mask"] = labels.attention_mask samples["labels"] = labels.input_ids.copy() return samples
traindata = Dataset.from_pandas(traindata)
tokenized_traindata = traindata.map(prepro_bert2bert, batched = True, remove_columns = [“text”, “description”, “_index_level_0_”])
tokenized_traindata.set_format(columns = [“labels”, “input_ids”, “attention_mask”, “decoder_input_ids”, “decoder_attention_mask”])
My tokenized_traindata looks like the following:
Dataset({
features: [‘attention_mask’, ‘decoder_attention_mask’, ‘decoder_input_ids’, ‘input_ids’, ‘labels’],
num_rows: 7986
})
Then I instantiate my bert2bert model and my DataCollator:
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained(“bert-base-german-cased”, “bert-base-german-cased”)
data_collator = DataCollatorForSeq2Seq(tokenizer, model = bert2bert, padding = “longest”)
Lastly, I form batches from my training data and want to use the data_collator
samples = tokenized_traindata[:8]
batch = data_collator(samples)
This returns the following error message
KeyError Traceback (most recent call last)
in
----> 1 batch = data_collator(samples)
2 {k: v.shape for k, v in batch.items()}~\miniconda3\envs\BERTnew\lib\site-packages\transformers\data\data_collator.py in call(self, features)
271
272 def call(self, features):
→ 273 labels = [feature[“labels”] for feature in features] if “labels” in features[0].keys() else None
274 # We have to pad the labels before callingtokenizer.pad
as this method won’t pad them and needs them of the
275 # same length to return tensors.KeyError: 0
Unfortunately, I do not know where to look further for a solution. I hope someone may has a suggestion where to look or how to solve this. Thank you very much in advance!