Error in model.prepare_tf_dataset

I used the following model for summarization task,

from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM

model_checkpoint = "google/mt5-small"
# model_checkpoint = "google/pegasus-cnn_dailymail"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_pt=True)

tokenization is done and my data has these columns

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

data collator function returns somehting like this

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")
features = [tokenized_data["train"][i] for i in range(2)]
data_collator(features)


{'input_ids': <tf.Tensor: shape=(2, 900), dtype=int32, numpy=
array([[ 5139,   259, 15241, ...,     0,     0,     0],
       [  259, 23129,   259, ...,  2454,  2841,     1]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 900), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 1, 1, 1]], dtype=int32)>, 'labels': <tf.Tensor: shape=(2, 124), dtype=int64, numpy=
array([[   322,    486,    335,  59128,  12733,    331, 102270,    288,

prepare_tf_dataset gives the error

tf_train_dataset = model.prepare_tf_dataset(
    tokenized_data["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=8,
)




TypeError: Cannot convert [array([3.22000e+02, 2.98000e+02, 3.82260e+04, 1.86600e+03, 5.69000e+02,])] to EagerTensor of dtype int64

please help i am stuck here,

the same is happening to me , found a fix?

Yep i fixed it and my MR got merged.

can you pls share MR link?

https://github.com/huggingface/transformers/pull/31109

1 Like