I used the following model for summarization task,
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM
model_checkpoint = "google/mt5-small"
# model_checkpoint = "google/pegasus-cnn_dailymail"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_pt=True)
tokenization is done and my data has these columns
DatasetDict({
train: Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 1000
})
validation: Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 1000
})
})
data collator function returns somehting like this
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")
features = [tokenized_data["train"][i] for i in range(2)]
data_collator(features)
{'input_ids': <tf.Tensor: shape=(2, 900), dtype=int32, numpy=
array([[ 5139, 259, 15241, ..., 0, 0, 0],
[ 259, 23129, 259, ..., 2454, 2841, 1]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 900), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 1, 1, 1]], dtype=int32)>, 'labels': <tf.Tensor: shape=(2, 124), dtype=int64, numpy=
array([[ 322, 486, 335, 59128, 12733, 331, 102270, 288,
prepare_tf_dataset gives the error
tf_train_dataset = model.prepare_tf_dataset(
tokenized_data["train"],
collate_fn=data_collator,
shuffle=True,
batch_size=8,
)
TypeError: Cannot convert [array([3.22000e+02, 2.98000e+02, 3.82260e+04, 1.86600e+03, 5.69000e+02,])] to EagerTensor of dtype int64
please help i am stuck here,