TypeError: Couldn't cast array of type int64 while mapping the dataset

I know this is very old question but still answering it. When I got same error I came here but not resolved error from this thread. After review few notebooks I resolved error.

Try this:

from datasets import Features, Sequence, ClassLabel, Value, Array2D, Array3D

# we need to define custom features
features = Features({
    'pixel_values': Array3D(dtype="float32", shape=(3, 224, 224)),
    'input_ids': Sequence(feature=Value(dtype='int64')),
    'attention_mask': Sequence(Value(dtype='int64')),
    'bbox': Array2D(dtype="int64", shape=(512, 4)),
#     'labels': ClassLabel(num_classes=len(labels), names=labels),
    'labels':Sequence(ClassLabel(names=label_list)),
})
def prepare_examples(examples):
    images = [Image.open(path).convert("RGB").resize(size=(224,224)) for path in examples['image_path']]
    words = examples[text_column_name]
    boxes = examples[boxes_column_name]
    
    word_labels = [[label2id[label]] for label in examples["label"]]
    encoding = processor(images, words, boxes=boxes,word_labels=word_labels,
                       truncation=True, padding='max_length')

It worked for me. My labels were in string from the start so I used dict label2id to convert string to number and storing into list. As I am using dict I am mentioning ClassLabel in labels.

1 Like