Hi guys,
I’m trying to concatenate two datasets that share some common features.
But these two datasets have features in a different order.
It’s like:
DatasetDict({
train: Dataset({
features: ['__index_level_0__', 'answers', 'context', 'document_id', 'id', 'question', 'title'],
num_rows: 3952
})
validation: Dataset({
features: ['__index_level_0__', 'answers', 'context', 'document_id', 'id', 'question', 'title'],
num_rows: 240
})
})
and
DatasetDict({
train: Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 60407
})
validation: Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 5774
})
})
so I erased uncommon features like: __index_level_0__
, document_id
, etc by using .remove_columns
Now the two DatasetDicts have the same features. However the order is different.
1st DatasetDict: features: ['answers', 'context', 'question', 'title']
,
2nd DatasetDict: features: ['title', 'context', 'question', 'answers']
,
So when I try to concatenate them by using datasets.concatenate_datasets([1stDatasetDict, 2ndDatasetDict])
,
I get an error that says:
ValueError: Features must match for all datasets