I can't concatenate_datasets because features are not sorted. How do I sort it?

Hi guys,

I’m trying to concatenate two datasets that share some common features.
But these two datasets have features in a different order.
It’s like:

DatasetDict({
    train: Dataset({
        features: ['__index_level_0__', 'answers', 'context', 'document_id', 'id', 'question', 'title'],
        num_rows: 3952
    })
    validation: Dataset({
        features: ['__index_level_0__', 'answers', 'context', 'document_id', 'id', 'question', 'title'],
        num_rows: 240
    })
})

and

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 60407
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 5774
    })
})

so I erased uncommon features like: __index_level_0__, document_id, etc by using .remove_columns

Now the two DatasetDicts have the same features. However the order is different.

1st DatasetDict: features: ['answers', 'context', 'question', 'title'],
2nd DatasetDict: features: ['title', 'context', 'question', 'answers'],

So when I try to concatenate them by using datasets.concatenate_datasets([1stDatasetDict, 2ndDatasetDict]),

I get an error that says:

ValueError: Features must match for all datasets

hey @jeffnlp, i don’t think you can concatenate DatasetDict objects with concatenate_datasets - as described in the docs this function expects a list of Dataset objects.

what happens if you try iterating over both DatasetDict objects and building up a new one that concatenates the Datasets objects as follows:

ds1 = DatasetDict(...)
ds2 = DatasetDict(...)
# Create empty DatasetDict
ds3 = DatasetDict()

for (split1, x), (split2, y) in zip(ds1.items(), ds2.items()):
    ds3[split1] = concatenate_datasets([x, y])

this should work since concatenate_datasets can handle out-of-order columns. if not, you might have a problem with the answers columns if they’re nested and the sub-columns don’t match. in that case you might be better off flattening the columns of each dataset and casting the features into the same types (see e.g. here for casting details)

Order does matter apparently ! Find the arrow_dataset.py file (it can be found in Lib\site-packages\datasets-1.8.1.dev0-py3.9.egg\datasets); look up line 3181 of the file (my version of the package is transformers==4.9.0dev0) :

if axis == 0 and not all([dset.features.type == dsets[0].features.type for dset in dsets]):
        raise ValueError("Features must match for all datasets")
elif axis == 1 and not all([dset.num_rows == dsets[0].num_rows for dset in dsets]):
        raise ValueError("Number of rows must match for all datasets")

dset.features.type is sensitive to the order of the columns; it gives a StructType, as below, which is clearly order-sensitive:

pawsqqp['train'].column_names
output : ['idx', 'question1', 'question2', 'label']
pawsqqp['train'].features.type
output: StructType(struct<idx: int32, question1: string, question2: string, label: int64>)

So I simply changed that first line in the arrow_dataset.py file to

if axis == 0 and not all([dset.features == dsets[0].features for dset in dsets]):

and now it no longer cares about the order of the columns, and I can confirm that it’s working well in the examples I had. But I don’t recommend applying this change without a double-check! Make sure this modification works well for your datasets too.

Given two datasets: d1 and d2 having the same features but differently ordered StructType, a slightly non-intrusive way might be to cast the data features into the same type (as mentioned by @lewtun)

d1 = d1.cast(d2.features)
d3 = datasets.concatenate_datasets([d1, d2])
1 Like