Issue concatenating datasets

I am trying to concatenate two datasets

from datasets import load_dataset, concatenate_datasets

movie = load_dataset("movie_rationales")
imdb = load_dataset("imdb")
imdb = imdb['train']

Then I adapt the movie dataset

movie_imdb_format = movie['train'].map(
    lambda x: {"text": x["review"]}
)
movie_imdb_format = movie_imdb_format.remove_columns(["review", "evidences"])

and aim to concatenate them

dataset_cc = concatenate_datasets([imdb, movie_imdb_format])

These both datasets output

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})
Dataset({
    features: ['label', 'text'],
    num_rows: 1600
})

However, I get an error

ValueError: The features can't be aligned because the key label of features {'label': ClassLabel(names=['NEG', 'POS'], id=None), 'text': Value(dtype='string', id=None)} has unexpected type - ClassLabel(names=['NEG', 'POS'], id=None) (expected either ClassLabel(names=['neg', 'pos'], id=None) or Value("null").

Any suggestion of why this may be happening and how to solve it?

In the end I have opted for

imdb_df = pd.DataFrame(imdb)
movie_df = pd.DataFrame(movie_imdb_format)
overall_df = pd.concat([imdb_df, movie_df])
dataset = Dataset.from_pandas(overall_df)

which solves the above problem. Though I am not sure why it was failing.

The problem was that your datasets had different features order:

  • imdb: ['text', 'label']
  • movie_imdb_format: ['label', 'text']

You can fix this with one simple line of code:

movie_imdb_format = movie_imdb_format.cast(imdb.features)

Now that both datasets have the exact same format, the command dataset_cc = concatenate_datasets([imdb, movie_imdb_format]) should work!

1 Like

Note that in recent versions of datasets it’s possible to concatenate two datasets with columns in different orders - in this case the columns are aligned automatically.