Issue concatenating datasets

PabloAMC · January 1, 2023, 6:16pm

I am trying to concatenate two datasets

from datasets import load_dataset, concatenate_datasets

movie = load_dataset("movie_rationales")
imdb = load_dataset("imdb")
imdb = imdb['train']

Then I adapt the movie dataset

movie_imdb_format = movie['train'].map(
    lambda x: {"text": x["review"]}
)
movie_imdb_format = movie_imdb_format.remove_columns(["review", "evidences"])

and aim to concatenate them

dataset_cc = concatenate_datasets([imdb, movie_imdb_format])

These both datasets output

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})
Dataset({
    features: ['label', 'text'],
    num_rows: 1600
})

However, I get an error

ValueError: The features can't be aligned because the key label of features {'label': ClassLabel(names=['NEG', 'POS'], id=None), 'text': Value(dtype='string', id=None)} has unexpected type - ClassLabel(names=['NEG', 'POS'], id=None) (expected either ClassLabel(names=['neg', 'pos'], id=None) or Value("null").

Any suggestion of why this may be happening and how to solve it?

PabloAMC · January 1, 2023, 11:18pm

In the end I have opted for

imdb_df = pd.DataFrame(imdb)
movie_df = pd.DataFrame(movie_imdb_format)
overall_df = pd.concat([imdb_df, movie_df])
dataset = Dataset.from_pandas(overall_df)

which solves the above problem. Though I am not sure why it was failing.

mapama247 · January 2, 2023, 2:12pm

The problem was that your datasets had different features order:

imdb: ['text', 'label']
movie_imdb_format: ['label', 'text']

You can fix this with one simple line of code:

movie_imdb_format = movie_imdb_format.cast(imdb.features)

Now that both datasets have the exact same format, the command dataset_cc = concatenate_datasets([imdb, movie_imdb_format]) should work!

lhoestq · January 3, 2023, 10:41am

Note that in recent versions of datasets it’s possible to concatenate two datasets with columns in different orders - in this case the columns are aligned automatically.

Topic		Replies	Views
I can't concatenate_datasets because features are not sorted. How do I sort it? Beginners	3	5471	August 11, 2021
How to merge two dataset objects? Beginners	7	44869	February 28, 2024
Getting Value Error while using datasets_interleave_datasets method Beginners	11	1883	April 8, 2023
Merge custom dataset with dataset on Huggingface : problem with features Beginners	0	179	April 20, 2024
Using a dataset for a different task than it was intended 🤗Datasets	1	193	July 24, 2023

Issue concatenating datasets

Related topics