How to merge two dataset objects?

Hi everyone!

I have two datasets, loaded as CSV files, which have the same features/columns. I would like to know if there is a way to merge both datasets into a larger one (like I would do with pd.concat((df_1, df_2))using pandas.

In case that such method does not exist, would it be interesting to implement such functionality?

Thanks in advance :hugs:

2 Likes

I would rather combine the csv’s :grin:

1 Like

Are you using :hugs:nlp ?

If so, you could try nlp.concatenate_datasets :slight_smile:

2 Likes

Thank you. That is exactly what I was looking for, but I couldn’t find it in the documentation (now that I now the method I can find it in the API when I autocomplete the code, but it doesn’t appear anywhere in the documentation).

Quick update since I see that this thread still has views:
concatenate_datasets is available through the datasets library here, since the library was renamed.

6 Likes

Hi @lhoestq , thanks for the solution. I follow that approach but getting errors to merge two datasets

dataset_ar = load_dataset('wikipedia',language='ar', date='20210320', beam_runner='DirectRunner')
dataset_bn = load_dataset('wikipedia',language='bn', date='20210320', beam_runner='DirectRunner')

I tried two ways to concatenate but both approaches give errors. Could you please help to find out what am I missing? Thanks

First Approach

dataset_cc = concatenate_datasets(dataset_ar, dataset_bn)

Traceback (most recent call last):
File “”, line 1, in
File “/home/anaconda3/envs/nlp/lib/python3.9/site-packages/datasets/arrow_dataset.py”, line 3135, in concatenate_datasets
if axis == 0 and not all([dset.features.type == dsets[0].features.type for dset in dsets]):
File “/home/anaconda3/envs/nlp/lib/python3.9/site-packages/datasets/arrow_dataset.py”, line 3135, in
if axis == 0 and not all([dset.features.type == dsets[0].features.type for dset in dsets]):
AttributeError: ‘str’ object has no attribute ‘features’

Second Approach

dataset_cc = concatenate_datasets(dataset_ar['train'], dataset_bn['train'])

Traceback (most recent call last):
File “”, line 1, in
File “/home/anaconda3/envs/nlp/lib/python3.9/site-packages/datasets/arrow_dataset.py”, line 3135, in concatenate_datasets
if axis == 0 and not all([dset.features.type == dsets[0].features.type for dset in dsets]):
File “/home/anaconda3/envs/nlp/lib/python3.9/site-packages/datasets/arrow_dataset.py”, line 3135, in
if axis == 0 and not all([dset.features.type == dsets[0].features.type for dset in dsets]):
AttributeError: ‘dict’ object has no attribute ‘features’

I solve this issue, it was a silly mistake :stuck_out_tongue_winking_eye:
Solution:

dataset_cc = concatenate_datasets([dataset_ar['train'], dataset_bn['train']])
12 Likes

concatenate_datasets([part1, part2, part3]) this way~