How to merge two dataset objects?

putopavel · August 24, 2020, 4:48pm

Hi everyone!

I have two datasets, loaded as CSV files, which have the same features/columns. I would like to know if there is a way to merge both datasets into a larger one (like I would do with pd.concat((df_1, df_2))using pandas.

In case that such method does not exist, would it be interesting to implement such functionality?

Thanks in advance

valhalla · August 24, 2020, 7:18pm

I would rather combine the csv’s

lhoestq · August 24, 2020, 7:30pm

Are you using nlp ?

If so, you could try nlp.concatenate_datasets

putopavel · August 24, 2020, 9:06pm

Thank you. That is exactly what I was looking for, but I couldn’t find it in the documentation (now that I now the method I can find it in the API when I autocomplete the code, but it doesn’t appear anywhere in the documentation).

lhoestq · May 3, 2021, 5:53pm

Quick update since I see that this thread still has views:
concatenate_datasets is available through the datasets library here, since the library was renamed.

mmiakashs · June 25, 2021, 8:58am

Hi @lhoestq , thanks for the solution. I follow that approach but getting errors to merge two datasets

dataset_ar = load_dataset('wikipedia',language='ar', date='20210320', beam_runner='DirectRunner')
dataset_bn = load_dataset('wikipedia',language='bn', date='20210320', beam_runner='DirectRunner')

I tried two ways to concatenate but both approaches give errors. Could you please help to find out what am I missing? Thanks

First Approach

dataset_cc = concatenate_datasets(dataset_ar, dataset_bn)

Traceback (most recent call last):
File “”, line 1, in
File “/home/anaconda3/envs/nlp/lib/python3.9/site-packages/datasets/arrow_dataset.py”, line 3135, in concatenate_datasets
if axis == 0 and not all([dset.features.type == dsets[0].features.type for dset in dsets]):
File “/home/anaconda3/envs/nlp/lib/python3.9/site-packages/datasets/arrow_dataset.py”, line 3135, in
if axis == 0 and not all([dset.features.type == dsets[0].features.type for dset in dsets]):
AttributeError: ‘str’ object has no attribute ‘features’

Second Approach

dataset_cc = concatenate_datasets(dataset_ar['train'], dataset_bn['train'])

Traceback (most recent call last):
File “”, line 1, in
File “/home/anaconda3/envs/nlp/lib/python3.9/site-packages/datasets/arrow_dataset.py”, line 3135, in concatenate_datasets
if axis == 0 and not all([dset.features.type == dsets[0].features.type for dset in dsets]):
File “/home/anaconda3/envs/nlp/lib/python3.9/site-packages/datasets/arrow_dataset.py”, line 3135, in
if axis == 0 and not all([dset.features.type == dsets[0].features.type for dset in dsets]):
AttributeError: ‘dict’ object has no attribute ‘features’

mmiakashs · June 25, 2021, 9:03am

I solve this issue, it was a silly mistake
Solution:

dataset_cc = concatenate_datasets([dataset_ar['train'], dataset_bn['train']])

AletheiaChengWon · February 28, 2024, 10:04am

concatenate_datasets([part1, part2, part3]) this way~

Topic		Replies	Views
Issue concatenating datasets 🤗Datasets	3	4569	January 3, 2023
Sharing ArrowDataset with subfolders 🤗Datasets	8	37	March 11, 2025
How to combine local data files with an official 🤗 dataset 🤗Datasets	15	3584	April 7, 2021
How to use Join operations like merege in Datasets 🤗Datasets	0	149	May 2, 2024
How to concatenate 100s of small datasets into a very large dataset? Without loading into memory? 🤗Datasets	1	434	May 18, 2023

How to merge two dataset objects?

Related topics