Hi everyone!
I have two datasets, loaded as CSV files, which have the same features/columns. I would like to know if there is a way to merge both datasets into a larger one (like I would do with pd.concat((df_1, df_2))
using pandas.
In case that such method does not exist, would it be interesting to implement such functionality?
Thanks in advance
2 Likes
I would rather combine the csv’s
1 Like
Are you using nlp ?
If so, you could try nlp.concatenate_datasets
2 Likes
Thank you. That is exactly what I was looking for, but I couldn’t find it in the documentation (now that I now the method I can find it in the API when I autocomplete the code, but it doesn’t appear anywhere in the documentation).
Quick update since I see that this thread still has views:
concatenate_datasets
is available through the datasets
library here, since the library was renamed.
6 Likes
Hi @lhoestq , thanks for the solution. I follow that approach but getting errors to merge two datasets
dataset_ar = load_dataset('wikipedia',language='ar', date='20210320', beam_runner='DirectRunner')
dataset_bn = load_dataset('wikipedia',language='bn', date='20210320', beam_runner='DirectRunner')
I tried two ways to concatenate but both approaches give errors. Could you please help to find out what am I missing? Thanks
First Approach
dataset_cc = concatenate_datasets(dataset_ar, dataset_bn)
Traceback (most recent call last):
File “”, line 1, in
File “/home/anaconda3/envs/nlp/lib/python3.9/site-packages/datasets/arrow_dataset.py”, line 3135, in concatenate_datasets
if axis == 0 and not all([dset.features.type == dsets[0].features.type for dset in dsets]):
File “/home/anaconda3/envs/nlp/lib/python3.9/site-packages/datasets/arrow_dataset.py”, line 3135, in
if axis == 0 and not all([dset.features.type == dsets[0].features.type for dset in dsets]):
AttributeError: ‘str’ object has no attribute ‘features’
Second Approach
dataset_cc = concatenate_datasets(dataset_ar['train'], dataset_bn['train'])
Traceback (most recent call last):
File “”, line 1, in
File “/home/anaconda3/envs/nlp/lib/python3.9/site-packages/datasets/arrow_dataset.py”, line 3135, in concatenate_datasets
if axis == 0 and not all([dset.features.type == dsets[0].features.type for dset in dsets]):
File “/home/anaconda3/envs/nlp/lib/python3.9/site-packages/datasets/arrow_dataset.py”, line 3135, in
if axis == 0 and not all([dset.features.type == dsets[0].features.type for dset in dsets]):
AttributeError: ‘dict’ object has no attribute ‘features’
I solve this issue, it was a silly mistake
Solution:
dataset_cc = concatenate_datasets([dataset_ar['train'], dataset_bn['train']])
12 Likes
concatenate_datasets([part1, part2, part3]) this way~