elsaEU
December 28, 2023, 10:58am
1
Hi,
I want to add more data to my dataset ELSA_D3:
Note that the filenames are:
train-0-05239
...
train-05239-05239
Can I add more parquet without re-upload all the files and automatically correct the readme metadata?
My currently upload procedure is:
Convert images to arrow files and store them on disk in N split
Load in memory the N splits using datasets.concatenate_datasets()
Push using datasets.push_to_hub()
Now I would like to concatenate another split and upload it without losing the previous data and without messing up with filenames
Thanks
You can push_to_hub
to a different split, and then manually modify the YAML in the README.md header to group the data_files
together in the same split.
For example:
After pushing a new split train_part2
you ill get:
configs:
- config_name: default
data_files:
- split: train
path: default/train-*
- split: train_part2
path: default/train_part2-*
and you can group the splits together this way:
configs:
- config_name: default
data_files:
- split: train
path:
- default/train-*
- default/train_part2-*
You’d also have to update the datasets_info
in the YAML to account for the new split size and number of examples (or just delete it)
elsaEU
January 15, 2024, 1:40pm
4
Thank you, it works fine.
system
Closed
January 16, 2024, 1:40am
5
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.