Adding more data to the dataset uploaded on HF

Hi,

I want to add more data to my dataset ELSA_D3:

Note that the filenames are:

train-0-05239
...
train-05239-05239

Can I add more parquet without re-upload all the files and automatically correct the readme metadata?

My currently upload procedure is:

  • Convert images to arrow files and store them on disk in N split
  • Load in memory the N splits using datasets.concatenate_datasets()
  • Push using datasets.push_to_hub()

Now I would like to concatenate another split and upload it without losing the previous data and without messing up with filenames

Thanks

cc @lhoestq @mariosasko

You can push_to_hub to a different split, and then manually modify the YAML in the README.md header to group the data_files together in the same split.

For example:

After pushing a new split train_part2 you ill get:

configs:
- config_name: default
  data_files:
  - split: train
    path: default/train-*
  - split: train_part2
    path: default/train_part2-*

and you can group the splits together this way:

configs:
- config_name: default
  data_files:
  - split: train
    path:
    - default/train-*
    - default/train_part2-*

You’d also have to update the datasets_info in the YAML to account for the new split size and number of examples (or just delete it)

Thank you, it works fine.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.