How to overwrite dataset with dataset.push_to_hub() or alternative?

I have a dataset that I want to update once in a while. When I call dataset.push_to_hub(repo_id = f"{COMPANY_NAME}/{dataset_name}", private=True, token=os.environ['HUGGINGFACE_TOKEN'], split=split), it

  • either silently does not update the dataset, even though I called datasets.disable_caching()
  • or raises ValueError: Split train already present in SplitInfo.

Is there a simple way to force update an already present dataset split? Ideally, with push_to_hub(), but any simple Python code will do, as long as I can update private datasets in COMPANY_NAME`.

2 Likes

I worked around this by not giving the split argument, but of course this is very restrictive, only allowing one default split (‘train’).

To avoid caching issues during downloading, I use download_mode=DownloadMode.FORCE_REDOWNLOAD for load_dataset.

i have the same question. I want to be able to overwrite my split

1 Like

I would like to remove my split actually…
what I did was to overwrite my train, and i ignore my test data :frowning: