New dataset raises 'UnexpectedSplits:' error

I鈥檝e just uploaded a new dataset with machine translations for 13 languages for 5 NLI datasets, see here: MoritzLaurer/mnli_fever_anli_ling_wanli_translated 路 Datasets at Hugging Face

I鈥檝e uploaded the dataset with

dataset_sample_trans_dic.push_to_hub("MoritzLaurer/mnli_fever_anli_ling_wanli_translated", private=True, token="XXX")

When I now want to load the dataset (after making it public manually), I get the following error:

dataset_sample_trans_dic = load_dataset("MoritzLaurer/mnli_fever_anli_ling_wanli_translated", use_auth_token=False)

/usr/local/lib/python3.7/dist-packages/datasets/utils/ in verify_splits(expected_splits, recorded_splits)
     65         raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))
     66     if len(set(recorded_splits) - set(expected_splits)) > 0:
---> 67         raise UnexpectedSplits(str(set(recorded_splits) - set(expected_splits)))
     68     bad_splits = [
     69         {"expected": expected_splits[name], "recorded": recorded_splits[name]}

UnexpectedSplits: {'ling_de', 'mnli_es', 'anli_tr', 'mnli_ru', 'anli_ar', 'ling_hi', 'anli_ru', 'mnli_id', 'mnli_ar', 'ling_ru', 'wanli_ar', 'fever_fr', 'fever_de', 'ling_id', 'wanli_de', 'fever_id', 'anli_pt', 'mnli_tr', 'wanli_hi', 'fever_it', 'fever_hi', 'fever_ru', 'wanli_es', 'mnli_de', 'ling_pt', 'mnli_hi', 'mnli_it', 'wanli_ru', 'ling_it', 'fever_zh', 'fever_es', 'ling_fr', 'anli_fr', 'mnli_fr', 'anli_es', 'wanli_zh', 'wanli_id', 'ling_tr', 'anli_zh', 'wanli_pt', 'fever_pt', 'anli_hi', 'mnli_zh', 'wanli_fr', 'anli_de', 'ling_es', 'wanli_it', 'anli_it', 'wanli_tr', 'fever_tr', 'ling_ar', 'mnli_pt', 'ling_zh', 'anli_id', 'fever_ar'}

The data for these splits is in the repository, but the error still gets raised for some reason.

One possible reason for the error: I first pushed 45 splits to the hub and then a few hours later I pushed 10 more splits and a bit later 10 additional splits. (this was necessary because they are machine translated and the translation takes very long even on an A100 GPU, so I had to split the translation pipeline across several runs). I have the impression that Datasets.load_dataset() now only recognises the last 10 splits I uploaded. There are 65 splits overall and the list of unexpected splits is 55 long.

=> How can I fix this error?

(re-uploading is not really an option, because I don鈥檛 have a local copy of the data and creating the data took several hours of GPU time)

I solved the issue now by deleting the dataset_infos.json file in the dataset repo. that鈥檚 probably not an ideal option, but the only solution I found.

Hi! We鈥檝e only recently added support for updating the dataset_infos.json file with new split info when pushing splits in separate push_to_hub calls, so can you make sure you are running the latest release of datasets?

Hey, thanks for your response. My code was running with the requirement datasets==2.4, that鈥檚 the latest version, right? or do you mean the main branch?

It should work in 2.4. I think I see what the issue is. What is the type of the dataset_sample_trans_dic object? I assume it鈥檚 DatasetDict, but it should be Dataset if you want to add new splits (one dataset for each split). DatasetDict.push_to_hub overwrites everything to make sure load_dataset("<repo>") returns the same dataset as the one which was used to create the repo with push_to_hub.

yeah you are right, it鈥檚 a DatasetDict. Ok so I understand that the library prefers only pushing Dataset objects to the hub when adding to an existing dataset. Interestingly enough, doing DatasetDict.push_to_hub didn鈥檛 overwrite everything, it just added additional splits to the existing splits with each new dict element as a new split, which is actually what I wanted. the problem is then that the download doesn鈥檛 seem to work anymore and throws the error above, but it鈥檚 fixable by just deleting the dataset_infos.json. Could be good to fix this issue when doing DatasetDict.push_to_hub with new splits for an existing dataset - but not sure if this causes issues for other use-cases.

I didn鈥檛 notice negative consequences from deleting datasets_info.json. Is there an important reason not to do that?

And when I plan to gradually upload different new splits, I suppose the advice is to first push an empty DatasetDict.push_to_hub and then gradually add Dataset.push_to_hub for each new split when its ready?