I鈥檝e just uploaded a new dataset with machine translations for 13 languages for 5 NLI datasets, see here: MoritzLaurer/mnli_fever_anli_ling_wanli_translated 路 Datasets at Hugging Face
I鈥檝e uploaded the dataset with
dataset_sample_trans_dic.push_to_hub("MoritzLaurer/mnli_fever_anli_ling_wanli_translated", private=True, token="XXX")
When I now want to load the dataset (after making it public manually), I get the following error:
dataset_sample_trans_dic = load_dataset("MoritzLaurer/mnli_fever_anli_ling_wanli_translated", use_auth_token=False)
/usr/local/lib/python3.7/dist-packages/datasets/utils/info_utils.py in verify_splits(expected_splits, recorded_splits)
65 raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))
66 if len(set(recorded_splits) - set(expected_splits)) > 0:
---> 67 raise UnexpectedSplits(str(set(recorded_splits) - set(expected_splits)))
68 bad_splits = [
69 {"expected": expected_splits[name], "recorded": recorded_splits[name]}
UnexpectedSplits: {'ling_de', 'mnli_es', 'anli_tr', 'mnli_ru', 'anli_ar', 'ling_hi', 'anli_ru', 'mnli_id', 'mnli_ar', 'ling_ru', 'wanli_ar', 'fever_fr', 'fever_de', 'ling_id', 'wanli_de', 'fever_id', 'anli_pt', 'mnli_tr', 'wanli_hi', 'fever_it', 'fever_hi', 'fever_ru', 'wanli_es', 'mnli_de', 'ling_pt', 'mnli_hi', 'mnli_it', 'wanli_ru', 'ling_it', 'fever_zh', 'fever_es', 'ling_fr', 'anli_fr', 'mnli_fr', 'anli_es', 'wanli_zh', 'wanli_id', 'ling_tr', 'anli_zh', 'wanli_pt', 'fever_pt', 'anli_hi', 'mnli_zh', 'wanli_fr', 'anli_de', 'ling_es', 'wanli_it', 'anli_it', 'wanli_tr', 'fever_tr', 'ling_ar', 'mnli_pt', 'ling_zh', 'anli_id', 'fever_ar'}
The data for these splits is in the repository, but the error still gets raised for some reason.
One possible reason for the error: I first pushed 45 splits to the hub and then a few hours later I pushed 10 more splits and a bit later 10 additional splits. (this was necessary because they are machine translated and the translation takes very long even on an A100 GPU, so I had to split the translation pipeline across several runs). I have the impression that Datasets.load_dataset() now only recognises the last 10 splits I uploaded. There are 65 splits overall and the list of unexpected splits is 55 long.
=> How can I fix this error?
(re-uploading is not really an option, because I don鈥檛 have a local copy of the data and creating the data took several hours of GPU time)