Windows-specific issue: Audio classification example throws a NonMatchingSplitsSizesError, only on Windows

mkunes · October 16, 2023, 1:34pm

Note: The following issue has been reproduced on three different Windows machines, all running Windows 11. It does not seem to happen on Linux or in WSL.

Hi all,

I normally only use transformers on Linux, but now I’m trying to help someone who only has Windows 11. They’ve started with the single-GPU audio classification example from

and are running into an unexpected error:

datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=8467781, num_examples=51094, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=20814589, num_examples=64727, shard_lengths=None, dataset_name='superb')}]

After some testing, we’ve confirmed that the same error also happens on two other Windows 11 machines (my home PC and a colleague’s computer), while Linux and WSL work fine.

Example of the output:

[...]
Downloading data files: 100%|███████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 399.99it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 658.91it/s]
Generating train split: 64727 examples [00:05, 10918.06 examples/s]
Generating validation split: 100%|████████████████████████████████████████| 6798/6798 [00:00<00:00, 6823.88 examples/s]
Generating test split: 100%|█████████████████████████████████████████████| 3081/3081 [00:00<00:00, 11419.73 examples/s]
Traceback (most recent call last):
  File "C:\Users\MK\Documents\temp\run_audio_classification.py", line 418, in <module>
    main()
  File "C:\Users\MK\Documents\temp\run_audio_classification.py", line 249, in main
    raw_datasets["train"] = load_dataset(
  File "C:\Users\MK\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\load.py", line 2153, in load_dataset
    builder_instance.download_and_prepare(
  File "C:\Users\MK\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "C:\Users\MK\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py", line 1717, in _download_and_prepare
    super()._download_and_prepare(
  File "C:\Users\MK\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py", line 1067, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "C:\Users\MK\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\utils\info_utils.py", line 100, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=8467781, num_examples=51094, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=20814589, num_examples=64727, shard_lengths=None, dataset_name='superb')}]

Note: I’m not on Windows right now, but if anyone wants additional system details, I can post them later.

Some observations and theories:

As far as I can tell, the superb/ks dataset (which is used in the audio classification example) is supposed to have 51094 training examples. But when running the script on Windows, the training split is for some reason generated with 64727 examples (even the progress bar first counts to 51094 and then continues on).

This could have something to do with the fact that there are (at least on the Linux machine I’m writing this post on) exactly 64727 WAV files in one of the two extracted folders (.cache/huggingface/datasets/downloads/extracted/b2e94f56705583e592357a3eb36da67974bf54c810dc51863106d661c2cf54b5; the other one is 33a70a050684383799a5f81690bf02980ca7c51159708d1560118175d4e956fc and has exactly 3081 files, matching the size of the test set).

And in the “b2e94[…]” folder, there are also files called testing_list.txt and validation_list.txt, whose contents seem to make up the difference in training set size: testing_list.txt has 6835 entries, validation_list.txt contains 6798; 51094 + 6835 + 6835 = 64727. There is no training_list.txt.

(By the way, even when I tried bypassing the NonMatchingSplitsSizesError error by disabling dataset verification, I just ran into two other, probably unrelated errors:

AttributeError: Can't pickle local object 'main.<locals>.train_transforms'
→ solved by removing “–dataloader_num_workers 4”, as mentioned at AttributeError: Can't pickle local object 'main.<locals>.train_transforms' · Issue #1327 · huggingface/optimum · GitHub
RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Int'
previously reported here: Error from CUDA on audio classification and here: You should supply an instance of `transformers.BatchFeature` or list of `transformers.BatchFeature` to this method that includes input_values, but you provided ['file', 'audio', 'label'] · Issue #25748 · huggingface/transformers · GitHub

At that point I gave up and installed WSL, which seems to work fine.)

So where is the problem? Is the superb/ks dataset somehow misconfigured? (But then why does it work fine on Linux and in WSL?) Is there a bug in datasets’ (or some other package’s) file processing on Windows? Or is there just something that we’re all missing?

Topic		Replies	Views
NonMatchingSplitsSizesError 🤗Datasets	5	6192	September 13, 2023
Multilabel Audio Classification Training size mismatch 🤗Transformers	3	365	February 22, 2024
Download_and_extract() file missing, but only for one split 🤗Datasets	1	175	March 18, 2024
Error from CUDA on audio classification Beginners	3	1852	September 18, 2024
Column lengths mismatch in IterableDataset 🤗Datasets	2	977	May 12, 2023

Windows-specific issue: Audio classification example throws a NonMatchingSplitsSizesError, only on Windows

Related topics