Custom dataset, wrong number of examples for *one* config

Hi!

I have tried a few things and I encounter this issue. I don’t really understand how it’s possible.

In a nutshell:

  • I have a custom loading script for a custom dataset
  • the script has 28 different configs – all the same but a different path for a different text.gz file
  • the paths are correct
  • running datasets-cli test DATASETNAME --save_info --all_configs works and writes to README.md

The issue: for one of the configs, the number of examples is wrong. the datasets-cli command above runs fine, everything is great, but a/ the number of examples for one config is wrong and b/ when it’s pushed to the hub and retrieved on another machine I have the issue pasted below for that specific config. I believe this is caused because the README.md thinks there are 124 880 138, examples but in truth there are around twice as much (285 384 149).

This does not happen when I try to load another config – there, the number of examples is correct.

I do not understand how that’s possible. For 27 of my configs, the number of examples matches the number of lines in the source text files. For the 28th, the biggest, this doesn’t match.

Any idea?

Below the error when loading the full set:

Traceback (most recent call last):                                                                                                                                                                      
  File "<stdin>", line 1, in <module>
  File "/home/simon/temp/env/lib/python3.10/site-packages/datasets/load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/simon/temp/env/lib/python3.10/site-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/home/simon/temp/env/lib/python3.10/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/simon/temp/env/lib/python3.10/site-packages/datasets/builder.py", line 985, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/home/simon/temp/env/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 100, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=7999426267, num_examples=124880138, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=18002980013, num_examples=285384149, shard_lengths=[7473000, 7828000, 8013000, 8015000, 7468000, 8172000, 7722000, 7942000, 7799000, 7500000, 7607000, 7674000, 8381000, 7871000, 7706000, 7726000, 7750000, 7566000, 8127000, 8190000, 8041000, 8297000, 7729000, 7890000, 7831000, 7937000, 8055000, 8270000, 7887000, 8417000, 8256000, 8044000, 7970000, 8023000, 8489000, 7694000, 24149], dataset_name='kubhist2')}]

I’m not sure what goes wrong but manually editing the datasets-cli-generated README.md to update the number of examples in that one failing config seems to have fixed it.