Custom dataset, wrong number of examples for one config

drvenabili · July 3, 2023, 2:31pm

Hi!

I have tried a few things and I encounter this issue. I don’t really understand how it’s possible.

In a nutshell:

I have a custom loading script for a custom dataset
the script has 28 different configs – all the same but a different path for a different text.gz file
the paths are correct
running datasets-cli test DATASETNAME --save_info --all_configs works and writes to README.md

The issue: for one of the configs, the number of examples is wrong. the datasets-cli command above runs fine, everything is great, but a/ the number of examples for one config is wrong and b/ when it’s pushed to the hub and retrieved on another machine I have the issue pasted below for that specific config. I believe this is caused because the README.md thinks there are 124 880 138, examples but in truth there are around twice as much (285 384 149).

This does not happen when I try to load another config – there, the number of examples is correct.

I do not understand how that’s possible. For 27 of my configs, the number of examples matches the number of lines in the source text files. For the 28th, the biggest, this doesn’t match.

Any idea?

Below the error when loading the full set:

Traceback (most recent call last):                                                                                                                                                                      
  File "<stdin>", line 1, in <module>
  File "/home/simon/temp/env/lib/python3.10/site-packages/datasets/load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/simon/temp/env/lib/python3.10/site-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/home/simon/temp/env/lib/python3.10/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/simon/temp/env/lib/python3.10/site-packages/datasets/builder.py", line 985, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/home/simon/temp/env/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 100, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=7999426267, num_examples=124880138, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=18002980013, num_examples=285384149, shard_lengths=[7473000, 7828000, 8013000, 8015000, 7468000, 8172000, 7722000, 7942000, 7799000, 7500000, 7607000, 7674000, 8381000, 7871000, 7706000, 7726000, 7750000, 7566000, 8127000, 8190000, 8041000, 8297000, 7729000, 7890000, 7831000, 7937000, 8055000, 8270000, 7887000, 8417000, 8256000, 8044000, 7970000, 8023000, 8489000, 7694000, 24149], dataset_name='kubhist2')}]

drvenabili · July 4, 2023, 8:15am

I’m not sure what goes wrong but manually editing the datasets-cli-generated README.md to update the number of examples in that one failing config seems to have fixed it.

Topic		Replies	Views
Load data from file get less examples 🤗Datasets	0	222	October 20, 2022
The Full Dataset Viewer is Not available, Only showing preview of rows 🤗Datasets	0	72	July 18, 2024
Custom loading dataset script 🤗Datasets	4	511	January 3, 2023
Dataset shows 0 rows when loaded but full when pushed 🤗Datasets	0	421	July 26, 2023
Load_dataset_builder verification 🤗Datasets	3	317	January 12, 2024

Custom dataset, wrong number of examples for *one* config

Related topics

Custom dataset, wrong number of examples for one config