Load_dataset_builder verification

Owos · December 27, 2023, 4:54am

I am trying to load a subset of a dataset from the hub using load_dataset_builder but the verification keeps throwing an error. I have edited the ReadMe to reflect the number that I want which is 2201939 but each time I run the download script, it overrides the Readme file and changes the number.

datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=14752554924, num_examples=40836715, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=724931127, num_examples=2201939, shard_lengths=[1516000, 685939], dataset_name='wmt_utils')}]

code that produces the error:

inspect_dataset("wmt14", "data/path2/to/scripts2")
    builder = load_dataset_builder(
        "data/path2/to/scripts2/wmt_utils.py",
        language_pair=("fr", "en"),
        subsets={
            datasets.Split.TRAIN: ["europarl_v7", "newscommentary_v10"],
            datasets.Split.VALIDATION: ["newstest2013"],
            datasets.Split.TEST: ["newstest2014"],

lhoestq · January 12, 2024, 2:49pm

Hi ! This looks like a bug
Maybe you can disable verifications using verification_mode="no_checks" ?

Owos · January 12, 2024, 7:25pm

I was the one making a mistake, commenting out the first line solves the problem.

system · February 5, 2024, 4:37am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
NonMatchingSplitsSizesError 🤗Datasets	6	6538	October 8, 2025
Dataset shows 0 rows when loaded but full when pushed 🤗Datasets	0	426	July 26, 2023
Error when downloading own dataset with git lfs files 🤗Datasets	4	1555	June 22, 2022
New dataset raises 'UnexpectedSplits:' error 🤗Datasets	5	2711	August 12, 2022
Custom dataset, wrong number of examples for one config 🤗Datasets	1	533	July 4, 2023

Load_dataset_builder verification

Related topics