Load_dataset_builder verification

I am trying to load a subset of a dataset from the hub using load_dataset_builder but the verification keeps throwing an error. I have edited the ReadMe to reflect the number that I want which is 2201939 but each time I run the download script, it overrides the Readme file and changes the number.

datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=14752554924, num_examples=40836715, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=724931127, num_examples=2201939, shard_lengths=[1516000, 685939], dataset_name='wmt_utils')}]

code that produces the error:

inspect_dataset("wmt14", "data/path2/to/scripts2")
    builder = load_dataset_builder(
        "data/path2/to/scripts2/wmt_utils.py",
        language_pair=("fr", "en"),
        subsets={
            datasets.Split.TRAIN: ["europarl_v7", "newscommentary_v10"],
            datasets.Split.VALIDATION: ["newstest2013"],
            datasets.Split.TEST: ["newstest2014"],

Hi ! This looks like a bug :confused:
Maybe you can disable verifications using verification_mode="no_checks" ?

I was the one making a mistake, commenting out the first line solves the problem.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.