I am trying to load a subset of a dataset from the hub using load_dataset_builder
but the verification keeps throwing an error. I have edited the ReadMe
to reflect the number that I want which is 2201939
but each time I run the download script, it overrides the Readme file and changes the number.
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=14752554924, num_examples=40836715, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=724931127, num_examples=2201939, shard_lengths=[1516000, 685939], dataset_name='wmt_utils')}]
code that produces the error:
inspect_dataset("wmt14", "data/path2/to/scripts2")
builder = load_dataset_builder(
"data/path2/to/scripts2/wmt_utils.py",
language_pair=("fr", "en"),
subsets={
datasets.Split.TRAIN: ["europarl_v7", "newscommentary_v10"],
datasets.Split.VALIDATION: ["newstest2013"],
datasets.Split.TEST: ["newstest2014"],