`load_dataset`: how to extract only the validation split?

When I run something like dataset = load_dataset('trivia_qa', 'rc', split='validation') , I see a tqdm that is extracting the train split. Is there a way to skip this?

Hi! No, but we plan to address this soon - this will likely require introducing a new script structure, which is why we haven’t implemented it yet.

In the meantime, you can download the [trivia_qa] (trivia_qa.py · trivia_qa at main) script from the Hub, replace

    return [
            datasets.SplitGenerator(
                name=name,
                gen_kwargs={
                    "files": _qa_files(file_paths, cfg.sources, name, cfg.unfiltered),
                    "web_dir": web_evidence_dir,
                    "wiki_dir": wiki_evidence_dir,
                },
            )
            for name in [datasets.Split.TRAIN, datasets.Split.VALIDATION, datasets.Split.TEST]
        ]

with

    return [
            datasets.SplitGenerator(
                name=name,
                gen_kwargs={
                    "files": _qa_files(file_paths, cfg.sources, name, cfg.unfiltered),
                    "web_dir": web_evidence_dir,
                    "wiki_dir": wiki_evidence_dir,
                },
            )
            for name in [datasets.Split.VALIDATION]
        ]

and then run load_dataset("path/to/script").

2 Likes

Amazing, thanks!