When I run something like dataset = load_dataset('trivia_qa', 'rc', split='validation')
, I see a tqdm that is extracting the train split. Is there a way to skip this?
Hi! No, but we plan to address this soon - this will likely require introducing a new script structure, which is why we haven’t implemented it yet.
In the meantime, you can download the [trivia_qa] (trivia_qa.py · trivia_qa at main) script from the Hub, replace
return [
datasets.SplitGenerator(
name=name,
gen_kwargs={
"files": _qa_files(file_paths, cfg.sources, name, cfg.unfiltered),
"web_dir": web_evidence_dir,
"wiki_dir": wiki_evidence_dir,
},
)
for name in [datasets.Split.TRAIN, datasets.Split.VALIDATION, datasets.Split.TEST]
]
with
return [
datasets.SplitGenerator(
name=name,
gen_kwargs={
"files": _qa_files(file_paths, cfg.sources, name, cfg.unfiltered),
"web_dir": web_evidence_dir,
"wiki_dir": wiki_evidence_dir,
},
)
for name in [datasets.Split.VALIDATION]
]
and then run load_dataset("path/to/script")
.
2 Likes
Amazing, thanks!