Customized split strategies (in my case, using leave_out="cat" for example to treat cats separately).
Splits train, testandleftout.
lazy loading of the splits, meaning the if a user requests leave_out="cat", split="leftout", then HF only downloads the cat samples.
I have trouble with the last part honestly…
What I am currently trying
I think from what I understood here that I need to create a custom dataset.py fils with the BuilderConfig and DatasetBuilder. But I have many questions:
Their example
class Squad(datasets.GeneratorBasedBuilder):
"""SQUAD: The Stanford Question Answering Dataset. Version 1.1."""
def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]:
downloaded_files = dl_manager.download_and_extract(_URLS)
return [
datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": downloaded_files["train"]}),
datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"filepath": downloaded_files["dev"]}),
]
seems to eagerly download every split??
2. I don’t really understand whether the script defining the DatasetBuilder will be used locally by me to upload to HF hub, or if it will be executed remotely by users and I should simply upload the raw files as I currently have tehm locally?
3. I think I can a maybe group files by test/train and class into zipballs to provide more efficient downloading? ut at this point it seems like I’m doing all the optimizing stuff HuggingFace should do for me?
Thanks in advance, it’s really hard to get into this from a beginner POV.
it might be more convenient to publish the built data set if you want to make it public.
Could you explain what you mean by “built” please? Because when I browse other datasets, they never upload files like I did (it seems stupid to, so I expected that), they often use parquet (I don’t think it’s very appropriate for images? Maybe zip better?). Is that what you mean?
Or do you mean “built” as in “publish it 11 times with 11 strategies in 11 folders (entire dataset + 10 times minus one class)”?
Could you explain what you mean by “built” please? Because when I browse other datasets, they never upload files like I did (it seems stupid to, so I expected that), they often use parquet (I don’t think it’s very appropriate for images? Maybe zip better?). Is that what you mean?
I think I’ll eventually settle for this, and use the filters option to leave_out specific classes on-the-fly. I cannot find the proper documentation for filters format though. I you have a pointer, that’d be lovely!
Again, thank you very much for your help!
All the best.
I edited the original message as I made a typo in the manual config paths previously.
Second edit, I still had a typo, now it seems to work!