I want to share multiple datasets in the same repo <my_username>/<my_repo_name>, each in its own folder. The datasets in each folder are already in sharded Arrow format (for best performance) and contain different splits, as usual. To read any of these datasets with load_dataset I would need a loading script to tell HF how to read from the folders, right? If so, should I use the ArrowBasedBuilder and how? I only see tutorials for GeneratorBaseBuilder!
@John6666 no because i dont want to concateneate the datasets! Each folder is a different dataset with different features. So do i need the arrow builder to tell HF how to load the different datasets from the subfolder?
Hmm…
In that case, I thought that it would be easier for Hugging Face, which is based on one model per repo (and dataset), to work properly if the datasets with different structures were kept separate.
However, I think there was a way to merge datasets with different structures. Let’s wait for lhonestq.
Yeah, maybe. I’m hesitating to separate into different repos because the datasets are related. It’s not completely separate projects. Think of it as GLUE, which is a set of multiple datasets but they are all related to one objective or project, like shown here Create a dataset loading script