Sharing ArrowDataset with subfolders

Hello everyone!

I want to share multiple datasets in the same repo <my_username>/<my_repo_name>, each in its own folder. The datasets in each folder are already in sharded Arrow format (for best performance) and contain different splits, as usual. To read any of these datasets with load_dataset I would need a loading script to tell HF how to read from the folders, right? If so, should I use the ArrowBasedBuilder and how? I only see tutorials for GeneratorBaseBuilder!

Thanks!

1 Like

If it’s already been converted to a Dataset class, is datasets.concatenate_dataset sufficient…? @lhoestq

@John6666 no because i dont want to concateneate the datasets! Each folder is a different dataset with different features. So do i need the arrow builder to tell HF how to load the different datasets from the subfolder?

1 Like

Hmm…
In that case, I thought that it would be easier for Hugging Face, which is based on one model per repo (and dataset), to work properly if the datasets with different structures were kept separate.:thinking:
However, I think there was a way to merge datasets with different structures. Let’s wait for lhonestq.

Yeah, maybe. I’m hesitating to separate into different repos because the datasets are related. It’s not completely separate projects. Think of it as GLUE, which is a set of multiple datasets but they are all related to one objective or project, like shown here Create a dataset loading script

1 Like

You can configure the subsets present in your dataset repository in YAML :slight_smile: see the docs at Manual Configuration

See the GLUE dataset for example: nyu-mll/glue at main

2 Likes

Thank you!

This is amazing! Thank you very much.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.