Sharing ArrowDataset with subfolders

samirchar · March 10, 2025, 12:41pm

Hello everyone!

I want to share multiple datasets in the same repo <my_username>/<my_repo_name>, each in its own folder. The datasets in each folder are already in sharded Arrow format (for best performance) and contain different splits, as usual. To read any of these datasets with load_dataset I would need a loading script to tell HF how to read from the folders, right? If so, should I use the ArrowBasedBuilder and how? I only see tutorials for GeneratorBaseBuilder!

Thanks!

John6666 · March 10, 2025, 3:20pm

If it’s already been converted to a Dataset class, is datasets.concatenate_dataset sufficient…? @lhoestq

samirchar · March 10, 2025, 5:21pm

@John6666 no because i dont want to concateneate the datasets! Each folder is a different dataset with different features. So do i need the arrow builder to tell HF how to load the different datasets from the subfolder?

John6666 · March 10, 2025, 5:34pm

Hmm…
In that case, I thought that it would be easier for Hugging Face, which is based on one model per repo (and dataset), to work properly if the datasets with different structures were kept separate.
However, I think there was a way to merge datasets with different structures. Let’s wait for lhonestq.

samirchar · March 10, 2025, 6:33pm

Yeah, maybe. I’m hesitating to separate into different repos because the datasets are related. It’s not completely separate projects. Think of it as GLUE, which is a set of multiple datasets but they are all related to one objective or project, like shown here Create a dataset loading script

lhoestq · March 10, 2025, 11:20pm

You can configure the subsets present in your dataset repository in YAML see the docs at Manual Configuration

See the GLUE dataset for example: nyu-mll/glue at main

John6666 · March 11, 2025, 3:04am

Thank you!

samirchar · March 11, 2025, 11:01am

This is amazing! Thank you very much.

system · March 11, 2025, 11:02pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
[solved] How to load multiple arrow files into one dataset Beginners	4	2975	September 16, 2023
[urgent]Can you reconstruct datasets using the cache file (.arrow file)? 🤗Datasets	5	1073	August 27, 2021
How to access relative file when building a datasets 🤗Datasets	2	539	October 14, 2022
ArrowBasedBuilder versus GeneratorDBasedBuilder 🤗Datasets	4	406	February 8, 2025
Trying to Build Datasets, Random Items Get Added Beginners	2	477	July 27, 2021

Sharing ArrowDataset with subfolders

Related topics