Upload efficiently for lazy split download

Hi everyone,

I’m a beginner regarding HuggigFace and I must say I’m completely lost in their tutorials.

The data I have locally

Essentially CIFAR 10, structured as follows:

data/airplane/airplane_xxxx.png
data/airplane/cat_yyyy.png
...

where xxxx goes from 0000 to 5999 and

  • 0000 -> 0999 belong to test,
  • 1000 -> 5999 belong to train.

What I want

To upload it with:

  • Customized split strategies (in my case, using leave_out="cat" for example to treat cats separately).
  • Splits train, test and leftout.
  • lazy loading of the splits, meaning the if a user requests leave_out="cat", split="leftout", then HF only downloads the cat samples.

I have trouble with the last part honestly…

What I am currently trying

I think from what I understood here that I need to create a custom dataset.py fils with the BuilderConfig and DatasetBuilder. But I have many questions:

  1. Their example

class Squad(datasets.GeneratorBasedBuilder):
    """SQUAD: The Stanford Question Answering Dataset. Version 1.1."""

    def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]:
        downloaded_files = dl_manager.download_and_extract(_URLS)

        return [
            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": downloaded_files["train"]}),
            datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"filepath": downloaded_files["dev"]}),
        ]

seems to eagerly download every split??
2. I don’t really understand whether the script defining the DatasetBuilder will be used locally by me to upload to HF hub, or if it will be executed remotely by users and I should simply upload the raw files as I currently have tehm locally?
3. I think I can a maybe group files by test/train and class into zipballs to provide more efficient downloading? ut at this point it seems like I’m doing all the optimizing stuff HuggingFace should do for me?

Thanks in advance, it’s really hard to get into this from a beginner POV.

Al the best!
Élie
I hav

1 Like

Currently, your dataset has labels (such as “cat”) in the file names, but if you use directory (or archive file) names as labels instead of file names and organize them hierarchically, you should be able to organize the dataset hierarchically via ImageFolder.
Incidentally, ImageFolder does not seem to be very efficient when the dataset is huge.
https://github.com/huggingface/datasets/issues/5317

2

I think the dataset builder script is executed locally.
By the way, since executing the dataset builder directly from Hub is no longer recommended, it might be more convenient to publish the built data set if you want to make it public.

3

Maybe true. I think it’s more convenient to divide them intentionally to a certain extent in some cases.

Thanks for your anwer and interesting pointers!

I am using ImageFolder structure currently but:

  • I cannot get it to work with “calibration” split name
  • It’s omega slow at download since it loads files one y one (1h20 yesterday when I tried to download it all)
  • It does not allow custom split strategies (like leave_out="cat" I mentioned)

By the way, since executing the dataset builder directly from Hub is no longer recommended,

Hmmm that’s a bummer.

it might be more convenient to publish the built data set if you want to make it public.

Could you explain what you mean by “built” please? Because when I browse other datasets, they never upload files like I did (it seems stupid to, so I expected that), they often use parquet (I don’t think it’s very appropriate for images? Maybe zip better?). Is that what you mean?

Or do you mean “built” as in “publish it 11 times with 11 strategies in 11 folders (entire dataset + 10 times minus one class)”?

All the best.

1 Like

I cannot get it to work with “calibration” split name

In many cases, placing files and folders into the data folder works well.
File names and splits

Could you explain what you mean by “built” please? Because when I browse other datasets, they never upload files like I did (it seems stupid to, so I expected that), they often use parquet (I don’t think it’s very appropriate for images? Maybe zip better?). Is that what you mean?

Yes. In parquet (default) or in WebDataset.

Yes. In parquet (default) or in WebDataset.

Ok thanks, I’ll eventually lean towards this.


Regarding the names, I know already that “calibration”, but following the tutorial for manual configuration with (metadata from my README.md)

configs:
  - config_name: default
    data_files:
      - split: train
        path: train/*/*.png
      - split: calibration
        path: calibration/*/*.png
      - split: test
        path: test/*/*.png

I made it work now!

I think I’ll eventually settle for this, and use the filters option to leave_out specific classes on-the-fly. I cannot find the proper documentation for filters format though. I you have a pointer, that’d be lovely!

Again, thank you very much for your help!

All the best.


I edited the original message as I made a typo in the manual config paths previously.

Second edit, I still had a typo, now it seems to work!

1 Like

Great!:laughing:

Since many people use .filter, I don’t know much about filters option, but it seems that they need to be passed in PyArrow format.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.