Allow streaming of large datasets with image/audio

All right I’m gonna have more time to look into it:

  • dataset
  • metadata.jsonl contains all the metadata line by line.
  • key represents the file path (ex: ‘601e28f77125baea9baa8591d1cbe48’)
    • the file will be in a zip file in f"data/{key[:3].zip"
    • within that zip file, the image path is f"data/images/{key[:3]}/{key[3:6]}/{key}.jpg"

The download and extract is a lot of data and takes a very long time locally!
Does it mean you keep the files permanently hosted?

Also it seems that I’m supposed to pass url’s. Is it the case even when the files are hosted on datasets, which means I would use urls such as https://huggingface.co/datasets/ENTITY/DATASET_NAME/resolve/main/…/FILE.zip or are they considered local for being in the hub?

As for _split_generators, what if I don’t have multiple split and all my data is training data?
Should I return a list with only one split with name datasets.Split.TRAIN?
Is it recommended to define arbitrarily at least a validation split (and maybe a test split), knowing that I can always choose to use all splits later?

Finally when I have an image, is it recommended to just yield the file path, and let the image loading be performed in a later map function?