Allow streaming of large datasets with image/audio

Hi,

Iā€™m trying to recreate the OpenAI subset of YFCC100M and since the dataset is very large, Iā€™d like to push it to the hub and allow streaming.

I was seeing it as:

  • large metadata file (containing title, description, file, path): I had planned a JSON but I guess I should use a tsv or csv (because a JSON probably needs to be read entirely to be parsed?)
  • an archive with all images: it seems that it needs to be a zip file

Would streaming be feasible? Is there a loading script I can get inspiration from?

1 Like

pinging @lhoestq here :slight_smile:

Hi !
For the metadata file Iā€™d suggest any line-by-line format (json lines, csv, tsv)
Then having the images in a ZIP file does the job :slight_smile:

Most formats are compatible with streaming, with a few exceptions like the TAR format (you canā€™t uncompress one single file from TAR without reading the full file, while ZIP supports this for example).

I donā€™t think there is a big image dataset on the hub that works with streaming yet, since the image datasets I can see use pickle/torch file formats which are not streamable (you canā€™t just load one image from such a file without reading the full file).

Therefore if you have any questions or if I can help, feel free to ping me :slight_smile:

1 Like

Great! I expect Iā€™m gonna end up with about 2TB of data.
Do you think I can just zip it and upload it as a single file?

For the metadata I didnā€™t realize I could also use JSON (I thought it needed to read the closing bracket since itā€™s a large array). Due to potential special characters, do you think it is a safer format or should I just stick with csv or tsv?

You should probably shard your dataset in parts that are smaller than 1GB each :wink:
This is convenient especially if some people just want to download some parts of it to run some tests for example. This is also convenient when building a distributed processing pipeline.

I mentioned the JSON Lines format, which is not exactly JSON: it is one JSON object per line in your metadata file. This is a nice format :slight_smile: But if you want to use csv or tsv instead feel free to do so

1 Like

I didnā€™t think about sharding the dataset but great idea!
The file path are naturally organized based on their hash so I could try to zip them based on the first 3 letters.

Didnā€™t know about JSON Lines, Iā€™ll give it a try.

So I expect the metadata in JSONL would take about 15 GB. Is it too much?

Here is a sample item (as a dict):

{'accuracy': None,
 'capturedevice': 'EASTMAN+KODAK+COMPANY+KODAK+CX4200+DIGITAL+CAMERA',
 'datetaken': '2004-09-06 10:47:16.0',
 'dateuploaded': '1094564954',
 'description': 'lounging+on+the+stairs',
 'downloadurl': 'http://farm1.staticflickr.com/1/364426_2b5099471f.jpg',
 'ext': 'jpg',
 'farmid': 1,
 'key': '601e28f77125baea9baa8591d1cbe48',
 'latitude': None,
 'licensename': 'Attribution-NonCommercial-ShareAlike License',
 'licenseurl': 'http://creativecommons.org/licenses/by-nc-sa/2.0/',
 'longitude': None,
 'machinetags': '',
 'marker': 0,
 'pageurl': 'http://www.flickr.com/photos/48600090655@N01/364426/',
 'photoid': 364426,
 'secret': '2b5099471f',
 'secretoriginal': '2b5099471f',
 'serverid': 1,
 'title': 'Lana',
 'uid': '48600090655@N01',
 'unickname': 'emmanslayer',
 'usertags': 'cat,stairs'}

Iā€™m thinking of doing a bit of processing:

  • title and description: Iā€™ll use urllib.parse.unquote_plus()
  • Iā€™m wondering if I should also keep all the metadata. Most likely Iā€™m gonna use only description + maybe title + potentially usertags + key (to retrieve corresponding file)
1 Like

Is there an example script that reads a metadata file (whatever format) and can script from a zip file?

You can just write as a regular dataset, as if you were doing no streaming:

  • in split_generators you download_and_extract the metadata file and all the audio files;
  • in generate_examples you open your metadata file and iterate on each line to get your examples. And for each example you open the associated audio file from one of the extracted zip directory.

In streaming mode, all the download and extract stuff in split_generators is done virtually, when the dataset script is loaded. This is only when you iterate on the loaded dataset that generate_examples is called. In generate_examples, it is when open is called that the file starts to be downloaded and extracted.

Feel free to ping me if you have questions or if I can help you with the script :slight_smile:

1 Like

All right Iā€™m gonna have more time to look into it:

  • dataset
  • metadata.jsonl contains all the metadata line by line.
  • key represents the file path (ex: ā€˜601e28f77125baea9baa8591d1cbe48ā€™)
    • the file will be in a zip file in f"data/{key[:3].zip"
    • within that zip file, the image path is f"data/images/{key[:3]}/{key[3:6]}/{key}.jpg"

The download and extract is a lot of data and takes a very long time locally!
Does it mean you keep the files permanently hosted?

Also it seems that Iā€™m supposed to pass urlā€™s. Is it the case even when the files are hosted on datasets, which means I would use urls such as https://huggingface.co/datasets/ENTITY/DATASET_NAME/resolve/main/ā€¦/FILE.zip or are they considered local for being in the hub?

As for _split_generators, what if I donā€™t have multiple split and all my data is training data?
Should I return a list with only one split with name datasets.Split.TRAIN?
Is it recommended to define arbitrarily at least a validation split (and maybe a test split), knowing that I can always choose to use all splits later?

Finally when I have an image, is it recommended to just yield the file path, and let the image loading be performed in a later map function?

I created a loading script.

  • works locally returns metadata + path, such as '/home/boris/.cache/huggingface/datasets/downloads/extracted/e109da16b51111fe97c22ce8134c5d3beb024c4c7b4e7081337045446a5566c2/data/images/01c/d85/01cd8545592671568162de8758b76.jpg'
  • in streaming mode, returns the path as 'zip:/::https:/huggingface.co/datasets/flax-community/YFCC100M_OpenAI_subset/resolve/main/data/d29.zip/data/images/d29/e7c/d29e7c6a3028418c64eb15e3cf577c2.jpg'

How can I access the file from this url?

Note: to test the file locally, I recommend to keep only first 10 _SHARDS. It takes about 3-5mn to load the data and you will only have the one present in the shards.

Hi ! Currently the streaming mode doesnā€™t support the use of pathlib.Path, you have to use os.path.join instead.

You should end up with the url

zip://data/images/d29/e7c/d29e7c6a3028418c64eb15e3cf577c2.jpg::https:/huggingface.co/datasets/flax-community/YFCC100M_OpenAI_subset/resolve/main/data/d29.zip

which is a compound URL. It allows to read/stream only one single file from a remote zip file.

Feel free to open an issue on github on huggingface/datasets so that we can start working on the support for pathlib :wink:

I did a few tests on a TPU v3-8 and 8,320 images split over 50 shards (see my loading script).

Mode load data + map (extract img dimensions) Iterate through dataset
Local 24.2 s 1.62 s
Local (cached) 0.023 s same
Streaming 0.072 s 26m 47s

Obviously in streaming mode the task is performed only when iterating.

Then I checked the time in streaming if I didnā€™t open the file.

Mode open img (extract img dimensions) Time to iterate Iterations per second
Streaming No 27.4 s 303.6
Streaming Yes 26m 47s 5.2

In streaming mode, we donā€™t take advantage of multiprocessing during mapping. So then I tested the following:

  • local dataset where img is fsspec format pointing to url
  • map to stream files from multiple processes

Results:

  • itā€™s actually not faster, probably because it uses requests which reads only from one stream at a time
  • it sometimes results in errors (similar to when I use datasets-cli to download lots of large archives) due to streams having been closed. I think they are open concurrently but read only 1 at a time, resulting sometimes in timeout.

Conclusions:

  • streaming data is currently a bottleneck for our application (we could only iterate through 0.5M images per day)
  • using local copies require very large disks. Also data is duplicated as we have archive + uncompressed data + arrow files. It could be improved a bit by removing compressed files (maybe after this line)
  • thereā€™s also the webdataset that looks pretty cool (the videos are nice as well). Interestingly they use the tar format. Iā€™m gonna test it out.
2 Likes

Iā€™ve been doing some tests on the same dataset.

First, I tried to make parallel requests for image loading and decoding. I used batched mode to retrieve a number of items sequentially, and then a simple Pool to load the images from the remote fsspec URLs that point to zip files. Unfortunately, it didnā€™t work: processes stalled and deadlocked.

Seeing that the backend of the fsspec URLs is HTTP, and that the fsspec library is capable of concurrent downloading of HTTP files, I tried to load several images at once using cat, as demonstrated here. That doesnā€™t work either: since the fsspec file is zip::, the filesystem is resolved as a Zip Archive even though the backend is HTTP. cat is not supported for lists in zip archives.

Then I used URLs that point to extracted images instead of entries inside zip files. I set up an http server pointing to an unarchived checkpoint of the dataset, and updated my loading script to return those URLs. This worked and was very fast, but the times are not comparable because the server was in the same LAN. More interestingly, a parallel Pool was able to retrieve images concurrently in this setup.

Iā€™m going to repeat this last test in a more realistic scenario, loading images from a distant server, and then Iā€™ll post some figures.

Meanwhile, I have the following questions on how to proceed.

  • Are there any hints to make fsspec work in parallel? We could implement an async zip filesystem, but that looks like a lot of hassle.
  • Does it make sense to store the files in the dataset in unarchived format, instead of inside zips?

Thank you!

2 Likes

A couple of figures from my dataset loading tests using streaming mode from a remote server:

Dataset structure Sequential Iteration Parallel iteration
Zip shards 5m 53s Unsupported
Unarchived images 6m 20s 35s
  • Dataset volume
    20 small shards with a total of just 1398 images.

  • Server
    4 processors, 8 GB RAM, SSD disk, nginx http server, hosted in Amsterdam. Same for both tests.
    The ā€œZip shardsā€ dataset loading script returns compound fsspec URLs that refer to entries inside the zip files, whereas the ā€œUnarchived imagesā€ script returns direct URLs to the extracted image files, stored in the same server. Both versions use the same metadata archives in JSON lines format.

  • Parallel configuration
    16 parallel Pools. Pre-processing batch size of 128 items.
    The dataset map function is invoked like dataset.map(p_function, batched=True, batch_size=128). The mapping function p_function uses a Pool of 16 workers.

  • Client: computer with 16 processors, not a TPU.

The conclusions are the same I alluded to in my previous message: dataset streaming could be improved if fsspec retrieval of zip contents could work in parallel, but I couldnā€™t make it work. If the dataset server stores unarchived files, then parallelization works fine, but file management becomes inconvenient for huge datasets. (In addition, parallelization needs to be handcrafted as part of the pre-processing code, but this could be abstracted into a class).

2 Likes

subscribed. we have similar situation here with large vision datasets and wondering whether we can utilize the huggingface paradigm. maybe take care of the sharding when the dataset is being created? splitting CSV into proper shards; that needs to go into the data processing script.

You can indeed define CSV shards. Right now the datasets library parallelizes the download of those shards, but _generate_examples still runs in a single process. We plan to parallelize the dataset generation as well as long as you pass a list of shards to _generate_examples :slight_smile:

In streaming mode it will be the same: using a pytorch DataLoader for example with num_workers > 1, it will download and run _generate_examples in parallel over the different shards (you can track the advancement of this feature here: Support DataLoader with num_workers > 0 in streaming mode by lhoestq Ā· Pull Request #4375 Ā· huggingface/datasets Ā· GitHub)

1 Like

Thanks Quentin for the reply; that makes sense! I see you can speficy data_files to include the list of csv shards and the downloading will be done in parallel

On a side note, I didnā€™t see the streaming mode being supported for tensorflow/tf.data; did I miss something or is the feature coming in the next release?

Yep we will definitely work on supporting tf.data for streaming datasets :slight_smile: