Allow streaming of large datasets with image/audio

boris · July 7, 2021, 6:35am

So I expect the metadata in JSONL would take about 15 GB. Is it too much?

Here is a sample item (as a dict):

{'accuracy': None,
 'capturedevice': 'EASTMAN+KODAK+COMPANY+KODAK+CX4200+DIGITAL+CAMERA',
 'datetaken': '2004-09-06 10:47:16.0',
 'dateuploaded': '1094564954',
 'description': 'lounging+on+the+stairs',
 'downloadurl': 'http://farm1.staticflickr.com/1/364426_2b5099471f.jpg',
 'ext': 'jpg',
 'farmid': 1,
 'key': '601e28f77125baea9baa8591d1cbe48',
 'latitude': None,
 'licensename': 'Attribution-NonCommercial-ShareAlike License',
 'licenseurl': 'http://creativecommons.org/licenses/by-nc-sa/2.0/',
 'longitude': None,
 'machinetags': '',
 'marker': 0,
 'pageurl': 'http://www.flickr.com/photos/48600090655@N01/364426/',
 'photoid': 364426,
 'secret': '2b5099471f',
 'secretoriginal': '2b5099471f',
 'serverid': 1,
 'title': 'Lana',
 'uid': '48600090655@N01',
 'unickname': 'emmanslayer',
 'usertags': 'cat,stairs'}

I’m thinking of doing a bit of processing:

title and description: I’ll use urllib.parse.unquote_plus()
I’m wondering if I should also keep all the metadata. Most likely I’m gonna use only description + maybe title + potentially usertags + key (to retrieve corresponding file)

Topic		Replies	Views
Streaming in dataset uploads 🤗Datasets	2	56	March 31, 2025
Standard way to upload huge dataset 🤗Datasets	5	608	April 26, 2024
Stream image dataset from (Azure) cloud storage 🤗Datasets	3	482	January 8, 2024
OOM issue with large dataset streaming 🤗Datasets	6	120	March 15, 2025
Streaming for Saving 🤗Datasets	1	40	January 26, 2025

Allow streaming of large datasets with image/audio

Related topics