I am wondering if there is a way one can use a large image-text dataset available in the huggingface hub (see below) without downloading (and for free) it for pertaining transformers. Also, Let me know how to use it if one could do that.
CC3M: conceptual_captions · Datasets at Hugging Face
LAION2B-en: laion/laion2B-en · Datasets at Hugging Face
Hi!
You can use streaming to avoid downloading the datasets locally. There are the calls:
load_dataset("conceptual_captions", streaming=True)
load_dataset("laion/laion2B-en", streaming=True)
This is, of course, free . You can find more info on streaming here .
mariosasko:
conceptual_captions
The map method’s fn_kwargs is not working with streaming=True, but it works otherwise. Here is the my implementation
USER_AGENT = get_datasets_user_agent()
def fetch_single_image(image_url, timeout=None, retries=0):
for _ in range(retries + 1):
try:
request = urllib.request.Request(
image_url,
data=None,
headers={"user-agent": USER_AGENT},
)
with urllib.request.urlopen(request, timeout=timeout) as req:
image = PIL.Image.open(io.BytesIO(req.read()))
break
except Exception:
image = None
return image
def fetch_images(batch, num_threads, timeout=None, retries=0):
fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries)
with ThreadPoolExecutor(max_workers=num_threads) as executor:
batch["image"] = list(executor.map(fetch_single_image_with_args, batch["image_url"]))
return batch
num_threads = 20
dset = load_dataset("conceptual_captions", split='train', streaming=True)
# dset = dset.map(fetch_images, batched=True, batch_size=32)
dset = dset.map(fetch_images, batched=True, batch_size=32, fn_kwargs={"num_threads": num_threads})
print(next(iter(dset)))
Here is the error
Hi again! Currently, fn_kwargs
is not supported as a param in map
in the streaming mode. I’ll open a PR to fix that.
Hi @mariosasko Thank you for helping me out here. Could you please cc me on the PR request so that I may know when it is done and start using the feature? Thank you again.
This is the link to the PR: Add `fn_kwargs` param to `IterableDataset.map` by mariosasko · Pull Request #4975 · huggingface/datasets · GitHub . It will be included in the next release of datasets, which is planned for this week.
1 Like
It is possible with the code provided in this thread. However, it is very slow to do so on-the-fly. I think you will have to download those datasets to ensure a high GPU utilization rate.