How to use large image-text datasets in hugging face hub without downloading for free

I am wondering if there is a way one can use a large image-text dataset available in the huggingface hub (see below) without downloading (and for free) it for pertaining transformers. Also, Let me know how to use it if one could do that.

CC3M: conceptual_captions · Datasets at Hugging Face
LAION2B-en: laion/laion2B-en · Datasets at Hugging Face

Hi!

You can use streaming to avoid downloading the datasets locally. There are the calls:

  • load_dataset("conceptual_captions", streaming=True)
  • load_dataset("laion/laion2B-en", streaming=True)

This is, of course, free :smile:. You can find more info on streaming here.

The map method’s fn_kwargs is not working with streaming=True, but it works otherwise. Here is the my implementation

 USER_AGENT = get_datasets_user_agent()

    def fetch_single_image(image_url, timeout=None, retries=0):
        for _ in range(retries + 1):
            try:
                request = urllib.request.Request(
                    image_url,
                    data=None,
                    headers={"user-agent": USER_AGENT},
                )
                with urllib.request.urlopen(request, timeout=timeout) as req:
                    image = PIL.Image.open(io.BytesIO(req.read()))
                break
            except Exception:
                image = None
        return image

    def fetch_images(batch, num_threads, timeout=None, retries=0):
        fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries)
        with ThreadPoolExecutor(max_workers=num_threads) as executor:
            batch["image"] = list(executor.map(fetch_single_image_with_args, batch["image_url"]))
        return batch

    num_threads = 20
    dset = load_dataset("conceptual_captions", split='train', streaming=True)
    # dset = dset.map(fetch_images, batched=True, batch_size=32)
    dset = dset.map(fetch_images, batched=True, batch_size=32, fn_kwargs={"num_threads": num_threads})
    print(next(iter(dset)))

Here is the error

Hi again! Currently, fn_kwargs is not supported as a param in map in the streaming mode. I’ll open a PR to fix that.

Hi @mariosasko Thank you for helping me out here. Could you please cc me on the PR request so that I may know when it is done and start using the feature? Thank you again.

This is the link to the PR: Add `fn_kwargs` param to `IterableDataset.map` by mariosasko · Pull Request #4975 · huggingface/datasets · GitHub. It will be included in the next release of datasets, which is planned for this week.

1 Like

It is possible with the code provided in this thread. However, it is very slow to do so on-the-fly. I think you will have to download those datasets to ensure a high GPU utilization rate.