How to use large image-text datasets in hugging face hub without downloading for free

Sunny111 · September 8, 2022, 4:11pm

I am wondering if there is a way one can use a large image-text dataset available in the huggingface hub (see below) without downloading (and for free) it for pertaining transformers. Also, Let me know how to use it if one could do that.

CC3M: conceptual_captions · Datasets at Hugging Face
LAION2B-en: laion/laion2B-en · Datasets at Hugging Face

mariosasko · September 8, 2022, 6:09pm

Hi!

You can use streaming to avoid downloading the datasets locally. There are the calls:

load_dataset("conceptual_captions", streaming=True)
load_dataset("laion/laion2B-en", streaming=True)

This is, of course, free . You can find more info on streaming here.

Sunny111 · September 10, 2022, 12:33am

The map method’s fn_kwargs is not working with streaming=True, but it works otherwise. Here is the my implementation

 USER_AGENT = get_datasets_user_agent()

    def fetch_single_image(image_url, timeout=None, retries=0):
        for _ in range(retries + 1):
            try:
                request = urllib.request.Request(
                    image_url,
                    data=None,
                    headers={"user-agent": USER_AGENT},
                )
                with urllib.request.urlopen(request, timeout=timeout) as req:
                    image = PIL.Image.open(io.BytesIO(req.read()))
                break
            except Exception:
                image = None
        return image

    def fetch_images(batch, num_threads, timeout=None, retries=0):
        fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries)
        with ThreadPoolExecutor(max_workers=num_threads) as executor:
            batch["image"] = list(executor.map(fetch_single_image_with_args, batch["image_url"]))
        return batch

    num_threads = 20
    dset = load_dataset("conceptual_captions", split='train', streaming=True)
    # dset = dset.map(fetch_images, batched=True, batch_size=32)
    dset = dset.map(fetch_images, batched=True, batch_size=32, fn_kwargs={"num_threads": num_threads})
    print(next(iter(dset)))

Here is the error

mariosasko · September 12, 2022, 10:33am

Hi again! Currently, fn_kwargs is not supported as a param in map in the streaming mode. I’ll open a PR to fix that.

Sunny111 · September 13, 2022, 3:49pm

Hi @mariosasko Thank you for helping me out here. Could you please cc me on the PR request so that I may know when it is done and start using the feature? Thank you again.

mariosasko · September 13, 2022, 4:21pm

This is the link to the PR: Add `fn_kwargs` param to `IterableDataset.map` by mariosasko · Pull Request #4975 · huggingface/datasets · GitHub. It will be included in the next release of datasets, which is planned for this week.

bbergner · November 12, 2023, 9:12pm

It is possible with the code provided in this thread. However, it is very slow to do so on-the-fly. I think you will have to download those datasets to ensure a high GPU utilization rate.

Topic		Replies	Views
Imagenet-1k is not available in huggingface dataset hub 🤗Datasets	3	4430	October 26, 2022
Streaming for Saving 🤗Datasets	1	39	January 26, 2025
How to create a dataset script for the LAION dataset 🤗Datasets	1	658	November 4, 2022
Accessing local data files 🤗Datasets	1	531	September 23, 2022
Huggingface Vision Dataset - the right way to use it? 🤗Datasets	5	1280	July 11, 2022

How to use large image-text datasets in hugging face hub without downloading for free

Related topics