Problems about loading and managing extreme large datasets (like SA1B)

xk-huang · July 12, 2023, 4:55am

Hi! I’m working on SA1B data loading recently. But I’m running into some troubles about loading and caching.

I will first describe my working scenario: I’ve download all the SA1B data (10G per tar file, 1000 tar files, 10T in total) on my data server (not GPU server which only has 500G storage available) and I’ve also wrote a loading script.

However, in the loading script running on the GPU server, I did not download all the tar files in _split_generators for fear of blowing up my local storage. Thus, I download them once per tar file in _generate_examples. Here is the code snippet:

    def _split_generators(self, dl_manager: datasets.DownloadManager):
        # Not download tar files, just list them
        sa1b_tar_list = self.config.sa1b_tar_list  
        # list of local urls to fetch the tar files in my data server

        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
                    "sa1b_tar_list": sa1b_tar_list,
                    "iter_archive_func": dl_manager.iter_archive,
                    "download_func": dl_manager.download,
                },
            ),
        ]


    def _generate_examples(
        self,
        sa1b_tar_list,
        iter_archive_func,
        download_func,
    ):
        for sa1b_tar in zip(sa1b_tar_list):
            download_func(sa1b_tar)
            with open(sa1b_tar, "rb") as f:
                sa1b_tar = iter_archive_func(f)
                data_iter = self._process_one_tar(sa1b_tar)
                for ret in data_iter:
                    yield ret

    def _process_one_tar(
        self,
        sa1b_tar,
    ):
        # ...
        yield image_id, dict(
            **image_dict,
        )

I am sure there are problems in the code:

Following the docs, the files should be downloaded with dl_manager.download in _split_generators. But I am afraid to blow up the local machine.
Although I move the download call into _generate_examples, the fetched files will not be cleared once they are finished loading which gradually take up all the local storage. Are there any caching management strategies for such extreme big dataset? I think directly removing the used files is problematic, since I use multiple processes in PyTorch’s DataLoader to load the data.
I assume the usage case of streaming mode dl_manager.iter_archive. Is it correct? What is the best practice for such situation?

Many thanks in advance!

lhoestq · July 12, 2023, 12:05pm

If you want to save disk space I’d encourage you to streaming, e.g.

load_dataset("path/to/script.py", streaming=True)

In that case all the dl_manager calls are done lazily, so you can actually call dl_manager.download inside _split_generators this way

Also I confirm that dl_manager.iter_archive is indeed the way to go to stream TAR archives !

xk-huang · July 13, 2023, 1:48am

Thank you! It definitely works well!

Topic		Replies	Views
Image Dataset Generation gets killed 🤗Datasets	5	584	September 8, 2023
How do I download and load a dataset in batches without caching all of it? 🤗Datasets	1	229	September 16, 2024
My dataset loading script is not working 🤗Datasets	3	852	September 15, 2022
Using load_datasets for newly created datasets 🤗Datasets	2	455	August 27, 2021
Datasets not using the cache dir 🤗Datasets	2	716	November 29, 2023

Problems about loading and managing extreme large datasets (like SA1B)

Related topics