Problems about loading and managing extreme large datasets (like SA1B)

Hi! I’m working on SA1B data loading recently. But I’m running into some troubles about loading and caching.

I will first describe my working scenario: I’ve download all the SA1B data (10G per tar file, 1000 tar files, 10T in total) on my data server (not GPU server which only has 500G storage available) and I’ve also wrote a loading script.

However, in the loading script running on the GPU server, I did not download all the tar files in _split_generators for fear of blowing up my local storage. Thus, I download them once per tar file in _generate_examples. Here is the code snippet:

    def _split_generators(self, dl_manager: datasets.DownloadManager):
        # Not download tar files, just list them
        sa1b_tar_list = self.config.sa1b_tar_list  
        # list of local urls to fetch the tar files in my data server

        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
                    "sa1b_tar_list": sa1b_tar_list,
                    "iter_archive_func": dl_manager.iter_archive,
                    "download_func": dl_manager.download,
                },
            ),
        ]


    def _generate_examples(
        self,
        sa1b_tar_list,
        iter_archive_func,
        download_func,
    ):
        for sa1b_tar in zip(sa1b_tar_list):
            download_func(sa1b_tar)
            with open(sa1b_tar, "rb") as f:
                sa1b_tar = iter_archive_func(f)
                data_iter = self._process_one_tar(sa1b_tar)
                for ret in data_iter:
                    yield ret

    def _process_one_tar(
        self,
        sa1b_tar,
    ):
        # ...
        yield image_id, dict(
            **image_dict,
        )

I am sure there are problems in the code:

  1. Following the docs, the files should be downloaded with dl_manager.download in _split_generators. But I am afraid to blow up the local machine.
  2. Although I move the download call into _generate_examples, the fetched files will not be cleared once they are finished loading which gradually take up all the local storage. Are there any caching management strategies for such extreme big dataset? I think directly removing the used files is problematic, since I use multiple processes in PyTorch’s DataLoader to load the data.
  3. I assume the usage case of streaming mode dl_manager.iter_archive. Is it correct? What is the best practice for such situation?

Many thanks in advance! :hugs:

If you want to save disk space I’d encourage you to streaming, e.g.

load_dataset("path/to/script.py", streaming=True)

In that case all the dl_manager calls are done lazily, so you can actually call dl_manager.download inside _split_generators this way :wink:

Also I confirm that dl_manager.iter_archive is indeed the way to go to stream TAR archives !

1 Like

Thank you! It definitely works well! :v: