Optimizing Disk Usage for Large (Audio) Datasets

John6666 · November 29, 2024, 2:29pm

I’m not familiar with the datasets library, but I wonder if iter_archive could be used?

huggingface/datasets/blob/3.1.0/src/datasets/download/download_manager.py#L234


      
          def _download_single(self, url_or_filename: str, download_config: DownloadConfig) -> str:
              url_or_filename = str(url_or_filename)
              if is_relative_path(url_or_filename):
                  # append the relative path to the base_path
                  url_or_filename = url_or_path_join(self._base_path, url_or_filename)
              out = cached_path(url_or_filename, download_config=download_config)
              out = tracked_str(out)
              out.set_origin(url_or_filename)
              return out
          
          def iter_archive(self, path_or_buf: Union[str, io.BufferedReader]):
              """Iterate over files within an archive.
          
              Args:
                  path_or_buf (`str` or `io.BufferedReader`):
                      Archive path or archive binary file object.
          
              Yields:
                  `tuple[str, io.BufferedReader]`:
                      2-tuple (path_within_archive, file_object).
                      File object is opened in binary mode.

Topic		Replies	Views
Understanding the `Datasets` cache system 🤗Datasets	2	3304	May 19, 2023
BuilderScript cleanup during extract of archives 🤗Datasets	0	66	November 14, 2024
Not able to use Custom Speech Data for training ASR 🤗Datasets	2	320	September 20, 2023
Can Data Files be generated upon dataset load? Beginners	3	454	March 4, 2022
Dataset loading script for an audio dataset 🤗Datasets	5	673	September 2, 2022

Optimizing Disk Usage for Large (Audio) Datasets

Related topics