Create batch from list of ids in the dataset is very slow

NohTow · March 23, 2022, 12:20pm

Hi all,

For a given use case, I need to forge batches where sampling depends on element within the batch, e.g to create batches with similar elements.
The solution I found online for such approaches is to first get elements from the dataloader (as usual) and then get additionnals elements depending on these first ones in the collate function.

So I built a custom collate function where I get a list of ids corresponding to the initial elements and the additionnal mined ones and I am trying to return the corresponding batch. However, constructing the dictionnary to return is really slow. It seems to come from the selection of items given the list of ids ( examples = train_dataset[mined_ids]).
I also tried to use examples = train_dataset.select(mined_ids) which is faster, but then, accessing column is very slow (input_ids = torch.tensor([example["input_ids"] for example in examples], dtype=torch.long).
Finally, I also tried to use set_format to directly get numpy arrays/tensors, but it is still very slow.

Is there a better way to create batches given a list of ids or another way to build batches where elements depends on other elements of the batch ?

NohTow · March 25, 2022, 8:57am

After digging more into this, it is obviously coming from the _getitem() function, but not from querying the table, rather formatting the resulting table.
More precisely, it comes from this function:

 def extract_batch(self, pa_table: pa.Table) -> dict:
        return {col: self._arrow_array_to_numpy(pa_table[col]) for col in pa_table.column_names}

I suppose this is due to the stored format of the data, but it seems like a numpy dataset allows a faster indexing through a list of id.
Is there a special method I’m missing ? Is there any workaround ?

dhruvgrammarly · November 24, 2024, 3:02am

I’m running into this myself. PythonArrowExtractor.extract_batch is very slow.

github.com

huggingface/datasets/blob/main/src/datasets/formatting/formatting.py#L150


      
                  return pa_table
          
          
          class PythonArrowExtractor(BaseArrowExtractor[dict, list, dict]):
              def extract_row(self, pa_table: pa.Table) -> dict:
                  return _unnest(pa_table.to_pydict())
          
              def extract_column(self, pa_table: pa.Table) -> list:
                  return pa_table.column(0).to_pylist()
          
              def extract_batch(self, pa_table: pa.Table) -> dict:
                  return pa_table.to_pydict()
          
          
          class NumpyArrowExtractor(BaseArrowExtractor[dict, np.ndarray, dict]):
              def __init__(self, **np_array_kwargs):
                  self.np_array_kwargs = np_array_kwargs
          
              def extract_row(self, pa_table: pa.Table) -> dict:
                  return _unnest(self.extract_batch(pa_table))

Is there a better or faster way to do this? It’s causing my GPUs to be work starved while data is being extracted from the arrow table. I chose arrow because it was supposed to be memory mapped, zero copy, fast, etc… but this seems largely untrue now.

dhruvgrammarly · November 24, 2024, 4:14am

I tried a suggestion from this thread Local dataset loading performance: HF's arrow vs torch.load - #3 by mztelus to call .with_format('torch'), but that did NOT help either. Now most of the time is spent in PyArrow’s ChunkedArray.to_numpy() method (pyarrow.ChunkedArray — Apache Arrow v18.0.0).

dhruvgrammarly · December 5, 2024, 3:29pm

Update: For me a suggestion from @nbroad helped - to increase the number of dataloader workers to 2 and I also increased the prefetch factor to 16.

Topic		Replies	Views
Querying column is slow for datasets with indices mapping 🤗Datasets	3	1490	May 17, 2021
Is there a way to change batching behaviour of `map`? 🤗Datasets	3	513	April 5, 2023
Collate function for tabular data with some text 🤗Datasets	3	577	February 2, 2023
Fetching data takes too too much time 🤗Datasets	1	1293	June 13, 2022
Index retrieval speed varies considerably with dataset size 🤗Datasets	2	865	May 9, 2022

Create batch from list of ids in the dataset is very slow

Related topics