Datasets mapper hanging issue

Lots to detail here because what is happening is not obvious. I’ll try to simplify it:

  • Using Datasets.map to encode a dataset that has Sequence, Array3d and Array2d features. Note that the Array2D features have a dynamic first dimension.

  • I can successfully encode and save to disk locally on a fraction of the data, I can also encode and save to S3 running on sagemaker on a 1/3rd of the data (~3000 rows).

  • However, when I run on the full dataset (~9,000 rows) on sagemaker with 16 processors and a batch size of 1000, the mapper gets to the end of mapping across all processors and hangs indefinitely.

  • Oddly when I manually press “stop” on the sagemaker processing instance, whatever “stop” does jostles the CPU(s) out of their hanging state and the mapper proceeds to the next step (where it appears to be flattening the indices) before the instance dies

  • In an attempt to debug i ran the same dataset and same code but with num_proc=1. This time mapping fully completed, but I got an error on the save_to_disk call:

  File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1348, in save_to_disk
    dataset = self.flatten_indices(num_proc=num_proc) if self._indices is not None else self
  File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/datasets/fingerprint.py", line 511, in wrapper
    out = func(dataset, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3541, in flatten_indices
    return self.map(
  File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2953, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3346, in _map_single
    writer.write_batch(batch)
  File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_writer.py", line 555, in write_batch
    self.write_table(pa_table, writer_batch_size)
  File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_writer.py", line 567, in write_table
    pa_table = pa_table.combine_chunks()
  File "pyarrow/table.pxi", line 3241, in pyarrow.lib.Table.combine_chunks
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: concatenation of extension<arrow.py_extension_type<Array2DExtensionType>>

Which is odd because I’ve used Array2D successfully multiple times in the past.

I tried this with num_proc=1 with a batch_size=1000 and batch_size=100 and got the same error.

This problem was caused by datasets==2.10.0. Downgrading to 2.9.0 fixed it

Hi ! This error comes from Arrow which doesn’t implement concatenation for extension types yet: [C++] Can't concatenate extension arrays · Issue #31868 · apache/arrow · GitHub

It was working in datasets 2.9 because we missed the .combine_chunks() call which is necessary to not end up with record batches of 1 row which cause performance issues.

Unfortunately I don’t have any workaround for now except using an old version of datasets - though we should definitely see with the Arrow community what we can do about this