Lots to detail here because what is happening is not obvious. I’ll try to simplify it:
-
Using Datasets.map to encode a dataset that has Sequence, Array3d and Array2d features. Note that the Array2D features have a dynamic first dimension.
-
I can successfully encode and save to disk locally on a fraction of the data, I can also encode and save to S3 running on sagemaker on a 1/3rd of the data (~3000 rows).
-
However, when I run on the full dataset (~9,000 rows) on sagemaker with 16 processors and a batch size of 1000, the mapper gets to the end of mapping across all processors and hangs indefinitely.
-
Oddly when I manually press “stop” on the sagemaker processing instance, whatever “stop” does jostles the CPU(s) out of their hanging state and the mapper proceeds to the next step (where it appears to be flattening the indices) before the instance dies
-
In an attempt to debug i ran the same dataset and same code but with
num_proc=1
. This time mapping fully completed, but I got an error on thesave_to_disk
call:
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1348, in save_to_disk
dataset = self.flatten_indices(num_proc=num_proc) if self._indices is not None else self
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3541, in flatten_indices
return self.map(
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2953, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3346, in _map_single
writer.write_batch(batch)
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_writer.py", line 555, in write_batch
self.write_table(pa_table, writer_batch_size)
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_writer.py", line 567, in write_table
pa_table = pa_table.combine_chunks()
File "pyarrow/table.pxi", line 3241, in pyarrow.lib.Table.combine_chunks
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: concatenation of extension<arrow.py_extension_type<Array2DExtensionType>>
Which is odd because I’ve used Array2D successfully multiple times in the past.
I tried this with num_proc=1 with a batch_size=1000 and batch_size=100 and got the same error.