DatasetDict save_to_disk with num_proc > 1 seems to hang with error

I set up a simple test in colab the following code:

from datasets import load_dataset, concatenate_datasets, DatasetDict, logging, disable_progress_bars
%pip show datasets #Confirming i am using 2.18

logging.set_verbosity_warning()
# Load the MNLI and SNLI train datasets
mnli_train = load_dataset("glue", "mnli", split="train")
snli_train = load_dataset("snli", split="train")

# Concatenate the MNLI and SNLI datasets
combined_train = concatenate_datasets([mnli_train, snli_train])

mnli_matched = load_dataset("glue", "mnli_matched")

combined_dataset = DatasetDict({
    'train': combined_train,  # This is the new combined train split
    # Include other splits from mnli_matched if needed, like:
    'validation': mnli_matched['validation'],
    'test': mnli_matched['test']
})

logging.set_verbosity_info()
disable_progress_bars()
print(combined_dataset)
print("saving")

combined_dataset.save_to_disk("./test", num_proc=2)

With verbosity_info() you get the following error:

Exception in thread Thread-34 (_handle_results):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/pool.py", line 579, in _handle_results
    task = get()
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/connection.py", line 254, in recv
    return _ForkingPickler.loads(buf.getbuffer())
  File "/usr/local/lib/python3.10/dist-packages/dill/_dill.py", line 303, in loads
    return load(file, ignore, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/dill/_dill.py", line 289, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/usr/local/lib/python3.10/dist-packages/dill/_dill.py", line 444, in load
    obj = StockUnpickler.load(self)
TypeError: CastError.__init__() missing 2 required keyword-only arguments: 'table_column_names' and 'requested_column_names'

(also on a side note I was hoping verbosity_info would show the saving progress. Anyone know how I can see the progress at the moment Progress Bars dont work using Papermill)

After looking into a couple other scenario it seems saving a dataset thats been concactinated like this where the snli dataset has no “idx” column causes the problem.

Hi ! Thanks for reporting, I opened a PR to fix this bug here: Fix sliced ConcatenationTable pickling with mixed schemas vertically by lhoestq · Pull Request #6715 · huggingface/datasets · GitHub