Datasets map tokenization throws OSError: No space left on device

I am running the run_mlm.py example script with my custom dataset, but I am getting out of memory error, even using the keep_in_memory=True parameter. My custom dataset is a set of CSV files, but for now, I’m only loading a single file (200 Mb) with 200 million rows.

Before running the script I have about 128 Gb free disk, when I run the script it creates a couple of arrow files with 11Gb sizes until the disk is full. Although I have small disk space available, I do have 128Gb of RAM and during execution, only 30Gb gets used, shouldn’t the RAM usage be greater and had less disk usage?

        with training_args.main_process_first(desc="dataset map tokenization"):
            tokenized_datasets = raw_datasets.map(
                tokenize_function,
                batched=True,
                num_proc=data_args.preprocessing_num_workers,
                remove_columns=column_names,
                desc="Running tokenizer on every text in dataset",
                keep_in_memory=True,
            )

Stacktrace:

Traceback (most recent call last):
  File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/datasets/table.py", line 54, in _write_table_to_file
    writer.write_batch(batch)
  File "pyarrow/ipc.pxi", line 384, in pyarrow.lib._CRecordBatchWriter.write_batch
OSError: [Errno 28] No space left on device

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 627, in <module>
    main()
  File "train.py", line 477, in main
    tokenized_datasets = raw_datasets.map(
  File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/datasets/dataset_dict.py", line 484, in map
    {
  File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/datasets/dataset_dict.py", line 485, in <dictcomp>
    k: dataset.map(
  File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2141, in map
    transformed_shards[index] = async_result.get()
  File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
  File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/multiprocess/pool.py", line 537, in _handle_tasks
    put(task)
  File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/multiprocess/connection.py", line 209, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/multiprocess/reduction.py", line 54, in dumps
    cls(buf, protocol, *args, **kwds).dump(obj)
  File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/dill/_dill.py", line 498, in dump
    StockPickler.dump(self, obj)
  File "/usr/local/lib/python3.8/pickle.py", line 487, in dump
    self.save(obj)
  File "/usr/local/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/usr/local/lib/python3.8/pickle.py", line 901, in save_tuple
    save(element)
  File "/usr/local/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/usr/local/lib/python3.8/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/local/lib/python3.8/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/usr/local/lib/python3.8/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/local/lib/python3.8/pickle.py", line 717, in save_reduce
    save(state)
  File "/usr/local/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/usr/local/lib/python3.8/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/local/lib/python3.8/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/usr/local/lib/python3.8/pickle.py", line 578, in save
    rv = reduce(self.proto)
  File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/datasets/table.py", line 176, in __getstate__
    _write_table_to_file(table=table, filename=filename)
  File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/datasets/table.py", line 56, in _write_table_to_file
    return sum(batch.nbytes for batch in batches)
OSError: [Errno 28] No space left on device
Makefile:11: recipe for target 'train' failed

As a workaround, I used my own dataset class to tokenize samples on the “__getitem__” method. In this way only the RAM is used, although, slower.

Hi ! What version of datasets are you using ?

Also you may want to check if your dataset is indeed in RAM or loaded from the disk by checking dataset.cache_files. If it’s empty, then your dataset is fully in RAM. If not, then it shows the path to the cached arrow files on disk that are loaded.

Hi Quentin,

I am using the 1.13.1 version.

The dataset.cache_files have {'train': [], 'validation': []}.

Since the dictionaries are empty, it should be expected to only use the RAM, right?

Yes exactly in this case the dataset is loaded in memory.

The issue might come from multiprocessing. There is a limitation in pyarrow that makes it impossible to send more that 4GB of data from one process to the other. To workaround that in datasets, it writes the data temporarily on disk and the other process reads the data from disk to put it in RAM. The temporary files are removed after map concatenates all the results from all the processes, but it looks like your disk gets full before that.

Let me know if you still have issues with this behavior so we can take a look

1 Like

This seems to be the behavior that I am facing. I don’t know how pyarrow works, but for a single CSV file with 250 Mb I got out of disk quickly.
I was able to make this work by using my own Dataset implementation, in this way pyarrow isn’t used and all works fine.
I could not find a mention of this limitation on datasets documentation. I think this is worth adding since it requires some knowledge of pyarrow internals.

Maybe we can avoid this behavior by sharding the the data in tables that are <4GB when sending them from one process to the other. This would also feel less hacky

I can take a look at this next week I think

1 Like

Hi, is there any update on this issue? Still happens with multi-CPU usage