I am running the run_mlm.py example script with my custom dataset, but I am getting out of memory error, even using the keep_in_memory=True
parameter. My custom dataset is a set of CSV files, but for now, I’m only loading a single file (200 Mb) with 200 million rows.
Before running the script I have about 128 Gb free disk, when I run the script it creates a couple of arrow files with 11Gb sizes until the disk is full. Although I have small disk space available, I do have 128Gb of RAM and during execution, only 30Gb gets used, shouldn’t the RAM usage be greater and had less disk usage?
with training_args.main_process_first(desc="dataset map tokenization"):
tokenized_datasets = raw_datasets.map(
tokenize_function,
batched=True,
num_proc=data_args.preprocessing_num_workers,
remove_columns=column_names,
desc="Running tokenizer on every text in dataset",
keep_in_memory=True,
)
Stacktrace:
Traceback (most recent call last):
File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/datasets/table.py", line 54, in _write_table_to_file
writer.write_batch(batch)
File "pyarrow/ipc.pxi", line 384, in pyarrow.lib._CRecordBatchWriter.write_batch
OSError: [Errno 28] No space left on device
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 627, in <module>
main()
File "train.py", line 477, in main
tokenized_datasets = raw_datasets.map(
File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/datasets/dataset_dict.py", line 484, in map
{
File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/datasets/dataset_dict.py", line 485, in <dictcomp>
k: dataset.map(
File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2141, in map
transformed_shards[index] = async_result.get()
File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/multiprocess/pool.py", line 537, in _handle_tasks
put(task)
File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/multiprocess/connection.py", line 209, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/multiprocess/reduction.py", line 54, in dumps
cls(buf, protocol, *args, **kwds).dump(obj)
File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/dill/_dill.py", line 498, in dump
StockPickler.dump(self, obj)
File "/usr/local/lib/python3.8/pickle.py", line 487, in dump
self.save(obj)
File "/usr/local/lib/python3.8/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/local/lib/python3.8/pickle.py", line 901, in save_tuple
save(element)
File "/usr/local/lib/python3.8/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/usr/local/lib/python3.8/pickle.py", line 971, in save_dict
self._batch_setitems(obj.items())
File "/usr/local/lib/python3.8/pickle.py", line 997, in _batch_setitems
save(v)
File "/usr/local/lib/python3.8/pickle.py", line 603, in save
self.save_reduce(obj=obj, *rv)
File "/usr/local/lib/python3.8/pickle.py", line 717, in save_reduce
save(state)
File "/usr/local/lib/python3.8/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/usr/local/lib/python3.8/pickle.py", line 971, in save_dict
self._batch_setitems(obj.items())
File "/usr/local/lib/python3.8/pickle.py", line 997, in _batch_setitems
save(v)
File "/usr/local/lib/python3.8/pickle.py", line 578, in save
rv = reduce(self.proto)
File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/datasets/table.py", line 176, in __getstate__
_write_table_to_file(table=table, filename=filename)
File "/home/emanuel/twitter-br/env/lib/python3.8/site-packages/datasets/table.py", line 56, in _write_table_to_file
return sum(batch.nbytes for batch in batches)
OSError: [Errno 28] No space left on device
Makefile:11: recipe for target 'train' failed