Saving memory with run_mlm.py with wikipedia datasets

Hi
I am trying to run run_mlm.py with mbert on wikipedia dataset using this command:

python run_mlm.py --model_name_or_path bert-base-multilingual-cased --dataset_name wikipedia --dataset_config_name 20200501.en --do_train --do_eval --output_dir /dara/test  --max_seq_length 128

You can find the codes for run_mlm.py in huggingface repo here: transformers/run_mlm.py at v4.3.2 路 huggingface/transformers 路 GitHub

I am using transformer version: 4.3.2

But I got memory erorr using this dataset, is there a way I could save on memory with run_mlm.py script? any suggestion is appreciated. thanks @sgugger, @patil-suraj

  File "run_mlm.py", line 441, in <module>
    main()
  File "run_mlm.py", line 233, in main
    split=f"train[{data_args.validation_split_percentage}%:]",
  File "/dara/libs/anaconda3/envs/code/lib/python3.7/site-packages/datasets-1.3.0-py3.7.egg/datasets/load.py", line 750, in load_dataset
    ds = builder_instance.as_dataset(split=split, ignore_verifications=ignore_verifications, in_memory=keep_in_memory)
  File "/dara/libs/anaconda3/envs/code/lib/python3.7/site-packages/datasets-1.3.0-py3.7.egg/datasets/builder.py", line 740, in as_dataset
    map_tuple=True,
  File "/dara/libs/anaconda3/envs/code/lib/python3.7/site-packages/datasets-1.3.0-py3.7.egg/datasets/utils/py_utils.py", line 225, in map_nested
    return function(data_struct)
  File "/dara/libs/anaconda3/envs/code/lib/python3.7/site-packages/datasets-1.3.0-py3.7.egg/datasets/builder.py", line 757, in _build_single_dataset
    in_memory=in_memory,
  File "/dara/libs/anaconda3/envs/code/lib/python3.7/site-packages/datasets-1.3.0-py3.7.egg/datasets/builder.py", line 829, in _as_dataset
    in_memory=in_memory,
  File "/dara/libs/anaconda3/envs/code/lib/python3.7/site-packages/datasets-1.3.0-py3.7.egg/datasets/arrow_reader.py", line 215, in read
    return self.read_files(files=files, original_instructions=instructions, in_memory=in_memory)
  File "/dara/libs/anaconda3/envs/code/lib/python3.7/site-packages/datasets-1.3.0-py3.7.egg/datasets/arrow_reader.py", line 236, in read_files
    pa_table = self._read_files(files, in_memory=in_memory)
  File "/dara/libs/anaconda3/envs/code/lib/python3.7/site-packages/datasets-1.3.0-py3.7.egg/datasets/arrow_reader.py", line 171, in _read_files
    pa_table: pa.Table = self._get_dataset_from_filename(f_dict, in_memory=in_memory)
  File "/dara/libs/anaconda3/envs/code/lib/python3.7/site-packages/datasets-1.3.0-py3.7.egg/datasets/arrow_reader.py", line 302, in _get_dataset_from_filename
    pa_table = ArrowReader.read_table(filename, in_memory=in_memory)
  File "/dara/libs/anaconda3/envs/code/lib/python3.7/site-packages/datasets-1.3.0-py3.7.egg/datasets/arrow_reader.py", line 322, in read_table
    stream = stream_from(filename)
  File "pyarrow/io.pxi", line 782, in pyarrow.lib.memory_map
  File "pyarrow/io.pxi", line 743, in pyarrow.lib.MemoryMappedFile._open
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Memory mapping file failed: Cannot allocate memory