IndexError: index out of bounds

Hi, I am trying to further pretrain “allenai/scibert_scivocab_uncased” on my own dataset using MLM. I am using following command -

python3 ./transformers/examples/language-modeling/run_mlm.py --model_name_or_path "allenai/scibert_scivocab_uncased" --train_file train.txt --validation_file validation.txt --do_train --do_eval --output_dir test1 --overwrite_cache --cache_dir ./tt
However I am getting error:

 0% 0/240 [00:00<?, ?ba/s]Traceback (most recent call last):
  File "./transformers/examples/language-modeling/run_mlm.py", line 409, in <module>
    main()
  File "./transformers/examples/language-modeling/run_mlm.py", line 355, in main
    load_from_cache_file=not data_args.overwrite_cache,
  File "/usr/local/lib/python3.6/dist-packages/datasets/dataset_dict.py", line 303, in map
    for k, dataset in self.items()
  File "/usr/local/lib/python3.6/dist-packages/datasets/dataset_dict.py", line 303, in <dictcomp>
    for k, dataset in self.items()
  File "/usr/local/lib/python3.6/dist-packages/datasets/arrow_dataset.py", line 1259, in map
    update_data=update_data,
  File "/usr/local/lib/python3.6/dist-packages/datasets/arrow_dataset.py", line 157, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/datasets/fingerprint.py", line 163, in wrapper
    out = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/datasets/arrow_dataset.py", line 1528, in _map_single
    writer.write_batch(batch)
  File "/usr/local/lib/python3.6/dist-packages/datasets/arrow_writer.py", line 278, in write_batch
    pa_table = pa.Table.from_pydict(typed_sequence_examples)
  File "pyarrow/table.pxi", line 1474, in pyarrow.lib.Table.from_pydict
  File "pyarrow/array.pxi", line 322, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 222, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
  File "/usr/local/lib/python3.6/dist-packages/datasets/arrow_writer.py", line 100, in __arrow_array__
    if trying_type and out[0].as_py() != self.data[0]:
  File "pyarrow/array.pxi", line 1058, in pyarrow.lib.Array.__getitem__
  File "pyarrow/array.pxi", line 540, in pyarrow.lib._normalize_index
IndexError: index out of bounds

Can someone help me in understanding this problem and how to resolve it? When I try the same command with bert-base-uncased, it runs fine. Also, what is the best practice to further pretrain a model on custom dataset?

Any progress on this? I’m currently facing the same issue.