IndexError: index out of bound, MLM+XLA

This is an error with the MLM script (PyTorch) for attempting to pre-train BigBird on TPUs over XLA. The dataset in question is a custom dataset, and the model config and tokenizer has been initialized appropriately.

This is a continuation of this unanswered Forum post that faces the same error.

Command used to run the script:-

%%bash
python xla_spawn.py --num_cores=8 ./run_mlm.py --output_dir="./results" \
    --model_type="big_bird" \
    --config_name="./config" \
    --tokenizer_name="./tokenizer" \
    --train_file="./dataset.txt" \
    --validation_file="./val.txt" \
    --line_by_line="True" \
    --max_seq_length="16000" \
    --weight_decay="0.01" \
    --per_device_train_batch_size="1" \
    --per_device_eval_batch_size="1" \
    --learning_rate="3e-4" \
    --tpu_num_cores='8' \
    --warmup_steps="1000" \
    --overwrite_output_dir \
    --pad_to_max_length \
    --num_train_epochs="5" \
    --adam_beta1="0.9" \
    --adam_beta2="0.98" \
    --do_train \
    --do_eval \
    --logging_steps="50" \
    --evaluation_strategy="steps" \
    --eval_accumulation_steps='10' \
    --report_to="tensorboard" \
    --logging_dir='./logs' \
    --save_strategy="epoch" \
    --load_best_model_at_end='True' \
    --metric_for_best_model='validation' \
    --preprocessing_num_workers='15'

I am facing two errors to be precise,

Exception in device=TPU:0: Default process group has not been initialized, please make sure to call init_process_group.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1006, in main_process_first
    yield
  File "/content/run_mlm.py", line 393, in main
    desc="Running tokenizer on dataset line_by_line",
  File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 489, in map
    for k, dataset in self.items()
  File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 489, in <dictcomp>
    for k, dataset in self.items()
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1664, in map
    for rank in range(num_proc)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1664, in <listcomp>
    for rank in range(num_proc)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2664, in shard
    writer_batch_size=writer_batch_size,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 186, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 397, in wrapper
    out = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2254, in select
    return self._new_dataset_with_indices(indices_buffer=buf_writer.getvalue(), fingerprint=new_fingerprint)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2170, in _new_dataset_with_indices
    fingerprint=fingerprint,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 297, in __init__
    self._indices.column(0)[0].type
  File "pyarrow/table.pxi", line 162, in pyarrow.lib.ChunkedArray.__getitem__
  File "pyarrow/array.pxi", line 549, in pyarrow.lib._normalize_index
IndexError: index out of bounds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/run_mlm.py", line 529, in _mp_fn
    main()
  File "/content/run_mlm.py", line 393, in main
    desc="Running tokenizer on dataset line_by_line",
  File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1011, in main_process_first
    torch.distributed.barrier()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 2523, in barrier
    default_pg = _get_default_group()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

I haven’t modified the script to call the init_process_group yet, focusing on the earlier error of index out of bounds. Clearly, the problem is arising from my own dataset - which was working before however. Interestingly, we get it when its in the tokenizing stage.

At some point, when constructing the arrow dataset its failing. I have no idea about Apache Arrow, so I can’t debug further :sweat_smile:

Can anyone give me some guidance on where should I start to investigate the error and some possible leads as to the origin?

Any ideas anyone?