Run_qa.py with custom dataset seems to expect batch size of 1000 but receives batch size of 1362

Hi, I am trying to fine-tune xlm-roberta-base on a custom machine-translated Dutch squad_v2 dataset with run_qa.py. The custom dataset is in a line-delimited JSON file with columns “question”, “answers” and “context” and seems to load correctly with load_dataset.

I am running the following run_qa.py command:

!python transformers/examples/pytorch/xla_spawn.py transformers/examples/pytorch/question-answering/run_qa.py --model_name_or_path xlm-roberta-base --train_file nl_squad_train_filtered2.json --validation_file nl_squad_dev_filtered2.json --do_train --do_eval --per_device_train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 2 --max_seq_length 256 --pad_to_max_length --version_2_with_negative --doc_stride 128 --output_dir /tmp/debug_squad_2/

While running run_qa.py the following error is shown:

Running tokenizer on train dataset:   0% 0/96 [00:00<?, ?ba/s]
Traceback (most recent call last):
  File "transformers/examples/pytorch/xla_spawn.py", line 85, in <module>
    main()
  File "transformers/examples/pytorch/xla_spawn.py", line 81, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 387, in spawn
    _start_fn(0, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/content/transformers/examples/pytorch/question-answering/run_qa.py", line 648, in _mp_fn
    main()
  File "/content/transformers/examples/pytorch/question-answering/run_qa.py", line 438, in main
    desc="Running tokenizer on train dataset",
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1971, in map
    desc=desc,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 519, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 486, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 458, in wrapper
    out = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2354, in _map_single
    writer.write_batch(batch)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py", line 495, in write_batch
    pa_table = pa.Table.from_arrays(arrays, schema=schema)
  File "pyarrow/table.pxi", line 1702, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 1314, in pyarrow.lib.Table.validate
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 2 named input_ids expected length 1000 but got length 1362

Printing the dimensions of the parameter examples in prepare_train_features results in the following:

ID dimensions: (1000,)
Title dimensions: (1000,)
Context dimensions: (1000,)
Question dimensions: (1000,)
Answers dimensions: (1000,)

Printing the dimensions of the returned tokenized_examples in prepare_train_features results in the following:

Input ID dimensions: (1362, 256)
Attention mask dimensions: (1362, 256)
Start positions dimensions: (1362,)
End positions dimensions: (1362,)

If I understand the error correctly a batch of 1000 examples is provided which turns into a batch of 1362 after tokenization while a batch size of 1000 is expected. My guess would be that this is due to truncation based on the max_seq_length of 256. As stated in the documentation input size does not have to equal output size (Batch mapping). Furthermore the output values all have the same length (1362) as stated as a requirement in the documentation so that should not be a problem. Any ideas on what the problem could be here?

Feel free to ask, if you need more information to be able to help. I will try to respond as quickly as possible.