Hi, I am trying to fine-tune xlm-roberta-base on a custom machine-translated Dutch squad_v2 dataset with run_qa.py. The custom dataset is in a line-delimited JSON file with columns “question”, “answers” and “context” and seems to load correctly with load_dataset.
I am running the following run_qa.py command:
!python transformers/examples/pytorch/xla_spawn.py transformers/examples/pytorch/question-answering/run_qa.py --model_name_or_path xlm-roberta-base --train_file nl_squad_train_filtered2.json --validation_file nl_squad_dev_filtered2.json --do_train --do_eval --per_device_train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 2 --max_seq_length 256 --pad_to_max_length --version_2_with_negative --doc_stride 128 --output_dir /tmp/debug_squad_2/
While running run_qa.py the following error is shown:
Running tokenizer on train dataset: 0% 0/96 [00:00<?, ?ba/s]
Traceback (most recent call last):
File "transformers/examples/pytorch/xla_spawn.py", line 85, in <module>
main()
File "transformers/examples/pytorch/xla_spawn.py", line 81, in main
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 387, in spawn
_start_fn(0, pf_cfg, fn, args)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
fn(gindex, *args)
File "/content/transformers/examples/pytorch/question-answering/run_qa.py", line 648, in _mp_fn
main()
File "/content/transformers/examples/pytorch/question-answering/run_qa.py", line 438, in main
desc="Running tokenizer on train dataset",
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1971, in map
desc=desc,
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 519, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 486, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 458, in wrapper
out = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2354, in _map_single
writer.write_batch(batch)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py", line 495, in write_batch
pa_table = pa.Table.from_arrays(arrays, schema=schema)
File "pyarrow/table.pxi", line 1702, in pyarrow.lib.Table.from_arrays
File "pyarrow/table.pxi", line 1314, in pyarrow.lib.Table.validate
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 2 named input_ids expected length 1000 but got length 1362
Printing the dimensions of the parameter examples in prepare_train_features results in the following:
ID dimensions: (1000,)
Title dimensions: (1000,)
Context dimensions: (1000,)
Question dimensions: (1000,)
Answers dimensions: (1000,)
Printing the dimensions of the returned tokenized_examples in prepare_train_features results in the following:
Input ID dimensions: (1362, 256)
Attention mask dimensions: (1362, 256)
Start positions dimensions: (1362,)
End positions dimensions: (1362,)
If I understand the error correctly a batch of 1000 examples is provided which turns into a batch of 1362 after tokenization while a batch size of 1000 is expected. My guess would be that this is due to truncation based on the max_seq_length of 256. As stated in the documentation input size does not have to equal output size (Batch mapping). Furthermore the output values all have the same length (1362) as stated as a requirement in the documentation so that should not be a problem. Any ideas on what the problem could be here?
Feel free to ask, if you need more information to be able to help. I will try to respond as quickly as possible.